ESEC/FSE 2023
31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2023)
Powered by
Conference Publishing Consulting

31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2023), December 3–9, 2023, San Francisco, CA, USA

ESEC/FSE 2023 – Proceedings

Contents - Abstracts - Authors
Twitter: https://twitter.com/esecfse

Frontmatter

Title Page


Welcome from the Chairs
We are pleased to welcome all delegates to ESEC/FSE 2023, the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ESEC/FSE is an internationally renowned forum for researchers, practitioners, and educators to present and discuss the most recent innovations, trends, experiences, and challenges in the field of software engineering. ESEC/FSE brings together experts from academia and industry to exchange the latest research results and trends as well as their practical application in all areas of software engineering.

ESEC/FSE 2023 Organization
Committee Listings

Sponsors



Keynotes

Towards AI-Driven Software Development: Challenges and Lessons from the Field (Keynote)
Eran Yahav
(Technion, Israel)
AI is changing the way we develop software. AI is becoming powerful enough to change the nature of interaction between humans and machines and not only to raise the level of abstraction. AI-driven software development is poised to transform the entire software development lifecycle (SDLC). As we move towards AI-driven software development, we must revisit some fundamental assumptions and address the following challenges:
• How does the SDLC change when autonomous agents can handle some tasks? What is the role of code and version control?
• Interaction model: What is the right human-machine interaction? How do we best communicate intent to the AI? How to best consume results?
• Contextual awareness: How do we make the AI contextually aware of our development environment? Can we make the AI hyper-local and tailored to our problem and solution domains?
• Trust: How can we trust the suggested results? How can we trust results that are not provided as code?
In this talk, we will start with practical AI-assisted software development, including lessons from the field, based on our experience serving millions of users with Tabnine. We will cover different tasks in the SDLC and various techniques for addressing them in the face of the challenges above.

Publisher's Version
Getting Outside the Bug Boxes (Keynote)
Margaret Burnett
(Oregon State University, USA)
Sometimes, we humans find ourselves a bit slow to abandon the comfort of sitting “inside the box”, and this can detract from our ability to innovate. In this talk, I’ll share some outside-the-box perspectives, gleaned from decades of software engineering work, on boxes I’ve seen when thinking about bugs — from failures to faults, from finding to fixing, and from traditional to very non-traditional notions of “what counts” as a bug. I’ll consider the intellectually freeing perspectives that can come from moving outside the “mechanisms” box to policies; the enhancement to applicability from moving outside sub-sub-area boxes to the whole software lifecycle; the differences revealed when moving outside the “typical developer” box to diverse humans; and the plethora of possibilities arising from moving outside the “buggy code” box to a wide range of bug types.

Publisher's Version

Research Papers

Human Aspects I

A Four-Year Study of Student Contributions to OSS vs. OSS4SG with a Lightweight Intervention
Zihan Fang, Madeline Endres, Thomas Zimmermann, Denae Ford, Westley Weimer, Kevin Leach, and Yu Huang
(Vanderbilt University, USA; University of Michigan, USA; Microsoft Research, USA)
Modern software engineering practice and training increasingly rely on Open Source Software (OSS). The recent growth in demand for professional software engineers has led to increased contributions to, and usage of, OSS. However, there is limited understanding of the factors affecting how developers, and how new or student developers in particular, decide which OSS projects to contribute to, a process critical to OSS sustainability, access, adoption, and growth. To better understand OSS contributions from the developers of tomorrow, we conducted a four-year study with 1,361 students investigating the life cycle of their contributions (from project selection to pull request acceptance). During the study, we also delivered a lightweight intervention to promote the awareness of open source projects for social good (OSS4SG), OSS projects that have positive impacts in other domains. Using both quantitative and qualitative methods, we analyze student experience reports and the pull requests they submit. Compared to general OSS projects, we find significant differences in project selection (𝑝 < 0.0001, effect size = 0.84), student motivation (𝑝 < 0.01, effect size = 0.13), and increased pull-request acceptance rates for OSS4SG contributions. We also find that our intervention correlates with increased student contributions to OSS4SG (𝑝 < 0.0001, effect size = 0.38). Finally, we analyze correlations of factors such as gender or working with a partner. Our findings may help improve the experience for new developers participating in OSS4SG and the quality of their contributions. We also hope our work helps educators, project leaders, and contributors to build a mutually-beneficial framework for the future growth of OSS4SG.

Publisher's Version
Do CONTRIBUTING Files Provide Information about OSS Newcomers’ Onboarding Barriers?
Felipe Fronchetti, David C. Shepherd, Igor Wiese, Christoph Treude, Marco Aurélio Gerosa, and Igor Steinmacher
(Virginia Commonwealth University, USA; Lousiana State University, USA; Federal University of Technology Paraná, Brazil; University of Melbourne, Australia; Northern Arizona University, USA)
Effectively onboarding newcomers is essential for the success of open source projects. These projects often provide onboarding guidelines in their ’CONTRIBUTING’ files (e.g., CONTRIBUTING.md on GitHub). These files explain, for example, how to find open tasks, implement solutions, and submit code for review. However, these files often do not follow a standard structure, can be too large, and miss barriers commonly found by newcomers. In this paper, we propose an automated approach to parse these CONTRIBUTING files and assess how they address onboarding barriers. We manually classified a sample of files according to a model of onboarding barriers from the literature, trained a machine learning classifier that automatically predicts the categories of each paragraph (precision: 0.655, recall: 0.662), and surveyed developers to investigate their perspective of the predictions’ adequacy (75% of the predictions were considered adequate). We found that CONTRIBUTING files typically do not cover the barriers newcomers face (52% of the analyzed projects missed at least 3 out of the 6 barriers faced by newcomers; 84% missed at least 2). Our analysis also revealed that information about choosing a task and talking with the community, two of the most recurrent barriers newcomers face, are neglected in more than 75% of the projects. We made available our classifier as an online service that analyzes the content of a given CONTRIBUTING file. Our approach may help community builders identify missing information in the project ecosystem they maintain and newcomers can understand what to expect in CONTRIBUTING files.

Publisher's Version Info
How Early Participation Determines Long-Term Sustained Activity in GitHub Projects?
Wenxin Xiao, Hao He, Weiwei Xu, Yuxia Zhang, and Minghui Zhou
(Peking University, China; Beijing Institute of Technology, China)
Although the open source model bears many advantages in software development, open source projects are always hard to sustain. Previous research on open source sustainability mainly focuses on projects that have already reached a certain level of maturity (e.g., with communities, releases, and downstream projects). However, limited attention is paid to the development of (sustainable) open source projects in their infancy, and we believe an understanding of early sustainability determinants is crucial for project initiators, incubators, newcomers, and users.
In this paper, we aim to explore the relationship between early participation factors and long-term project sustainability. We leverage a novel methodology combining the Blumberg model of performance and machine learning to predict the sustainability of 290,255 GitHub projects. Specificially, we train an XGBoost model based on early participation (first three months of activity) in 290,255 GitHub projects and we interpret the model using LIME. We quantitatively show that early participants have a positive effect on project's future sustained activity if they have prior experience in OSS project incubation and demonstrate concentrated focus and steady commitment. Participation from non-code contributors and detailed contribution documentation also promote project's sustained activity. Compared with individual projects, building a community that consists of more experienced core developers and more active peripheral developers is important for organizational projects. This study provides unique insights into the incubation and recognition of sustainable open source projects, and our interpretable prediction approach can also offer guidance to open source project initiators and newcomers.

Publisher's Version
Matching Skills, Past Collaboration, and Limited Competition: Modeling When Open-Source Projects Attract Contributors
Hongbo Fang, James Herbsleb, and Bogdan Vasilescu
(Carnegie Mellon University, USA)
Attracting and retaining new developers is often at the heart of open-source project sustainability and success. Previous research found many intrinsic (or endogenous) project characteristics associated with the attractiveness of projects to new developers, but the impact of factors external to the project itself have largely been overlooked. In this work, we focus on one such external factor, a project's labor pool, which is defined as the set of contributors active in the overall open-source ecosystem that the project could plausibly attempt to recruit from at a given time. How are the size and characteristics of the labor pool associated with a project's attractiveness to new contributors? Through an empirical study of over 516,893 Python projects, we found that the size of the project's labor pool, the technical skill match, and the social connection between the project's labor pool and members of the focal project all significantly influence the number of new developers that the focal project attracts, with the competition between projects with overlapping labor pools also playing a role. Overall, the labor pool factors add considerable explanatory power compared to models with only project-level characteristics.

Publisher's Version

Testing I

Accelerating Continuous Integration with Parallel Batch Testing
Emad Fallahzadeh, Amir Hossein Bavand, and Peter C. Rigby
(Concordia University, Canada)
Continuous integration at scale is costly but essential to software development. Various test optimization techniques including test selection and prioritization aim to reduce the cost. Test batching is an effective alternative, but overlooked technique. This study evaluates parallelization’s effect by adjusting machine count for test batching and introduces two novel approaches.
We establish TestAll as a baseline to study the impact of parallelism and machine count on feedback time. We re-evaluate ConstantBatching and introduce DynamicBatching, which adapts batch size based on the remaining changes in the queue. We also propose TestCaseBatching, enabling new builds to join a batch before full test execution, thus speeding up continuous integration. Our evaluations utilize Ericsson’s results and 276 million test outcomes from open-source Chrome, assessing feedback time, execution reduction, and providing access to Chrome project scripts and data.
The results reveal a non-linear impact of test parallelization on feedback time, as each test delay compounds across the entire test queue. ConstantBatching, with a batch size of 4, utilizes up to 72% fewer machines to maintain the actual average feedback time and provides a constant execution reduction of up to 75%. Similarly, DynamicBatching maintains the actual average feedback time with up to 91% fewer machines and exhibits variable execution reduction of up to 99%. TestCaseBatching holds the line of the actual average feedback time with up to 81% fewer machines and demonstrates variable execution reduction of up to 67%. We recommend practitioners use DynamicBatching and TestCaseBatching to reduce the required testing machines efficiently. Analyzing historical data to find the threshold where adding more machines has minimal impact on feedback time is also crucial for resource-effective testing.

Publisher's Version
DistXplore: Distribution-Guided Testing for Evaluating and Enhancing Deep Learning Systems
Longtian Wang, Xiaofei Xie, Xiaoning Du, Meng Tian, Qing Guo, Zheng Yang, and Chao Shen
(Xi’an Jiaotong University, China; Singapore Management University, Singapore; Monash University, Australia; A*STAR, Singapore; Huawei, China)
Deep learning (DL) models are trained on sampled data, where the distribution of training data differs from that of real-world data (i.e., the distribution shift), which reduces the model's robustness. Various testing techniques have been proposed, including distribution-unaware and distribution-aware methods. However, distribution-unaware testing lacks effectiveness by not explicitly considering the distribution of test cases and may generate redundant errors (within same distribution). Distribution-aware testing techniques primarily focus on generating test cases that follow the training distribution, missing out-of-distribution data that may also be valid and should be considered in the testing process. In this paper, we propose a novel distribution-guided approach for generating valid test cases with diverse distributions, which can better evaluate the model's robustness (i.e., generating hard-to-detect errors) and enhance the model's robustness (i.e., enriching training data). Unlike existing testing techniques that optimize individual test cases, DistXplore optimizes test suites that represent specific distributions. To evaluate and enhance the model's robustness, we design two metrics: distribution difference, which maximizes the similarity in distribution between two different classes of data to generate hard-to-detect errors, and distribution diversity, which increase the distribution diversity of generated test cases for enhancing the model's robustness. To evaluate the effectiveness of DistXplore in model evaluation and enhancement, we compare DistXplore with 14 state-of-the-art baselines on 10 models across 4 datasets. The evaluation results show that DisXplore not only detects a larger number of errors (e.g., 2×+ on average). Furthermore, DistXplore achieves a higher improvement in empirical robustness (e.g., 5.2% more accuracy improvement than the baselines on average).

Publisher's Version
CAmpactor: A Novel and Effective Local Search Algorithm for Optimizing Pairwise Covering Arrays
Qiyuan Zhao, Chuan Luo, Shaowei Cai, Wei Wu, Jinkun Lin, Hongyu Zhang, and Chunming Hu
(Beihang University, China; Institute of Software at Chinese Academy of Sciences, China; Central South University, China; Xiangjiang Laboratory, China; SeedMath Technology, China; Chongqing University, China)
The increasing demand for software customization has led to the development of highly configurable systems. Combinatorial interaction testing (CIT) is an effective method for testing these types of systems. The ultimate goal of CIT is to generate a test suite of acceptable size, called a t-wise covering array (CA), where t is the testing strength. Pairwise testing (i.e., CIT with t=2) is recognized to be the most widely-used CIT technique and has strong fault detection capability. In pairwise testing, the most important problem is pairwise CA generation (PCAG), which is to generate a pairwise CA (PCA) of minimum size. However, existing state-of-the-art PCAG algorithms suffer from the severe scalability challenge; that is, they cannot tackle large-scale PCAG instances effectively, resulting in PCAs of large sizes. To alleviate this challenge, in this paper we propose CAmpactor, a novel and effective local search algorithm for compacting given PCAs into smaller sizes. Extensive experiments on a large number of real-world, public PCAG instances show that the sizes of CAmpactor's generated PCAs are around 45% smaller than the sizes of PCAs constructed by existing state-of-the-art PCAG algorithms, indicating its superiority. Also, our evaluation confirms the generality of CAmpactor, since CAmpactor can reduce the sizes of PCAs generated by a variety of PCAG algorithms.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable

Machine Learning I

Design by Contract for Deep Learning APIs
Shibbir Ahmed, Sayem Mohammad Imtiaz, Samantha Syeda Khairunnesa, Breno Dantas Cruz, and Hridesh Rajan
(Iowa State University, USA; Bradley University, USA)
Deep Learning (DL) techniques are increasingly being incorporated in critical software systems today. DL software is buggy too. Recent work in SE has characterized these bugs, studied fix patterns, and proposed detection and localization strategies. In this work, we introduce a preventative measure. We propose design by contract for DL libraries, DL Contract for short, to document the properties of DL libraries and provide developers with a mechanism to identify bugs during development. While DL Contract builds on the traditional design by contract techniques, we need to address unique challenges. In particular, we need to document properties of the training process that are not visible at the functional interface of the DL libraries. To solve these problems, we have introduced mechanisms that allow developers to specify properties of the model architecture, data, and training process. We have designed and implemented DL Contract for Python-based DL libraries and used it to document the properties of Keras, a well-known DL library. We evaluate DL Contract in terms of effectiveness, runtime overhead, and usability. To evaluate the utility of DL Contract, we have developed 15 sample contracts specifically for training problems and structural bugs. We have adopted four well-vetted benchmarks from prior works on DL bug detection and repair. For the effectiveness, DL Contract correctly detects 259 bugs in 272 real-world buggy programs, from well-vetted benchmarks provided in prior work on DL bug detection and repair. We found that the DL Contract overhead is fairly minimal for the used benchmarks. Lastly, to evaluate the usability, we conducted a survey of twenty participants who have used DL Contract to find and fix bugs. The results reveal that DL Contract can be very helpful to DL application developers when debugging their code.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Testing Coreference Resolution Systems without Labeled Test Sets
Jialun Cao, Yaojie Lu, Ming Wen, and Shing-Chi Cheung
(Hong Kong University of Science and Technology, China; Guangzhou HKUST Fok Ying Tung Research Institute, China; Institute of Software at Chinese Academy of Sciences, China; Huazhong University of Science and Technology, China)
Coreference resolution (CR) is a task to resolve different expressions (e.g., named entities, pronouns) that refer to the same real-world en- tity/event. It is a core natural language processing (NLP) component that underlies and empowers major downstream NLP applications such as machine translation, chatbots, and question-answering. De- spite its broad impact, the problem of testing CR systems has rarely been studied. A major difficulty is the shortage of a labeled dataset for testing. While it is possible to feed arbitrary sentences as test inputs to a CR system, a test oracle that captures their expected test outputs (coreference relations) is hard to define automatically. To address the challenge, we propose Crest, an automated testing methodology for CR systems. Crest uses constituency and depen- dency relations to construct pairs of test inputs subject to the same coreference. These relations can be leveraged to define the meta- morphic relation for metamorphic testing. We compare Crest with five state-of-the-art test generation baselines on two popular CR systems, and apply them to generate tests from 1,000 sentences randomly sampled from CoNLL-2012, a popular dataset for corefer- ence resolution. Experimental results show that Crest outperforms baselines significantly. The issues reported by Crest are all true positives (i.e., 100% precision), compared with 63% to 75% achieved by the baselines.

Publisher's Version Published Artifact Artifacts Available
Neural-Based Test Oracle Generation: A Large-Scale Evaluation and Lessons Learned
Soneya Binta Hossain, Antonio Filieri, Matthew B. Dwyer, Sebastian Elbaum, and Willem Visser
(University of Virginia, USA; Amazon Web Services, USA)
Defining test oracles is crucial and central to test development, but manual construction of oracles is expensive. While recent neural-based automated test oracle generation techniques have shown promise, their real-world effectiveness remains a compelling question requiring further exploration and understanding. This paper investigates the effectiveness of TOGA, a recently developed neural-based method for automatic test oracle generation. TOGA utilizes EvoSuite-generated test inputs and generates both exception and assertion oracles. In a Defects4j study, TOGA outperformed specification, search, and neural-based techniques, detecting 57 bugs, including 30 unique bugs not detected by other methods. To gain a deeper understanding of its applicability in real-world settings, we conducted a series of external, extended, and conceptual replication studies of TOGA.
In a large-scale study involving 25 real-world Java systems, 223.5K test cases, and 51K injected faults, we evaluate TOGA’s ability to improve fault-detection effectiveness relative to the state-of-the-practice and the state-of-the-art. We find that TOGA misclassifies the type of oracle needed 24% of the time and that when it classifies correctly around 62% of the time it is not confident enough to generate any assertion oracle. When it does generate an assertion oracle, more than 47% of them are false positives, and the true positive assertions only increase fault detection by 0.3% relative to prior work. These findings expose limitations of the state-of-the-art neural-based oracle generation technique, provide valuable insights for improvement, and offer lessons for evaluating future automated oracle generation methods.

Publisher's Version Published Artifact Artifacts Available
Revisiting Neural Program Smoothing for Fuzzing
Maria-Irina Nicolae, Max Eisele, and Andreas Zeller
(Bosch, Germany; Saarland University, Germany; CISPA Helmholtz Center for Information Security, Germany)
Testing with randomly generated inputs (fuzzing) has gained significant traction due to its capacity to expose program vulnerabilities automatically. Fuzz testing campaigns generate large amounts of data, making them ideal for the application of machine learning (ML). Neural program smoothing, a specific family of ML-guided fuzzers, aims to use a neural network as a smooth approximation of the program target for new test case generation.
In this paper, we conduct the most extensive evaluation of neural program smoothing (NPS) fuzzers against standard gray-box fuzzers (>11 CPU years and >5.5 GPU years), and make the following contributions: We find that the original performance claims for NPS fuzzers do not hold; a gap we relate to fundamental, implementation, and experimental limitations of prior works. We contribute the first in-depth analysis of the contribution of machine learning and gradient-based mutations in NPS. We implement Neuzz++, which shows that addressing the practical limitations of NPS fuzzers improves performance, but that standard gray-box fuzzers almost always surpass NPS-based fuzzers. As a consequence, we propose new guidelines targeted at benchmarking fuzzing based on machine learning, and present MLFuzz, a platform with GPU access for easy and reproducible evaluation of ML-based fuzzers.
Neuzz++, MLFuzz, and all our data are public.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable

Automated Repair I

RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair
Weishi Wang, Yue Wang, Shafiq Joty, and Steven C.H. Hoi
(Nanyang Technological University, Singapore; Salesforce AI Research, Singapore; Salesforce AI Research, USA)
Automatic program repair (APR) is crucial to reduce manual debugging efforts for developers and improve software reliability. While conventional search-based techniques typically rely on heuristic rules or a redundancy assumption to mine fix patterns, recent years have witnessed the surge of deep learning (DL) based approaches to automate the program repair process in a data-driven manner. However, their performance is often limited by a fixed set of parameters to model the highly complex search space of APR.
To ease such burden on the parametric models, in this work, we propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen) by explicitly leveraging relevant fix patterns retrieved from a codebase of previous bug-fix pairs. Specifically, we build a hybrid patch retriever to account for both lexical and semantic matching based on the raw source code in a language-agnostic manner, which does not rely on any code-specific features. In addition, we adapt a code-aware language model CodeT5 as our foundation model to facilitate both patch retrieval and generation tasks in a unified manner. We adopt a stage-wise approach where the patch retriever first retrieves a relevant external bug-fix pair to augment the buggy input for the CodeT5 patch generator, which synthesizes a ranked list of repair patch candidates. Notably, RAP-Gen is a generic APR framework that can flexibly integrate different patch retrievers and generators to repair various types of bugs.
We thoroughly evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java, where the bug localization information may or may not be provided. Experimental results show that RAP-Gen significantly outperforms previous state-of-the-art (SoTA) approaches on all benchmarks, e.g., boosting the accuracy of T5-large on TFix from 49.70% to 54.15% (repairing 478 more bugs) and repairing 15 more bugs on 818 Defects4J bugs. Further analysis reveals that our patch retriever can search for relevant fix patterns to guide the APR systems.

Publisher's Version
From Leaks to Fixes: Automated Repairs for Resource Leak Warnings
Akshay Utture and Jens Palsberg
(University of California at Los Angeles, USA)
Resource leaks are a common and elusive source of bugs that can result in crashes and security vulnerabilities. The most effective technique to identify such leaks during development is static analysis. However, empirical studies show that in addition to leak warnings, developers often need help in the form of automated fix suggestions to correctly repair such leaks. The only existing tool that can suggest resource-leak fixes is the general-purpose tool Footpatch. Footpatch, however, performs poorly at this task; it generates fixes for only 6% of the leaks, out of which only 27% are correct.
In this paper, we introduce RLFixer, a specialized repair tool that generates high-quality fixes for resource leaks identified by any resource-leak detector. A major challenge for RLFixer is that the most general version of the resource-leak repair problem is at least as hard as compile-time object deallocation, a well-known hard problem for compilers. RLFixer tackles this issue by separating the resource-leaks that are infeasible for a compile-time tool to fix from those that are feasible to fix. RLFixer achieves this separation by using a data-flow analysis of resource objects to classify how they escape the context of their methods. The same analysis also enables RLFixer to generate correct repairs for the feasible-to-fix leaks. RLFixer is demand-driven and hence only analyzes statements relevant to the leak, thereby keeping overhead low.
We evaluated RLFixer by applying it to warnings generated by five popular Java resource-leak detectors. We show that, on average, RLFixer generates repairs for 66% of their warnings, out of which 95% are correct. It has an average repair time of 14 seconds.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Automated Program Repair
Yuxiang Wei, Chunqiu Steven Xia, and Lingming Zhang
(University of Illinois at Urbana-Champaign, USA)
During Automated Program Repair (APR), it can be challenging to synthesize correct patches for real-world systems in general-purpose programming languages. Recent Large Language Models (LLMs) have been shown to be helpful “copilots” in assisting developers with various coding tasks, and have also been directly applied for patch synthesis. However, most LLMs treat programs as sequences of tokens, meaning that they are ignorant of the underlying semantics constraints of the target programming language. This results in plenty of statically invalid generated patches, impeding the practicality of the technique. Therefore, we propose Repilot, a framework to further copilot the AI “copilots” (i.e., LLMs) by synthesizing more valid patches during the repair process. Our key insight is that many LLMs produce outputs autoregressively (i.e., token by token), resembling human writing programs, which can be significantly boosted and guided through a Completion Engine. Repilot synergistically synthesizes a candidate patch through the interaction between an LLM and a Completion Engine, which 1) prunes away infeasible tokens suggested by the LLM and 2) proactively completes the token based on the suggestions provided by the Completion Engine. Our evaluation on a subset of the widely-used Defects4j 1.2 and 2.0 datasets shows that Repilot fixes 66 and 50 bugs, respectively, surpassing the best-performing baseline by 14 and 16 bugs fixed. More importantly, Repilot is capable of producing more valid and correct patches than the base LLM when given the same generation budget.

Publisher's Version Published Artifact Info Artifacts Available Artifacts Reusable
SmartFix: Fixing Vulnerable Smart Contracts by Accelerating Generate-and-Verify Repair using Statistical Models
Sunbeom So and Hakjoo Oh
(Korea University, South Korea)
We present SmartFix, a new technique for repairing vulnerable smart contracts. There is an urgent need to develop automatic bug-repair techniques for smart contracts, as smart contracts are safety-critical software and manual debugging is burdensome and error-prone. While several repair approaches have been proposed recently, they are unsatisfactory since no existing techniques can achieve high repairability, full automation, and safety guarantee at the same time, posing significant problems for practical use. SmartFix aims to address these shortcomings by using a “generate-and-verify” approach that iteratively enumerates candidate patches while validating their correctness by invoking a safety verifier. However, in this approach, a technical challenge arises as the search space is huge and the verification-based patch validation is expensive. To address this challenge, we present a novel technique for accelerating the generate-and-verify repair procedure using statistical models derived from the verifier’s feedback. Experimental results on real-world Ethereum smart contracts show that SmartFix is able to achieve a fix success rate of 94.8% for critical classes of vulnerabilities, far outperforming sGuard, the existing state-of-the-art technique whose success rate is 65.4%.

Publisher's Version Published Artifact Archive submitted (500 kB) Artifacts Available Artifacts Reusable
Automatically Resolving Dependency-Conflict Building Failures via Behavior-Consistent Loosening of Library Version Constraints
Huiyan Wang, Shuguan Liu, Lingyu Zhang, and Chang Xu
(Nanjing University, China)
Python projects grow quickly by code reuse and building automation based on third-party libraries. However, the version constraints associated with these libraries are prone to mal-configuration, and this forms a major obstacle to correct project building (known as dependency-conflict (DC) building failure). Our empirical findings suggest that such mal-configured version constraints were mainly prepared manually, and could essentially be refined for better quality to improve the chance of successful project building. We propose a LooCo approach to refining Python projects’ library version constraints by automatically loosening them to maximize their solutions, while keeping the libraries to observe their original behaviors. Our experimental results with real-life Python projects report that LooCo could efficiently refine library version constraints (0.4s per version loosening) by effective loosening (5.5 new versions expanded on average) automatically, and transform 54.8% originally unsolvable cases into solvable ones (i.e., successful building) and significantly increase solutions (21 more on average) for originally solvable cases.

Publisher's Version

Empirical Studies I

On the Relationship between Code Verifiability and Understandability
Kobi Feldman, Martin Kellogg, and Oscar Chaparro
(College of William and Mary, USA; New Jersey Institute of Technology, USA)
Proponents of software verification have argued that simpler code is easier to verify: that is, that verification tools issue fewer false positives and require less human intervention when analyzing simpler code. We empirically validate this assumption by comparing the number of warnings produced by four state-of-the-art verification tools on 211 snippets of Java code with 20 metrics of code comprehensibility from human subjects in six prior studies. Our experiments, based on a statistical (meta-)analysis, show that, in aggregate, there is a small correlation (r = 0.23) between understandability and verifiability. The results support the claim that easy-to-verify code is often easier to understand than code that requires more effort to verify. Our work has implications for the users and designers of verification tools and for future attempts to automatically measure code comprehensibility: verification tools may have ancillary benefits to understandability, and measuring understandability may require reasoning about semantic, not just syntactic, code properties.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Towards Greener Yet Powerful Code Generation via Quantization: An Empirical Study
Xiaokai Wei, Sujan Kumar Gonugondla, Shiqi Wang, Wasi Ahmad, Baishakhi Ray, Haifeng Qian, Xiaopeng Li, Varun Kumar, Zijian Wang, Yuchen Tian, Qing Sun, Ben Athiwaratkun, Mingyue Shang, Murali Krishna Ramanathan, Parminder Bhatia, and Bing Xiang
(AWS AI Labs, USA)
ML-powered code generation aims to assist developers to write code in a more productive manner by intelligently generating code blocks based on natural language prompts. Recently, large pretrained deep learning models have pushed the boundary of code generation and achieved impressive performance. However, the huge number of model parameters poses a significant challenge to their adoption in a typical software development environment, where a developer might use a standard laptop or mid-size server to develop code. Such large models cost significant resources in terms of memory, latency, dollars, as well as carbon footprint.
Model compression is a promising approach to address these challenges. We have identified quantization as one of the most promising compression techniques for code-generation as it avoids expensive retraining costs. As quantization represents model parameters with lower-bit integer (e.g., int8), the model size and runtime latency would both benefit. We empirically evaluate quantized models on code generation tasks across different dimensions: (i) resource usage and carbon footprint, (ii) accuracy, and (iii) robustness. Through systematic experiments we find a code-aware quantization recipe that could run even a 6-billion-parameter model in a regular laptop without significant accuracy or robustness degradation. We find that the recipe is readily applicable to code summarization task as well.

Publisher's Version

Testing II

Statfier: Automated Testing of Static Analyzers via Semantic-Preserving Program Transformations
Huaien Zhang, Yu Pei, Junjie Chen, and Shin Hwei Tan
(Southern University of Science and Technology, China; Hong Kong Polytechnic University, China; Tianjin University, China; Concordia University, Canada)
Static analyzers reason about the behaviors of programs without executing them and report issues when they violate pre-defined desirable properties. One of the key limitations of static analyzers is their tendency to produce inaccurate and incomplete analysis results, i.e., they often generate too many spurious warnings and miss important issues. To help enhance the reliability of a static analyzer, developers usually manually write tests involving input programs and the corresponding expected analysis results for the analyzers. Meanwhile, a static analyzer often includes example programs in its documentation to demonstrate the desirable properties and/or their violations. Our key insight is that we can reuse programs extracted either from the official test suite or documentation and apply semantic-preserving transformations to them to generate variants. We studied the quality of input programs from these two sources and found that most rules in static analyzers are covered by at least one input program, implying the potential of using these programs as the basis for test generation. We present Statfier, a heuristic-based automated testing approach for static analyzers that generates program variants via semantic-preserving transformations and detects inconsistencies between the original program and variants (indicate inaccurate analysis results in the static analyzer). To select variants that are more likely to reveal new bugs, Statfier uses two key heuristics: (1) analysis report guided location selection that uses program locations in the reports produced by static analyzers to perform transformations and (2) structure diversity driven variant selection that chooses variants with different program contexts and diverse types of transformations. Our experiments with five popular static analyzers show that Statfier can find 79 bugs in these analyzers, of which 46 have been confirmed.

Publisher's Version
Contextual Predictive Mutation Testing
Kush Jain, Uri Alon, Alex Groce, and Claire Le Goues
(Carnegie Mellon University, USA; Northern Arizona University, USA)
Mutation testing is a powerful technique for assessing and improving test suite quality that artificially introduces bugs and checks whether the test suites catch them. However, it is also computationally expensive and thus does not scale to large systems and projects. One promising recent approach to tackling this scalability problem uses machine learning to predict whether the tests will detect the synthetic bugs, without actually running those tests. However, existing predictive mutation testing approaches still misclassify 33% of detection outcomes on a randomly sampled set of mutant-test suite pairs. We introduce MutationBERT, an approach for predictive mutation testing that simultaneously encodes the source method mutation and test method, capturing key context in the input representation. Thanks to its higher precision, MutationBERT saves 33% of the time spent by a prior approach on checking/verifying live mutants. MutationBERT, also outperforms the state-of-the-art in both same project and cross project settings, with meaningful improvements in precision, recall, and F1 score. We validate our input representation, and aggregation approaches for lifting predictions from the test matrix level to the test suite level, finding similar improvements in performance. MutationBERT not only enhances the state-of-the-art in predictive mutation testing, but also presents practical benefits for real-world applications, both in saving developer time and finding hard to detect mutants that prior approaches do not.

Publisher's Version
𝜇Akka: Mutation Testing for Actor Concurrency in Akka using Real-World Bugs
Mohsen Moradi Moghadam, Mehdi Bagherzadeh, Raffi Khatchadourian, and Hamid Bagheri
(Oakland University, USA; City University of New York, USA; University of Nebraska-Lincoln, USA)
Actor concurrency is becoming increasingly important in the real world and mission-critical software. This requires these applications to be free from actor bugs, that occur in the real world, and have tests that are effective in finding these bugs. Mutation testing is a well-established technique that transforms an application to induce its likely bugs and evaluate the effectiveness of its tests in finding these bugs. Mutation testing is available for a broad spectrum of applications and their bugs, ranging from web to mobile to machine learning, and is used at scale in companies like Google and Facebook. However, there still is no mutation testing for actor concurrency that uses real-world actor bugs. In this paper, we propose 𝜇Akka, a framework for mutation testing of Akka actor concurrency using real actor bugs. Akka is a popular industrial-strength implementation of actor concurrency. To design, implement, and evaluate 𝜇Akka, we take the following major steps: (1) manually analyze a recent set of 186 real Akka bugs from Stack Overflow and GitHub to understand their causes; (2) design a set of 32 mutation operators, with 138 source code changes in Akka API, to emulate these causes and induce their bugs; (3) implement these operators in an Eclipse plugin for Java Akka; (4) use the plugin to generate 11.7k mutants of 10 real GitHub applications, with 446.4k lines of code and 7.9k tests; (5) run these tests on these mutants to measure the quality of mutants and effectiveness of tests; (6) use PIT to generate 26.2k mutants to compare 𝜇Akka and PIT mutant quality and test effectiveness. PIT is a popular mutation testing tool with traditional operators; (7) manually analyze the bug coverage and overlap of 𝜇Akka, PIT, and actor operators in a previous work; and (8) discuss a few implications of our findings. Among others, we find that 𝜇Akka mutants are higher quality, cover more bugs, and tests are less effective in detecting them.

Publisher's Version

Software Evolution I

EvaCRC: Evaluating Code Review Comments
Lanxin Yang, Jinwei Xu, Yifan Zhang, He Zhang, and Alberto Bacchelli
(Nanjing University, China; University of Zurich, Switzerland)
In code reviews, developers examine code changes authored by peers and provide feedback through comments. Despite the importance of these comments, no accepted approach currently exists for assessing their quality. Therefore, this study has two main objectives: (1) to devise a conceptual model for an explainable evaluation of review comment quality, and (2) to develop models for the automated evaluation of comments according to the conceptual model. To do so, we conduct mixed-method studies and propose a new approach: EvaCRC (Evaluating Code Review Comments). To achieve the first goal, we collect and synthesize quality attributes of review comments, by triangulating data from both authoritative documentation on code review standards and academic literature. We then validate these attributes using real-world instances. Finally, we establish mappings between quality attributes and grades by inquiring domain experts, thus defining our final explainable conceptual model. To achieve the second goal, EvaCRC leverages multi-label learning. To evaluate and refine EvaCRC, we conduct an industrial case study with a global ICT enterprise. The results indicate that EvaCRC can effectively evaluate review comments while offering reasons for the grades. Data and materials: https://doi.org/10.5281/zenodo.8297481

Publisher's Version Info
HyperDiff: Computing Source Code Diffs at Scale
Quentin Le Dilavrec, Djamel Eddine Khelladi, Arnaud Blouin, and Jean-Marc Jézéquel
(University of Rennes, France; CNRS - University of Rennes, France; INSA Rennes, France)
With the advent of fast software evolution and multistage releases, temporal code analysis is becoming useful for various purposes, such as bug cause identification, bug prediction or code evolution analysis. Temporal code analyses can consist in analyzing multiple Abstract Syntax Trees (ASTs) extracted from code evolutions, e.g. one AST for each commit or release. Core feature to temporal analysis is code differencing: the computation of the so-called Diff or edit script between two given versions of the code. However, jointly analyzing and computing the difference on thousands versions of code faces scalability issues. Mainly because of the cost of: 1) parsing the original and evolved code in two source and target ASTs; 2) wasting resources by not reusing intermediate computation results that can be shared between versions. This paper details a novel approach based on time-oriented data structures that makes code differencing scale up to large software codebases. In particular, we leverage on the HyperAST, a novel representation of code histories, to propose an incremental and memory efficient approach by lazifying the well known GumTree diffing algorithms, a mainstream code differencing algorithm and tool. We evaluated our approach on a curated list of 19 large software projects and compared it to GumTree. Our approach outperforms it in scalability both in time and memory. We observed an order-of-magnitude difference: 1) in CPU time from x1.2 to x12.7 for the total time of diff computation and up to x226 in intermediate phases of the diff computation, and 2) in memory footprint of x4.5 per AST node. The approach produced 99.3% of identical diffs with respect to GumTree.

Publisher's Version Published Artifact Info Artifacts Available
Understanding Solidity Event Logging Practices in the Wild
Lantian Li, Yejian Liang, Zhihao Liu, and Zhongxing Yu
(Shandong University, China)
Writing logging messages is a well-established conventional programming practice, and it is of vital importance for a wide variety of software development activities. The logging mechanism in Solidity programming is enabled by the high-level event feature, but up to now there lacks study for understanding Solidity event logging practices in the wild. To fill this gap, we in this paper provide the first quantitative characteristic study of the current Solidity event logging practices using 2,915 popular Solidity projects hosted on GitHub. The study methodically explores the pervasiveness of event logging, the goodness of current event logging practices, and in particular the reasons for event logging code evolution, and delivers 8 original and important findings. The findings notably include the existence of a large percentage of independent event logging code modifications, and the underlying reasons for different categories of independent event logging code modifications are diverse (for instance, bug fixing and gas saving). We additionally give the implications of our findings, and these implications can enlighten developers, researchers, tool builders, and language designers to improve the event logging practices. To illustrate the potential benefits of our study, we develop a proof-of-concept checker on top of one of our findings and the checker effectively detects problematic event logging code that consumes extra gas in 35 popular GitHub projects and 9 project owners have already confirmed the detected issues.

Publisher's Version

Program Analysis I

An Automated Approach to Extracting Local Variables
Xiaye Chi, Hui Liu, Guangjie Li, Weixiao Wang, Yunni Xia, Yanjie Jiang, Yuxia Zhang, and Weixing Ji
(Beijing Institute of Technology, China; National Innovation Institute of Defense Technology, China; Chongqing University, China)
Extract local variable is a well-known and widely used refactoring. It is frequently employed to replace one or more occurrences of a complex expression with simple accesses to a newly added variable. Although most IDEs provide tool support for extract local variables, such tools without deep analysis of the refactorings may result in semantic errors. To this end, in this paper, we propose a novel and more reliable approach, called ValExtractor, to conduct extract variable refactorings automatically. The major challenge of automated extract local variable refactorings is how to efficiently and accurately identify the side effect of the extracted expressions and the potential interaction between the extracted expressions and their contexts without time-consuming dynamic execution of the involved programs. To resolve this challenge, ValExtractor leverages a lightweight static source code analysis to validate the side effect of the selected expression, and to identify which occurrences of the selected expression could be extracted together without changing the semantics of the program or introducing potential new exceptions. Our evaluation results on open-source Java applications suggest that Eclipse and IntelliJ IDEA, the state-of-the-practice refactoring engines, resulted in a large number of faulty extract variable refactorings whereas ValExtractor successfully avoided all such errors. The proposed approach has been merged into (and distributed with) Eclipse to improve the safety of extract local variable refactoring.

Publisher's Version Published Artifact Artifacts Available
Statistical Reachability Analysis
Seongmin Lee and Marcel Böhme
(MPI-SP, Germany)
Given a target program state (or statement) s, what is the probability that an input reaches s? This is the quantitative reachability analysis problem. For instance, quantitative reachability analysis can be used to approximate the reliability of a program (where s is a bad state). Traditionally, quantitative reachability analysis is solved as a model counting problem for a formal constraint that represents the (approximate) reachability of s along paths in the program, i.e., probabilistic reachability analysis. However, in preliminary experiments, we failed to run state-of-the-art probabilistic reachability analysis on reasonably large programs.
In this paper, we explore statistical methods to estimate reachability probability. An advantage of statistical reasoning is that the size and composition of the program are insubstantial as long as the program can be executed. We are particularly interested in the error compared to the state-of-the-art probabilistic reachability analysis. We realize that existing estimators do not exploit the inherent structure of the program and develop structure-aware estimators to further reduce the estimation error given the same number of samples. Our empirical evaluation on previous and new benchmark programs shows that (i) our statistical reachability analysis outperforms state-of-the-art probabilistic reachability analysis tools in terms of accuracy, efficiency, and scalability, and (ii) our structure-aware estimators further outperform (blackbox) estimators that do not exploit the inherent program structure. We also identify multiple program properties that limit the applicability of the existing probabilistic analysis techniques.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
PPR: Pairwise Program Reduction
Mengxiao Zhang, Zhenyang Xu, Yongqiang Tian, Yu Jiang, and Chengnian Sun
(University of Waterloo, Canada; Tsinghua University, China)
Program reduction is a practical technique widely used for debugging compilers. To report a compiler bug with a bug-triggering program, one needs to minimize the program by removing bugirrelevant program elements first. Though existing program reduction techniques, such as C-Reduce and Perses, can reduce a bug-triggering program as a whole, they overlook the fact that the degree of relevance of each remaining token to the bug varies. To this end, we propose Pairwise Program Reduction (PPR), a new program reduction technique for minimizing a pair of programs w.r.t. certain properties. Given a seed program 𝑃𝑠 , a variant 𝑃𝑣 derived from 𝑃𝑠 , and the properties 𝑃𝑠 and 𝑃𝑣 exhibit separately (e.g., 𝑃𝑣 crashes a compiler whereas 𝑃𝑠 does not), PPR not only reduces the sizes of 𝑃𝑠 and 𝑃𝑣 , but also minimizes the differences between 𝑃𝑠 and 𝑃𝑣 . The final result of PPR is a pair of minimized programs that still preserve the properties, but the minimized differences between the pair highlight the critical program elements that are highly related to the bug. To thoroughly evaluate PPR, we manually constructed the first pairwise benchmark suite from real-world compiler bugs (20 bugs in GCC and LLVM, 9 bugs in Rustc and 9 bugs in JerryScript). The evaluation results show that PPR significantly outperforms the baseline: DD, a variant of Delta Debugging. Specifically, on large and complex programs, PPR’s reduction results are only 0.6% of those by DD w.r.t. program size. The sizes of the minimized variants (i.e., 𝑃𝑣 ) by PPR are also comparable to those by Perses and C-Reduce; but PPR offers more for debugging by highlighting the critical, bug-inducing changes via the minimized differences. Evaluation on Rust and JavaScript demonstrates PPR’s strong generality to other languages.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
When Function Inlining Meets WebAssembly: Counterintuitive Impacts on Runtime Performance
Alan Romano and Weihang Wang
(University of Southern California, USA)
The WebAssembly standard defines a bytecode format serving as a compilation target for languages such as C, C++, and Rust. WebAssembly compilers are built on top of existing compiler infrastructures such as LLVM and newly developed compiler toolchains such as Binaryen, handling various new features of the WebAssembly language. However, we observe that both these new and existing infrastructures implicitly assume that the execution environments of native and WebAssembly applications are the same, ignoring the presence of browser compilers in the WebAssembly pipeline. This incorrect assumption often misguides function inlining optimizations, resulting in a slower WebAssembly module when function inlining is applied. This paper is the first to investigate the counterintuitive impacts of function inlining on WebAssembly runtime performance. We inspect the inlining optimization passes of the LLVM and Binaryen infrastructures used in the Emscripten C/C++-to-WebAssembly compiler. Our investigation on 127 C/C++ samples from the LLVM test suite shows that 66 samples exhibit counterintuitive behavior due to function inlining, particularly from inlining hot functions into long-running functions. We hope our findings motivate further work on revising existing optimizations with the unique characteristics of WebAssembly environments in mind.

Publisher's Version Published Artifact Artifacts Available

Code Search and Text to Code

Self-Supervised Query Reformulation for Code Search
Yuetian Mao, Chengcheng Wan, Yuze Jiang, and Xiaodong Gu
(Shanghai Jiao Tong University, China; East China Normal University, China)
Automatic query reformulation is a widely utilized technology for enriching user requirements and enhancing the outcomes of code search. It can be conceptualized as a machine translation task, wherein the objective is to rephrase a given query into a more comprehensive alternative. While showing promising results, training such a model typically requires a large parallel corpus of query pairs (i.e., the original query and a reformulated query) that are confidential and unpublished by online code search engines. This restricts its practicality in software development processes. In this paper, we propose SSQR, a self-supervised query reformulation method that does not rely on any parallel query corpus. Inspired by pre-trained models, SSQR treats query reformulation as a masked language modeling task conducted on an extensive unannotated corpus of queries. SSQR extends T5 (a sequence-to-sequence model based on Transformer) with a new pre-training objective named corrupted query completion (CQC), which randomly masks words within a complete query and trains T5 to predict the masked content. Subsequently, for a given query to be reformulated, SSQR identifies potential locations for expansion and leverages the pre-trained T5 model to generate appropriate content to fill these gaps. The selection of expansions is then based on the information gain associated with each candidate. Evaluation results demonstrate that SSQR outperforms unsupervised baselines significantly and achieves competitive performance compared to supervised methods.

Publisher's Version
Natural Language to Code: How Far Are We?
Shangwen Wang, Mingyang Geng, Bo Lin, Zhensu Sun, Ming Wen, Yepang Liu, Li Li, Tegawendé F. Bissyandé, and Xiaoguang Mao
(National University of Defense Technology, China; Singapore Management University, Singapore; Huazhong University of Science and Technology, China; Southern University of Science and Technology, China; Beihang University, China; University of Luxembourg, Luxembourg)
A longstanding dream in software engineering research is to devise effective approaches for automating development tasks based on developers' informally-specified intentions. Such intentions are generally in the form of natural language descriptions. In recent literature, a number of approaches have been proposed to automate tasks such as code search and even code generation based on natural language inputs. While these approaches vary in terms of technical designs, their objective is the same: transforming a developer's intention into source code. The literature, however, lacks a comprehensive understanding towards the effectiveness of existing techniques as well as their complementarity to each other. We propose to fill this gap through a large-scale empirical study where we systematically evaluate natural language to code techniques. Specifically, we consider six state-of-the-art techniques targeting code search, and four targeting code generation. Through extensive evaluations on a dataset of 22K+ natural language queries, our study reveals the following major findings: (1) code search techniques based on model pre-training are so far the most effective while code generation techniques can also provide promising results; (2) complementarity widely exists among the existing techniques; and (3) combining the ten techniques together can enhance the performance for 35% compared with the most effective standalone technique. Finally, we propose a post-processing strategy to automatically integrate different techniques based on their generated code. Experimental results show that our devised strategy is both effective and extensible.

Publisher's Version Published Artifact Artifacts Available
Efficient Text-to-Code Retrieval with Cascaded Fast and Slow Transformer Models
Akhilesh Deepak Gotmare, Junnan Li, Shafiq Joty, and Steven C.H. Hoi
(Salesforce AI Research, Singapore; Salesforce AI Research, USA)
The goal of semantic code search or text-to-code search is to retrieve a semantically relevant code snippet from an existing code database using a natural language query. When constructing a practical semantic code search system, existing approaches fail to provide an optimal balance between retrieval speed and the relevance of the retrieved results. We propose an efficient and effective text-to-code search framework with cascaded fast and slow models, in which a fast transformer encoder model is learned to optimize a scalable index for fast retrieval followed by learning a slow classification-based re-ranking model to improve the accuracy of the top K results from the fast retrieval. To further reduce the high memory cost of deploying two separate models in practice, we propose to jointly train the fast and slow model based on a single transformer encoder with shared parameters. Empirically our cascaded method is not only efficient and scalable, but also achieves state-of-the-art results with an average mean reciprocal ranking (MRR) score of 0.7795 (across 6 programming languages) on the CodeSearchNet benchmark as opposed to the prior state-of-the-art result of 0.744 MRR. Our codebase can be found at this link.

Publisher's Version
PEM: Representing Binary Program Semantics for Similarity Analysis via a Probabilistic Execution Model
Xiangzhe Xu, Zhou Xuan, Shiwei Feng, Siyuan Cheng, Yapeng Ye, Qingkai Shi, Guanhong Tao, Le Yu, Zhuo Zhang, and Xiangyu Zhang
(Purdue University, USA)
Binary similarity analysis determines if two binary executables are from the same source program. Existing techniques leverage static and dynamic program features and may utilize advanced Deep Learning techniques. Although they have demonstrated great potential, the community believes that a more effective representation of program semantics can further improve similarity analysis. In this paper, we propose a new method to represent binary program semantics. It is based on a novel probabilistic execution engine that can effectively sample the input space and the program path space of subject binaries. More importantly, it ensures that the collected samples are comparable across binaries, addressing the substantial variations of input specifications. Our evaluation on 9 real-world projects with 35k functions, and comparison with 6 state-of-the-art techniques show that PEM can achieve a precision of 96% with common settings, outperforming the baselines by 10-20%.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable

Log Analysis and Debugging

Hue: A User-Adaptive Parser for Hybrid Logs
Junjielong Xu, Qiuai Fu, Zhouruixing Zhu, Yutong Cheng, Zhijing Li, Yuchi Ma, and Pinjia He
(Chinese University of Hong Kong, Shenzhen, China; Huawei Cloud Computing Technologies, China)
Log parsing, which extracts log templates from semi-structured logs and produces structured logs, is the first and the most critical step in automated log analysis. While existing log parsers have achieved decent results, they suffer from two major limitations by design. First, they do not natively support hybrid logs that consist of both single-line logs and multi-line logs (Java Exception and Hadoop Counters). Second, they fall short in integrating domain knowledge in parsing, making it hard to identify ambiguous tokens in logs. This paper defines a new research problem, hybrid log parsing, as a superset of traditional log parsing tasks, and proposes Hue, the first attempt for hybrid log parsing via a user-adaptive manner. Specifically, Hue converts each log message to a sequence of special wildcards using a key casting table and determines the log types via line aggregating and pattern extracting. In addition, Hue can effectively utilize user feedback via a novel merge-reject strategy, making it possible to quickly adapt to complex and changing log templates. We evaluated Hue on three hybrid log datasets and sixteen widely-used single-line log datasets (Loghub). The results show that Hue achieves an average grouping accuracy of 0.845 on hybrid logs, which largely outperforms the best results (0.563 on average) obtained by existing parsers. Hue also exhibits SOTA performance on single-line log datasets.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Log Parsing with Generalization Ability under New Log Types
Siyu Yu, Yifan Wu, Zhijing Li, Pinjia He, Ningjiang Chen, and Changjian Liu
(Guangxi University, China; Peking University, China; Chinese University of Hong Kong, China)
Log parsing, which converts semi-structured logs into structured logs, is the first step for automated log analysis. Existing parsers are still unsatisfactory in real-world systems due to new log types in new-coming logs. In practice, available logs collected during system runtime often do not contain all the possible log types of a system because log types related to infrequently activated system states are unlikely to be recorded and new log types are frequently introduced with system updates. Meanwhile, most existing parsers require preprocessing to extract variables in advance, but preprocessing is based on the operator’s prior knowledge of available logs and therefore may not work well on new log types. In addition, parser parameters set based on available logs are difficult to generalize to new log types. To support new log types, we propose a variable generation imitation strategy to craft a novel log parsing approach with generalization ability, called Log3T. Log3T employs a pre-trained transformer encoder-based model to extract log templates and can update parameters at parsing time to adapt to new log types by a modified test-time training. Experimental results on 16 benchmark datasets show that Log3T outperforms the state-of-the-art parsers in terms of parsing accuracy. In addition, Log3T can automatically adapt to new log types in new-coming logs.

Publisher's Version
Semantic Debugging
Martin Eberlein, Marius Smytzek, Dominic Steinhöfel, Lars Grunske, and Andreas Zeller
(Humboldt University of Berlin, Germany; CISPA Helmholtz Center for Information Security, Germany)
Why does my program fail? We present a novel and general technique to automatically determine failure causes and conditions, using logical properties over input elements: “The program fails if and only if int(<length>) > len(<payload>) holds—that is, the given <length> is larger than the <payload> length.” Our AVICENNA prototype uses modern techniques for inferring properties of passing and failing inputs and validating and refining hypotheses by having a constraint solver generate supporting test cases to obtain such diagnoses. As a result, AVICENNA produces crisp and expressive diagnoses even for complex failure conditions, considerably improving over the state of the art with diagnoses close to those of human experts.

Publisher's Version Published Artifact Info Artifacts Available Artifacts Reusable
Demystifying Dependency Bugs in Deep Learning Stack
Kaifeng Huang, Bihuan Chen, Susheng Wu, Junming Cao, Lei Ma, and Xin Peng
(Fudan University, China; University of Tokyo, Japan; University of Alberta, Canada)
Deep learning (DL) applications, built upon a heterogeneous and complex DL stack (e.g., Nvidia GPU, Linux, CUDA driver, Python runtime, and TensorFlow), are subject to software and hardware dependencies across the DL stack. One challenge in dependency management across the entire engineering lifecycle is posed by the asynchronous and radical evolution and the complex version constraints among dependencies. Developers may introduce dependency bugs (DBs) in selecting, using and maintaining dependencies. However, the characteristics of DBs in DL stack is still under-investigated, hindering practical solutions to dependency management in DL stack. To bridge this gap, this paper presents the first comprehensive study to characterize symptoms, root causes and fix patterns of DBs across the whole DL stack with 446 DBs collected from StackOverflow posts and GitHub issues. For each DB, we first investigate the symptom as well as the lifecycle stage and dependency where the symptom is exposed. Then, we analyze the root cause as well as the lifecycle stage and dependency where the root cause is introduced. Finally, we explore the fix pattern and the knowledge sources that are used to fix it. Our findings from this study shed light on practical implications on dependency management.

Publisher's Version

Machine Learning II

Can Machine Learning Pipelines Be Better Configured?
Yibo Wang, Ying Wang, Tingwei Zhang, Yue Yu, Shing-Chi Cheung, Hai Yu, and Zhiliang Zhu
(Northeastern University, China; Hong Kong University of Science and Technology, China; National University of Defense Technology, China)
A Machine Learning (ML) pipeline configures the workflow of a learning task using the APIs provided by ML libraries. However, a pipeline’s performance can vary significantly across different configurations of ML library versions. Misconfigured pipelines can result in inferior performance, such as poor execution time and memory usage, numeric errors and even crashes. A pipeline is subject to misconfiguration if it exhibits significantly inconsistent performance upon changes in the versions of its configured libraries or the combination of these libraries. We refer to such performance inconsistency as a pipeline configuration (PLC) issue.
There is no prior systematic study on the pervasiveness, impact and root causes of PLC issues. A systematic understanding of these issues helps configure effective ML pipelines and identify misconfigured ones. In this paper, we conduct the first empirical study of PLC issues. To better dig into the problem, we propose Piecer, an infrastructure that automatically generates a set of pipeline variants by varying different version combinations of ML libraries and compares their performance inconsistencies. We apply Piecer to the 3,380 pipelines that can be deployed out of the 11,363 ML pipelines collected from multiple ML competitions at Kaggle platform. The empirical study results show that 1,092 (32.3

Publisher's Version Info
Compatibility Issues in Deep Learning Systems: Problems and Opportunities
Jun Wang, Guanping Xiao, Shuai Zhang, Huashan Lei, Yepang Liu, and Yulei Sui
(Nanjing University of Aeronautics and Astronautics, China; Southern University of Science and Technology, China; UNSW, Australia)
Deep learning (DL) systems are complex component-based systems, which consist of core program (code implementation and data), Python (language and interpreter), third-party libraries, low-level libraries, development tools, OS, and hardware environments. Incompatible interaction between components would cause serious compatibility issues, substantially affecting the development and deployment processes. What types of compatibility issues are frequently exposed in DL systems? What are the root causes of such issues and how do developers fix them? How far are we from automatically detecting and fixing DL compatibility issues? Although there are many existing studies on DL bugs, the characteristics of DL compatibility issues have rarely been systematically studied and the above questions remain largely unexplored. To fill this gap, we conduct the first comprehensive empirical study to characterize compatibility issues in DL systems. Through analyzing 352 DL compatibility issues classified from 3,072 posts in Stack Overflow, we present their types, manifestation stages, and symptoms. We further summarize the root causes and common fixing strategies, and conduct a tool survey on the current research status of automated detection and repair of DL compatibility issues. Our study allows researchers and practitioners to gain a better understanding of DL compatibility issues and can facilitate future tool development.

Publisher's Version Published Artifact Artifacts Available
An Extensive Study on Adversarial Attack against Pre-trained Models of Code
Xiaohu Du, Ming Wen, Zichao Wei, Shangwen Wang, and Hai Jin
(Huazhong University of Science and Technology, China; National University of Defense Technology, China)
Transformer-based pre-trained models of code (PTMC) have been widely utilized and have achieved state-of-the-art performance in many mission-critical applications. However, they can be vulnerable to adversarial attacks through identifier substitution or coding style transformation, which can significantly degrade accuracy and may further incur security concerns. Although several approaches have been proposed to generate adversarial examples for PTMC, the effectiveness and efficiency of such approaches, especially on different code intelligence tasks, has not been well understood. To bridge this gap, this study systematically analyzes five state-of-the-art adversarial attack approaches from three perspectives: effectiveness, efficiency, and the quality of generated examples. The results show that none of the five approaches balances all these perspectives. Particularly, approaches with a high attack success rate tend to be time-consuming; the adversarial code they generate often lack naturalness, and vice versa. To address this limitation, we explore the impact of perturbing identifiers under different contexts and find that identifier substitution within for and if statements is the most effective. Based on these findings, we propose a new approach that prioritizes different types of statements for various tasks and further utilizes beam search to generate adversarial examples. Evaluation results show that it outperforms the state-of-the-art ALERT in terms of both effectiveness and efficiency while preserving the naturalness of the generated adversarial examples.

Publisher's Version Published Artifact Artifacts Available
Fix Fairness, Don’t Ruin Accuracy: Performance Aware Fairness Repair using AutoML
Giang Nguyen, Sumon Biswas, and Hridesh Rajan
(Iowa State University, USA; Carnegie Mellon University, USA)
Machine learning (ML) is increasingly being used in critical decision-making software, but incidents have raised questions about the fairness of ML predictions. To address this issue, new tools and methods are needed to mitigate bias in ML-based software. Previous studies have proposed bias mitigation algorithms that only work in specific situations and often result in a loss of accuracy. Our proposed solution is a novel approach that utilizes automated machine learning (AutoML) techniques to mitigate bias. Our approach includes two key innovations: a novel optimization function and a fairness-aware search space. By improving the default optimization function of AutoML and incorporating fairness objectives, we are able to mitigate bias with little to no loss of accuracy. Additionally, we propose a fairness-aware search space pruning method for AutoML to reduce computational cost and repair time. Our approach, built on the state-of-the-art Auto-Sklearn tool, is designed to reduce bias in real-world scenarios. In order to demonstrate the effectiveness of our approach, we evaluated our approach on four fairness problems and 16 different ML models, and our results show a significant improvement over the baseline and existing bias mitigation techniques. Our approach, Fair-AutoML, successfully repaired 60 out of 64 buggy cases, while existing bias mitigation techniques only repaired up to 44 out of 64 cases.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
BiasAsker: Measuring the Bias in Conversational AI System
Yuxuan Wan, Wenxuan Wang, Pinjia He, Jiazhen Gu, Haonan Bai, and Michael R. Lyu
(Chinese University of Hong Kong, China)
Powered by advanced Artificial Intelligence (AI) techniques, conversational AI systems, such as ChatGPT, and digital assistants like Siri, have been widely deployed in daily life. However, such systems may still produce content containing biases and stereotypes, causing potential social problems. Due to modern AI techniques’ data-driven, black-box nature, comprehensively identifying and measuring biases in conversational systems remains challenging. Particularly, it is hard to generate inputs that can comprehensively trigger potential bias due to the lack of data containing both social groups and biased properties. In addition, modern conversational systems can produce diverse responses (e.g., chatting and explanation), which makes existing bias detection methods based solely on sentiment and toxicity hardly being adopted. In this paper, we propose BiasAsker, an automated framework to identify and measure social bias in conversational AI systems. To obtain social groups and biased properties, we construct a comprehensive social bias dataset containing a total of 841 groups and 5,021 biased properties. Given the dataset, BiasAsker automatically generates questions and adopts a novel method based on existence measurement to identify two types of biases (i.e., absolute bias and related bias) in conversational systems. Extensive experiments on eight commercial systems and two famous research models, such as ChatGPT and GPT-3, show that 32.83% of the questions generated by BiasAsker can trigger biased behaviors in these widely deployed conversational systems. All the code, data, and experimental results have been released to facilitate future research.

Publisher's Version
Pitfalls in Experiments with DNN4SE: An Analysis of the State of the Practice
Sira Vegas and Sebastian Elbaum
(Universidad Politécnica de Madrid, Spain; University of Virginia, USA)
Software engineering (SE) techniques are increasingly relying on deep learning approaches to support many SE tasks, from bug triaging to code generation. To assess the efficacy of such techniques researchers typically perform controlled experiments. Conducting these experiments, however, is particularly challenging given the complexity of the space of variables involved, from specialized and intricate architectures and algorithms to a large number of training hyper-parameters and choices of evolving datasets, all compounded by how rapidly the machine learning technology is advancing, and the inherent sources of randomness in the training process. In this work we conduct a mapping study, examining 194 experiments with techniques that rely on deep neural networks (DNNs) appearing in 55 papers published in premier SE venues to provide a characterization of the state of the practice, pinpointing experiments’ common trends and pitfalls. Our study reveals that most of the experiments, including those that have received ACM artifact badges, have fundamental limitations that raise doubts about the reliability of their findings. More specifically, we find: 1) weak analyses to determine that there is a true relationship between independent and dependent variables (87% of the experiments), 2) limited control over the space of DNN relevant variables, which can render a relationship between dependent variables and treatments that may not be causal but rather correlational (100% of the experiments), and 3) lack of specificity in terms of what are the DNN variables and their values utilized in the experiments (86% of the experiments) to define the treatments being applied, which makes it unclear whether the techniques designed are the ones being assessed, or how the sources of extraneous variation are controlled. We provide some practical recommendations to address these limitations.

Publisher's Version Published Artifact Info Artifacts Available
DecompoVision: Reliability Analysis of Machine Vision Components through Decomposition and Reuse
Boyue Caroline Hu, Lina Marsso, Nikita Dvornik, Huakun Shen, and Marsha Chechik
(University of Toronto, Canada; Waabi, Canada)
Analyzing reliability of Machine Vision Components (MVC) against scene changes (such as rain or fog) in their operational environment is crucial for safety-critical applications. Safety analysis relies on the availability of precisely specified and, ideally, machine-verifiable requirements. The state-of-the-art reliability framework ICRAF developed machine-verifiable requirements obtained using human performance data. However, ICRAF is limited to analyzing reliability of MVCs solving simple vision tasks, such as image classification. Yet, many real-world safety-critical systems require solving more complex vision tasks, such as object detection and instance segmentation. Fortunately, many complex vision tasks (which we call “c-tasks”) can be represented as a sequence of simple vision subtasks. For instance, object detection can be decomposed as object localization followed by classification. Based on this fact, in this paper, we show that the analysis of c-tasks can also be decomposed as a sequential analysis of their simple subtasks, which allows us to apply existing techniques for analyzing simple vision tasks. Specifically, we propose a modular reliability framework, DecompoVision, that decomposes: (1) the problem of solving a c-task, (2) the reliability requirements, and (3) the reliability analysis, and, as a result, provides deeper insights into MVC reliability. DecompoVision extends ICRAF to handle complex vision tasks and enables reuse of existing artifacts across different c-tasks. We capture new reliability gaps by checking our requirements on 13 widely used object detection MVCs, and, for the first time, benchmark segmentation MVCs.

Publisher's Version

Fault Diagnosis and Root Cause Analysis I

Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data
Guangba Yu, Pengfei Chen, Yufeng Li, Hongyang Chen, Xiaoyun Li, and Zibin Zheng
(Sun Yat-sen University, China)
Root cause analysis (RCA) in large-scale microservice systems is a critical and challenging task. To understand and localize root causes of unexpected faults, modern observability tools collect and preserve multi-modal observability data, including metrics, traces, and logs. Since system faults may manifest as anomalies in different data sources, existing RCA approaches that rely on single-modal data are constrained in the granularity and interpretability of root causes. In this study, we present Nezha, an interpretable and fine-grained RCA approach that pinpoints root causes at the code region and resource type level by incorporative analysis of multi-modal data. Nezha transforms heterogeneous multi-modal data into a homogeneous event representation and extracts event patterns by constructing and mining event graphs. The core idea of Nezha is to compare event patterns in the fault-free phase with those in the fault-suffering phase to localize root causes in an interpretable way. Practical implementation and experimental evaluations on two microservice applications show that Nezha achieves a high top1 accuracy (89.77%) on average at the code region and resource type level and outperforms state-of-the-art approaches by a large margin. Two ablation studies further confirm the contributions of incorporating multi-modal data.

Publisher's Version Published Artifact Artifacts Available
DiagConfig: Configuration Diagnosis of Performance Violations in Configurable Software Systems
Zhiming Chen, Pengfei Chen, Peipei Wang, Guangba Yu, Zilong He, and Genting Mai
(Sun Yat-sen University, China; ByteDance Infrastructure System Lab, USA)
Performance degradation due to misconfiguration in software systems that violates SLOs (service-level objectives) is commonplace. Diagnosing and explaining the root causes of such performance violations in configurable software systems is often challenging due to their increasing complexity. Although there are many tools and techniques for diagnosing performance violations, they provide limited evidence to attribute causes of observed performance violations to specific configurations. This is because the configuration is not originally considered in those tools. This paper proposes DiagConfig, specifically designed to conduct configuration diagnosis of performance violations. It leverages static code analysis to track configuration option propagation, identifies performance-sensitive options, detects performance violations, and constructs cause-effect chains that help stakeholders better understand the relationship between configuration and performance violations. Experimental evaluations with eight real-world software demonstrate that DiagConfig produces fewer false positives than a state-of-the-art documentation analysis-based tool (i.e., 5 vs 41) in the identification of performance-sensitive options, and outperforms a statistics-based debugging tool in the diagnosis of performance violations caused by configuration changes, offering more comprehensive results (recall: 0.892 vs 0.289). Moreover, we also show that DiagConfig can accelerate auto-tuning by compressing configuration space.

Publisher's Version Published Artifact Artifacts Available
Pre-training Code Representation with Semantic Flow Graph for Effective Bug Localization
Yali Du and Zhongxing Yu
(Shandong University, China)
Enlightened by the big success of pre-training in natural language processing, pre-trained models for programming languages have been widely used to promote code intelligence in recent years. In particular, BERT has been used for bug localization tasks and impressive results have been obtained. However, these BERT-based bug localization techniques suffer from two issues. First, the pre-trained BERT model on source code does not adequately capture the deep semantics of program code. Second, the overall bug localization models neglect the necessity of large-scale negative samples in contrastive learning for representations of changesets and ignore the lexical similarity between bug reports and changesets during similarity estimation. We address these two issues by 1) proposing a novel directed, multiple-label code graph representation named Semantic Flow Graph (SFG), which compactly and adequately captures code semantics, 2) designing and training SemanticCodeBERT based on SFG, and 3) designing a novel Hierarchical Momentum Contrastive Bug Localization technique (HMCBL). Evaluation results show that our method achieves state-of-the-art performance in bug localization.

Publisher's Version Published Artifact Artifacts Available
Automata-Based Trace Analysis for Aiding Diagnosing GUI Testing Tools for Android
Enze Ma, Shan Huang, Weigang He, Ting Su, Jue Wang, Huiyu Liu, Geguang Pu, and Zhendong Su
(East China Normal University, China; Nanjing University, China; ETH Zurich, Switzerland)
Benchmarking software testing tools against known bugs is a classic approach to evaluating the tools’ bug finding abilities. However, this approach is difficult to give some clues on the tool-missed bugs to aid diagnosing the testing tools. As a result, heavy and ad hoc manual analysis is needed. In this work, in the setting of GUI testing for Android apps, we introduce an automata-based trace analysis approach to tackling the key challenge of manual analysis, i.e., how to analyze the lengthy event traces generated by a testing tool against a missed bug to find the clues. Our key idea is that, we model a bug in the form of a finite automaton which captures its bug-triggering traces; and match the event traces generated by the testing tool (which misses this bug) against this automaton to obtain the clues. Specifically, the clues are presented in the form of three designated automata-based coverage values. We apply our approach to enhance Themis, a representative benchmark suite for Android, to aid diagnosing GUI testing tools. Our extensive evaluation on nine state-of-the-art GUI testing tools and the involvement with several tool developers shows that our approach is feasible and useful. Our approach enables Themis+ (the enhanced benchmark suite) to provide the clues on the tool-missed bugs, and all the Themis+’s clues are identical or useful, compared to the manual analysis results of tool developers. Moreover, the clues have helped find several tool weaknesses, which were unknown or unclear before. Based on the clues, two actively-developing industrial testing tools in our study have quickly made several optimizations and demonstrated their improved bug finding abilities. All the tool developers give positive feedback on the usefulness and usability of Themis+’s clues. Themis+ is available at https://github.com/DDroid-Android/home.

Publisher's Version
A Practical Human Labeling Method for Online Just-in-Time Software Defect Prediction
Liyan Song, Leandro Lei Minku, Cong Teng, and Xin Yao
(Southern University of Science and Technology, China; University of Birmingham, UK)
Just-in-Time Software Defect Prediction (JIT-SDP) can be seen as an online learning problem where additional software changes produced over time may be labeled and used to create training examples. These training examples form a data stream that can be used to update JIT-SDP models in an attempt to avoid models becoming obsolete and poorly performing. However, labeling procedures adopted in existing online JIT-SDP studies implicitly assume that practitioners would not inspect software changes upon a defect-inducing prediction, delaying the production of training examples. This is inconsistent with a real-world scenario where practitioners would adopt JIT-SDP models and inspect certain software changes predicted as defect-inducing to check whether they really induce defects. Such inspection means that some software changes would be labeled much earlier than assumed in existing work, potentially leading to different JIT-SDP models and performance results. This paper aims at formulating a more practical human labeling procedure that takes into account the adoption of JIT-SDP models during the software development process. It then analyses whether and to what extent it would impact the predictive performance of JIT-SDP models. We also propose a new method to target the labeling of software changes with the aim of saving human inspection effort. Experiments based on 14 GitHub projects revealed that adopting a more realistic labeling procedure led to significantly higher predictive performance than when delaying the labeling process, meaning that existing work may have been underestimating the performance of JIT-SDP. In addition, our proposed method to target the labeling process was able to reduce human effort while maintaining predictive performance by recommending practitioners to inspect software changes that are more likely to induce defects. We encourage the adoption of more realistic human labeling methods in research studies to obtain an evaluation of JIT-SDP predictive performance that is closer to reality.

Publisher's Version Published Artifact Info Artifacts Available

Human Aspects II

Flow Experience in Software Engineering
Saima Ritonummi, Valtteri Siitonen, Markus Salo, Henri Pirkkalainen, and Anu Sivunen
(University of Jyväskylä, Finland; Tampere University, Finland)
Software engineering (SE) requires high analytical skills and creativity, which makes it an excellent context for experiencing flow. Although previous work in the SE context has identified how positive affect and development tools can support the flow experience, there is still much to uncover about the characteristics of software developers’ flow experiences. To address this gap in knowledge, we conducted a qualitative critical incident technique (CIT) questionnaire (n = 401) on the flow-facilitating factors and characteristics of flow in the SE context. The most important flow-facilitating factors in developers’ work included optimal challenge, high motivation, positive developer experience (DX), and no distractions or interruptions. The flow experiences were characterized by absorption, effortless control, intrinsic reward, and high performance. Our study identifies the features of flow commonly addressed in flow research; however, it also highlights how IT use, especially development tools that provide positive DX, as well as being able to work without excessive distractions and interruptions are important facilitators of developers’ flow.

Publisher's Version
Building and Sustaining Ethnically, Racially, and Gender Diverse Software Engineering Teams: A Study at Google
Ella Dagan, Anita Sarma, Alison Chang, Sarah D’Angelo, Jill Dicker, and Emerson Murphy-Hill
(Google, USA; Oregon State University, USA)
Teams that build software are largely demographically homogeneous. Without diversity, homogeneous perspectives dominate how, why, and for whom software is designed. To understand how teams can successfully build and sustain diversity, we interviewed 11 engineers and 9 managers from some of the most gender and racially diverse teams at Google, a large software company. Qualitatively analyzing the interviews, we found shared approaches to recruiting, hiring, and promoting an inclusive environment, all of which create a positive feedback loop. Our findings produce actionable practices that every member of the team can take to increase diversity by fostering a more inclusive software engineering environment.

Publisher's Version
Towards Automated Detection of Unethical Behavior in Open-Source Software Projects
Hsu Myat Win, Haibo Wang, and Shin Hwei Tan
(Southern University of Science and Technology, China; Concordia University, Canada)
Given the rapid growth of Open-Source Software (OSS) projects, ethical considerations are becoming more important. Past studies focused on specific ethical issues (e.g., gender bias and fairness in OSS). There is little to no study on the different types of unethical behavior in OSS projects. We present the first study of unethical behavior in OSS projects from the stakeholders’ perspective. Our study of 316 GitHub issues provides a taxonomy of 15 types of unethical behavior guided by six ethical principles (e.g., autonomy). Examples of new unethical behavior include soft forking (copying a repository without forking) and self-promotion (promoting a repository without self-identifying as contributor to the repository). We also identify 18 types of software artifacts affected by the unethical behavior. The diverse types of unethical behavior identified in our study (1) call for attentions of developers and researchers when making contributions in GitHub, and (2) point to future research on automated detection of unethical behavior in OSS projects. From our study, we propose Etor, an approach that can automatically detect six types of unethical behavior by using ontological engineering and Semantic Web Rule Language (SWRL) rules to model GitHub attributes and software artifacts. Our evaluation on 195,621 GitHub issues (1,765 GitHub repositories) shows that Etor can automatically detect 548 unethical behavior with 74.8% average true positive rate (up to 100% true positive rate). This shows the feasibility of automated detection of unethical behavior in OSS projects.

Publisher's Version

Testing III

NeuRI: Diversifying DNN Generation via Inductive Rule Inference
Jiawei Liu, Jinjun Peng, Yuyao Wang, and Lingming Zhang
(University of Illinois at Urbana-Champaign, USA; Columbia University, USA; Nanjing University, China)
Deep Learning (DL) is prevalently used in various industries to improve decision-making and automate processes, driven by the ever-evolving DL libraries and compilers. The correctness of DL systems is crucial for trust in DL applications. As such, the recent wave of research has been studying the automated synthesis of test-cases (i.e., DNN models and their inputs) for fuzzing DL systems. However, existing model generators only subsume a limited number of operators, lacking the ability to pervasively model operator constraints. To address this challenge, we propose NeuRI, a fully automated approach for generating valid and diverse DL models composed of hundreds of types of operators. NeuRI adopts a three-step process: (i) collecting valid and invalid API traces from various sources; (ii) applying inductive program synthesis over the traces to infer the constraints for constructing valid models; and (iii) using hybrid model generation which incorporates both symbolic and concrete operators. Our evaluation shows that NeuRI improves branch coverage of TensorFlow and PyTorch by 24% and 15% over the state-of-the-art model-level fuzzers. NeuRI finds 100 new bugs for PyTorch and TensorFlow in four months, with 81 already fixed or confirmed. Of these, 9 bugs are labelled as high priority or security vulnerability, constituting 10% of all high-priority bugs of the period. Open-source developers regard error-inducing tests reported by us as "high-quality" and "common in practice".

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Heterogeneous Testing for Coverage Profilers Empowered with Debugging Support
Yibiao Yang, Maolin Sun, Yang Wang, Qingyang Li, Ming Wen, and Yuming Zhou
(Nanjing University, China; Huazhong University of Science and Technology, China)
Ensuring the correctness of code coverage profilers is crucial, given the widespread adoption of code coverage for various software engineering tasks. Existing validation techniques, such as differential testing and metamorphic testing, have shown effectiveness in uncovering bugs in coverage profilers. However, these techniques have limitations as they primarily rely on homogeneous sources, i.e., different coverage profilers or the profilers themselves, for validation. In this paper, we propose Decov, a novel heterogeneous testing technique, to validate coverage profilers using the information provided by debuggers as a heterogeneous source. Coverage profilers record execution counts for each source line in the program, while debuggers monitor hit counts for each source line when running the program in debug mode. Our key insight is that the execution counts obtained from coverage profilers should align with the hit counts monitored by debuggers, without conflicts. Decov constructs multiple heterogeneous relations and utilizes them to uncover bugs in coverage profilers. Through experiments on Gcov and LLVM-cov, two widely used code coverage profilers, we discovered 21 new bug reports, with 19 of them directly confirmed by developers. Notably, developers have resolved 5 bugs in the latest trunk version. Decov serves as a simple yet effective coverage profiler validator and offers a complementary approach to existing techniques.

Publisher's Version Published Artifact Artifacts Available
Outage-Watch: Early Prediction of Outages using Extreme Event Regularizer
Shubham Agarwal, Sarthak Chakraborty, Shaddy Garg, Sumit Bisht, Chahat Jain, Ashritha Gonuguntla, and Shiv Saini
(Adobe Research, India; University of Illinois at Urbana-Champaign, USA; Adobe, India; Amazon, India; Traceable.ai, India; Cisco, India)
Cloud services are omnipresent and critical cloud service failure is a fact of life. In order to retain customers and prevent revenue loss, it is important to provide high reliability guarantees for these services. One way to do this is by predicting outages in advance, which can help in reducing the severity as well as time to recovery. It is difficult to forecast critical failures due to the rarity of these events. Moreover, critical failures are ill-defined in terms of observable data. Our proposed method, Outage-Watch, defines critical service outages as deteriorations in the Quality of Service (QoS) captured by a set of metrics. Outage-Watch detects such outages in advance by using current system state to predict whether the QoS metrics will cross a threshold and initiate an extreme event. A mixture of Gaussian is used to model the distribution of the QoS metrics for flexibility and an extreme event regularizer helps in improving learning in tail of the distribution. An outage is predicted if the probability of any one of the QoS metrics crossing threshold changes significantly. Our evaluation on a real-world SaaS company dataset shows that Outage-Watch significantly outperforms traditional methods with an average AUC of 0.98. Additionally, Outage-Watch detects all the outages exhibiting a change in service metrics and reduces the Mean Time To Detection (MTTD) of outages by up to 88% when deployed in an enterprise cloud-service system, demonstrating efficacy of our proposed method.

Publisher's Version

Software Evolution II

Multilingual Code Co-evolution using Large Language Models
Jiyang Zhang, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric
(University of Texas at Austin, USA)
Many software projects implement APIs and algorithms in multiple programming languages. Maintaining such projects is tiresome, as developers have to ensure that any change (e.g., a bug fix or a new feature) is being propagated, timely and without errors, to implementations in other programming languages. In the world of ever-changing software, using rule-based translation tools (i.e., transpilers) or machine learning models for translating code from one language to another provides limited value. Translating each time the entire codebase from one language to another is not the way developers work. In this paper, we target a novel task: translating code changes from one programming language to another using large language models (LLMs). We design and implement the first LLM, dubbed Codeditor, to tackle this task. Codeditor explicitly models code changes as edit sequences and learns to correlate changes across programming languages. To evaluate Codeditor, we collect a corpus of 6,613 aligned code changes from 8 pairs of open-source software projects implementing similar functionalities in two programming languages (Java and C#). Results show that Codeditor outperforms the state-of-the-art approaches by a large margin on all commonly used automatic metrics. Our work also reveals that Codeditor is complementary to the existing generation-based models, and their combination ensures even greater performance.

Publisher's Version
Knowledge-Based Version Incompatibility Detection for Deep Learning
Zhongkai Zhao, Bonan Kou, Mohamed Yilmaz Ibrahim, Muhao Chen, and Tianyi Zhang
(Tongji University, China; Purdue University, USA; University of California at Davis, USA)
Version incompatibility issues are rampant when reusing or reproducing deep learning models and applications. Existing techniques are limited to library dependency specifications declared in PyPI. Therefore, these techniques cannot detect version issues due to undocumented version constraints or issues involving hardware drivers or OS. To address this challenge, we propose to leverage the abundant discussions of DL version issues from Stack Overflow to facilitate version incompatibility detection. We reformulate the problem of knowledge extraction as a Question-Answering (QA) problem and use a pre-trained QA model to extract version compatibility knowledge from online discussions. The extracted knowledge is further consolidated into a weighted knowledge graph to detect potential version incompatibilities when reusing a DL project. Our evaluation results show that (1) our approach can accurately extract version knowledge with 84% accuracy, and (2) our approach can accurately identify 65% of known version issues in 10 popular DL projects with a high precision (92%), while two state-of-the-art approaches can only detect 29% and 6% of these issues with 33% and 17% precision respectively.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable

Program Analysis II

Statistical Type Inference for Incomplete Programs
Yaohui Peng, Jing Xie, Qiongling Yang, Hanwen Guo, Qingan Li, Jingling Xue, and Mengting Yuan
(Wuhan University, China; UNSW, Australia)
We propose a novel two-stage approach, Stir, for inferring types in incomplete programs that may be ill-formed, where whole-program syntactic analysis often fails. In the first stage, Stir predicts a type tag for each token by using neural networks, and consequently, infers all the simple types in the program. In the second stage, Stir refines the complex types for the tokens with predicted complex type tags. Unlike existing machine-learning-based approaches, which solve type inference as a classification problem, Stir reduces it to a sequence-to-graph parsing problem. According to our experimental results, Stir achieves an accuracy of 97.37 % for simple types. By representing complex types as directed graphs (type graphs), Stir achieves a type similarity score of 77.36 % and 59.61 % for complex types and zero-shot complex types, respectively.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
OOM-Guard: Towards Improving the Ergonomics of Rust OOM Handling via a Reservation-Based Approach
Chengjun Chen, Zhicong Zhang, Hongliang Tian, Shoumeng Yan, and Hui Xu
(Fudan University, China; Ant Group, China)
Out of memory (OOM) is an exceptional system state where any further memory allocation requests may fail. Such allocation failures would crash the process or system if not handled properly, and they may also lead to an inconsistent program state that cannot be recovered easily. Current mechanisms for preventing such hazards highly rely on the manual effort of the programmers themselves. This paper studies the OOM issues of Rust, which is an emerging system programming language that stresses the importance of memory safety but still lacks handy mechanisms to handle OOM well. Even worse, Rust employs an infallible mode of memory allocations by default. As a result, the program written by Rust would simply abort itself when OOM occurs. Such crashes would lead to critical robustness issues for services or modules of operating systems. We propose OOM-Guard, a handy approach for Rust programmers to handle OOM. OOM-Guard is by nature a reservation-based approach that aims to convert the handlings for many possible failed memory allocations into handlings for a smaller number of reservations. In order to achieve efficient reservation, OOM-Guard incorporates a subtle cost analysis algorithm based on static analysis and a proxy allocator. We then apply OOM-Guard to two well-known Rust projects, Bento and rCore. Results show that OOM-Guard can largely reduce developers' efforts for handling OOM and incurs trivial overhead in both memory space and execution time.

Publisher's Version
DeepInfer: Deep Type Inference from Smart Contract Bytecode
Kunsong Zhao, Zihao Li, Jianfeng Li, He Ye, Xiapu Luo, and Ting Chen
(Hong Kong Polytechnic University, China; Xi’an Jiaotong University, China; KTH Royal Institute of Technology, Sweden; University of Electronic Science and Technology of China, China)
Smart contracts play an increasingly important role in Ethereum platform. It provides various functions implementing numerous services, whose bytecode runs on Ethereum Virtual Machine. To use services by invoking corresponding functions, the callers need to know the function signatures. Moreover, such signatures provide crucial information for many downstream applications, e.g., identifying smart contracts, fuzzing, detecting vulnerabilities, etc. However, it is challenging to infer function signatures from the bytecode due to a lack of type information. Existing work solving this problem depended heavily on limited databases or hard-coded heuristic patterns. However, these approaches are hard to be adapted to semantic differences in distinct languages and various compiler versions when developing smart contracts. In this paper, we propose a novel framework DeepInfer that first leverages deep learning techniques to automatically infer function signatures and returns. The novelties of DeepInfer are: 1) DeepInfer lifts the bytecode into the Intermediate Representation (IR) to preserve code semantics; 2) DeepInfer extracts the type-related knowledge (e.g., critical data flows, constant values, and control flow graphs) from the IR to recover function signatures and returns. We conduct experiments on Solidity and Vyper smart contracts and the results show that DeepInfer performs faster and more accurate than existing tools, while being immune to changes in different languages and various compiler versions.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
DeMinify: Neural Variable Name Recovery and Type Inference
Yi Li, Aashish Yadavally, Jiaxing Zhang, Shaohua Wang, and Tien N. Nguyen
(New Jersey Institute of Technology, USA; University of Texas at Dallas, USA)
To avoid the exposure of original source code, the variable names deployed in the wild are often replaced by short, meaningless names, thus making the code difficult to understand and be analyzed. We introduce DeMinify, a Deep-Learning (DL)-based approach that formulates such recovery problem as the prediction of missing features in a Graph Convolutional Network–Missing Features. The graph represents both the relations among the variables and the relations among their types, in which the names or types of some nodes are missing. Moreover, DeMinify leverages dual-task learning to propagate the mutual impact between the learning of the variable names and that of their types. We conducted experiments to evaluate DeMinify in both name recovery and type prediction on a Python dataset with 180k methods and a JavaScript (JS) dataset with 322k files. For variable name prediction, in 76.7% and 81.6% of the cases in Python and JS code respectively, DeMinify can predict correctly the variables' names with a single suggested name. DeMinify relatively improves 15.3%–40.7% and 7.7%–49.7% in top-1 accuracy over the state-of-the-art variable name recovery approaches for Python and JS code, respectively. It also relatively improves 14.5%–51.9% in top-1 accuracy over the existing type prediction approaches. Our experimental results showed that learning of data types helps improve variable name recovery and vice versa.

Publisher's Version

Clone and Similarity Detection

Tritor: Detecting Semantic Code Clones by Building Social Network-Based Triads Model
Deqing Zou, Siyue Feng, Yueming Wu, Wenqi Suo, and Hai Jin
(Huazhong University of Science and Technology, China; Nanyang Technological University, Singapore)
Code clone detection refers to finding the functional similarities between two code fragments, which is becoming increasingly important with the evolution of software engineering. It is reasonable because code cloning can increase maintenance costs and even cause the propagation of vulnerabilities, which can have a negative impact on software security. Numbers of code clone detection methods have been proposed, including tree-based methods that are capable of detecting semantic code clones. However, since tree structure is complex, these methods are difficult to apply to large-scale clone detection. In this paper, we propose a scalable semantic code clone detector based on semantically enhanced abstract syntax tree. Specifically, we add the control flow and data flow details into the original tree and regard the enhanced tree as a social network. Then we build a social network-based triads model to collect the similarity features between the two methods by analyzing different types of triads within the network. After obtaining all features, we use them to train a machine learning-based code clone detector (i.e., Tritor). Our comparative experimental results show that Tritor is superior to SourcererCC, RtvNN, Deckard, ASTNN, TBCNN, CDLH, and SCDetector, are equally good with DeepSim and FCCA. As for scalability, Tritor is about 39 times faster than another current state-of-the-art tree-based code clone detector ASTNN.

Publisher's Version Info
Gitor: Scalable Code Clone Detection by Building Global Sample Graph
Junjie Shan, Shihan Dou, Yueming Wu, Hairu Wu, and Yang Liu
(Westlake University, China; Fudan University, China; Nanyang Technological University, Singapore)
Code clone detection is about finding out similar code fragments, which has drawn much attention in software engineering since it is important for software maintenance and evolution. Researchers have proposed many techniques and tools for source code clone detection, but current detection methods concentrate on analyzing or processing code samples individually without exploring the underlying connections among code samples.
In this paper, we propose Gitor to capture the underlying connections among different code samples. Specifically, given a source code database, we first tokenize all code samples to extract the pre-defined individual information (keywords). After obtaining all samples’ individual information, we leverage them to build a large global sample graph where each node is a code sample or a type of individual information. Then we apply a node embedding technique on the global sample graph to extract all the samples’ vector representations. After collecting all code samples’ vectors, we can simply compare the similarity between any two samples to detect possible clone pairs. More importantly, since the obtained vector of a sample is from a global sample graph, we can combine it with its own code features to improve the code clone detection performance. To demonstrate the effectiveness of Gitor, we evaluate it on a widely used dataset namely BigCloneBench. Our experimental results show that Gitor has higher accuracy in terms of code clone detection and excellent execution time for inputs of various sizes (1–100 MLOC) compared to existing state-of-the-art tools. Moreover, we also evaluate the combination of Gitor with other traditional vector-based clone detection methods, the results show that the use of Gitor enables them detect more code clones with higher F1.

Publisher's Version
Demystifying the Composition and Code Reuse in Solidity Smart Contracts
Kairan Sun, Zhengzi Xu, Chengwei Liu, Kaixuan Li, and Yang Liu
(Nanyang Technological University, Singapore; East China Normal University, China)
As the development of Solidity smart contracts has increased in popularity, the reliance on external sources such as third-party packages increases to reduce development costs. However, despite the use of external sources bringing flexibility and efficiency to the development, they could also complicate the process of assuring the security of downstream applications due to the lack of package managers for standardized ways and sources. While previous studies have only focused on code clones without considering how the external components are introduced, the compositions of a smart contract and their characteristics still remain puzzling.
To fill these gaps, we conducted an empirical study with over 350,000 Solidity smart contracts to uncover their compositions, conduct code reuse analysis, and identify prevalent development patterns. Our findings indicate that a typical smart contract comprises approximately 10 subcontracts, with over 80% of these originating from external sources, reflecting the significant reliance on third-party packages. For self-developed subcontracts, around 50% of the subcontracts have less than 10% unique functions, suggesting that code reuse at the level of functions is also common. For external subcontracts, though around 35% of the subcontracts are interfaces to provide templates for standards or protocols, an inconsistency in the use of subcontract types is also identified. Lastly, we extracted 61 frequently reused development patterns, offering valuable insights for secure and efficient smart contract development.

Publisher's Version
Scalable Program Clone Search through Spectral Analysis
Tristan Benoit, Jean-Yves Marion, and Sébastien Bardin
(LORIA, France; CNRS, France; Université de Lorraine, France; CEA, France; Université Paris-Saclay, France)
We consider the problem of program clone search, i.e. given a target program and a repository of known programs (all in executable format), the goal is to find the program in the repository most similar to the target program -- with potential applications in terms of reverse engineering, program clustering, malware lineage and software theft detection. Recent years have witnessed a blooming in code similarity techniques, yet most of them focus on function-level similarity and function clone search, while we are interested in program-level similarity and program clone search. Actually, our study shows that prior similarity approaches are either too slow to handle large program repositories, or not precise enough, or yet not robust against slight variations introduced by compilers, source code versions or light obfuscations. We propose a novel spectral analysis method for program-level similarity and program clone search called Programs Spectral Similarity (PSS). In a nutshell, PSS one-time spectral feature extraction is tailored for large repositories, making it a perfect fit for program clone search. We have compared the different approaches with extensive benchmarks, showing that PSS reaches a sweet spot in terms of precision, speed and robustness.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable

Performance

A Highly Scalable, Hybrid, Cross-Platform Timing Analysis Framework Providing Accurate Differential Throughput Estimation via Instruction-Level Tracing
Min-Yih Hsu, Felicitas Hetzelt, David Gens, Michael Maitland, and Michael Franz
(University of California at Irvine, USA; SiFive, USA)
Differential throughput estimation, i.e., predicting the performance impact of software changes, is critical when developing applications that rely on accurate timing bounds, such as automotive, avionic, or industrial control systems. However, developers often lack access to the target hardware to perform on-device measurements, and hence rely on instruction throughput estimation tools to evaluate performance impacts.
State-of-the-art techniques broadly fall into two categories: dynamic and static. Dynamic approaches emulate program execution using cycle-accurate microarchitectural simulators resulting in high precision at the cost of long turnaround times and convoluted setups. Static approaches reduce overhead by predicting cycle counts outside of a concrete runtime environment. However, they are limited by the lack of dynamic runtime information and mostly focus on predictions over single basic blocks which requires developers to manually construct critical instruction sequences.
We present MCAD, a hybrid timing analysis framework that combines the advantages of dynamic and static approaches. Instead of relying on heavyweight cycle-accurate emulation, MCAD collects instruction traces along with dynamic runtime information from QEMU and streams them to a static throughput estimator. This allows developers to accurately estimate the performance impact of software changes for complete programs within minutes, reducing turnaround times by orders of magnitude compared to existing approaches with similar accuracy. Our evaluation shows that MCAD scales to real-world applications such as FFmpeg and Clang with millions of instructions, achieving < 3% geo. mean error compared to ground truth timings from hardware-performance counters on x86 and ARM machines.

Publisher's Version
Discovering Parallelisms in Python Programs
Siwei Wei, Guyang Song, Senlin Zhu, Ruoyi Ruan, Shihao Zhu, and Yan Cai
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Ant Group, China)
Parallelization is a promising way to improve the performance of Python programs. Unfortunately, developers may miss parallelization possibilities, because they usually do not concentrate on parallelization. Many approaches have been proposed to parallelize Python programs automatically, however, they are either domain-specific or require manual annotation. Thus they cannot solve the problem well in general. In this paper, we propose PyPar, an effective tool aiming at discovering parallelization possibilities in real-world Python programs. PyPar doesn’t need manual annotation and is universally applicable. It first drives a data-dependence analysis to determine whether two pieces of code can run concurrently. The key is the use of a graph-theoretic approach. Next, it adopts a dynamic selection strategy to eliminate inefficient parallelisms. Finally, PyPar produces a parallelism report as well as a referential parallelized program, which is built by PyPar using one of the three parallelization methods (thread-based, processbased, and Ray-based). We have implemented a prototype of PyPar and evaluated it on six well-designed widely-used real-world Python packages: Scikit-Image, SciPy, librosa, trimesh, Scikit-learn and seaborn. In total, 1,240 functions are tested, and PyPar found 127 parallelizable functions among them. Based on manual filtering, only 7 of them are false positives (i.e., a 94.5% precision). The remaining 120 are parallelizable (almost 10% among all functions under test), and most of them can be efficiently sped up by gaining an acceleration of up to 90% , with an average of 44%. The acceleration in practice is close to theoretical estimation. The results show that even well-designed practical Python programs can be further parallelized for speeding up, and PyPar can bring effective and efficient parallelization on real-world Python programs.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
IoPV: On Inconsistent Option Performance Variations
Jinfu Chen, Zishuo Ding, Yiming Tang, Mohammed Sayagh, Heng Li, Bram Adams, and Weiyi Shang
(Wuhan University, China; University of Waterloo, Canada; Rochester Institute of Technology, USA; ÉTS, Canada; Polytechnique Montréal, Canada; Queen’s University, Canada)
Maintaining a good performance of a software system is a primordial task when evolving a software system. The performance regression issues are among the dominant problems that large software systems face. In addition, these large systems tend to be highly configurable, which allows users to change the behaviour of these systems by simply altering the values of certain configuration options. However, such flexibility comes with a cost. Such software systems suffer throughout their evolution from what we refer to as “Inconsistent Option Performance Variation” (IoPV ). An IoPV indicates, for a given commit, that the performance regression or improvement of different values of the same configuration option is inconsistent compared to the prior commit. For instance, a new change might not suffer from any performance regression under the default configuration (i.e., when all the options are set to their default values), while altering one option’s value manifests a regression, which we refer to as a hidden regression as it is not manifested under the default configuration. Similarly, when developers improve the performance of their systems, performance regression might be manifested under a subset of the existing configurations. Unfortunately, such hidden regressions are harmful as they can go unseen to the production environment. In this paper, we first quantify how prevalent (in)consistent performance regression or improvement is among the values of an option. In particular, we study over 803 Hadoop and 502 Cassandra commits, for which we execute a total of 4,902 and 4,197 tests, respectively, amounting to 12,536 machine hours of testing. We observe that IoPV is a common problem that is difficult to manually predict. 69% and 93% of the Hadoop and Cassandra commits have at least one configuration that hides a performance regression. Worse, most of the commits have different options or tests leading to IoPV and hiding performance regressions. Therefore, we propose a prediction model that identifies whether a given combination of commit, test, and option (CTO) manifests an IoPV. Our evaluation for different models shows that random forest is the best performing classifier, with a median AUC of 0.91 and 0.82 for Hadoop and Cassandra, respectively. Our paper defines and provides scientific evidence about the IoPV problem and its prevalence, which can be explored by future work. In addition, we provide an initial machine learning model for predicting IoPV.

Publisher's Version
Predicting Software Performance with Divide-and-Learn
Jingzhi Gong and Tao Chen
(University of Electronic Science and Technology of China, China; Loughborough University, UK; University of Birmingham, UK)
Predicting the performance of highly configurable software systems is the foundation for performance testing and quality assurance. To that end, recent work has been relying on machine/deep learning to model software performance. However, a crucial yet unaddressed challenge is how to cater for the sparsity inherited from the configuration landscape: the influence of configuration options (features) and the distribution of data samples are highly sparse.
In this paper, we propose an approach based on the concept of “divide-and-learn”, dubbed DaL. The basic idea is that, to handle sample sparsity, we divide the samples from the configuration landscape into distant divisions, for each of which we build a regularized Deep Neural Network as the local model to deal with the feature sparsity. A newly given configuration would then be assigned to the right model of division for the final prediction.
Experiment results from eight real-world systems and five sets of training data reveal that, compared with the state-of-the-art approaches, DaL performs no worse than the best counterpart on 33 out of 40 cases (within which 26 cases are significantly better) with up to 1.94× improvement on accuracy; requires fewer samples to reach the same/better accuracy; and producing acceptable training overhead. Practically, DaL also considerably improves different global models when using them as the underlying local models, which further strengthens its flexibility. To promote open science, all the data, code, and supplementary figures of this work can be accessed at our repository: https://github.com/ideas-labo/DaL.

Publisher's Version Info

Machine Learning III

Benchmarking Robustness of AI-Enabled Multi-sensor Fusion Systems: Challenges and Opportunities
Xinyu Gao, Zhijie Wang, Yang Feng, Lei Ma, Zhenyu Chen, and Baowen Xu
(Nanjing University, China; University of Alberta, Canada; University of Tokyo, Japan)
Multi-Sensor Fusion (MSF) based perception systems have been the foundation in supporting many industrial applications and domains, such as self-driving cars, robotic arms, and unmanned aerial vehicles. Over the past few years, the fast progress in datadriven artificial intelligence (AI) has brought a fast-increasing trend to empower MSF systems by deep learning techniques to further improve performance, especially on intelligent systems and their perception systems. Although quite a few AI-enabled MSF perception systems and techniques have been proposed, up to the present, limited benchmarks that focus on MSF perception are publicly available. Given that many intelligent systems such as self-driving cars are operated in safety-critical contexts where perception systems play an important role, there comes an urgent need for a more in-depth understanding of the performance and reliability of these MSF systems.
To bridge this gap, we initiate an early step in this direction and construct a public benchmark of AI-enabled MSF-based perception systems including three commonly adopted tasks (i.e., object detection, object tracking, and depth completion). Based on this, to comprehensively understand MSF systems’ robustness and reliability, we design 14 common and realistic corruption patterns to synthesize large-scale corrupted datasets. We further perform a systematic evaluation of these systems through our large-scale evaluation and identify the following key findings: (1) existing AI-enabled MSF systems are not robust enough against corrupted sensor signals; (2) small synchronization and calibration errors can lead to a crash of AI-enabled MSF systems; (3) existing AI-enabled MSF systems are usually tightly-coupled in which bugs/errors from an individual sensor could result in a system crash; (4) the robustness of MSF systems can be enhanced by improving fusion mechanisms. Our results reveal the vulnerability of the current AI-enabled MSF perception systems, calling for researchers and practitioners to take robustness and reliability into account when designing AI-enabled MSF. Our benchmark, code, and detailed evaluation results are publicly available at https://sites.google.com/view/ai-msf-benchmark.

Publisher's Version Published Artifact Info Artifacts Available
Automated Testing and Improvement of Named Entity Recognition Systems
Boxi Yu, Yiyan Hu, Qiuyang Mang, Wenhan Hu, and Pinjia He
(Chinese University of Hong Kong, China)
Named entity recognition (NER) systems have seen rapid progress in recent years due to the development of deep neural networks. These systems are widely used in various natural language processing applications, such as information extraction, question answering, and sentiment analysis. However, the complexity and intractability of deep neural networks can make NER systems unreliable in certain circumstances, resulting in incorrect predictions. For example, NER systems may misidentify female names as chemicals or fail to recognize the names of minority groups, leading to user dissatisfaction. To tackle this problem, we introduce TIN, a novel, widely applicable approach for automatically testing and repairing various NER systems. The key idea for automated testing is that the NER predictions of the same named entities under similar contexts should be identical. The core idea for automated repairing is that similar named entities should have the same NER prediction under the same context. We use TIN to test two SOTA NER models and two commercial NER APIs, i.e., Azure NER and AWS NER. We manually verify 784 of the suspicious issues reported by TIN and find that 702 are erroneous issues, leading to high precision (85.0%-93.4%) across four categories of NER errors: omission, over-labeling, incorrect category, and range error. For automated repairing, TIN achieves a high error reduction rate (26.8%-50.6%) over the four systems under test, which successfully repairs 1,056 out of the 1,877 reported NER errors.

Publisher's Version
The EarlyBIRD Catches the Bug: On Exploiting Early Layers of Encoder Models for More Efficient Code Classification
Anastasiia Grishina, Max Hort, and Leon Moonen
(Simula Research Laboratory, Norway; BI Norwegian Business School, Norway)
The use of modern Natural Language Processing (NLP) techniques has shown to be beneficial for software engineering tasks, such as vulnerability detection and type inference. However, training deep NLP models requires significant computational resources. This paper explores techniques that aim at achieving the best usage of resources and available information in these models.
We propose a generic approach, EarlyBIRD, to build composite representations of code from the early layers of a pre-trained transformer model. We empirically investigate the viability of this approach on the CodeBERT model by comparing the performance of 12 strategies for creating composite representations with the standard practice of only using the last encoder layer.
Our evaluation on four datasets shows that several early layer combinations yield better performance on defect detection, and some combinations improve multi-class classification. More specifically, we obtain a +2 average improvement of detection accuracy on Devign with only 3 out of 12 layers of CodeBERT and a 3.3x speed-up of fine-tuning. These findings show that early layers can be used to obtain better results using the same resources, as well as to reduce resource usage during fine-tuning and inference.

Publisher's Version Published Artifact Artifacts Available
Deep Learning Based Feature Envy Detection Boosted by Real-World Examples
Bo Liu, Hui Liu, Guangjie Li, Nan Niu, Zimao Xu, Yifan Wang, Yunni Xia, Yuxia Zhang, and Yanjie Jiang
(Beijing Institute of Technology, China; National Innovation Institute of Defense Technology, China; University of Cincinnati, USA; Huawei Cloud, China; Chongqing University, China; Peking University, China)
Feature envy is one of the well-recognized code smells that should be removed by software refactoring. A major challenge in feature envy detection is that traditional approaches are less accurate whereas deep learning-based approaches are suffering from the lack of high-quality large-scale training data. Although existing refactoring detection tools could be employed to discover real-world feature envy examples, the noise (i.e., false positives) within the resulting data could significantly influence the quality of the training data as well as the performance of the models trained on the data. To this end, in this paper, we propose a sequence of heuristic rules and a decision tree-based classifier to filter out false positives reported by state-of-the-art refactoring detection tools. The data after filtering serve as the positive items in the requested training data. From the same subject projects, we randomly select methods that are different from positive items as negative items. With the real-world examples (both positive and negative examples), we design and train a deep learning-based binary model to predict whether a given method should be moved to a potential target class. Different from existing models, it leverages additional features, i.e., coupling between methods and classes (CBMC) and the message passing coupling between methods and classes (MCMC) that have not yet been exploited by existing approaches. Our evaluation results on real-world open-source projects suggest that the proposed approach substantially outperforms the state of the art in feature envy detection, improving precision and recall by 38.5% and 20.8%, respectively.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional

Security I

Comparison and Evaluation on Static Application Security Testing (SAST) Tools for Java
Kaixuan Li, Sen Chen, Lingling Fan, Ruitao Feng, Han Liu, Chengwei Liu, Yang Liu, and Yixiang Chen
(East China Normal University, China; Tianjin University, China; Nankai University, China; UNSW, Australia; Nanyang Technological University, Singapore)
Static application security testing (SAST) takes a significant role in the software development life cycle (SDLC). However, it is challenging to comprehensively evaluate the effectiveness of SAST tools to determine which is the better one for detecting vulnerabilities. In this paper, based on well-defined criteria, we first selected seven free or open-source SAST tools from 161 existing tools for further evaluation. Owing to the synthetic and newly-constructed real-world benchmarks, we evaluated and compared these SAST tools from different and comprehensive perspectives such as effectiveness, consistency, and performance. While SAST tools perform well on synthetic benchmarks, our results indicate that only 12.7% of real-world vulnerabilities can be detected by the selected tools. Even combining the detection capability of all tools, most vulnerabilities (70.9%) remain undetected, especially those beyond resource control and insufficiently neutralized input/output vulnerabilities. The fact is that although they have already built the corresponding detecting rules and integrated them into their capabilities, the detection result still did not meet the expectations. All useful findings unveiled in our comprehensive study indeed help to provide guidance on tool development, improvement, evaluation, and selection for developers, researchers, and potential users.

Publisher's Version
Input-Driven Dynamic Program Debloating for Code-Reuse Attack Mitigation
Xiaoke Wang, Tao Hui, Lei Zhao, and Yueqiang Cheng
(Wuhan University, China; NIO, USA)
Modern software is bloated, especially for libraries. The unnecessary code not only brings severe vulnerabilities, but also assists attackers to construct exploits. To mitigate the damage of bloated libraries, researchers have proposed several debloating techniques to remove or restrict the invocation of unused code in a library. However, existing approaches either statically keep code for all expected inputs, which leave unused code for each concrete input, or rely on runtime context to dynamically determine the necessary code, which could be manipulated by attackers.
In this paper, we propose Picup, a practical approach that dynamically customizes libraries for each input. Based on the observation that the behavior of a program mainly depends on the given input, we design Picup to predict the necessary library functions immediately after we get the input, which erases the unused code before attackers can affect the decision-making data. To achieve an effective prediction, we adopt a convolutional neural network (CNN) with attention mechanism to extract key bytes from the input and map them to library functions. We evaluate Picup on real-world benchmarks and popular applications. The results show that we can predict the necessary library functions with 97.56% accuracy, and reduce the code size by 87.55% on average with low overheads. These results indicate that Picup is a practical solution for secure and effective library debloating.

Publisher's Version
TransRacer: Function Dependence-Guided Transaction Race Detection for Smart Contracts
Chenyang Ma, Wei Song, and Jeff Huang
(Nanjing University of Science and Technology, China; Texas A&M University, USA)
Smart contracts are programs that define rules for transactions running on blockchains. Since any qualified transaction sequence within the same block can be orchestrated by the blockchain miner, unexpected results may occur due to data races between transactions (called transaction races). Surprisingly, transaction races in smart contracts have not been fully investigated. To address this, we propose TransRacer, an automated approach and open-source tool that employs symbolic execution to detect transaction races in smart contracts. TransRacer analyzes function dependencies to identify transaction races hidden in specific contract states. It also generates witness transactions that can trigger such races. The experimental results on 50 real-world smart contracts show the effectiveness and efficiency of TransRacer: it detects 426 races in 255.9 minutes, including 149 race bugs leading to inconsistent states.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Software Composition Analysis for Vulnerability Detection: An Empirical Study on Java Projects
Lida Zhao, Sen Chen, Zhengzi Xu, Chengwei Liu, Lyuye Zhang, Jiahui Wu, Jun Sun, and Yang Liu
(Singapore Management University, Singapore; Tianjin University, China; Nanyang Technological University, Singapore)
Software composition analysis (SCA) tools are proposed to detect potential vulnerabilities introduced by open-source software (OSS) imported as third-party libraries (TPL). With the increasing complexity of software functionality, SCA tools may encounter various scenarios during the dependency resolution process, such as diverse formats of artifacts, diverse dependency imports, and diverse dependency specifications. However, there still lacks a comprehensive evaluation of SCA tools for Java that takes into account the above scenarios. This could lead to a confined interpretation of comparisons, improper use of tools, and hinder further improvements of the tools. To fill this gap, we proposed an Evaluation Model which consists of Scan Modes, Scan Methods, and SCA Scope for Maven (SSM), for comprehensive assessments of the dependency resolving capabilities and effectiveness of SCA tools. Based on the Evaluation Model, we first qualitatively examined 6 SCA tools’ capabilities. Next, the accuracy of dependency and vulnerability is quantitatively evaluated with a large-scale dataset (21,130 Maven modules with 73,499 unique dependencies) under two Scan Modes (i.e., build scan and pre-build scan). The results show that most tools do not fully support SSM, which leads to compromised accuracy. For dependency detection, the average F1-score is 0.890 and 0.692 for build and pre-build respectively, and for vulnerability accuracy, the average F1-score is 0.475. However, proper support for SSM reduces dependency detection false positives by 34.24% and false negatives by 6.91%. This further leads to a reduction of 18.28% in false positives and 8.72% in false negatives in vulnerability reports.

Publisher's Version

Fault Diagnosis and Root Cause Analysis II

DeepDebugger: An Interactive Time-Travelling Debugging Approach for Deep Classifiers
Xianglin Yang, Yun Lin, Yifan Zhang, Linpeng Huang, Jin Song Dong, and Hong Mei
(Shanghai Jiao Tong University, China; National University of Singapore, Singapore)
A deep classifier is usually trained to (i) learn the numeric representation vector of samples and (ii) classify sample representations with learned classification boundaries. Time-travelling visualization, as an explainable AI technique, is designed to transform the model training dynamics into an animation of canvas with colorful dots and territories. Despite that the training dynamics of the high-level concepts such as sample representations and classification boundaries are now observable, the model developers can still be overwhelmed by tens of thousands of moving dots across hundreds of training epochs (i.e., frames in the animation), which makes them miss important training events.
In this work, we make the first attempt to develop the model time-travelling visualizers to the model time-travelling debuggers, for its practical use in model debugging tasks. Specifically, given an animation of model training dynamics of sample representation and classification landscape, we propose DeepDebugger solution to recommend the samples of user interest in a human-in-the-loop manner. On one hand, DeepDebugger monitors the training dynamics of samples and recommends suspicious samples based on their abnormality. On the other hand, our recommendation is interactive and fault-resilient for the model developers to explore the training process. By learning users’ feedback, DeepDebugger refines its recommendation to fit their intention. Our extensive experiments on applying DeepDebugger on the known time-travelling visualizers show that DeepDebugger can (1) detect the majority of the abnormal movement of the training samples on canvas; (2) significantly boost the recommendation performance of samples of interest (5-10X more accurate than the baselines) with the runtime overhead of 0.015s per feedback; (3) be resilient under the 3%, 5%, 10% mistaken user feedback. Our user study of the tool shows that the interactive recommendation of DeepDebugger can help the participants accomplish the debugging tasks by saving 18.1% completion time and boosting the performance by 20.3%.

Publisher's Version
Mining Resource-Operation Knowledge to Support Resource Leak Detection
Chong Wang, Yiling Lou, Xin Peng, Jianan Liu, and Baihan Zou
(Fudan University, China)
Resource leaks, which are caused by acquired resources not being released, often result in performance degradation and system crashes. Resource leak detection relies on two essential components: identifying potential Resource Acquisition and Release (RAR) API pairs, and subsequently analyze code to uncover instances where the corresponding release API call is absent after an acquisition API call. Yet, existing techniques confine themselves to an incomplete pair pool, either pre-defined manually or mined from project-specific code corpus, thus limiting coverage across libraries/APIs and po- tentially overlooking latent resource leaks.
In this work, we propose to represent resource-operation knowledge as abstract resource acquisition/release operation pairs (Abs-RAR pairs for short), and present a novel approach called MiROK to mine such Abs-RAR pairs to construct a better RAR pair pool. Given a large code corpus, MiROK first mines Abs-RAR pairs with rule-based pair expansion and learning-based pair identification strategies, and then instantiates these Abs-RAR pairs into concrete RAR pairs. We implement MiROK and apply it to mine RAR pairs from a large code corpus of 1,454,224 Java methods and 20,000 Maven libraries. We then perform an extensive evaluation to investigate the mining effectiveness of MiROK and the practical usage of its mined RAR pairs for supporting resource leak detection. Our results show that MiROK mines 1,313 new Abs-RAR pairs and instantiates them into 6,314 RAR pairs with a high precision (i.e., 93.3%). In addition, by feeding our mined RAR pairs, existing approaches detect more resource leak defects in both online code examples and open-source projects

Publisher's Version
TransMap: Pinpointing Mistakes in Neural Code Translation
Bo Wang, Ruishi Li, Mingkai Li, and Prateek Saxena
(National University of Singapore, Singapore)
Automated code translation between programming languages can greatly reduce the human effort needed in learning new languages or in migrating code. Recent neural machine translation models, such as Codex, have been shown to be effective on many code generation tasks including translation. However, code produced by neural translators often has semantic mistakes. These mistakes are difficult to eliminate from the neural translator itself because the translator is a black box, which is difficult to interpret or control compared to rule-based transpilers. We propose the first automated approach to pinpoint semantic mistakes in code obtained after neural code translation. Our techniques are implemented in a prototype tool called TransMap which translates Python to JavaScript, both of which are popular scripting languages. On our created micro-benchmarks of Python programs with 648 semantic mistakes in total, TransMap accurately pinpoints the correct location for a fix for 87.96%, often highlighting 1-2 lines for the user to inspect per mistake. We report on our experience in translating 5 Python libraries with up to 1k lines of code with TransMap. Our preliminary user study suggests that TransMap can reduce the time for fixing semantic mistakes by around 70% compared to using a standard IDE with debuggers.

Publisher's Version Published Artifact Info Artifacts Available Artifacts Reusable
Dynamic Prediction of Delays in Software Projects using Delay Patterns and Bayesian Modeling
Elvan Kula, Eric Greuter, Arie van Deursen, and Georgios Gousios
(Delft University of Technology, Netherlands; ING, Netherlands)
Modern agile software projects are subject to constant change, making it essential to re-asses overall delay risk throughout the project life cycle. Existing effort estimation models are static and not able to incorporate changes occurring during project execution. In this paper, we propose a dynamic model for continuously predicting overall delay using delay patterns and Bayesian modeling. The model incorporates the context of the project phase and learns from changes in team performance over time. We apply the approach to real-world data from 4,040 epics and 270 teams at ING. An empirical evaluation of our approach and comparison to the state-of-the-art demonstrate significant improvements in predictive accuracy. The dynamic model consistently outperforms static approaches and the state-of-the-art, even during early project phases.

Publisher's Version
Commit-Level, Neural Vulnerability Detection and Assessment
Yi Li, Aashish Yadavally, Jiaxing Zhang, Shaohua Wang, and Tien N. Nguyen
(New Jersey Institute of Technology, USA; University of Texas at Dallas, USA)
Software Vulnerabilities (SVs) are security flaws that are exploitable in cyber-attacks. Delay in the detection and assessment of SVs might cause serious consequences due to the unknown impacts on the attacked systems. The state-of-the-art approaches have been proposed to work directly on the committed code changes for early detection. However, none of them could provide both commit-level vulnerability detection and assessment at once. Moreover, the assessment approaches still suffer low accuracy due to limited representations for code changes and surrounding contexts.
We propose a Context-aware, Graph-based, Commit-level Vulnerability Detection and Assessment Model, VDA, that evaluates a code change, detects any vulnerability and provides the CVSS assessment grades. To build VDA, we have key novel components. First, we design a novel context-aware, graph-based, representation learning model to learn the contextualized embeddings for the code changes that integrate program dependencies and the surrounding contexts of code changes, facilitating the automated vulnerability detection and assessment. Second, VDA considers the mutual impact of learning to detect vulnerability and learning to assess each vulnerability assessment type. To do so, it leverages multi-task learning among the vulnerability detection and vulnerability assessment tasks, improving all the tasks at the same time. Our empirical evaluation shows that on a C vulnerability dataset, VDA achieves 25.5% and 26.9% relatively higher than the baselines in vulnerability assessment regarding F-score and MCC, respectively. In a Java dataset, it achieves 31% and 33.3% relatively higher than the baselines in F-score and MCC, respectively. VDA also relatively improves the vulnerability detection over the baselines from 13.4–322% in F-score.

Publisher's Version

Fuzzing

Enhancing Coverage-Guided Fuzzing via Phantom Program
Mingyuan Wu, Kunqiu Chen, Qi Luo, Jiahong Xiang, Ji Qi, Junjie Chen, Heming Cui, and Yuqun Zhang
(Southern University of Science and Technology, China; University of Hong Kong, China; Tianjin University, China)
For coverage-guided fuzzers, many of their adopted seeds are usually underused by exploring limited program states since essentially all their executions have to abide by rigorous program dependencies while only limited seeds are capable of accessing dependencies. Moreover, even when iteratively executing such limited seeds, the fuzzers have to repeatedly access the covered program states before uncovering new states. Such facts indicate that exploration power on program states of seeds has not been sufficiently leveraged by the existing coverage-guided fuzzing strategies. To tackle these issues, we propose a coverage-guided fuzzer, namely MirageFuzz, to mitigate the program dependencies when executing seeds for enhancing their exploration power on program states. Specifically, MirageFuzz first creates a “phantom” program of the target program by reducing its program dependencies corresponding to conditional statements while retaining their original semantics. Accordingly, MirageFuzz performs dual fuzzing, i.e., the source fuzzing to fuzz the original program and the phantom fuzzing to fuzz the phantom program simultaneously. Then, MirageFuzz applies the taint-based mutation mechanism to generate a new seed by updating the target conditional statement of a given seed from the source fuzzing with the corresponding condition value derived by the phantom fuzzing. To evaluate the effectiveness of MirageFuzz, we build a benchmark suite with 18 projects commonly adopted by recent fuzzing papers, and select seven open-source fuzzers as baselines for performance comparison with MirageFuzz. The experiment results suggest that MirageFuzz outperforms our baseline fuzzers from 13.42% to 77.96% averagely. Furthermore, MirageFuzz exposes 29 previously unknown bugs where 4 of them have been confirmed and 3 have been fixed by the corresponding developers.

Publisher's Version
Co-dependence Aware Fuzzing for Dataflow-Based Big Data Analytics
Ahmad Humayun, Miryung Kim, and Muhammad Ali Gulzar
(Virginia Tech, USA; University of California at Los Angeles, USA)
Data-intensive scalable computing has become popular due to the increasing demands of analyzing big data. For example, Apache Spark and Hadoop allow developers to write dataflow-based applications with user-defined functions to process data with custom logic. Testing such applications is difficult. (1) These applications often take multiple datasets as input. (2) Unlike in SQL, there is no explicit schema for these datasets and each unstructured (or semi-structured) dataset is segmented and parsed at runtime. (3) Dataflow operators (e.g., join) create implicit co-dependence constraints between the fields of multiple datasets. An efficient and effective testing technique must analyze co-dependence among different regions of multiple datasets at the level of rows and columns and orchestrate input mutations jointly on co-dependent regions.
We propose DepFuzz to increase the effectiveness and efficiency of fuzz testing dataflow-based big data applications. The key insight behind DepFuzz is twofold. It keeps track of which code segments operate on which datasets, which rows, and which columns. By analyzing the use of dataflow operators (e.g., join and groupByKey) in tandem with the semantics of UDFs, DepFuzz generates test data that subsequently reach hard-to-reach regions of the application code. In real-world big data applications, DepFuzz finds 3.4× more faults, achieving 29% more statement coverage in half the time as Jazzer’s, a state-of-the-art commercial fuzzer for Java bytecode. It outperforms prior DISC testing by exposing deeper semantic faults beyond simpler input formatting errors, especially when multiple datasets have complex interactions through dataflow operators.

Publisher's Version Published Artifact Artifacts Available
SJFuzz: Seed and Mutator Scheduling for JVM Fuzzing
Mingyuan Wu, Yicheng Ouyang, Minghai Lu, Junjie Chen, Yingquan Zhao, Heming Cui, Guowei Yang, and Yuqun Zhang
(Southern University of Science and Technology, China; University of Hong Kong, China; Tianjin University, China; University of Queensland, Australia)
While the Java Virtual Machine (JVM) plays a vital role in ensuring correct executions of Java applications, testing JVMs via generating and running class files on them can be rather challenging. The existing techniques, e.g., ClassFuzz and Classming, attempt to leverage the power of fuzzing and differential testing to cope with JVM intricacies by exposing discrepant execution results among different JVMs, i.e., inter-JVM discrepancies, for testing analytics. However, their adopted fuzzers are insufficiently guided since they include no well-designed seed and mutator scheduling mechanisms, leading to inefficient differential testing. To address such issues, in this paper, we propose SJFuzz, the first JVM fuzzing framework with seed and mutator scheduling mechanisms for automated JVM differential testing. Overall, SJFuzz aims to mutate class files via control flow mutators to facilitate the exposure of inter-JVM discrepancies. To this end, SJFuzz schedules seeds (class files) for mutations based on the discrepancy and diversity guidance. SJFuzz also schedules mutators for diversifying class file generation. To evaluate SJFuzz, we conduct an extensive study on multiple representative real-world JVMs, and the experimental results show that SJFuzz significantly outperforms the state-of-the-art mutation-based and generation-based JVM fuzzers in terms of the inter-JVM discrepancy exposure and the class file diversity. Moreover, SJFuzz successfully reported 46 potential JVM issues, and 20 of them have been confirmed as bugs and 16 have been fixed by the JVM developers.

Publisher's Version
Metamong: Detecting Render-Update Bugs in Web Browsers through Fuzzing
Suhwan Song and Byoungyoung Lee
(Seoul National University, South Korea)
A render-update bug arises when a web browser produces an erroneous rendering output due to incorrect rendering updates. Such render-update bugs seriously harm the usability and reliability of web browsers. However, we find that detecting render-update bugs is challenging because the render-update bug is a semantic bug - given a rendering result, it is difficult to determine if it is correct due to the complex rendering specification of DOM and CSS. Thus, unlike memory corruption bugs, the incorrect rendering output does not raise the violation or crash. In practice, render-update bug detection relies on the time-prohibitive manual analysis of domain experts to determine the bug.
This paper proposes Metamong, an automated framework to detect render-update bugs without false positive issues via differential fuzz testing. Metamong features two key components: (i) page mutator, and (ii) render-update oracle. The page mutator generates render-update operations, which change the content of the web page, to trigger a render-update bug. The render-update oracle exploits an HTML standard rule, so-called yielding, to produce the correct rendering result of a given web page. Combining these components, Metamong creates two HTML files where each constructs the same web page, but only one of them induces the render-update. It then uses differential testing to compare their rendering outputs to determine a bug. We implemented a prototype of Metamong, which performs differential fuzz testing on popular browsers, Chrome and Firefox. By far, Metamong identified 19 new render-update bugs, 17 in Chrome and two in Firefox. All of those have been confirmed by each browser vendor and five are already fixed, demonstrating the practical effectiveness of Metamong in identifying render-update bugs.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Property-Based Fuzzing for Finding Data Manipulation Errors in Android Apps
Jingling Sun, Ting Su, Jiayi Jiang, Jue Wang, Geguang Pu, and Zhendong Su
(East China Normal University, China; Nanjing University, China; ETH Zurich, Switzerland)
Like many software applications, data manipulation functionalities( DMFs ) are prevalent in Android apps, which perform the common CRUD operations (create, read, update, delete) to handle app-specific data. Thus, ensuring the correctness of these DMFs is fundamentally important for many core app functionalities. However, the bugs related to DMFs (named as data manipulation errors, DMEs ), especially those non-crashing logic ones, are prevalent but difficult to find. To this end, inspired by property-based testing, we introduce a property-based fuzzing approach to effectively finding DMEs in Android apps. Our key idea is that, given some type of app data of interest, we randomly interleave its relevant DMFs and other possible events to explore diverse app states for thorough validation. Specifically, our approach characterizes DMFs in (data) model-based properties and leverage the consistency between the data model and the UI layouts as the handler to do property checking. The properties of DMFs are specified by human according to specific app features. To support the application of our approach, we implemented an automated GUI testing tool, PBFDroid. We evaluated PBFDroid on 20 real-world Android apps, and successfully found 30 unique and previously unknown bugs in 18 apps. Out of the 30 bugs, 29 of which are DMEs (22 are non-crashing logic bugs, and 7 are crash ones). To date, 19 have been confirmed and 9 have already been fixed. Many of these bugs are non-trivial and lead to different types of app failures. Our further evaluation confirms that none of the 22 non-crashing DMEs can be found by the state-of-the-art techniques. In addition, a user study shows that the manual cost of specifying the DMF properties with the assistance of our tool is acceptable. Overall, given accurate DMF properties, our approach can automatically find DMEs without any false positives. We have made all the artifacts publicly available at:https:// github.com/ property-based-fuzzing/ home.

Publisher's Version
Leveraging Hardware Probes and Optimizations for Accelerating Fuzz Testing of Heterogeneous Applications
Jiyuan Wang, Qian Zhang, Hongbo Rong, Guoqing Harry Xu, and Miryung Kim
(University of California at Los Angeles, USA; University of California at Riverside, USA; Intel Labs, USA)
There is a growing interest in the computer architecture community to incorporate heterogeneity and specialization to improve performance. Developers can create heterogeneous applications that consist of both host code and kernel code, where compute-intensive kernels can be offloaded from CPU to hardware accelerators. Testing such applications on real heterogeneous architectures is extremely challenging as kernels are black boxes, providing no information about the kernels’ internal execution to diagnose issues such as silent hangs or unexpected results. Additionally, inputs for heterogeneous applications are often large matrices, leading to a vast search space for identifying bug-revealing inputs.
We propose a novel fuzz testing technique, HFuzz, to enable efficient testing on real heterogeneous architectures. HFuzz aims to increase both the observability of hardware kernels and testing efficiency through a three-pronged approach. First, HFuzz automatically generates test guidance by inserting device-side in-kernel hardware probes in addition to host-side software monitors. Second, it performs rapid input space exploration by offloading compute-intensive input mutations to hardware kernels. Third, HFuzz parallelizes fuzzing and enables fast on-chip memory access, by utilizing four FPGA-level optimizations including loop unrolling, shannonization, data preloading, and dynamic kernel sharing.
We evaluate HFuzz on seven open-source OneAPI subjects from Intel. HFuzz speeds up fuzz testing by 4.7x with HW-accelerated input space exploration. By incorporating HW probes in tandem with SW monitors, HFuzz finds 33 defects within 4 hours and reveals 25 unique, unexpected behavior symptoms that could not be found by SW-based monitoring alone. HFuzz is the first to design hardware optimizations to accelerate fuzz testing.

Publisher's Version Published Artifact Info Artifacts Available Artifacts Functional
NaNofuzz: A Usable Tool for Automatic Test Generation
Matthew C. Davis, Sangheon Choi, Sam Estep, Brad A. Myers, and Joshua Sunshine
(Carnegie Mellon University, USA; Rose-Hulman Institute of Technology, USA)
In the United States alone, software testing labor is estimated to cost $48 billion USD per year. Despite widespread test execution automation and automation in other areas of software engineering, test suites continue to be created manually by software engineers. We have built a test generation tool, called NaNofuzz, that helps users find bugs in their code by suggesting tests where the output is likely indicative of a bug, e.g., that return NaN (not-a-number) values. NaNofuzz is an interactive tool embedded in a development environment to fit into the programmer's workflow. NaNofuzz tests a function with as little as one button press, analyses the program to determine inputs it should evaluate, executes the program on those inputs, and categorizes outputs to prioritize likely bugs. We conducted a randomized controlled trial with 28 professional software engineers using NaNofuzz as the intervention treatment and the popular manual testing tool, Jest, as the control treatment. Participants using NaNofuzz on average identified bugs more accurately (p < .05, by 30%), were more confident in their tests (p < .03, by 20%), and finished their tasks more quickly (p < .007, by 30%).

Publisher's Version Published Artifact Artifacts Available
A Generative and Mutational Approach for Synthesizing Bug-Exposing Test Cases to Guide Compiler Fuzzing
Guixin Ye, Tianmin Hu, Zhanyong Tang, Zhenye Fan, Shin Hwei Tan, Bo Zhang, Wenxiang Qian, and Zheng Wang
(Northwest University, China; Concordia University, Canada; Tencent, China; University of Leeds, UK)
Random test case generation, or fuzzing, is a viable means for uncovering compiler bugs. Unfortunately, compiler fuzzing can be time-consuming and inefficient with purely randomly generated test cases due to the complexity of modern compilers. We present COMFUZZ, a focused compiler fuzzing framework. COMFUZZ aims to improve compiler fuzzing efficiency by focusing on testing components and language features that are likely to trigger compiler bugs. Our key insight is human developers tend to make common and repeat errors across compiler implementations; hence, we can leverage the previously reported buggy-exposing test cases of a programming language to test a new compiler implementation. To this end, COMFUZZ employs deep learning to learn a test program generator from open-source projects hosted on GitHub. With the machine-generated test programs in place, COMFUZZ then leverages a set of carefully designed mutation rules to improve the coverage and bug-exposing capabilities of the test cases. We evaluate COMFUZZ on 11 compilers for JS and Java programming languages. Within 260 hours of automated testing runs, we discovered 33 unique bugs across nine compilers, of which 29 have been confirmed and 22, including an API documentation defect, have already been fixed by the developers. We also compared COMFUZZ to eight prior fuzzers on four evaluation metrics. In a 24-hour comparative test, COMFUZZ uncovers at least 1.5× more bugs than the state-of-the-art baselines.

Publisher's Version Published Artifact Artifacts Available

Formal Verification

State Merging with Quantifiers in Symbolic Execution
David Trabish, Noam Rinetzky, Sharon Shoham, and Vaibhav Sharma
(Tel Aviv University, Israel; University of Minnesota, USA)
We address the problem of constraint encoding explosion which hinders the applicability of state merging in symbolic execution. Specifically, our goal is to reduce the number of disjunctions and if-then-else expressions introduced during state merging. The main idea is to dynamically partition the symbolic states into merging groups according to a similar uniform structure detected in their path constraints, which allows to efficiently encode the merged path constraint and memory using quantifiers. To address the added complexity of solving quantified constraints, we propose a specialized solving procedure that reduces the solving time in many cases. Our evaluation shows that our approach can lead to significant performance gains.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Detecting Atomicity Violations in Interrupt-Driven Programs via Interruption Points Selecting and Delayed ISR-Triggering
Bin Yu, Cong Tian, Hengrui Xing, Zuchao Yang, Jie Su, Xu Lu, Jiyu Yang, Liang Zhao, Xiaofeng Li, and Zhenhua Duan
(Xidian University, China; Beijing Institute of Control Engineering, China)
Interrupt-driven programs have been widely used in safety-critical areas such as aerospace and embedded systems. However, uncertain interleaving execution of interrupt service routines (ISRs) usually causes concurrency bugs. Specifically, when one or more ISRs attempt to preempt a sequence of instructions which are expected to be atomic, a kind of concurrency bugs namely atomicity violation may occur, and it is challenging to find this kind of bugs precisely and efficiently. In this paper, we propose a static approach for detecting atomicity violations in interrupt-driven programs. First, the program model is constructed with interruption points being selected to determine the possibly influenced ISRs. After that, reachability computation is conducted to build up a whole abstract reachability tree, and a delayed ISR-triggering strategy is employed to reduce the state space. Meanwhile, unserializable interleaving patterns are recognized to achieve the goal of atomicity violation detection. The approach has been implemented as a configurable tool namely CPA4AV. Extensive experiments show that CPA4AV is much more precise than the relative tools available with little extra time overhead. In addition, more complex situations can be dealt with CPA4AV.

Publisher's Version Info
Engineering a Formally Verified Automated Bug Finder
Arthur Correnson and Dominic Steinhöfel
(CISPA Helmholtz Center for Information Security, Germany)
Symbolic execution is a program analysis technique executing programs with symbolic instead of concrete inputs. This principle allows for exploring many program paths at once. Despite its wide adoption—in particular for program testing–little effort was dedicated to studying the semantic foundations of symbolic execution. Without these foundations, critical questions regarding the correctness of symbolic executors cannot be satisfyingly answered: Can a reported bug be reproduced, or is it a false positive (soundness)? Can we be sure to find all bugs if we let the testing tool run long enough (completeness)? This paper presents a systematic approach for engineering provably sound and complete symbolic execution-based bug finders by relating a programming language’s operational semantics with a symbolic semantics. In contrast to prior work on symbolic execution semantics, we address the correctness of critical implementation details of symbolic bug finders, including the search strategy and the role of constraint solvers to prune the search space. We showcase our approach by implementing WiSE, a prototype of a verified bug finder for an imperative language, in the Coq proof assistant and proving it sound and complete. We demonstrate that the design principles of WiSE survive outside the ecosystem of interactive proof assistants by (1) automatically extracting an OCaml implementation and (2) transforming WiSE to PyWiSE, a functionally equivalent Python version.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Speeding up SMT Solving via Compiler Optimization
Benjamin Mikek and Qirun Zhang
(Georgia Institute of Technology, USA)
SMT solvers are fundamental tools for reasoning about constraints in practical problems like symbolic execution and program synthesis. Faster SMT solving can improve the performance and precision of those analysis tools. Existing approaches typically speed up SMT solving by developing new heuristics inside particular solvers, which requires nontrivial engineering efforts. This paper presents a new perspective on speeding up SMT solving. We propose SMT-LLVM Optimizing Translation (SLOT), a solver-agnostic pre-processing approach that utilizes existing compiler optimizations to simplify SMT problem instances. We implement SLOT for the two most application-critical SMT theories, bitvectors, and floating-point numbers. Our extensive evaluation based on the standard SMT-LIB benchmarks shows that SLOT can substantially increase the number of solvable SMT formulas given fixed timeouts and achieve mean speedups of nearly 3× for large benchmarks.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable

Automated Repair II

Semantic Test Repair for Web Applications
Xiaofang Qi, Xiang Qian, and Yanhui Li
(Southeast University, China; Nanjing University, China)
Automation testing is widely used in the functional testing of web applications. However, during the evolution of web applications, such web test scripts tend to break. It is essential to repair such broken test scripts to make regression testing run successfully. As manual repairing is time-consuming and expensive, researchers focus on automatic repairing techniques. Empirical study shows that the web element locator is the leading cause of web test breakages. Most existing repair techniques utilize Document Object Model attributes or visual appearances of elements to find their location but neglect their semantic information.
This paper proposes a novel semantic repair technique called Semantic Test Repair (Semter) for web test repair. Our approach captures relevant semantic information from test executions on the application’s basic version and locates target elements by calculating semantic similarity between elements to repair tests. Our approach can also repair test workflow due to web page additions or deletions by a local exploration in the updated version. We evaluated the efficacy of our technique on six real-world web applications compared with three baselines. Experimental results show that Semter achieves an 84% average repair ratio within an acceptable time cost, significantly outperforming the state-of-the-art web test repair techniques.

Publisher's Version
A Large-Scale Empirical Review of Patch Correctness Checking Approaches
Jun Yang, Yuehan Wang, Yiling Lou, Ming Wen, and Lingming Zhang
(University of Illinois at Urbana-Champaign, USA; Fudan University, China; Huazhong University of Science and Technology, China)
Automated Program Repair (APR) techniques have drawn wide attention from both academia and industry. Meanwhile, one main limitation with the current state-of-the-art APR tools is that patches passing all the original tests are not necessarily the correct ones wanted by developers, i.e., the plausible patch problem. To date, various Patch-Correctness Checking (PCC) techniques have been proposed to address this important issue. However, they are only evaluated on very limited datasets as the APR tools used for generating such patches can only explore a small subset of the search space of possible patches, posing serious threats to external validity to existing PCC studies. In this paper, we construct an extensive PCC dataset, PraPatch (the largest manually labeled PCC dataset to our knowledge), to revisit all nine state-of-the-art PCC techniques. More specifically, our PCC dataset PraPatch includes 1,988 patches generated from the recent PraPR APR tool, which leverages highly-optimized bytecode-level patch executions and can exhaustively explore all possible plausible patches within its large predefined search space (including well-known fixing patterns from various prior APR tools). Our extensive study of representative PCC techniques on PraPatch has revealed various findings, including: 1) the assumption made by existing static PCC techniques that correct patches are more similar to buggy code than incorrect plausible patches no longer holds, 2) state-of-the-art learning-based techniques tend to suffer from the dataset overfitting problem, 3) while dynamic techniques overall retain their effectiveness on our new dataset, their performance drops substantially on patches with more complicated changes and 4) the very recent naturalness-based techniques can substantially outperform traditional static techniques and could be a promising direction for PCC. Based on our findings, we also provide various guidelines/suggestions for advancing PCC in the near future.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Program Repair Guided by Datalog-Defined Static Analysis
Yu Liu, Sergey Mechtaev, Pavle Subotić, and Abhik Roychoudhury
(National University of Singapore, Singapore; University College London, UK; Microsoft, Serbia)
Automated program repair relying on static analysis complements test-driven repair, since it does not require failing tests to repair a bug, and it avoids test-overfitting by considering program properties. Due to the rich variety and complexity of program analyses, existing static program repair techniques are tied to specific analysers, and thus repair only narrow classes of defects. To develop a general-purpose static program repair framework that targets a wide range of properties and programming languages, we propose to integrate program repair with Datalog-based analysis. Datalog solvers are programmable fixed point engines which can be used to encode many program analysis problems in a modular fashion. The program under analysis is encoded as Datalog facts, while the fixed point equations of the program analysis are expressed as recursive Datalog rules. In this context, we view repairing the program as modifying the corresponding Datalog facts. This is accomplished by a novel technique, symbolic execution of Datalog, that evaluates Datalog queries over a symbolic database of facts, instead of a concrete set of facts. The result of symbolic query evaluation allows us to infer what changes to a given set of Datalog facts repair the program so that it meets the desired analysis goals. We developed a symbolic executor for Datalog called Symlog, on top of which we built a repair tool SymlogRepair. We show the versatility of our approach on several analysis problems --- repairing null pointer exceptions in Java programs, repairing data leaks in Python notebooks, and repairing four types of security vulnerabilities in Solidity smart contracts.

Publisher's Version
Baldur: Whole-Proof Generation and Repair with Large Language Models
Emily First, Markus N. Rabe, Talia Ringer, and Yuriy Brun
(University of Massachusetts, USA; Augment Computing, USA; University of Illinois at Urbana-Champaign, USA)
Formally verifying software is a highly desirable but labor-intensive task. Recent work has developed methods to automate formal verification using proof assistants, such as Coq and Isabelle/HOL, e.g., by training a model to predict one proof step at a time and using that model to search through the space of possible proofs. This paper introduces a new method to automate formal verification: We use large language models, trained on natural language and code and fine-tuned on proofs, to generate whole proofs at once. We then demonstrate that a model fine-tuned to repair generated proofs further increasing proving power. This paper: (1) Demonstrates that whole-proof generation using transformers is possible and is as effective but more efficient than search-based techniques. (2) Demonstrates that giving the learned model additional context, such as a prior failed proof attempt and the ensuing error message, results in proof repair that further improves automated proof generation. (3) Establishes, together with prior work, a new state of the art for fully automated proof synthesis. We reify our method in a prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL theorems and their proofs, empirically showing the effectiveness of whole-proof generation, repair, and added context. We also show that Baldur complements the state-of-the-art tool, Thor, by automatically generating proofs for an additional 8.7% of the theorems. Together, Baldur and Thor can prove 65.7% of the theorems fully automatically. This paper paves the way for new research into using large language models for automating formal verification.

Publisher's Version
KG4CraSolver: Recommending Crash Solutions via Knowledge Graph
Xueying Du, Yiling Lou, Mingwei Liu, Xin Peng, and Tianyong Yang
(Fudan University, China)
Fixing crashes is challenging, and developers often discuss their encountered crashes and refer to similar crashes and solutions on online Q&A forums (e.g., Stack Overflow). However, a crash often involves very complex contexts, which includes different contextual elements, e.g., purposes, environments, code, and crash traces. Existing crash solution recommendation or general solution recommendation techniques only use an incomplete context or treat the entire context as pure texts to search relevant solutions for a given crash, resulting in inaccurate recommendation results. In this work, we propose a novel crash solution knowledge graph (KG) to summarize the complete crash context and its solution with a graph-structured representation. To construct the crash solution KG automatically, we propose to leverage prompt learning to construct the KG from SO threads with a small set of labeled data. Based on the constructed KG, we further propose a novel KG-based crash solution recommendation technique KG4CraSolver, which precisely finds the relevant SO thread for an encountered crash by finely analyzing and matching the complete crash context based on the crash solution KG. The evaluation results show that the constructed KG is of high quality and KG4CraSolver outperforms baselines in terms of all metrics (e.g., 13.4%-113.4% MRR improvements). Moreover, we perform a user study and find that KG4CraSolver helps participants find crash solutions 34.4% faster and 63.3% more accurately.

Publisher's Version Info
Automated and Context-Aware Repair of Color-Related Accessibility Issues for Android Apps
Yuxin Zhang, Sen Chen, Lingling Fan, Chunyang Chen, and Xiaohong Li
(Tianjin University, China; Nankai University, China; Monash University, Australia)
Approximately 15% of the world's population is suffering from various disabilities or impairments. However, many mobile UX designers and developers disregard the significance of accessibility for those with disabilities when developing apps. It is unbelievable that one in seven people might not have the same level of access that other users have, which actually violates many legal and regulatory standards. On the contrary, if the apps are developed with accessibility in mind, it will drastically improve the user experience for all users as well as maximize revenue. Thus, a large number of studies and some effective tools for detecting accessibility issues have been conducted and proposed to mitigate such a severe problem. However, compared with detection, the repair work is obviously falling behind. Especially for the color-related accessibility issues, which is one of the top issues in apps with a greatly negative impact on vision and user experience. Apps with such issues are difficult to use for people with low vision and the elderly. Unfortunately, such an issue type cannot be directly fixed by existing repair techniques. To this end, we propose Iris, an automated and context-aware repair method to fix the color-related accessibility issues (i.e., the text contrast issues and the image contrast issues) for apps. By leveraging a novel context-aware technique that resolves the optimal colors and a vital phase of attribute-to-repair localization, Iris not only repairs the color contrast issues but also guarantees the consistency of the design style between the original UI page and repaired UI page. Our experiments unveiled that Iris can achieve a 91.38% repair success rate with high effectiveness and efficiency. The usefulness of Iris has also been evaluated by a user study with a high satisfaction rate as well as developers' positive feedback. 9 of 40 submitted pull requests on GitHub repositories have been accepted and merged into the projects by app developers, and another 4 developers are actively discussing with us for further repair. Iris is publicly available to facilitate this new research direction.

Publisher's Version Info

Human Aspects III

A Case Study of Developer Bots: Motivations, Perceptions, and Challenges
Sumit Asthana, Hitesh Sajnani, Elena Voyloshnikova, Birendra Acharya, and Kim Herzig
(University of Michigan, USA; Trade Desk, USA; Microsoft, USA)
Continuous integration and deployment (CI/CD) is now a widely adopted development model in practice as it reduces the time from ideas to customers. This adoption has also revived the idea of "shifting left" during software development -- a practice intended to find and prevent defects early in the software delivery process. To assist with that, engineering systems integrate developer bots in the development workflow to improve developer productivity and help them identify issues early in the software delivery process.
In this paper, we present a case study of developer bots in Microsoft. We identify and analyze 23 developer bots that are deployed across 13,000 repositories and assist about 6,000 developers daily in their CI/CD software development workflows. We classify these bots across five major categories: Config Violation, Security, Data-privacy, Developer Productivity, and Code Quality. By conducting interviews and surveys with bot developers and bot users and by analyzing about half a million historical bot actions spanning over one and a half years, we present software workflows that motivate bot instrumentation, factors impacting their usefulness as perceived by bot users, and challenges associated with their use. Our findings echo existing issues with bots, such as noise, and illustrate new benefits (e.g., cross-team communication) and challenges (e.g., too many bots) for large software teams.

Publisher's Version Published Artifact Artifacts Available
“We Feel Like We’re Winging It:” A Study on Navigating Open-Source Dependency Abandonment
Courtney Miller, Christian Kästner, and Bogdan Vasilescu
(Carnegie Mellon University, USA)
While lots of research has explored how to prevent maintainers from abandoning the open-source projects that serve as our digital infras- tructure, there are very few insights on addressing abandonment when it occurs. We argue open-source sustainability research must expand its focus beyond trying to keep particular projects alive, to also cover the sustainable use of open source by supporting users when they face potential or actual abandonment. We interviewed 33 developers who have experienced open-source dependency aban- donment. Often, they used multiple strategies to cope with aban- donment, for example, first reaching out to the community to find potential alternatives, then switching to a community-accepted alternative if one exists. We found many developers felt they had little to no support or guidance when facing abandonment, leaving them to figure out what to do through a trial-and-error process on their own. Abandonment introduces cost for otherwise seem- ingly free dependencies, but users can decide whether and how to prepare for abandonment through a number of different strategies, such as dependency monitoring, building abstraction layers, and community involvement. In many cases, community members can invest in resources that help others facing the same abandoned dependency, but often do not because of the many other competing demands on their time – a form of the volunteer’s dilemma. We dis- cuss cost reduction strategies and ideas to overcome this volunteer’s dilemma. Our findings can be used directly by open-source users seeking resources on dealing with dependency abandonment, or by researchers to motivate future work supporting the sustainable use of open source.

Publisher's Version
How Practitioners Expect Code Completion?
Chaozheng Wang, Junhao Hu, Cuiyun Gao, Yu Jin, Tao Xie, Hailiang Huang, Zhenyu Lei, and Yuetang Deng
(Chinese University of Hong Kong, China; Peking University, China; Tencent, USA; Tencent, China)
Code completion has become a common practice for programmers during their daily programming activities. It automatically predicts the next tokens or statements that the programmers may use. Code completion aims to substantially save keystrokes and improve the programming efficiency for programmers. Although there exists substantial research on code completion, it is still unclear what practitioner expectations are on code completion and whether these expectations are met by the existing research. To address these questions, we perform a study by first interviewing 15 professionals and then surveying 599 practitioners from 18 IT companies about their expectations on code completion. We then compare the practitioner expectations with the existing research by conducting a literature review of papers on code completion published in major publication venues from 2012 to 2022. Based on the comparison, we highlight the directions desirable for researchers to invest efforts toward developing code completion techniques for meeting practitioner expectations.

Publisher's Version

Testing IV

Code Coverage Criteria for Asynchronous Programs
Mohammad Ganji, Saba Alimadadi, and Frank Tip
(Simon Fraser University, Canada; Northeastern University, USA)
Asynchronous software often exhibits complex and error-prone behaviors that should be tested thoroughly. Code coverage has been the most popular metric to assess test suite quality. However, traditional code coverage criteria do not adequately reflect completion, interactions, and error handling of asynchronous operations. This paper proposes novel test adequacy criteria for measuring: (i) completion of asynchronous operations in terms of both successful and exceptional execution, (ii) registration of reactions for handling both possible outcomes, and (iii) execution of said reactions through tests. We implement JScope, a tool for automatically measuring coverage according to these criteria in JavaScript applications, as an interactive plug-in for Visual Studio Code. An evaluation of JScope on 20 JavaScript applications shows that the proposed criteria can help improve assessment of test adequacy, complementing traditional criteria. According to our investigation of 15 real GitHub issues concerned with asynchrony, the new criteria can help reveal faulty asynchronous behaviors that are untested yet are deemed covered by traditional coverage criteria. We also report on a controlled experiment with 12 participants to investigate the usefulness of JScope in realistic settings, demonstrating its effectiveness in improving programmers’ ability to assess test adequacy and detect untested behavior of asynchronous code.

Publisher's Version Published Artifact Artifacts Available
API-Knowledge Aware Search-Based Software Testing: Where, What, and How
Xiaoxue Ren, Xinyuan Ye, Yun Lin, Zhenchang Xing, Shuqing Li, and Michael R. Lyu
(Zhejiang University, China; Australian National University, Australia; Shanghai Jiao Tong University, China; CSIRO’s Data61, Australia; Chinese University of Hong Kong, China)
Search-based software testing (SBST) has proved its effectiveness in generating test cases to achieve its defined test goals, such as branch and data-dependency coverage. However, to detect more program faults in an effective way, pre-defined goals can hardly be adaptive in diversified projects. In this work, we propose KAT, a novel knowledge-aware SBST approach to generate on-demand assertions in the program under test (PUT) based on its used APIs. KAT constructs an API knowledge graph from the API documentation to derive the constraints that the client codes need to satisfy. Each constraint is instrumented into the PUT as a program branch, serving as a test goal to guide SBST to detect faults. We evaluate KAT with two baselines (i.e., EvoSuite and Catcher) with a close-world and an open-world experiment to detect API bugs. The close-world experiment shows that KAT outperforms the baselines in the F1-score (0.55 vs. 0.24 and 0.30) to detect API-related bugs. The open-world experiment shows that KAT can detect 59.64% and 9.05% more bugs than the baselines in practice.

Publisher's Version
EtherDiffer: Differential Testing on RPC Services of Ethereum Nodes
Shinhae Kim and Sungjae Hwang
(Affiliated Institute of ETRI, South Korea; Sungkyunkwan University, South Korea)
Blockchain is a distributed ledger that records transactions among users on top of a peer-to-peer network. Among all, Ethereum is the most popular general-purpose platform and its support of smart contracts led to a new form of applications called decentralized applications (DApps). A typical DApp has an off-chain frontend and on-chain backend architecture, and the frontend often needs interactions with the backend network, e.g., to acquire chain data or make transactions. Therefore, Ethereum nodes implement the official RPC specification and expose a uniform set of RPC methods to the frontend. However, the specification is not sufficient in two points: (1) lack of clarification for non-deterministic event handling, and (2) lack of specification for invalid arguments. To effectively disclose any deviations caused by the insufficiency, this paper introduces EtherDiffer that automatically performs differential testing on four major node implementations in terms of their RPC services. EtherDiffer first generates a non-deterministic chain by multi-concurrent transactions and propagation delay. Then, it applies our key techniques called property-based generation and type-preserving mutation to generate both semantically-valid and semantically-invalid-yet-executable test cases. EtherDiffer executes the test cases on target nodes and reports any deviations in error handling or return values. The evaluation showed the effectiveness of our test case generation techniques with the success ratios of 98.8% and 95.4%, respectively. Also, EtherDiffer detected 48 different classes of deviations including 11 implementation bugs such as crash and denial-of-service bugs. We reported 44 of the detected classes to the specification and node developers and received acknowledgements as well as bug patches. Lastly, it significantly outperformed the official node testing tool in every technical aspect. We believe that our research findings can contribute to more stable DApp ecosystem by reducing the inconsistencies among nodes.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional

Machine Learning IV

Dynamic Data Fault Localization for Deep Neural Networks
Yining Yin, Yang Feng, Shihao Weng, Zixi Liu, Yuan Yao, Yichi Zhang, Zhihong Zhao, and Zhenyu Chen
(Nanjing University, China)
Rich datasets have empowered various deep learning (DL) applications, leading to remarkable success in many fields. However, data faults hidden in the datasets could result in DL applications behaving unpredictably and even cause massive monetary and life losses. To alleviate this problem, in this paper, we propose a dynamic data fault localization approach, namely DFauLo, to locate the mislabeled and noisy data in the deep learning datasets. DFauLo is inspired by the conventional mutation-based code fault localization, but utilizes the differences between DNN mutants to amplify and identify the potential data faults. Specifically, it first generates multiple DNN model mutants of the original trained model. Then it extracts features from these mutants and maps them into a suspiciousness score indicating the probability of the given data being a data fault. Moreover, DFauLo is the first dynamic data fault localization technique, prioritizing the suspected data based on user feedback, and providing the generalizability to unseen data faults during training. To validate DFauLo, we extensively evaluate it on 26 cases with various fault types, data types, and model structures. We also evaluate DFauLo on three widely-used benchmark datasets. The results show that DFauLo outperforms the state-of-the-art techniques in almost all cases and locates hundreds of different types of real data faults in benchmark datasets.

Publisher's Version Published Artifact Info Artifacts Available Artifacts Functional
Understanding the Bug Characteristics and Fix Strategies of Federated Learning Systems
Xiaohu Du, Xiao Chen, Jialun Cao, Ming Wen, Shing-Chi Cheung, and Hai Jin
(Huazhong University of Science and Technology, China; Hong Kong University of Science and Technology, China)
Federated learning (FL) is an emerging machine learning paradigm that aims to address the problem of isolated data islands. To preserve privacy, FL allows machine learning models and deep neural networks to be trained from decentralized data kept privately at individual devices. FL has been increasingly adopted in missioncritical fields such as finance and healthcare. However, bugs in FL systems are inevitable and may result in catastrophic consequences such as financial loss, inappropriate medical decision, and violation of data privacy ordinance. While many recent studies were conducted to understand the bugs in machine learning systems, there is no existing study to characterize the bugs arising from the unique nature of FL systems. To fill the gap, we collected 395 real bugs from six popular FL frameworks (Tensorflow Federated, PySyft, FATE, Flower, PaddleFL, and Fedlearner) in GitHub and StackOverflow, and then manually analyzed their symptoms and impacts, prone stages, root causes, and fix strategies. Furthermore, we report a series of findings and actionable implications that can potentially facilitate the detection of FL bugs.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Learning Program Semantics for Vulnerability Detection via Vulnerability-Specific Inter-procedural Slicing
Bozhi Wu, Shangqing Liu, Yang Xiao, Zhiming Li, Jun Sun, and Shang-Wei Lin
(Singapore Management University, Singapore; Nanyang Technological University, Singapore; Chinese Academy of Sciences, China)
Learning-based approaches that learn code representations for software vulnerability detection have been proven to produce inspiring results. However, they still fail to capture complete and precise vulnerability semantics for code representations. To address the limitations, in this work, we propose a learning-based approach namely SnapVuln, which first utilizes multiple vulnerability-specific inter-procedural slicing algorithms to capture vulnerability semantics of various types and then employs a Gated Graph Neural Network (GGNN) with an attention mechanism to learn vulnerability semantics. We compare SnapVuln with state-of-the-art learning-based approaches on two public datasets, and confirm that SnapVuln outperforms them. We further perform an ablation study and demonstrate that the completeness and precision of vulnerability semantics captured by SnapVuln contribute to the performance improvement.

Publisher's Version
DeepRover: A Query-Efficient Blackbox Attack for Deep Neural Networks
Fuyuan Zhang, Xinwen Hu, Lei Ma, and Jianjun Zhao
(Kyushu University, Japan; Hunan Normal University, China; University of Tokyo, Japan; University of Alberta, Canada)
Deep neural networks (DNNs) achieved a significant performance breakthrough over the past decade and have been widely adopted in various industrial domains. However, a fundamental problem regarding DNN robustness is still not adequately addressed, which can potentially lead to many quality issues after deployment, e.g., safety, security, and reliability. An adversarial attack is one of the most commonly investigated techniques to penetrate a DNN by misleading the DNN’s decision through the generation of minor perturbations in the original inputs. More importantly, the adversarial attack is a crucial way to assess, estimate, and understand the robustness boundary of a DNN. Intuitively, a stronger adversarial attack can help obtain a tighter robustness boundary, allowing us to understand the potential worst-case scenario when a DNN is deployed. To push this further, in this paper, we propose DeepRover, a fuzzing-based blackbox attack for deep neural networks used for image classification. We show that DeepRover is more effective and query-efficient in generating adversarial examples than state-of-the-art blackbox attacks. Moreover, DeepRover can find adversarial examples at a finer-grained level than other approaches.

Publisher's Version

Program Analysis III

Practical Inference of Nullability Types
Nima Karimipour, Justin Pham, Lazaro Clapp, and Manu Sridharan
(University of California, Riverside, USA; Uber Technologies, USA)
NullPointerExceptions (NPEs), caused by dereferencing null, fre- quently cause crashes in Java programs. Pluggable type checking is highly effective in preventing Java NPEs. However, this approach is difficult to adopt for large, existing code bases, as it requires manually inserting a significant number of type qualifiers into the code. Hence, a tool to automatically infer these qualifiers could make adoption of type-based NPE prevention significantly easier. We present a novel and practical approach to automatic inference of nullability type qualifiers for Java. Our technique searches for a set of qualifiers that maximizes the amount of code that can be successfully type checked. The search uses the type checker as a black box oracle, easing compatibility with existing tools. However, this approach can be costly, as evaluating the impact of a qualifier requires re-running the checker. We present a technique for safely evaluating many qualifiers in a single checker run, dramatically reducing running times. We also describe extensions to make the approach practical in a real-world deployment. We implemented our approach in an open-source tool Null- AwayAnnotator, designed to work with the NullAway type checker. We evaluated NullAwayAnnotator’s effectiveness on both open- source projects and commercial code. NullAwayAnnotator re- duces the number of reported NullAway errors by 69.5% on average. Further, our optimizations enable NullAwayAnnotator to scale to large Java programs. NullAwayAnnotator has been highly effective in practice: in a production deployment, it has already been used to add NullAway checking to 160 production modules totaling over 1.3 million lines of Java code.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
LibKit: Detecting Third-Party Libraries in iOS Apps
Daniel Domínguez-Álvarez, Alejandro de la Cruz, Alessandra Gorla, and Juan Caballero
(IMDEA Software Institute, Spain; University of Verona, Italy)
We present LibKit, the first approach and tool for detecting the name and version of third-party libraries (TPLs) present in iOS apps. LibKit automatically builds fingerprints for 86K library versions available through the CocoaPods dependency manager and matches them on the decrypted app executables to identify the TPLs (name and version) an iOS app uses. LibKit supports apps written in Swift and Objective-C, detects statically and dynamically linked libraries, and addresses challenges such as partially included libraries and different compiler versions and configurations producing variants of the same library version. On a ground truth of 95 open-source apps, LibKit identifies libraries with a precision of 0.911 and a recall of 0.839. LibKit also significantly outperforms the state-of-the-art CRiOS tool for identifying TPL boundaries. When applied to 1,500 apps from the iTunes Store, LibKit detects 47,015 library versions, identifying popular apps that contain old library versions.

Publisher's Version Published Artifact Info Artifacts Available
FunProbe: Probing Functions from Binary Code through Probabilistic Analysis
Soomin Kim, Hyungseok Kim, and Sang Kil Cha
(KAIST, South Korea)
Current function identification techniques have been mostly focused on a specific set of binaries compiled for a specific CPU architecture. While recent deep-learning-based approaches theoretically can handle binaries from different architectures, they require significant computation resources for training and inference, making their use less practical. Furthermore, due to the lack of interpretability of such models, it is fundamentally difficult to gain insight from them. Hence, in this paper, we propose FunProbe, an efficient system for identifying functions from binaries using probabilistic inference. In particular, we identify 16 architecture-neutral hints for function identification, and devise an effective method to combine them in a probabilistic framework. We evaluate our tool on a large dataset consisting of 19,872 real-world binaries compiled for six major CPU architectures. The results are promising. FunProbe shows the best accuracy compared to five state-of-the-art tools we tested, while it takes only 6 seconds on average to analyze a single binary. Notably, FunProbe is 6× faster on average in identifying functions than XDA, a state-of-the-art deep-learning tool that leverages GPU in its inference phase.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
BigDataflow: A Distributed Interprocedural Dataflow Analysis Framework
Zewen Sun, Duanchen Xu, Yiyu Zhang, Yun Qi, Yueyang Wang, Zhiqiang Zuo, Zhaokang Wang, Yue Li, Xuandong Li, Qingda Lu, Wenwen Peng, and Shengjian Guo
(Nanjing University, China; Alibaba Group, USA; Alibaba Group, China; Baidu Research, USA)
Abstract: Apart from forming the backbone of compiler optimization, static dataflow analysis has been widely applied in a vast variety of applications, such as bug detection, privacy analysis, program comprehension, etc. Despite its importance, performing interprocedural dataflow analysis on large-scale programs is well known to be challenging.In this paper, we propose a novel distributed analysis framework supporting the general interprocedural dataflow analysis.Inspired by large-scale graph processing, we devise a dedicated distributed worklist algorithm tailored for interprocedural dataflow analysis. We implement the algorithm and develop a distributed framework called BigDataflow running on a large-scale cluster.The experimental results validate the promising performance of BigDataflow – it can finish analyzing the program of millions lines of code in minutes. Compared with the state-of-the-art, BigDataflow achieves much more analysis efficiency.

Publisher's Version Published Artifact Artifacts Available

Empirical Studies II

Understanding the Topics and Challenges of GPU Programming by Classifying and Analyzing Stack Overflow Posts
Wenhua Yang, Chong Zhang, and Minxue Pan
(Nanjing University of Aeronautics and Astronautics, China; Nanjing University, China)
GPUs have cemented their position in computer systems, not restricted to graphics but also extensively used for general-purpose computing. With this comes a rapidly expanding population of developers using GPUs for programming. However, programming with GPUs is notoriously difficult due to their unique architecture and constant evolution. A large number of developers have encountered problems of one kind or another, and many of them have turned to Q&A sites for help. Unfortunately, there has been no prior work to comprehensively study the topics discussed and challenges encountered by developers in GPU programming. To fill this knowledge gap, we conduct a comprehensive study to understand the topics and challenges of GPU programming using Stack Overflow. We collect 25,269 relevant posts from Stack Overflow, propose a novel approach that combines automatic techniques and manual thematic analysis to extract topics, and build a taxonomy of topics with detailed discussions of the popularity, difficulty, and changing trends of these topics. In addition, we analyzed relevant posts through extensive manual efforts to understand the challenges of each topic and to summarize them for future research.

Publisher's Version
Software Architecture in Practice: Challenges and Opportunities
Zhiyuan Wan, Yun Zhang, Xin Xia, Yi Jiang, and David Lo
(Zhejiang University, China; Hangzhou City University, China; Huawei, China; Singapore Management University, Singapore)
Software architecture has been an active research field for nearly four decades, in which previous studies make significant progress such as creating methods and techniques and building tools to support software architecture practice. Despite past efforts, we have little understanding of how practitioners perform software architecture related activities, and what challenges they face. Through interviews with 32 practitioners from 21 organizations across three continents, we identified challenges that practitioners face in software architecture practice during software development and maintenance. We reported on common software architecture activities at software requirements, design, construction and testing, and maintenance stages, as well as corresponding challenges. Our study uncovers that most of these challenges center around management, documentation, tooling and process, and collects recommendations to address these challenges.

Publisher's Version

Models of Code and Documentation

On the Usage of Continual Learning for Out-of-Distribution Generalization in Pre-trained Language Models of Code
Martin Weyssow, Xin Zhou, Kisub Kim, David Lo, and Houari Sahraoui
(Université de Montréal, Canada; Singapore Management University, Singapore)
Pre-trained language models (PLMs) have become a prevalent technique in deep learning for code, utilizing a two-stage pre-training and fine-tuning procedure to acquire general knowledge about code and specialize in a variety of downstream tasks. However, the dynamic nature of software codebases poses a challenge to the effectiveness and robustness of PLMs. In particular, world-realistic scenarios potentially lead to significant differences between the distribution of the pre-training and test data, i.e., distribution shift, resulting in a degradation of the PLM's performance on downstream tasks. In this paper, we stress the need for adapting PLMs of code to software data whose distribution changes over time, a crucial problem that has been overlooked in previous works. The motivation of this work is to consider the PLM in a non-stationary environment, where fine-tuning data evolves over time according to a software evolution scenario. Specifically, we design a scenario where the model needs to learn from a stream of programs containing new, unseen APIs over time. We study two widely used PLM architectures, i.e., a GPT2 decoder and a RoBERTa encoder, on two downstream tasks, API call and API usage prediction. We demonstrate that the most commonly used fine-tuning technique from prior work is not robust enough to handle the dynamic nature of APIs, leading to the loss of previously acquired knowledge i.e., catastrophic forgetting. To address these issues, we implement five continual learning approaches, including replay-based and regularization-based methods. Our findings demonstrate that utilizing these straightforward methods effectively mitigates catastrophic forgetting in PLMs across both downstream tasks while achieving comparable or superior performance.

Publisher's Version Published Artifact Artifacts Available
Grace: Language Models Meet Code Edits
Priyanshu Gupta, Avishree Khare, Yasharth Bajpai, Saikat Chakraborty, Sumit Gulwani, Aditya Kanade, Arjun Radhakrishna, Gustavo Soares, and Ashish Tiwari
(Microsoft, India; University of Pennsylvania, USA; Microsoft Research, USA; Microsoft, USA; Microsoft Research, India)
Developers spend a significant amount of time in editing code for a variety of reasons such as bug fixing or adding new features. Designing effective methods to predict code edits has been an active yet challenging area of research due to the diversity of code edits and the difficulty of capturing the developer intent. In this work, we address these challenges by endowing pre-trained large language models (LLMs) with the knowledge of relevant prior associated edits, which we call the Grace (Generation conditioned on Associated Code Edits) method. The generative capability of the LLMs helps address the diversity in code changes and conditioning code generation on prior edits helps capture the latent developer intent. We evaluate two well-known LLMs, codex and CodeT5, in zero-shot and fine-tuning settings respectively. In our experiments with two datasets, Grace boosts the performance of the LLMs significantly, enabling them to generate 29% and 54% more correctly edited code in top-1 suggestions relative to the current state-of-the-art symbolic and neural approaches, respectively.

Publisher's Version Published Artifact Archive submitted (770 kB) Artifacts Available
Recommending Analogical APIs via Knowledge Graph Embedding
Mingwei Liu, Yanjun Yang, Yiling Lou, Xin Peng, Zhong Zhou, Xueying Du, and Tianyong Yang
(Fudan University, China)
Library migration, which replaces the current library with a different one to retain the same software behavior, is common in software evolution. An essential part of this is finding an analogous API for the desired functionality. However, due to the multitude of libraries/APIs, manually finding such an API is time-consuming and error-prone. Researchers created automated analogical API recommendation techniques, notably documentation-based methods. Despite potential, these methods have limitations, e.g., incomplete semantic understanding in documentation and scalability issues. In this study, we present KGE4AR, a novel documentation-based approach using knowledge graph (KG) embedding for recommending analogical APIs during library migration. KGE4AR introduces a unified API KG to comprehensively represent documentation knowledge, capturing high-level semantics. It further embeds this unified API KG into vectors for efficient, scalable similarity calculation. We assess KGE4AR with 35,773 Java libraries in two scenarios, with and without target libraries. KGE4AR notably outperforms state-of-the-art techniques (e.g., 47.1%-143.0% and 11.7%-80.6% MRR improvements), showcasing scalability with growing library counts.

Publisher's Version Info
CCT5: A Code-Change-Oriented Pre-trained Model
Bo Lin, Shangwen Wang, Zhongxin Liu, Yepang Liu, Xin Xia, and Xiaoguang Mao
(National University of Defense Technology, China; Zhejiang University, China; Southern University of Science and Technology, China)
Software is constantly changing, requiring developers to perform several derived tasks in a timely manner, such as writing a description for the intention of the code change, or identifying the defect-prone code changes. Considering that the cost of dealing with these tasks can account for a large proportion (typically around 70 percent) of the total development expenditure, automating such processes will significantly lighten the burdens of developers. To achieve such a target, existing approaches mainly rely on training deep learning models from scratch or fine-tuning existing pre-trained models on such tasks, both of which have weaknesses. Specifically, the former uses comparatively small-scale labelled data for training, making it difficult to learn and exploit the domain knowledge of programming language hidden in the large-amount unlabelled code in the wild; the latter is hard to fully leverage the learned knowledge of the pre-trained model, as existing pre-trained models are designed to encode a single code snippet rather than a code change (the difference between two code snippets). We propose to pre-train a model specially designed for code changes to better support developers in software maintenance. To this end, we first collect a large-scale dataset containing 1.5M+ pairwise data of code changes and commit messages. Based on these data, we curate five different tasks for pre-training, which equip the model with diverse domain knowledge about code changes. We fine-tune the pre-trained model, CCT5, on three widely-studied tasks incurred by code changes and two tasks specific to the code review process. Results show that CCT5 outperforms both conventional deep learning approaches and existing pre-trained models on these tasks.

Publisher's Version Published Artifact Artifacts Available

Machine Learning V

LExecutor: Learning-Guided Execution
Beatriz Souza and Michael Pradel
(University of Stuttgart, Germany)
Executing code is essential for various program analysis tasks, e.g., to detect bugs that manifest through exceptions or to obtain execution traces for further dynamic analysis. How- ever, executing an arbitrary piece of code is often difficult in practice, e.g., because of missing variable definitions, miss- ing user inputs, and missing third-party dependencies. This paper presents LExecutor, a learning-guided approach for executing arbitrary code snippets in an underconstrained way. The key idea is to let a neural model predict missing values that otherwise would cause the program to get stuck, and to inject these values into the execution. For example, LExecutor injects likely values for otherwise undefined vari- ables and likely return values of calls to otherwise missing functions. We evaluate the approach on Python code from popular open-source projects and on code snippets extracted from Stack Overflow. The neural model predicts realistic values with an accuracy between 79.5% and 98.2%, allowing LExecutor to closely mimic real executions. As a result, the approach successfully executes significantly more code than any available technique, such as simply executing the code as-is. For example, executing the open-source code snippets as-is covers only 4.1% of all lines, because the code crashes early on, whereas LExecutor achieves a coverage of 51.6%.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Software Architecture Recovery with Information Fusion
Yiran Zhang, Zhengzi Xu, Chengwei Liu, Hongxu Chen, Jianwen Sun, Dong Qiu, and Yang Liu
(Nanyang Technological University, Singapore; Huawei Technologies, China)
Understanding the architecture is vital for effectively maintaining and managing large software systems. However, as software systems evolve over time, their architectures inevitably change. To keep up with the change, architects need to track the implementation-level changes and update the architectural documentation accordingly, which is time-consuming and error-prone. Therefore, many automatic architecture recovery techniques have been proposed to ease this process. Despite efforts have been made to improve the accuracy of architecture recovery, existing solutions still suffer from two limitations. First, most of them only use one or two type of information for the recovery, ignoring the potential usefulness of other sources. Second, they tend to use the information in a coarse-grained manner, overlooking important details within it.
To address these limitations, we propose SARIF, a fully automated architecture recovery technique, which incorporates three types of comprehensive information, including dependencies, code text and folder structure. SARIF can recover architecture more accurately by thoroughly analyzing the details of each type of information and adaptively fusing them based on their relevance and quality. To evaluate SARIF, we collected six projects with published ground-truth architectures and three open-source projects labeled by our industrial collaborators. We compared SARIF with nine state-of-the-art techniques using three commonly-used architecture similarity metrics and two new metrics. The experimental results show that SARIF is 36.1% more accurate than the best of the previous techniques on average. By providing comprehensive architecture, SARIF can help users understand systems effectively and reduce the manual effort of obtaining ground-truth architectures.

Publisher's Version
Evaluating Transfer Learning for Simplifying GitHub READMEs
Haoyu Gao, Christoph Treude, and Mansooreh Zahedi
(University of Melbourne, Australia)
Software documentation captures detailed knowledge about a software product, e.g., code, technologies, and design. It plays an important role in the coordination of development teams and in conveying ideas to various stakeholders. However, software documentation can be hard to comprehend if it is written with jargon and complicated sentence structure. In this study, we explored the potential of text simplification techniques in the domain of software engineering to automatically simplify GitHub README files. We collected software-related pairs of GitHub README files consisting of 14,588 entries, aligned difficult sentences with their simplified counterparts, and trained a Transformer-based model to automatically simplify difficult versions. To mitigate the sparse and noisy nature of the software-related simplification dataset, we applied general text simplification knowledge to this field. Since many general-domain difficult-to-simple Wikipedia document pairs are already publicly available, we explored the potential of transfer learning by first training the model on the Wikipedia data and then fine-tuning it on the README data. Using automated BLEU scores and human evaluation, we compared the performance of different transfer learning schemes and the baseline models without transfer learning. The transfer learning model using the best checkpoint trained on a general topic corpus achieved the best performance of 34.68 BLEU score and statistically significantly higher human annotation scores compared to the rest of the schemes and baselines. We conclude that using transfer learning is a promising direction to circumvent the lack of data and drift style problem in software README files simplification and achieved a better trade-off between simplification and preservation of meaning.

Publisher's Version
CodeMark: Imperceptible Watermarking for Code Datasets against Neural Code Completion Models
Zhensu Sun, Xiaoning Du, Fu Song, and Li Li
(Beihang University, China; Monash University, Australia; Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Automotive Software Innovation Center, China)
Code datasets are of immense value for training neural-network-based code completion models, where companies or organizations have made substantial investments to establish and process these datasets. Unluckily, these datasets, either built for proprietary or public usage, face the high risk of unauthorized exploits, resulting from data leakages, license violations, etc. Even worse, the "black-box" nature of neural models sets a high barrier for externals to audit their training datasets, which further connives these unauthorized usages. Currently, watermarking methods have been proposed to prohibit inappropriate usage of image and natural language datasets. However, due to domain specificity, they are not directly applicable to code datasets, leaving the copyright protection of this emerging and important field of code data still exposed to threats. To fill this gap, we propose a method, named CodeMark, to embed user-defined imperceptible watermarks into code datasets to trace their usage in training neural code completion models. CodeMark is based on adaptive semantic-preserving transformations, which preserve the exact functionality of the code data and keep the changes covert against rule-breakers. We implement CodeMark in a toolkit and conduct an extensive evaluation of code completion models. CodeMark is validated to fulfill all desired properties of practical watermarks, including harmlessness to model accuracy, verifiability, robustness, and imperceptibility.

Publisher's Version

Security II

Mate! Are You Really Aware? An Explainability-Guided Testing Framework for Robustness of Malware Detectors
Ruoxi Sun, Minhui Xue, Gareth Tyson, Tian Dong, Shaofeng Li, Shuo Wang, Haojin Zhu, Seyit Camtepe, and Surya Nepal
(CSIRO’s Data61, Australia; Cybersecurity CRC, Australia; Hong Kong University of Science and Technology, China; Shanghai Jiao Tong University, China; Peng Cheng Laboratory, China)
Numerous open-source and commercial malware detectors are available. However, their efficacy is threatened by new adversarial attacks, whereby malware attempts to evade detection, e.g., by performing feature-space manipulation. In this work, we propose an explainability-guided and model-agnostic testing framework for robustness of malware detectors when confronted with adversarial attacks. The framework introduces the concept of Accrued Malicious Magnitude (AMM) to identify which malware features could be manipulated to maximize the likelihood of evading detection. We then use this framework to test several state-of-the-art malware detectors' ability to detect manipulated malware. We find that (i) commercial antivirus engines are vulnerable to AMM-guided test cases; (ii) the ability of a manipulated malware generated using one detector to evade detection by another detector (i.e., transferability) depends on the overlap of features with large AMM values between the different detectors; and (iii) AMM values effectively measure the fragility of features (i.e., capability of feature-space manipulation to flip the prediction results) and explain the robustness of malware detectors facing evasion attacks. Our findings shed light on the limitations of current malware detectors, as well as how they can be improved.

Publisher's Version Published Artifact Artifacts Available
Crystallizer: A Hybrid Path Analysis Framework to Aid in Uncovering Deserialization Vulnerabilities
Prashast Srivastava, Flavio Toffalini, Kostyantyn Vorobyov, François Gauthier, Antonio Bianchi, and Mathias Payer
(Purdue University, USA; EPFL, Switzerland; Oracle Labs, Australia)
Applications use serialization and deserialization to exchange data. Serialization allows developers to exchange messages or perform remote method invocation in distributed applications. However, the application logic itself is responsible for security. Adversaries may abuse bugs in the deserialization logic to forcibly invoke attacker-controlled methods by crafting malicious bytestreams (payloads). Crystallizer presents a novel hybrid framework to automatically uncover deserialization vulnerabilities by combining static and dynamic analyses. Our intuition is to first over-approximate possible payloads through static analysis (to constrain the search space). Then, we use dynamic analysis to instantiate concrete payloads as a proof-of-concept of a vulnerability (giving the analyst concrete examples of possible attacks). Our proof-of-concept focuses on Java deserialization as the imminent domain of such attacks. We evaluate our prototype on seven popular Java libraries against state-of-the-art frameworks for uncovering gadget chains. In contrast to existing tools, we uncovered 41 previously unknown exploitable chains. Furthermore, we show the real-world security impact of Crystallizer by using it to synthesize gadget chains to mount RCE and DoS attacks on three popular Java applications. We have responsibly disclosed all newly discovered vulnerabilities.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
ViaLin: Path-Aware Dynamic Taint Analysis for Android
Khaled Ahmed, Yingying Wang, Mieszko Lis, and Julia Rubin
(University of British Columbia, Canada)
Dynamic taint analysis - a program analysis technique that checks whether information flows between particular source and sink locations in the program, has numerous applications in security, program comprehension, and software testing. Specifically, in mobile software, taint analysis is often used to determine whether mobile apps contain stealthy behaviors that leak user-sensitive information to unauthorized third-party servers. While a number of dynamic taint analysis techniques for Android software have been recently proposed, none of them are able to report the complete information propagation path, only reporting flow endpoints, i.e., sources and sinks of the detected information flows. This design optimizes for runtime performance and allows the techniques to run efficiently on a mobile device. Yet, it impedes the applicability and usefulness of the techniques: an analyst using the tool would need to manually identify information propagation paths, e.g., to determine whether information was properly handled before being released, which is a challenging task in large real-world applications.
In this paper, we address this problem by proposing a dynamic taint analysis technique that reports accurate taint propagation paths. We implement it in a tool, ViaLin, and evaluate it on a set of existing benchmark applications and on 16 large Android applications from the Google Play store. Our evaluation shows that ViaLin accurately detects taint flow paths while running on a mobile device with a reasonable time and memory overhead.

Publisher's Version
Distinguishing Look-Alike Innocent and Vulnerable Code by Subtle Semantic Representation Learning and Explanation
Chao Ni, Xin Yin, Kaiwen Yang, Dehai Zhao, Zhenchang Xing, and Xin Xia
(Zhejiang University, China; Australian National University, Australia; CSIRO’s Data61, Australia; Huawei, China)
Though many deep learning (DL)-based vulnerability detection approaches have been proposed and indeed achieved remarkable performance, they still have limitations in the generalization as well as the practical usage. More precisely, existing DL-based approaches (1) perform negatively on prediction tasks among functions that are lexically similar but have contrary semantics; (2) provide no intuitive developer-oriented explanations to the detected results.
In this paper, we propose a novel approach named SVulD, a function-level S ubtle semantic embedding for Vulnerability Detection along with intuitive explanations, to alleviate the above limitations. Specifically, SVulD firstly trains a model to learn distinguishing semantic representations of functions regardless of their lexical similarity. Then, for the detected vulnerable functions, SVulD provides natural language explanations (e.g., root cause) of results to help developers intuitively understand the vulnerabilities. To evaluate the effectiveness of SVulD, we conduct large-scale experiments on a widely used practical vulnerability dataset and compare it with four state-of-the-art (SOTA) approaches by considering five performance measures. The experimental results indicate that SVulD outperforms all SOTAs with a substantial improvement (i.e., 23.5%-68.0% in terms of F1-score, 15.9%-134.8% in terms of PR-AUC and 7.4%-64.4% in terms of Accuracy). Besides, we conduct a user-case study to evaluate the usefulness of SVulD for developers on understanding the vulnerable code and the participants’ feedback demonstrates that SVulD is helpful for development practice.

Publisher's Version

Industry

Industry Papers

A Unified Framework for Mini-game Testing: Experience on WeChat
Chaozheng Wang, Haochuan Lu, Cuiyun Gao, Zongjie Li, Ting Xiong, and Yuetang Deng
(Chinese University of Hong Kong, China; Tencent, China; Hong Kong University of Science and Technology, China)
Mobile games play an increasingly important role in our daily life. The quality of mobile games can substantially affect the user experience and game revenue. Different from traditional mobile games, the mini-games provided by our partner, Tencent, are embedded in the mobile app WeChat, so users do not need to install specific game apps and can directly play the games in the app. Due to the convenient installation, WeChat has attracted large numbers of developers to design and publish on the mini-game platform in the app. Until now, the platform has more than one hundred thousand published mini-games. Manually testing all the mini-games requires enormous effort and is impractical. There exist automated game testing methods; however, they are difficult to be applied for testing mini-games for the following reasons: 1) Effective game testing heavily relies on prior knowledge about game operations and extraction of GUI widget trees. However, this knowledge is specific and not always applicable when testing a large number of mini-games with complex game engines (e.g., Unity). 2) The highly diverse GUI widget design of mini-games deviates significantly from that of mobile apps. Such issue prevents the existing image-based GUI widget detection techniques from effectively detecting widgets in mini-games.
To address the aforementioned issues, we propose a unified framework for black-box mini-game testing named iExplorer. iExplorer involves a mixed GUI widget detection approach incorporating both deep learning-based object detection and edge aggregation-based segmentation for detecting GUI widgets in mini-games. A category-aware testing strategy is then proposed for testing mini-games, with different categories of widgets (e.g., sliding and clicking widgets) considered. iExplorer has been deployed in for more than six months. In the past 30 days, iExplorer has tested large-scale mini-games (i.e., 76,000) and successfully found 22,144 real bugs.

Publisher's Version
Beyond Sharing: Conflict-Aware Multivariate Time Series Anomaly Detection
Haotian Si, Changhua Pei, Zhihan Li, Yadong Zhao, Jingjing Li, Haiming Zhang, Zulong Diao, Jianhui Li, Gaogang Xie, and Dan Pei
(Computer Network Information Center at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Kuaishou Technology, China; Institute of Computing Technology at Chinese Academy of Sciences, China; Purple Mountain Laboratories, China; Tsinghua University, China)
Massive key performance indicators (KPIs) are monitored as multivariate time series data (MTS) to ensure the reliability of the software applications and service system. Accurately detecting the abnormality of MTS is very critical for subsequent fault elimination. The scarcity of anomalies and manual labeling has led to the development of various self-supervised MTS anomaly detection (AD) methods, which optimize an overall objective/loss encompassing all metrics' regression objectives/losses. However, our empirical study uncovers the prevalence of conflicts among metrics' regression objectives, causing MTS models to grapple with different losses. This critical aspect significantly impacts detection performance but has been overlooked in existing approaches. To address this problem, by mimicking the design of multi-gate mixture-of-experts (MMoE), we introduce CAD, a Conflict-aware multivariate KPI Anomaly Detection algorithm. CAD offers an exclusive structure for each metric to mitigate potential conflicts while fostering inter-metric promotions. Upon thorough investigation, we find that the poor performance of vanilla MMoE mainly comes from the input-output misalignment settings of MTS formulation and convergence issues arising from expansive tasks. To address these challenges, we propose a straightforward yet effective task-oriented metric selection and p&s (personalized and shared) gating mechanism, which establishes CAD as the first practicable multi-task learning (MTL) based MTS AD model. Evaluations on multiple public datasets reveal that CAD obtains an average F1-score of 0.943 across three public datasets, notably outperforming state-of-the-art methods. Our code is accessible at https://github.com/dawnvince/MTS_CAD.

Publisher's Version Info
InferFix: End-to-End Program Repair with LLMs
Matthew Jin, Syed Shahriar, Michele Tufano, Xin Shi, Shuai Lu, Neel Sundaresan, and Alexey Svyatkovskiy
(Microsoft, USA; University of California at Los Angeles, USA; Microsoft Research, China)
Software development life cycle is profoundly influenced by bugs; their introduction, identification, and eventual resolution account for a significant portion of software development cost. This has motivated software engineering researchers and practitioners to propose different approaches for automating the identification and repair of software defects.
Large Language Models (LLMs) have been adapted to the program repair task through few-shot demonstration learning and instruction prompting, treating this as an infilling task. However, these models have only focused on learning general bug-fixing patterns for uncategorized bugs mined from public repositories. In this paper, we propose : a transformer-based program repair framework paired with a state-of-the-art static analyzer to fix critical security and performance bugs. combines a Retriever – transformer encoder model pretrained via contrastive learning objective, which aims at searching for semantically equivalent bugs and corresponding fixes; and a Generator – an LLM (12 billion parameter Codex Cushman model) finetuned on supervised bug-fix data with prompts augmented via adding bug type annotations and semantically similar fixes retrieved from an external non-parametric memory.
To train and evaluate our approach, we curated , a novel, metadata-rich dataset of bugs extracted by executing the Infer static analyzer on the change histories of thousands of Java and C# repositories. Our evaluation demonstrates that outperforms strong LLM baselines, with a top-1 accuracy of 65.6% for generating fixes in C# and 76.8% in Java. We discuss the deployment of alongside Infer at Microsoft which offers an end-to-end solution for detection, classification, and localization of bugs, as well as fixing and validation of candidate patches, integrated in the continuous integration (CI) pipeline to automate the software development workflow.

Publisher's Version
Assess and Summarize: Improve Outage Understanding with Large Language Models
Pengxiang Jin, Shenglin Zhang, Minghua Ma, Haozhe Li, Yu Kang, Liqun Li, Yudong Liu, Bo Qiao, Chaoyun Zhang, Pu Zhao, Shilin He, Federica Sarro, Yingnong Dang, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang
(Nankai University, China; Microsoft, China; Peking University, China; University College London, UK)
Cloud systems have become increasingly popular in recent years due to their flexibility and scalability. Each time cloud computing applications and services hosted on the cloud are affected by a cloud outage, users can experience slow response times, connection issues or total service disruption, resulting in a significant negative business impact. Outages are usually comprised of several concurring events/source causes, and therefore understanding the context of outages is a very challenging yet crucial first step toward mitigating and resolving outages. In current practice, on-call engineers with in-depth domain knowledge, have to manually assess and summarize outages when they happen, which is time-consuming and labor-intensive. In this paper, we first present a large-scale empirical study investigating the way on-call engineers currently deal with cloud outages at Microsoft, and then present and empirically validate a novel approach (dubbed Oasis) to help the engineers in this task. Oasis is able to automatically assess the impact scope of outages as well as to produce human-readable summarization. Specifically, Oasis first assesses the impact scope of an outage by aggregating relevant incidents via multiple techniques. Then, it generates a human-readable summary by leveraging fine-tuned large language models like GPT-3.x. The impact assessment component of Oasis was introduced in Microsoft over three years ago, and it is now widely adopted, while the outage summarization component has been recently introduced, and in this article we present the results of an empirical evaluation we carried out on 18 real-world cloud systems as well as a human-based evaluation with outage owners. The results obtained show that Oasis can effectively and efficiently summarize outages, and lead Microsoft to deploy its first prototype which is currently under experimental adoption by some of the incident teams.

Publisher's Version
Understanding Hackers’ Work: An Empirical Study of Offensive Security Practitioners
Andreas Happe and Jürgen Cito
(TU Wien, Austria)
Offensive security-tests are commonly employed to pro-actively discover potential vulnerabilities. They are performed by specialists, also known as penetration-testers or white-hat hackers. The chronic lack of available white-hat hackers prevents sufficient security test coverage of software. Research into automation tries to alleviate this problem by improving the efficiency of security testing. To achieve this, researchers and tool builders need a solid understanding of how hackers work, their assumptions, and pain points.
In this paper, we present a first data-driven exploratory qualitative study of twelve security professionals, their work and problems occurring therein. We perform a thematic analysis to gain insights into the execution of security assignments, hackers' thought processes and encountered challenges. This analysis allows us to conclude with recommendations for researchers and tool builders, to increase the efficiency of their automation and identify novel areas for research.

Publisher's Version Published Artifact Artifacts Available
Towards Efficient Record and Replay: A Case Study in WeChat
Sidong Feng, Haochuan Lu, Ting Xiong, Yuetang Deng, and Chunyang Chen
(Monash University, Australia; Tencent, China)
WeChat, a widely-used messenger app boasting over 1 billion monthly active users, requires effective app quality assurance for its complex features. Record-and-replay tools are crucial in achieving this goal. Despite the extensive development of these tools, the impact of waiting time between replay events has been largely overlooked. On one hand, a long waiting time for executing replay events on fully-rendered GUIs slows down the process. On the other hand, a short waiting time can lead to events executing on partially-rendered GUIs, negatively affecting replay effectiveness. An optimal waiting time should strike a balance between effectiveness and efficiency. We introduce WeReplay, a lightweight image-based approach that dynamically adjusts inter-event time based on the GUI rendering state. Given the real-time streaming on the GUI, WeReplay employs a deep learning model to infer the rendering state and synchronize with the replaying tool, scheduling the next event when the GUI is fully rendered. Our evaluation shows that our model achieves 92.1% precision and 93.3% recall in discerning GUI rendering states in the WeChat app. Through assessing the performance in replaying 23 common WeChat usage scenarios, WeReplay successfully replays all scenarios on the same and different devices more efficiently than the state-of-the-practice baselines.

Publisher's Version
Last Diff Analyzer: Multi-language Automated Approver for Behavior-Preserving Code Revisions
Yuxin Wang, Adam Welc, Lazaro Clapp, and Lingchao Chen
(Uber Technologies, USA; Mysten Labs, USA)
Code review is a crucial step in ensuring the quality and maintainability of software systems. However, this process can be time-consuming and resource-intensive, especially in large-scale projects where a significant number of code changes are submitted every day. Fortunately, not all code changes require human reviews, as some may only contain syntactic modifications that do not alter the behavior of the system, such as format changes, variable / function renamings, and constant extractions.
In this paper, we propose a multi-language automated code approver — Last Diff Analyzer for Go and Java, which is able to detect if a reviewable incremental unit of code change (diff) contains only changes that do not modify system behavior. It is built on top of a novel multi-language static analysis framework that unifies common features of multiple languages while keeping unique language constructs separate. This makes it easy to extend to other languages such as TypeScript, Kotlin, Swift, and others. Besides skipping unnecessary code reviews, Last Diff Analyzer could be further applied to skip certain resource-intensive end-to-end (E2E) tests for auto-approved diffs for significant reduction of resource usage. We have deployed the analyzer at scale within Uber, and data collected in production shows that approximately 15% of analyzed diffs are auto-approved weekly for code reviews. Furthermore, 13.5% reduction in server node usage dedicated to E2E tests (measured by number of executed E2E tests) is observed as a result of skipping E2E tests, compared to the node usage if Last Diff Analyzer were not enabled.

Publisher's Version Published Artifact Artifacts Available
Dead Code Removal at Meta: Automatically Deleting Millions of Lines of Code and Petabytes of Deprecated Data
Will Shackleton, Katriel Cohn-Gordon, Peter C. Rigby, Rui Abreu, James Gill, Nachiappan Nagappan, Karim Nakad, Ioannis Papagiannis, Luke Petre, Giorgi Megreli, Patrick Riggs, and James Saindon
(Meta, USA; Concordia University, Canada)
Software constantly evolves in response to user needs: new features are built, deployed, mature and grow old, and eventually their usage drops enough to merit switching them off. In any large codebase, this feature lifecycle can naturally lead to retaining unnecessary code and data. Removing these respects users’ privacy expectations, as well as helping engineers to work efficiently. In prior software engineering research, we have found little evidence of code deprecation or dead-code removal at industrial scale. We describe Systematic Code and Asset Removal Framework (SCARF), a product deprecation system to assist engineers working in large codebases. SCARF identifies unused code and data assets and safely removes them. It operates fully automatically, including committing code and dropping database tables. It also gathers developer input where it cannot take automated actions, leading to further removals. Dead code removal increases the quality and consistency of large codebases, aids with knowledge management and improves reliability. SCARF has had an important impact at Meta. In the last year alone, it has removed petabytes of data across 12.8 million distinct assets, and deleted over 104 million lines of code.

Publisher's Version
Incrementalizing Production CodeQL Analyses
Tamás Szabó
(GitHub, Germany)
Instead of repeatedly re-analyzing from scratch, an incremental static analysis only analyzes a codebase once completely, and then it updates the previous results based on the code changes. While this sounds promising to achieve speed-ups, the reality is that sophisticated static analyses typically employ features that can ruin incremental performance, such as inter-procedurality or context-sensitivity. In this study, we set out to explore whether incrementalization can help to achieve speed-ups for production CodeQL analyses that provide automated feedback on pull requests on GitHub. We first empirically validate the idea by measuring the potential for reuse on real-world codebases, and then we create a prototype incremental solver for CodeQL that exploits incrementality. We report on experimental results showing that we can indeed achieve update times proportional to the size of the code change, and we also discuss the limitations of our prototype.

Publisher's Version
xASTNN: Improved Code Representations for Industrial Practice
Zhiwei Xu, Min Zhou, Xibin Zhao, Yang Chen, Xi Cheng, and Hongyu Zhang
(Tsinghua University, China; Fudan University, China; VMware, China; Chongqing University, China)
The application of deep learning techniques in software engineering becomes increasingly popular. One key problem is developing high-quality and easy-to-use source code representations for code-related tasks. The research community has acquired impressive results in recent years. However, due to the deployment difficulties and performance bottlenecks, seldom these approaches are applied to the industry. In this paper, we present xASTNN, an eXtreme Abstract Syntax Tree (AST)-based Neural Network for source code representation, aiming to push this technique to industrial practice. The proposed xASTNN has three advantages. First, xASTNN is completely based on widely-used ASTs and does not require complicated data pre-processing, making it applicable to various programming languages and practical scenarios. Second, three closely-related designs are proposed to guarantee the effectiveness of xASTNN, including statement subtree sequence for code naturalness, gated recursive unit for syntactical information, and gated recurrent unit for sequential information. Third, a dynamic batching algorithm is introduced to significantly reduce the time complexity of xASTNN. Two code comprehension downstream tasks, code classification and code clone detection, are adopted for evaluation. The results demonstrate that our xASTNN can improve the state-of-the-art while being faster than the baselines.

Publisher's Version
From Point-wise to Group-wise: A Fast and Accurate Microservice Trace Anomaly Detection Approach
Zhe Xie, Changhua Pei, Wanxue Li, Huai Jiang, Liangfei Su, Jianhui Li, Gaogang Xie, and Dan Pei
(Tsinghua University, China; Computer Network Information Center at Chinese Academy of Sciences, China; eBay, China)
As Internet applications continue to scale up, microservice architecture has become increasingly popular due to its flexibility and logical structure. Anomaly detection in traces that record inter-microservice invocations is essential for diagnosing system failures. Deep learning-based approaches allow for accurate modeling of structural features (i.e., call paths) and latency features (i.e., call response time), which can determine the anomaly of a particular trace sample. However, the point-wise manner employed by these methods results in substantial system detection overhead and impracticality, given the massive volume of traces (billion-level). Furthermore, the point-wise approach lacks high-level information, as identical sub-structures across multiple traces may be encoded differently. In this paper, we introduce the first Group-wise Trace anomaly detection algorithm, named GTrace. This method categorizes the traces into distinct groups based on their shared sub-structure, such as the entire tree or sub-tree structure. A group-wise Variational AutoEncoder (VAE) is then employed to obtain structural representations. Moreover, the innovative "predicting latency with structure" learning paradigm facilitates the association between the grouped structure and the latency distribution within each group. With the group-wise design, representation caching, and batched inference strategies can be implemented, which significantly reduces the burden of detection on the system. Our comprehensive evaluation reveals that GTrace outperforms state-of-the-art methods in both performances (2.64% to 195.45% improvement in AUC metrics and 2.31% to 40.92% improvement in best F-Score) and efficiency (21.9x to 28.2x speedup). We have deployed and assessed the proposed algorithm on eBay's microservices cluster, and our code is available at https://github.com/NetManAIOps/GTrace.git.

Publisher's Version
STEAM: Observability-Preserving Trace Sampling
Shilin He, Botao Feng, Liqun Li, Xu Zhang, Yu Kang, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang
(Microsoft Research, Beijing, China; Microsoft, Beijing, China; Microsoft, USA)
In distributed systems and microservice applications, tracing is a crucial observability signal employed for comprehending their internal states. To mitigate the overhead associated with distributed tracing, most tracing frameworks utilize a uniform sampling strategy, which retains only a subset of traces. However, this approach is insufficient for preserving system observability. This is primarily attributed to the long-tail distribution of traces in practice, which results in the omission or rarity of minority yet critical traces after sampling. In this study, we introduce an observability-preserving trace sampling method, denoted as STEAM, which aims to retain as much information as possible in the sampled traces. We employ Graph Neural Networks (GNN) for trace representation, while incorporating domain knowledge of trace comparison through logical clauses. Subsequently, we employ a scalable approach to sample traces, emphasizing mutually dissimilar traces. STEAM has been implemented on top of OpenTelemetry, comprising approximately 1.6K lines of Golang code and 2K lines of Python code. Evaluation on four benchmark microservice applications and a production system demonstrates the superior performance of our approach compared to baseline methods. Furthermore, STEAM is capable of processing 15,000 traces in approximately 4 seconds.

Publisher's Version
TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems
Ruomeng Ding, Chaoyun Zhang, Lu Wang, Yong Xu, Minghua Ma, Xiaomin Wu, Meng Zhang, Qingjun Chen, Xin Gao, Xuedong Gao, Hao Fan, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang
(Microsoft, China; Microsoft 365, China; Microsoft 365, USA)
Root Cause Analysis (RCA) is becoming increasingly crucial for ensuring the reliability of microservice systems. However, performing RCA on modern microservice systems can be challenging due to their large scale, as they usually comprise hundreds of components, leading significant human effort. This paper proposes TraceDiag, an end-to-end RCA framework that addresses the challenges for large-scale microservice systems. It leverages reinforcement learning to learn a pruning policy for the service dependency graph to automatically eliminates redundant components, thereby significantly improving the RCA efficiency. The learned pruning policy is interpretable and fully adaptive to new RCA instances. With the pruned graph, a causal-based method can be executed with high accuracy and efficiency. The proposed TraceDiag framework is evaluated on real data traces collected from the Microsoft Exchange system, and demonstrates superior performance compared to state-of-the-art RCA approaches. Notably, TraceDiag has been integrated as a critical component in the Microsoft M365 Exchange, resulting in a significant improvement in the system's reliability and a considerable reduction in the human effort required for RCA.

Publisher's Version
Triggering Modes in Spectrum-Based Multi-location Fault Localization
Tung Dao, Na Meng, and ThanhVu Nguyen
(Cvent, USA; Virginia Tech, USA; George Mason University, USA)
Spectrum-based fault localization (SBFL) techniques can aid in debugging, but their practicality in industrial settings has been limited due to the large number of tests needed to execute before applying SBFL. Previous research has explored different trigger modes for SBFL and found that applying it immediately after the first test failure is also effective. However, this study only considered single-location bugs, while multi-location bugs are prevalent in real-world scenarios and especially at our company Cvent, which is interested in integrating SBFL to its CI/CD workflow.
In this work, we investigate the effectiveness of SBFL on multi-location bugs and propose a framework called Instant Fault Localization for Multi-location Bugs (IFLM). We compare and evaluate four trigger modes of IFLM using open-source (Defects4J) and close-source (Cvent) bug datasets.
Our study showed that it is not necessary to execute all test cases before applying SBFL. However, we also found that that applying SBFL right after the first failed test is less effective than applying it after executing all tests for multi-location bugs, which is contrary to the single-location bug study. We also observe differences in performance between real and artificial bugs. Our contributions include the development of IFLM and CVent bug datasets, analysis of SBFL effectiveness for multi-location bugs, and practical implications for integrating SBFL in industrial environments.

Publisher's Version Info
Appaction: Automatic GUI Interaction for Mobile Apps via Holistic Widget Perception
Yongxiang Hu, Jiazhen Gu, Shuqing Hu, Yu Zhang, Wenjie Tian, Shiyu Guo, Chaoyi Chen, and Yangfan Zhou
(Fudan University, China; Meituan, China; Shanghai Key Laboratory of Intelligent Information Processing, China)
In industrial practice, GUI (Graphic User Interface) testing of mobile apps still inevitably relies on huge manual efforts. The major efforts are those on understanding the GUIs, so that testing scripts can be written accordingly. Quality assurance could therefore be very labor-intensive, especially for modern commercial mobile apps, where one may include tremendous, diverse, and complex GUIs, e.g., those for placing orders of different commercial items. To reduce such human efforts, we propose Appaction, a learning-based automatic GUI interaction approach we developed for Meituan, one of the largest E-commerce providers with over 600 million users. Appaction can automatically analyze the target GUI and understand what each input of the GUI is about, so that corresponding valid inputs can be entered accordingly. To this end, Appaction adopts a multi-modal model to learn from human experiences in perceiving a GUI. This allows it to infer corresponding valid input events that can properly interact with the GUI. In this way, the target app can be effectively exercised. We present our experiences in Meituan on applying Appaction to popular commercial apps. We demonstrate the effectiveness of Appaction in GUI analysis, and it can perform correct interactions for numerous form pages.

Publisher's Version
MuRS: Mutant Ranking and Suppression using Identifier Templates
Zimin Chen, Małgorzata Salawa, Manushree Vijayvergiya, Goran Petrović, Marko Ivanković, and René Just
(KTH Royal Institute of Technology, Sweden; Google, Switzerland; University of Washington, USA)
Diff-based mutation testing is a mutation testing approach that only mutates lines affected by a code change under review. This approach scales independently of the code-base size and introduces test goals (mutants) that are directly relevant to an engineer’s goal such as fixing a bug, adding a new feature, or refactoring existing functionality. Google’s mutation testing service integrates diff-based mutation testing into the code review process and continuously gathers developer feedback on mutants surfaced during code review. To enhance the developer experience, the mutation testing service uses a number of manually-written rules that suppress not-useful mutants—mutants that have consistently received negative developer feedback. However, while effective, manually implementing suppression rules requires significant engineering time.
This paper proposes and evaluates MuRS, an automated approach that groups mutants by patterns in the source code under test and uses these patterns to rank and suppress future mutants based on historical developer feedback on mutants in the same group. To evaluate MuRS, we conducted an A/B testing study, comparing MuRS to the existing mutation testing service. Despite the strong baseline, which uses manually-written suppression rules, the results show a statistically significantly lower negative feedback ratio of 11.45% for MuRS versus 12.41% for the baseline. The results also show that MuRS is able to recover existing suppression rules implemented in the baseline. Finally, the results show that statement-deletion mutant groups received both the most positive and negative developer feedback, suggesting a need for additional context that can distinguish between useful and not-useful mutants in these groups. Overall, MuRS is able to recover existing suppression rules and automatically learn additional, finer-grained suppression rules from developer feedback.

Publisher's Version
Modeling the Centrality of Developer Output with Software Supply Chains
Audris Mockus, Peter C. Rigby, Rui Abreu, Parth Suresh, Yifen Chen, and Nachiappan Nagappan
(Meta, USA; University of Tennessee, USA; Concordia University, Canada)
Raw developer output, as measured by the number of changes a developer makes to the system, is simplistic and potentially misleading measure of productivity as new developers tend to work on peripheral and experienced developers on more central parts of the system. In this work, we use Software Supply Chain (SSC) networks and Katz centrality and PageRank on these networks to suggest a more nuanced measure of developer productivity. Our SSC is a network that represents the relationships between developers and artifacts that make up a system. We combine author-to-file, co-changing files, call hierarchies, and reporting structure into a single SSC and calculate the centrality of each node. The measures of centrality can be used to better understand variations in the impact of developer output at Meta. We start by partially replicating prior work and show that the raw number of developer commits plateaus over a project-specific period. However, the centrality of developer work grows for the entire period of study, but the growth slows after one year. This implies that while raw output might plateau, more experienced developers work on more central parts of the system. Finally, we investigate the incremental contribution of SSC attributes in modeling developer output. We find that local attributes such as the number of reports and the specific project do not explain much variation (𝑅2 = 5.8%). In contrast, adding Katz centrality or PageRank produces a model with an 𝑅2 above 30%. SSCs and their centrality provide valuable insights into the centrality and importance of a developer’s work.

Publisher's Version
On-Premise AIOps Infrastructure for a Software Editor SME: An Experience Report
Anes Bendimerad, Youcef Remil, Romain Mathonat, and Mehdi Kaytoue
(Infologic, France; INSA Lyon, France; CNRS, France; LIRIS UMR5205, France)
Information Technology has become a critical component in various industries, leading to an increased focus on software maintenance and monitoring. With the complexities of modern software systems, traditional maintenance approaches have become insufficient. The concept of AIOps has emerged to enhance predictive maintenance using Big Data and Machine Learning capabilities. However, exploiting AIOps requires addressing several challenges related to the complexity of data and incident management. Commercial solutions exist, but they may not be suitable for certain companies due to high costs, data governance issues, and limitations in covering private software. This paper investigates the feasibility of implementing on-premise AIOps solutions by leveraging open-source tools. We introduce a comprehensive AIOps infrastructure that we have successfully deployed in our company, and we provide the rationale behind different choices that we made to build its various components. Particularly, we provide insights into our approach and criteria for selecting a data management system and we explain its integration. Our experience can be beneficial for companies seeking to internally manage their software maintenance processes with a modern AIOps approach.

Publisher's Version
C³: Code Clone-Based Identification of Duplicated Components
Yanming Yang, Ying Zou, Xing Hu, David Lo, Chao Ni, John Grundy, and Xin Xia
(Zhejiang University, China; Queen’s University, Canada; Singapore Management University, Singapore; Monash University, Australia; Huawei, China)
Reinventing the wheel is a detrimental programming practice in software development that frequently results in the introduction of duplicated components. This practice not only leads to increased maintenance and labor costs but also poses a higher risk of propagating bugs throughout the system. Despite numerous issues introduced by duplicated components in software, the identification of component-level clones remains a significant challenge that existing studies struggle to effectively tackle. Specifically, existing methods face two primary limitations that are challenging to overcome: 1) Measuring the similarity between different components presents a challenge due to the significant size differences among them; 2) Identifying functional clones is a complex task as determining the primary functionality of components proves to be difficult.
To overcome the aforementioned challenges, we present a novel approach named C3 (Component-level Code Clone detector) to effectively identify both textual and functional cloned components. In addition, to enhance the efficiency of eliminating cloned components, we develop an assessment method based on six component-level clone features, which assists developers in prioritizing the cloned components based on the refactoring necessity.
To validate the effectiveness of C3, we employ a large-scale industrial product developed by Huawei, a prominent global ICT company, as our dataset and apply C3 to this dataset to identify the cloned components. Our experimental results demonstrate that C3 is capable of accurately detecting cloned components, achieving impressive performance in terms of precision (0.93), recall (0.91), and F1-score (0.9). Besides, we conduct a comprehensive user study to further validate the effectiveness and practicality of our assessment method and the proposed clone features in assessing the refactoring necessity of different cloned components. Our study establishes solid alignment between assessment outcomes and participant responses, indicating the accurate prioritization of clone components with a high refactoring necessity through our method. This finding further confirms the usefulness of the six “golden features” in our assessment.

Publisher's Version
AdaptivePaste: Intelligent Copy-Paste in IDE
Xiaoyu Liu, Jinu Jang, Neel Sundaresan, Miltiadis Allamanis, and Alexey Svyatkovskiy
(Microsoft, USA; Google, UK)
In software development, it is common for programmers to copy-paste or port code snippets and then adapt them to their use case. This scenario motivates the code adaptation task – a variant of program repair which aims to adapt variable identifiers in a pasted snippet of code to the surrounding, preexisting context. However, no existing approach has been shown to effectively address this task. In this paper, we introduce AdaptivePaste, a learning-based approach to source code adaptation, based on transformers and a dedicated dataflow-aware deobfuscation pre-training task to learn meaningful representations of variable usage patterns. We demonstrate that AdaptivePaste can learn to adapt Python source code snippets with 67.8% exact match accuracy. We study the impact of confidence thresholds on the model predictions, showing the model precision can be further improved to 85.9% with our parallel-decoder transformer model in a selective code adaptation setting. To assess the practical use of AdaptivePaste we perform a user study among Python software developers on real-world copy-paste instances. The results show that AdaptivePaste reduces dwell time to nearly half the time it takes to port code manually, and helps to avoid bugs. In addition, we utilize the participant feedback to identify potential avenues for improvement.

Publisher's Version
Adapting Performance Analytic Techniques in a Real-World Database-Centric System: An Industrial Experience Report
Lizhi Liao, Heng Li, Weiyi Shang, Catalin Sporea, Andrei Toma, and Sarah Sajedi
(University of Waterloo, Canada; Polytechnique Montréal, Canada; ERA Environmental, Canada)
Database-centric architectures have been widely adopted in large-scale software systems in various domains to deal with the ever-increasing amount and complexity of data. Prior studies have proposed a wide range of performance analytic techniques aimed at assisting developers in pinpointing software performance inefficiencies and diagnosing performance issues. However, directly applying these existing techniques to large-scale database-centric systems can be challenging and may not perform well due to the unique nature of such systems. In particular, compared to typical database-based systems like online shopping systems, in database-centric systems, a majority of the business logic and calculations reside in the database instead of the application. As the calculations in the database typically use domain-specific languages such as SQL, the performance issues of such systems and their diagnosis may be significantly different from the systems dominated by traditional programming languages such as Java. In this paper, we share our experience of adapting performance analytic techniques in a large-scale database-centric system from our industrial collaborator. Our adapted performance analysis pays special attention to the database and the interactions between the database and the application with minimal reliance on expert knowledge and manual effort. Moreover, we document our encountered challenges and how they are addressed during the development and adoption of our solution in the industrial setting as well as the corresponding lessons learned. We also discuss the real-world performance issues detected by applying our analysis to the target database-centric system. We anticipate that our solution and the reported experience can be helpful for practitioners and researchers who would like to ensure and improve the performance of database-centric systems.

Publisher's Version
KDDT: Knowledge Distillation-Empowered Digital Twin for Anomaly Detection
Qinghua Xu, Shaukat Ali, Tao Yue, Zaimovic Nedim, and Inderjeet Singh
(Simula Research Laboratory, Norway; University of Oslo, Norway; Oslo Metropolitan University, Norway; Alstom Rail, Sweden)
Cyber-physical systems (CPSs), like train control and management systems (TCMS), are becoming ubiquitous in critical infrastructures. As safety-critical systems, ensuring their dependability during operation is crucial. Digital twins (DTs) have been increasingly studied for this purpose owing to their capability of runtime monitoring and warning, prediction and detection of anomalies, etc. However, constructing a DT for anomaly detection in TCMS necessitates sufficient training data and extracting both chronological and context features with high quality. Hence, in this paper, we propose a novel method named KDDT for TCMS anomaly detection. KDDT harnesses a language model (LM) and a long short-term memory (LSTM) network to extract contexts and chronological features, respectively. To enrich data volume, KDDT benefits from out-of-domain data with knowledge distillation (KD). We evaluated KDDT with two datasets from our industry partner Alstom and obtained the F1 scores of 0.931 and 0.915, respectively, demonstrating the effectiveness of KDDT. We also explored individual contributions of the DT model, LM, and KD to the overall performance of KDDT, via a comprehensive empirical study, and observed average F1 score improvements of 12.4%, 3%, and 6.05%, respectively.

Publisher's Version
AG3: Automated Game GUI Text Glitch Detection Based on Computer Vision
Xiaoyun Liang, Jiayi Qi, Yongqiang Gao, Chao Peng, and Ping Yang
(ByteDance, China)
With the advancement of device software and hardware performance, and the evolution of game engines, an increasing number of emerging high-quality games are captivating game players from all around the world who speak different languages. However, due to the vast fragmentation of the device and platform market, a well-tested game may still experience text glitches when installed on a new device with an unseen screen resolution and system version, which can significantly impact the user experience. In our testing pipeline, current testing techniques for identifying multilingual text glitches are laborious and inefficient. In this paper, we present AG3, which offers intelligent game traversal, precise visual text glitch detection, and integrated quality report generation capabilities. Our empirical evaluation and internal industrial deployment demonstrate that AG3 can detect various real-world multilingual text glitches with minimal human involvement.

Publisher's Version
Detection Is Better Than Cure: A Cloud Incidents Perspective
Vaibhav Ganatra, Anjaly Parayil, Supriyo Ghosh, Yu Kang, Minghua Ma, Chetan Bansal, Suman Nath, and Jonathan Mace
(Microsoft, India; Microsoft, China; Microsoft, USA)
Cloud providers use automated watchdogs or monitors to continuously observe service availability and to proactively report incidents when system performance degrades. Improper monitoring can lead to delays in the detection and mitigation of production incidents, which can be extremely expensive in terms of customer impacts and manual toil from engineering resources. Therefore, a systematic understanding of the pitfalls in current monitoring practices and how they can lead to production incidents is crucial for ensuring {continuous} reliability of cloud services.
In this work, we carefully study the production incidents from the past year at Microsoft to understand the monitoring gaps in a hyperscale cloud platform. We conduct an extensive empirical study to answer: (1) What are the major causes of failures in early detection of production incidents and what are the steps taken for mitigation, (2) What is the impact of failures in early detection, (3) How do we recommend best monitoring practices for different services, and (4) How can we leverage the insights from this study to enhance the reliability of the cloud services. This study provides a deeper understanding of existing monitoring gaps in cloud platforms, uncover interesting insights and provide guidance for best monitoring practices for ensuring continuous reliability.

Publisher's Version
PropProof: Free Model-Checking Harnesses from PBT
Yoshiki Takashima
(Carnegie Mellon University, USA)
Property-based testing (PBT) is often used by Rust developers to test functional correctness properties of their code. Since PBT uses randomized testing, its guarantees are limited: it can detect bugs but provides no formal guarantees of correctness. The Kani Rust Verifier uses the CProver verification framework to verify Rust code, given a specification in a Kani verification harness. However, developers must manually write Kani harnesses while avoiding model-checking-specific pitfalls like large memory usage or timeouts. We introduce , a library that automatically converts PBT harnesses into Kani harnesses which can be formally validated using Kani. We discuss the data-structure models we developed in order to optimize performance of these Kani verification harnesses. Using this library, we identified and fixed 2 issues in , an AWS-developed protocol-buffer library with nearly 40 million downloads. is being used in ’s CI. Our evaluation on 42 PBT harnesses from top-ranked open-source Rust libraries demonstrates enabling the use of Kani to verify complex, user-defined properties on existing code with minimal user intervention.

Publisher's Version
LightF3: A Lightweight Fully-Process Formal Framework for Automated Verifying Railway Interlocking Systems
Yibo Dong, Xiaoyu Zhang, Yicong Xu, Chang Cai, Yu Chen, Weikai Miao, Jianwen Li, and Geguang Pu
(East China Normal University, China; Shanghai Trusted Industrial Control Platform, China)
Interlocking has long played a crucial role in railway systems. Its functional correctness, particularly concerning safety, forms the foundation of the entire signaling system. To date, numerous efforts have been made to formally model and verify interlocking systems. However, two main problems persist in most prior work: (1) The formal description of the interlocking system heavily depends on reusing existing models, which often results in overgeneralization and failing to fully utilize the intrinsic characteristics of interlocking systems. (2) The verification techniques of current approaches may quickly become outdated, and there is no adaptable method to integrate state-of-the-art verification algorithms or tools.
To address the above issues, we present LightF3, a lightweight and fully-process formal framework for modeling and verifying railway interlocking systems. LightF3 provides RIS-FL, a formal language based on FQLTL (a variant of LTL) to model the system and its specifications. LightF3 transforms the RIS-FL model automatically to the aiger model, which is the mainstream input of state-of-the-art model checkers, and then invokes the most advanced checkers to complete the verification task. We evaluated LightF3 by testing five real station instances from our industrial partner, demonstrating its effectiveness as a new framework. Additionally, we analyzed the statistics of the verification results from different model-checking techniques, providing useful conclusions for both the railway interlocking and formal methods communities.

Publisher's Version Published Artifact Artifacts Available
BFSig: Leveraging File Significance in Bus Factor Estimation
Vahid Haratian, Mikhail Evtikhiev, Pouria Derakhshanfar, Eray Tüzün, and Vladimir Kovalenko
(Bilkent University, Turkiye; JetBrains Research, Cyprus; JetBrains Research, Netherlands)
Software projects experience the departure of developers due to various reasons. As developers are one of the main sources of knowledge in software projects, their absence will inevitably result in a certain degree of knowledge depletion. Bus Factor (BF) is a metric to evaluate how this knowledge loss can affect the project’s continuity. Conventionally, BF is calculated as the smallest set of developers, removing over half the project knowledge upon departure. Current state-of-the-art approaches measure developers’ knowledge by the number of authored files, utilizing version control system (VCS) information. However, numerous studies have shown that files in software projects have different significance. In this study, we explore how weighting files according to their significance affects the performance of two prevailing BF estimators. We derive significance scores by computing five well-known graph metrics from the project’s dependency graph: PageRank, In-/Out-/All-Degree, and Betweenness Centralities. Furthermore, we introduce BFSig , a prototype of our approach. Finally, we present a new dataset comprising reported BF scores collected by surveying software practitioners from five prominent Github repositories. Our results indicate that BFSig outperforms the baselines by up to an 18% reduction in terms of Normalized Mean Absolute Error (NMAE). Moreover, BFSig yields 18% fewer False Negatives in identifying potential risks associated with low BF. Besides, our respondent confirmed BFSig versatility by showing its ability to assess the BF of the project’s subfolders. In conclusion, we believe to estimate BF from authorship, software components of higher importance should be assigned heavier weight. Currently, BFSig exclusively explores the topological characteristics of these components. Nevertheless, considering attributes such as code complexity and bug proneness could potentially enhance the performance of BFSig.

Publisher's Version
Automated Test Generation for Medical Rules Web Services: A Case Study at the Cancer Registry of Norway
Christoph Laaber, Tao Yue, Shaukat Ali, Thomas Schwitalla, and Jan F. Nygård
(Simula Research Laboratory, Norway; Oslo Metropolitan University, Norway; Cancer Registry of Norway, Norway; UiT The Arctic University of Norway, Norway)
The Cancer Registry of Norway (CRN) collects, curates, and manages data related to cancer patients in Norway, supported by an interactive, human-in-the-loop, socio-technical decision support software system. Automated software testing of this software system is inevitable; however, currently, it is limited in CRN’s practice. To this end, we present an industrial case study to evaluate an AI-based system-level testing tool, i.e., EvoMaster, in terms of its effectiveness in testing CRN’s software system. In particular, we focus on GURI, CRN’s medical rule engine, which is a key component at the CRN. We test GURI with EvoMaster’s black-box and white-box tools and study their test effectiveness regarding code coverage, errors found, and domain-specific rule coverage. The results show that all EvoMaster tools achieve a similar code coverage; i.e., around 19% line, 13% branch, and 20% method; and find a similar number of errors; i.e., 1 in GURI’s code. Concerning domain-specific coverage, EvoMaster’s black-box tool is the most effective in generating tests that lead to applied rules; i.e., 100% of the aggregation rules and between 12.86% and 25.81% of the validation rules; and to diverse rule execution results; i.e., 86.84% to 89.95% of the aggregation rules and 0.93% to 1.72% of the validation rules pass, and 1.70% to 3.12% of the aggregation rules and 1.58% to 3.74% of the validation rules fail. We further observe that the results are consistent across 10 versions of the rules. Based on these results, we recommend using EvoMaster’s black-box tool to test GURI since it provides good results and advances the current state of practice at the CRN. Nonetheless, EvoMaster needs to be extended to employ domain-specific optimization objectives to improve test effectiveness further. Finally, we conclude with lessons learned and potential research directions, which we believe are applicable in a general context.

Publisher's Version
Test Case Generation for Drivability Requirements of an Automotive Cruise Controller: An Experience with an Industrial Simulator
Federico Formica, Nicholas Petrunti, Lucas Bruck, Vera Pantelic, Mark Lawford, and Claudio Menghi
(McMaster University, Canada; University of Bergamo, Italy)
Automotive software development requires engineers to test their systems to detect violations of both functional and drivability requirements. Functional requirements define the functionality of the automotive software. Drivability requirements refer to the driver's perception of the interactions with the vehicle; for example, they typically require limiting the acceleration and jerk perceived by the driver within given thresholds. While functional requirements are extensively considered by the research literature, drivability requirements garner less attention. This industrial paper describes our experience assessing the usefulness of an automated search-based software testing (SBST) framework in generating failure-revealing test cases for functional and drivability requirements. We report on our experience with the VI-CarRealTime simulator, an industrial virtual modeling and simulation environment widely used in the automotive domain. We designed a Cruise Control system in Simulink for a four-wheel vehicle, in an iterative fashion, by producing 21 model versions. We used the SBST framework for each version of the model to search for failure-revealing test cases revealing requirement violations. Our results show that the SBST framework successfully identified a failure-revealing test case for 66.7% of our model versions, requiring, on average, 245.9s and 3.8 iterations. We present lessons learned, reflect on the generality of our results, and discuss how our results improve the state of practice.

Publisher's Version
Prioritizing Natural Language Test Cases Based on Highly-Used Game Features
Markos Viggiato, Dale Paas, and Cor-Paul Bezemer
(University of Alberta, Canada; Prodigy Education, Canada)
Software testing is still a manual activity in many industries, such as the gaming industry. But manually executing tests becomes impractical as the system grows and resources are restricted, mainly in a scenario with short release cycles. Test case prioritization is a commonly used technique to optimize the test execution. However, most prioritization approaches do not work for manual test cases as they require source code information or test execution history, which is often not available in a manual testing scenario. In this paper, we propose a prioritization approach for manual test cases written in natural language based on the tested application features (in particular, highly-used application features). Our approach consists of (1) identifying the tested features from natural language test cases (with zero-shot classification techniques) and (2) prioritizing test cases based on the features that they test. We leveraged the NSGA-II genetic algorithm for the multi-objective optimization of the test case ordering to maximize the coverage of highly-used features while minimizing the cumulative execution time. Our findings show that we can successfully identify the application features covered by test cases using an ensemble of pre-trained models with strong zero-shot capabilities (an F-score of 76.1%). Also, our prioritization approaches can find test case orderings that cover highly-used application features early in the test execution while keeping the time required to execute test cases short. QA engineers can use our approach to focus the test execution on test cases that cover features that are relevant to users.

Publisher's Version
EvoCLINICAL: Evolving Cyber-Cyber Digital Twin with Active Transfer Learning for Automated Cancer Registry System
Chengjie Lu, Qinghua Xu, Tao Yue, Shaukat Ali, Thomas Schwitalla, and Jan F. Nygård
(Simula Research Laboratory, Norway; University of Oslo, Norway; Oslo Metropolitan University, Norway; Cancer Registry of Norway, Norway; Arctic University of Norway, Norway)
The Cancer Registry of Norway (CRN) collects information on cancer patients by receiving cancer messages from different medical entities (e.g., medical labs, hospitals) in Norway. Such messages are validated by an automated cancer registry system: GURI. Its correct operation is crucial since it lays the foundation for cancer research and provides critical cancer-related statistics to its stakeholders. Constructing a cyber-cyber digital twin (CCDT) for GURI can facilitate various experiments and advanced analyses of the operational state of GURI without requiring intensive interactions with the real system. However, GURI constantly evolves due to novel medical diagnostics and treatment, technological advances, etc. Accordingly, CCDT should evolve as well to synchronize with GURI. A key challenge of achieving such synchronization is that evolving CCDT needs abundant data labelled by the new GURI. To tackle this challenge, we propose EvoCLINICAL, which considers the CCDT developed for the previous version of GURI as the pretrained model and fine-tunes it with the dataset labelled by querying a new GURI version. EvoCLINICAL employs a genetic algorithm to select an optimal subset of cancer messages from a candidate dataset and query GURI with it. We evaluate EvoCLINICAL on three evolution processes. The precision, recall, and F1 score are all greater than 91%, demonstrating the effectiveness of EvoCLINICAL. Furthermore, we replace the active learning part of EvoCLINICAL with random selection to study the contribution of transfer learning to the overall performance of EvoCLINICAL. Results show that employing active learning in EvoCLINICAL increases its performances consistently.

Publisher's Version
Compositional Taint Analysis for Enforcing Security Policies at Scale
Subarno Banerjee, Siwei Cui, Michael Emmi, Antonio Filieri, Liana Hadarean, Peixuan Li, Linghui Luo, Goran Piskachev, Nicolás Rosner, Aritra Sengupta, Omer Tripp, and Jingbo Wang
(Amazon Web Services, USA; Texas A&M University, USA; Amazon Web Services, Germany; University of Southern California, USA)
Automated static dataflow analysis is an effective technique for detecting security critical issues like sensitive data leak, and vulnerability to injection attacks. Ensuring high precision and recall requires an analysis that is context, field and object sensitive. However, it is challenging to attain high precision and recall and scale to large industrial code bases. Compositional style analyses in which individual software components are analyzed separately, independent from their usage contexts, compute reusable summaries of components. This is an essential feature when deploying such analyses in CI/CD at code-review time or when scanning deployed container images. In both these settings the majority of software components stay the same between subsequent scans. However, it is not obvious how to extend such analyses to check the kind of contextual taint specifications that arise in practice, while maintaining compositionality.
In this work we present contextual dataflow modeling, an extension to the compositional analysis to check complex taint specifications and significantly increasing recall and precision. Furthermore, we show how such high-fidelity analysis can scale in production using three key optimizations: (i) discarding intermediate results for previously-analyzed components, an optimization exploiting the compositional nature of our analysis; (ii) a scope-reduction analysis to reduce the scope of the taint analysis w.r.t. the taint specifications being checked, and (iii) caching of analysis models. We show a 9.85% reduction in false positive rate on a comprehensive test suite comprising the OWASP open-source benchmarks as well as internal real-world code samples. We measure the performance and scalability impact of each individual optimization using open source JVM packages from the Maven central repository and internal AWS service codebases. This combination of high precision, recall, performance, and scalability has allowed us to enforce security policies at scale both internally within Amazon as well as for external customers.

Publisher's Version
A Multidimensional Analysis of Bug Density in SAP HANA
Julian Reck, Thomas Bach, and Jan Stoess
(SAP, Germany; University of Applied Sciences Karlsruhe, Germany)
Researchers and practitioners have been studying correlations between software metrics and defects for decades. The typical approach is to postulate a hypothesis that a certain metric correlates with the number of defects. A statistical test then utilizes historical data to accept or reject the hypothesis. Although this methodology has been widely adopted, our own experience is that such correlations are often limited in their practical relevance, particularly for large industrial projects: Interpreting and arguing about them is challenging and cumbersome; the difference between correlation and causation might not be clear; and the practical impact of a correlation is often questioned due to misconceptions between a statistical conclusion and the impact on singular events. Instead of discussing correlations, we found that the analysis for binary testedness results in more fruitful discussions. Binary testedness, as proposed by prior work, utilizes a metric to divide the source code into two parts and verifies whether more (or less) defects appear in each part than expected. In our work, we leverage the binary testedness approach and analyze several software metrics for a large industrial project to illustrate the concept. We furthermore introduce dynamic thresholds as a novel and more practical approach for source code classification compared to the static binary classification of previous works. Our results show that some studied metrics have a significant correlation with bug distribution, but effect sizes differ by several magnitudes across metrics. Overall, our approach moves away from “metric X correlates with defects” to a more fruitful “source code with attribute X has more (or less) bugs than expected”, reframing the discussion from questioning statistics and methods towards an evidence-based root cause analysis.

Publisher's Version Archive submitted (420 kB)
Ownership in the Hands of Accountability at Brightsquid: A Case Study and a Developer Survey
Umme Ayman Koana, Francis Chew, Chris Carlson, and Maleknaz Nayebi
(York University, Canada; Brightsquid, Canada)
The COVID−19 pandemic has accelerated the adoption of digital health solutions. This has presented significant challenges for software development teams to swiftly adjust to the market needs and demand. To address these challenges, product management teams have had to adapt their approach to software development, reshaping their processes to meet the demands of the pandemic. Brighsquid implemented a new task assignment process aimed at enhancing developer accountability toward the customer. To assess the impact of this change on code ownership, we conducted a code change analysis. Additionally, we surveyed 67 developers to investigate the relationship between accountability and ownership more broadly. The findings of our case study indicate that the revised assignment model not only increased the perceived sense of accountability within the production team but also improved code resilience against ownership changes. Moreover, the survey results revealed that a majority of the participating developers (67.5%) associated perceived accountability with artifact ownership.

Publisher's Version

Industry Short Papers

Rotten Green Tests in Google Test
Paul T. Robinson
(Sony Interactive Entertainment, USA)
Executable unit tests are a key component of many software engineering methodologies. A “green” test is one that is reported as passing (successfully testing some software feature). However, it is common for a test harness to assume that a test has passed when, in fact, it has merely not reported a failure. In this gap, where the “excluded middle” lives and thrives, we find the Rotten Green Test: A test that looks like it does something useful, but in fact does not.
Google Test, a popular open-source test framework for C++ software, has been enhanced to detect rotten green tests. This enhancement follows the lead of similar work done for the Pharo language, but in a framework more applicable in industry, and with no requirement for test modifications or an external tool. The enhanced Google Test has detected 183 rotten assertions in the LLVM project’s unit tests, and even found one in Google Test’s own internal test suite. The enhancement may report false positives from parameterized tests where assertions are conditioned on the parameter, and currently does not detect rotten assertions in helper methods.

Publisher's Version
Issue Report Validation in an Industrial Context
Ethem Utku Aktas, Ebru Cakmak, Mete Cihad Inan, and Cemal Yilmaz
(Softtech Research and Development, Turkiye; Microsoft EMEA, Turkiye; Sabanci University, Turkiye)
Effective issue triaging is crucial for software development teams to improve software quality, and thus customer satisfaction. Validating issue reports manually can be time-consuming, hindering the overall efficiency of the triaging process. This paper presents an approach on automating the validation of issue reports to accelerate the issue triaging process in an industrial set-up. We work on 1,200 randomly selected issue reports in banking domain, written in Turkish, an agglutinative language, meaning that new words can be formed with linear concatenation of suffixes to express entire sentences. We manually label these reports for validity, and extract the relevant patterns indicating that they are invalid. Since the issue reports we work on are written in an agglutinative language, we use morphological analysis to extract the features. Using the proposed feature extractors, we utilize a machine learning based approach to predict the issue reports’ validity, performing a 0.77 F1-score.

Publisher's Version
On the Dual Nature of Necessity in Use of Rust Unsafe Code
Yuchen Zhang, Ashish Kundu, Georgios Portokalidis, and Jun Xu
(Stevens Institute of Technology, USA; Cisco Research, USA; University of Utah, USA)
Rust offers both safety guarantees and high performance. Thus, it has gained significant popularity in the industry. To extend its capability as a system programming language, Rust allows unsafe blocks where the execution has low-level controls but loses the safety guarantees. In principle, unsafe blocks should only be used when necessary. However, preliminary evidence shows a different situation. This paper aims to establish a deeper view of this matter and bring endeavors toward improvement.
We first present a study on the use of unsafe Rust in practice. We manually inspected 5946 unsafe blocks from 140 popular libraries and applications, focusing on whether the use of unsafe code is necessary (precisely, whether they have safe alternatives). The study unveils hundreds of instances of unnecessary unsafe Rust code and provides a taxonomy together with detailed analyses. These results complement our understanding and offer insights for the community to make a change.
Following the study, we further summarize nine popular patterns of unnecessary unsafe blocks and design an IDE plugin to auto-suggest their safe alternatives. Applied to 140 buggy unsafe blocks from the RustSec Advisory Database, the plugin identifies and offers safe versions to remove the bug for 28.6% of all cases.

Publisher's Version
Analyzing Microservice Connectivity with Kubesonde
Jacopo Bufalino, Mario Di Francesco, and Tuomas Aura
(Aalto University, Finland; Eficode, Finland)
Modern cloud-based applications are composed of several microservices that interact over a network. They are complex distributed systems, to the point that developers may not even be aware of how microservices connect to each other and to the Internet. As a consequence, the security of these applications can be greatly compromised. This work explicitly targets this context by providing a methodology to assess microservice connectivity, a software tool that implements it, and findings from analyzing real cloud applications. Specifically, it introduces Kubesonde, a cloud-native software that instruments live applications running on a Kubernetes cluster to analyze microservice connectivity, with minimal impact on performance. An assessment of microservices in 200 popular cloud applications with Kubesonde revealed significant issues in terms of network isolation: more than 60% of them had discrepancies between their declared and actual connectivity, and none restricted outbound connections towards the Internet. Our analysis shows that Kubesonde offers valuable insights on the connectivity between microservices, beyond what is possible with existing tools.

Publisher's Version
Testing Real-World Healthcare IoT Application: Experiences and Lessons Learned
Hassan Sartaj, Shaukat Ali, Tao Yue, and Kjetil Moberg
(Simula Research Laboratory, Norway; Oslo Metropolitan University, Norway; Norwegian Health Authority, Norway)
Healthcare Internet of Things (IoT) applications require rigorous testing to ensure their dependability. Such applications are typically integrated with various third-party healthcare applications and medical devices through REST APIs. This integrated network of healthcare IoT applications leads to REST APIs with complicated and interdependent structures, thus creating a major challenge for automated system-level testing. We report an industrial evaluation of a state-of-the-art REST APIs testing approach (RESTest) on a real-world healthcare IoT application. We analyze the effectiveness of RESTest’s testing strategies regarding REST APIs failures, faults in the application, and REST API coverage, by experimenting with six REST APIs of 41 API endpoints of the healthcare IoT application. Results show that several failures are discovered in different REST APIs with ≈56% coverage using RESTest. Moreover, nine potential faults are identified. Using the evidence collected from the experiments, we provide our experiences and lessons learned.

Publisher's Version
Diffusion-Based Time Series Data Imputation for Cloud Failure Prediction at Microsoft 365
Fangkai Yang, Wenjie Yin, Lu Wang, Tianci Li, Pu Zhao, Bo Liu, Paul Wang, Bo Qiao, Yudong Liu, Mårten Björkman, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang
(Microsoft, China; KTH Royal Institute of Technology, Sweden; Microsoft, USA)
Ensuring reliability in large-scale cloud systems like Microsoft 365 is crucial. Cloud failures, such as disk and node failure, threaten service reliability, causing service interruptions and financial loss. Existing works focus on failure prediction and proactively taking action before failures happen. However, they suffer from poor data quality, like data missing in model training and prediction, which limits performance. In this paper, we focus on enhancing data quality through data imputation by the proposed Diffusion+, a sample-efficient diffusion model, to impute the missing data efficiently conditioned on the observed data. Experiments with industrial datasets and application practice show that our model contributes to improving the performance of downstream failure prediction.

Publisher's Version
The Most Agile Teams Are the Most Disciplined: On Scaling out Agile Development
Zheng Li and Austen Rainer
(Queen's University Belfast, UK)
As one of the next frontiers of software engineering, agile development at scale has attracted more and more research interests and efforts. When following the existing autonomy-focused and goal-driven lessons and guidelines to scale agile development for a large astronomy project, however, we encountered surprising tech stack sprawl and spreading team coordination issues. By revisiting the unique features of our project (e.g., the data processing-intensive nature and the frequent team member changes), and by identifying a fractal pattern from various data processing logic and processes, we defined disciplined agile teams to clone the best practices of pioneer agile teams, and to work on similar system modules with similar user stories. Such a targeted strategy effectively relieved the tech stack sprawl and facilitated teamwork handover, at least for refactoring and growing the data processing modules in our project. Based on this emerging result and our reflections, we distinguish this targeted strategy as scaling out agile development from the existing agile scaling approaches that are generally in a scaling-up fashion. Considering the popularity of data processing-intensive projects, and also considering the pervasive fractal patterns in modern businesses and organisations, we claim that this targeted strategy still has broad application opportunities. Therefore, developing a well-defined methodology for scaling out agility, and combining both scaling up and scaling out agility, will deserve attentions and new research efforts in the future.

Publisher's Version

Ideas, Visions, and Reflections

Contribution-Based Firing of Developers?
Vincenzo Orrei, Marco Raglianti, Csaba Nagy, and Michele Lanza
(USI Lugano, Switzerland)
There has been some recent clamor about the developer layoff and turnover policies enacted by high-profile corporate executives. Precisely defining the contributions in software development has always been a thorny issue, as it is difficult to establish a developer’s “performance” without recurring to guesswork, due to how software development works and how Git persists history. Taking inspiration from a seemingly informal notion, the pony factor, we present an approach to identify the key developers in a software project. We present an analysis of 1,011 GitHub repositories, providing fact-based reflections on development contributions.

Publisher's Version
Keeping Mutation Test Suites Consistent and Relevant with Long-Standing Mutants
Milos Ojdanic, Mike Papadakis, and Mark Harman
(University of Luxembourg, Luxembourg; Meta Platforms, UK; University College London, UK)
Mutation testing has been demonstrated to be one of the most powerful fault-revealing tools in the tester's tool kit. Much previous work implicitly assumed it to be sufficient to re-compute mutant suites per release. Sadly, this makes mutation results inconsistent; mutant scores from each release cannot be directly compared, making it harder to measure test improvement. Furthermore, regular code change means that a mutant suite's relevance will naturally degrade over time. We measure this degradation in relevance for 143,500 mutants in 4 non-trivial systems, finding that 52% degrade, on average. We introduce a mutant brittleness measure and use it to audit software systems and their mutation suites. We also demonstrate how consistent-by-construction long-standing mutant suites can be identified with a 10x improvement in mutant relevance over an arbitrary test suite. Our results indicate that the research community should avoid the re-computation of mutant suites and focus, instead, on long-standing mutants, thereby improving the consistency and relevance of mutation testing.

Publisher's Version
Towards Top-Down Automated Development in Limited Scopes: A Neuro-Symbolic Framework from Expressibles to Executables
Jian Gu and Harald C. Gall
(Monash University, Australia; University of Zurich, Switzerland)
Deep code generation is a topic of deep learning for software engineering (DL4SE), which adopts neural models to generate code for the intended functions. Since end-to-end neural methods lack domain knowledge and software hierarchy awareness, they tend to perform poorly w.r.t project-level tasks. To systematically explore the potential improvements of code generation, we let it participate in the whole top-down development from expressibles to executables, which is possible in limited scopes. In the process, it benefits from massive samples, features, and knowledge. As the foundation, we suggest building a taxonomy on code data, namely code taxonomy, leveraging the categorization of code information. Moreover, we introduce a three-layer semantic pyramid (SP) to associate text data and code data. It identifies the information of different abstraction levels, and thus introduces the domain knowledge on development and reveals the hierarchy of software. Furthermore, we propose a semantic pyramid framework (SPF) as the approach, focusing on software of high modularity and low complexity. SPF divides the code generation process into stages and reserves spots for potential interactions. In addition, we conceived preliminary applications in software development to confirm the neuro-symbolic framework.

Publisher's Version
Lessons from the Long Tail: Analysing Unsafe Dependency Updates across Software Ecosystems
Supatsara Wattanakriengkrai, Raula Gaikovina Kula, Christoph Treude, and Kenichi Matsumoto
(NAIST, Japan; University of Melbourne, Australia)
A risk in adopting third-party dependencies into an application is their potential to serve as a doorway for malicious code to be injected (most often unknowingly). While many initiatives from both industry and research communities focus on the most critical dependencies (i.e., those most depended upon within the ecosystem), little is known about whether the rest of the ecosystem suffers the same fate. Our vision is to promote and establish safer practises throughout the ecosystem. To motivate our vision, in this paper, we present preliminary data based on three representative samples from a population of 88,416 pull requests (PRs) and identify unsafe dependency updates (i.e., any pull request that risks being unsafe during runtime), which clearly shows that unsafe dependency updates are not limited to highly impactful libraries. To draw attention to the long tail, we propose a research agenda comprising six key research questions that further explore how to safeguard against these unsafe activities. This includes developing best practises to address unsafe dependency updates not only in top-tier libraries but throughout the entire ecosystem.

Publisher's Version
Getting pwn’d by AI: Penetration Testing with Large Language Models
Andreas Happe and Jürgen Cito
(TU Wien, Austria)
The field of software security testing, more specifically penetration testing, requires high levels of expertise and involves many manual testing and analysis steps. This paper explores the potential use of large-language models, such as GPT3.5, to augment penetration testers with AI sparring partners. We explore two distinct use cases: high-level task planning for security testing assignments and low-level vulnerability hunting within a vulnerable virtual machine. For the latter, we implemented a closed-feedback loop between LLM-generated low-level actions with a vulnerable virtual machine (connected through SSH) and allowed the LLM to analyze the machine state for vulnerabilities and suggest concrete attack vectors which were automatically executed within the virtual machine. We discuss promising initial results, detail avenues for improvement, and close deliberating on the ethics of AI sparring partners.

Publisher's Version Info
Towards Feature-Based Analysis of the Machine Learning Development Lifecycle
Boyue Caroline Hu and Marsha Chechik
(University of Toronto, Canada)
The safety and trustworthiness of systems with components that are based on Machine Learning (ML) require an in-depth understanding and analysis of all stages in its Development Lifecycle (MLDL). High-level abstractions of desired functionalities, model behaviour, and data are called features, and they have been studied by different communities across all MLDL stages. In this paper, we propose to support Software Engineering analysis of the MLDL through features, calling it feature-based analysis of the MLDL. First, to achieve a shared understanding of features among different experts, we establish a taxonomy of existing feature definitions currently used in various MLDL stages. Through this taxonomy, we map features from different stages to each other, discover gaps and future research directions and identify areas of collaboration between Software Engineering and other MLDL experts.

Publisher's Version
Exploring Moral Principles Exhibited in OSS: A Case Study on GitHub Heated Issues
Ramtin Ehsani, Rezvaneh Rezapour, and Preetha Chatterjee
(Drexel University, USA)
To foster collaboration and inclusivity in Open Source Software (OSS) projects, it is crucial to understand and detect patterns of toxic language that may drive contributors away, especially those from underrepresented communities. Although machine learning-based toxicity detection tools trained on domain-specific data have shown promise, their design lacks an understanding of the unique nature and triggers of toxicity in OSS discussions, highlighting the need for further investigation. In this study, we employ Moral Foundations Theory to examine the relationship between moral principles and toxicity in OSS. Specifically, we analyze toxic communications in GitHub issue threads to identify and understand five types of moral principles exhibited in text, and explore their potential association with toxic behavior. Our preliminary findings suggest a possible link between moral principles and toxic comments in OSS communications, with each moral principle associated with at least one type of toxicity. The potential of MFT in toxicity detection warrants further investigation.

Publisher's Version
Towards Understanding Emotions in Informal Developer Interactions: A Gitter Chat Study
Amirali Sajadi, Kostadin Damevski, and Preetha Chatterjee
(Drexel University, USA; Virginia Commonwealth University, USA)
Emotions play a significant role in teamwork and collaborative activities like software development. While researchers have analyzed developer emotions in various software artifacts (e.g., issues, pull requests), few studies have focused on understanding the broad spectrum of emotions expressed in chats. As one of the most widely used means of communication, chats contain valuable information in the form of informal conversations, such as negative perspectives about adopting a tool. In this paper, we present a dataset of developer chat messages manually annotated with a wide range of emotion labels (and sub-labels), and analyze the type of information present in those messages. We also investigate the unique signals of emotions specific to chats and distinguish them from other forms of software communication. Our findings suggest that chats have fewer expressions of Approval and Fear but more expressions of Curiosity compared to GitHub comments. We also notice that Confusion is frequently observed when discussing programming-related information such as unexpected software behavior. Overall, our study highlights the potential of mining emotions in developer chats for supporting software maintenance and evolution tools.

Publisher's Version
Towards Strengthening Formal Specifications with Mutation Model Checking
Maxime Cordy, Sami Lazreg, Axel Legay, and Pierre Yves Schobbens
(University of Luxembourg, Luxembourg; Université Catholique de Louvain, Belgium; University of Namur, Belgium)
We propose mutation model checking as an approach to strengthen formal specifications used for model checking. Inspired by mutation testing, our approach concludes that specifications are not strong enough if they fail to detect faults in purposely mutated models. Our preliminary experiments on two case studies confirm the relevance of the problem: their specification can only detect 40% and 60% of randomly generated mutants. As a result, we propose a framework to strengthen the original specification, such that the original model satisfies the strengthened specification but the mutants do not.

Publisher's Version
Assisting Static Analysis with Large Language Models: A ChatGPT Experiment
Haonan Li, Yu Hao, Yizhuo Zhai, and Zhiyun Qian
(University of California at Riverside, USA)
Recent advances of Large Language Models (LLMs), e.g., ChatGPT, exhibited strong capabilities of comprehending and responding to questions across a variety of domains. Surprisingly, ChatGPT even possesses a strong understanding of program code. In this paper, we investigate where and how LLMs can assist static analysis by asking appropriate questions. In particular, we target a specific bug-finding tool, which produces many false positives from the static analysis. In our evaluation, we find that these false positives can be effectively pruned by asking carefully constructed questions about function-level behaviors or function summaries. Specifically, with a pilot study of 20 false positives, we can successfully prune 8 out of 20 based on GPT-3.5, whereas GPT-4 had a near-perfect result of 16 out of 20, where the four failed ones are not currently considered/supported by our questions, e.g., involving concurrency. Additionally, it also identified one false negative case (a missed bug). We find LLMs a promising tool that can enable a more effective and efficient program analysis.

Publisher's Version
Reflecting on the Use of the Policy-Process-Product Theory in Empirical Software Engineering
Kelechi G. Kalu, Taylor R. Schorlemmer, Sophie Chen, Kyle A. Robinson, Erik Kocinare, and James C. Davis
(Purdue University, USA; University of Michigan, USA)
The primary theory of software engineering is that an organization’s Policies and Processes influence the quality of its Products. We call this the PPP Theory. Although empirical software engineering research has grown common, it is unclear whether researchers are trying to evaluate the PPP Theory. To assess this, we analyzed half (33) of the empirical works published over the last two years in three prominent software engineering conferences. In this sample, 70% focus on policies/processes or products, not both. Only 33% provided measurements relating policy/process and products. We make four recommendations: (1) Use PPP Theory in study design; (2) Study feedback relationships; (3) Diversify the studied feed-forward relationships; and (4) Disentangle policy and process. Let us remember that research results are in the context of, and with respect to, the relationship between software products, processes, and policies.

Publisher's Version
A Vision on Intentions in Software Engineering
Jacob Krüger, Yi Li, Chenguang Zhu, Marsha Chechik, Thorsten Berger, and Julia Rubin
(Eindhoven University of Technology, Netherlands; Nanyang Technological University, Singapore; University of Texas at Austin, USA; University of Toronto, Canada; Ruhr University Bochum, Germany; Chalmers - University of Gothenburg, Sweden; University of British Columbia, Canada)
Intentions are fundamental in software engineering, but they are typically only implicitly considered through different abstractions, such as requirements, use cases, features, or issues. Specifically, software engineers develop and evolve (i.e., change) a software system based on such abstractions of a stakeholder’s intention—something a stakeholder wants the system to be able to do. Unfortunately, existing abstractions are (inherently) limited when it comes to representing stakeholder intentions and are mostly used for documenting only. So, whether a change in a system fulfills its underlying intention (and only this one) is an essential problem in practice that motivates many research areas (e.g., testing to ensure intended behavior, untangling intentions in commits). We argue that none of the existing abstractions is ideal for capturing intentions and controlling software evolution, which is why intentions are often vague and must be recovered, untangled, or understood in retrospect. In this paper, we reflect on the role of intentions (represented by changes) in software engineering and sketch how improving their management may support developers. Particularly, we argue that continuously managing and controlling intentions as well as their fulfillment has the potential to improve the reasoning about which stakeholder requests have been addressed, avoid misunderstandings, and prevent expensive retrospective analyses. To guide future research for achieving such benefits for researchers and practitioners, we discuss the relationships between different abstractions and intentions, and propose steps towards managing intentions.

Publisher's Version
Deeper Notions of Correctness in Image-Based DNNs: Lifting Properties from Pixel to Entities
Felipe Toledo, David Shriver, Sebastian Elbaum, and Matthew B. Dwyer
(University of Virginia, USA)
Deep Neural Networks (DNNs) that process images are being widely used for many safety-critical tasks, from autonomous vehicles to medical diagnosis. Currently, DNN correctness properties are defined at the pixel level over the entire input. Such properties are useful to expose system failures related to sensor noise or adversarial attacks, but they cannot capture features that are relevant to domain-specific entities and reflect richer types of behaviors. To overcome this limitation, we envision the specification of properties based on the entities that may be present in image input, capturing their semantics and how they change. Creating such properties today is difficult as it requires determining where the entities appear in images, defining how each entity can change, and writing a specification that is compatible with each particular V&V client. We introduce an initial framework structured around those challenges to assist in the generation of Domain-specific Entity-based properties automatically by leveraging object detection models to identify entities in images and creating properties based on entity features. Our feasibility study provides initial evidence that the new properties can uncover interesting system failures, such as changes in skin color can modify the output of a gender classification network. We conclude by analyzing the framework potential to implement the vision and by outlining directions for future work.

Publisher's Version

Demonstrations

LazyCow: A Lightweight Crowdsourced Testing Tool for Taming Android Fragmentation
Xiaoyu Sun, Xiao Chen, Yonghui Liu, John Grundy, and Li Li
(Australian National University, Australia; Monash University, Australia; Beihang University, China)
Android fragmentation refers to the increasing variety of Android devices and operating system versions. Their number make it impossible to test an app on every supported device, resulting in many device compatibility issues and leading to poor user experiences. To mitigate this, a number of works that automatically detect compatibility issues have been proposed. However, current state-of-the-art techniques can only be used to detect specific types of compatibility issues (i.e., compatibility issues caused by API signature evolution), i.e., many other essential categories of compatibility issues are still unknown. For instance, customised OS versions on real devices and semantic OS modifications could result in severe compatibility issues that are difficult to detect statically. In order to address this research gap and facilitate the prospect of taming Android frag- mentation through crowdsourced efforts, we propose LazyCow, a novel, lightweight, crowdsourced testing tool. Our experimental results involving thousands of test cases on real Android devices demonstrate that LazyCow is effective at autonomously identifying and validating API-induced compatibility issues. The source code of both client side and server side are all made publicly available in our artifact package. A demo video of our tool is available at https://www.youtube.com/watch?v=_xzWv_mo5xQ.

Publisher's Version
npm-follower: A Complete Dataset Tracking the NPM Ecosystem
Donald Pinckney, Federico Cassano, Arjun Guha, and Jonathan Bell
(Northeastern University, USA)
Software developers typically rely upon a large network of dependencies to build their applications. For instance, the NPM package repository contains over 3 million packages and serves tens of billions of downloads weekly. Understanding the structure and nature of packages, dependencies, and published code requires datasets that provide researchers with easy access to metadata and code of packages. However, prior work on NPM dataset construction typically has two limitations: 1) only metadata is scraped, and 2) packages or versions that are deleted from NPM can not be scraped. Over 330,000 versions of packages were deleted from NPM between July 2022 and May 2023. This data is critical for researchers as it often pertains to important questions of security and malware. We present npm-follower, a dataset and crawling architecture which archives metadata and code of all packages and versions as they are published, and is thus able to retain data which is later deleted. The dataset currently includes over 35 million versions of packages, and grows at a rate of about 1 million versions per month. The dataset is designed to be easily used by researchers answering questions involving either metadata or program analysis. Both the code and dataset are available at: https://dependencies.science

Publisher's Version
Ad Hoc Syntax-Guided Program Reduction
Jia Le Tian, Mengxiao Zhang, Zhenyang Xu, Yongqiang Tian, Yiwen Dong, and Chengnian Sun
(University of Waterloo, Canada)
Program reduction is a widely adopted, indispensable technique for debugging language implementations such as compilers and interpreters. Given a program 𝑃 and a bug triggered by 𝑃, a program reducer can produce a minimized program 𝑃∗ that is derived from 𝑃 and still triggers the same bug. Perses is one of the state-of-the-art program reducers. It leverages the syntax of 𝑃 to guide the reduction process for efficiency and effectiveness. It is language-agnostic as its reduction algorithm is independent of any language-specific syntax. Conceptually to support a new language, Perses only needs the context-free grammar 𝐺 of the language; in practice, it is not easy. One needs to first manually transform 𝐺 into a special grammar form PNF with a tool provided by Perses, second manually change the code base of Perses to integrate the new language, and lastly build a binary of Perses.
This paper presents our latest work to improve the usability of Perses by extending Perses to perform ad hoc program reduction for any new language as long as the language has a context-free grammar 𝐺. With this extended version (referred to as Persesadhoc), the difficulty of supporting new languages is significantly reduced: a user only needs to write a configuration file and execute one command to support a new language in Perses, compared to manually transforming the grammar format, modifying the code base, and re-building Perses.
Our case study demonstrates that with Persesadhoc, the Perses related infrastructure code required for supporting GLSL can be reduced from 190 lines of code to 20. Our extensive evaluations also show that Persesadhoc is as effective and efficient as Perses in reducing hoc programs, and only takes 10 seconds to support a new language, which is negligible compared to the manual effort required in Perses. A video demonstration of the tool can be found at https://youtu.be/trYwOT0mXhU.

Publisher's Version Video
On Using Information Retrieval to Recommend Machine Learning Good Practices for Software Engineers
Laura Cabra-Acela, Anamaria Mojica-Hanke, Mario Linares-Vásquez, and Steffen Herbold
(University of Los Andes, Colombia; University of Passau, Germany)
Machine learning (ML) is nowadays widely used for different purposes and with several disciplines. From self-driving cars to automated medical diagnosis, machine learning models extensively support users’ daily activities, and software engineering tasks are no exception. Not embracing good ML practices may lead to pitfalls that hinder the performance of an ML system and potentially lead to unexpected results. Despite the existence of documentation and literature about ML best practices, many non-ML experts turn towards gray literature like blogs and Q&A systems when looking for help and guidance when implementing ML systems. To better aid users in distilling relevant knowledge from such sources, we propose a recommender system that recommends ML practices based on the user’s context. As a first step in creating a recommender system for machine learning practices, we implemented Idaka. A tool that provides two different approaches for retrieving/generating ML best practices: i) an information retrieval (IR) engine and ii) a large language model. The IR-engine uses BM25 as the algorithm for retrieving the practices, and a large language model, in our case Alpaca. The platform has been designed to allow comparative studies of best practices retrieval tools. Idaka is publicly available at GitHub: https://bit.ly/idaka. Video: https://youtu.be/cEb-AhIPxnM

Publisher's Version Published Artifact Video Info Artifacts Available
Helion: Enabling Natural Testing of Smart Homes
Prianka Mandal, Sunil Manandhar, Kaushal Kafle, Kevin Moran, Denys Poshyvanyk, and Adwait Nadkarni
(College of William and Mary, USA; IBM Research, USA; University of Central Florida, USA)
Prior work has developed numerous systems that test the security and safety of smart homes. For these systems to be applicable in practice, it is necessary to test them with realistic scenarios that represent the use of the smart home, i.e., home automation, in the wild. This demo paper presents the technical details and usage of Helion, a system that uses n-gram language modeling to learn the regularities in user-driven programs, i.e., routines developed for the smart home, and predicts natural scenarios of home automation, i.e., event sequences that reflect realistic home automation usage. We demonstrate the HelionHA platform, developed by integrating Helion with the popular Home Assistant smart home platform. HelionHA allows an end-to-end exploration of Helion’s scenarios by executing them as test cases with real and virtual smart home devices.

Publisher's Version
A Language Model of Java Methods with Train/Test Deduplication
Chia-Yi Su, Aakash Bansal, Vijayanta Jain, Sepideh Ghanavati, and Collin McMillan
(University of Notre Dame, USA; University of Maine, USA)
This tool demonstration presents a research toolkit for a language model of Java source code. The target audience includes researchers studying problems at the granularity level of subroutines, statements, or variables in Java. In contrast to many existing language models, we prioritize features for researchers including an open and easily-searchable training set, a held out test set with different levels of deduplication from the training set, infrastructure for deduplicating new examples, and an implementation platform suitable for execution on equipment accessible to a relatively modest budget. Our model is a GPT2-like architecture with 350m parameters. Our training set includes 52m Java methods (9b tokens) and 13m StackOverflow threads (10.5b tokens). To improve accessibility of research to more members of the community, we limit local resource requirements to GPUs with 16GB video memory. We provide a test set of held out Java methods that include descriptive comments, including the entire Java projects for those methods. We also provide deduplication tools using precomputed hash tables at various similarity thresholds to help researchers ensure that their own test examples are not in the training set. We make all our tools and data open source and available via Huggingface and Github.

Publisher's Version Info
DENT: A Tool for Tagging Stack Overflow Posts with Deep Learning Energy Patterns
Shriram Shanbhag, Sridhar Chimalakonda, Vibhu Saujanya Sharma, and Vikrant Kaulgud
(IIT Tirupati, India; Accenture Labs, India)
Energy efficiency has become an important consideration in deep learning systems. However, it remains a largely under-emphasized aspect during the development. Despite the emergence of energy-efficient deep learning patterns, their adoption remains a challenge due to limited awareness. To address this gap, we present DENT (Deep Learning Energy Pattern Tagger, a Chrome extension used to add "energy pattern tags" to the deep learning related questions from Stack Overflow. The idea of DENT is to hint to the developers about the possible energy-saving opportunities associated with the Stack Overflow post through energy pattern labels. We hope this will increase awareness about energy patterns in deep learning and improve their adoption. A preliminary evaluation of DENT achieved an average precision of 0.74, recall of 0.66, and an F1-score of 0.65 with an accuracy of 66%. The demonstration of the tool is available at https://youtu.be/S0Wf_w0xajw and the related artifacts are available at https://rishalab.github.io/DENT/

Publisher's Version
MASC: A Tool for Mutation-Based Evaluation of Static Crypto-API Misuse Detectors
Amit Seal Ami, Syed Yusuf Ahmed, Radowan Mahmud Redoy, Nathan Cooper, Kaushal Kafle, Kevin Moran, Denys Poshyvanyk, and Adwait Nadkarni
(College of William and Mary, USA; University of Dhaka, Bangladesh; University of Central Florida, USA)
While software engineers are optimistically adopting crypto-API misuse detectors (or crypto-detectors) in their software development cycles, this momentum must be accompanied by a rigorous understanding of crypto-detectors’ effectiveness at finding crypto-API misuses in practice. This demo paper presents the technical details and usage scenarios of our tool, namely Mutation Analysis for evaluating Static Crypto-API misuse detectors (MASC). We developed 12 generalizable, usage based mutation operators and three mutation scopes, namely Main Scope, Similarity Scope, and Exhaustive Scope, which can be used to expressively instantiate compilable variants of the crypto-API misuse cases. Using MASC, we evaluated nine major crypto-detectors, and discovered 19 unique, undocumented flaws. We designed MASC to be configurable and user-friendly; a user can configure the parameters to change the nature of generated mutations. Furthermore, MASC comes with both Command Line Interface and Web-based front-end, making it practical for users of different levels of expertise.

Publisher's Version
llvm2CryptoLine: Verifying Arithmetic in Cryptographic C Programs
Ruiling Chen, Jiaxiang Liu, Xiaomu Shi, Ming-Hsien Tsai, Bow-Yaw Wang, and Bo-Yin Yang
(Shenzhen University, China; Institute of Software at Chinese Academy of Sciences, China; National Institute of Cyber Security, Taiwan; Academia Sinica, Taiwan)
Correct implementations of cryptographic primitives are essential for modern security. These implementations often contain arithmetic operations involving non-linear computations that are infamously hard to verify. We present llvm2CryptoLine, an automated formal verification tool for arithmetic operations in cryptographic C programs. llvm2CryptoLine successfully verifies 51 arithmetic C programs from industrial cryptographic libraries OpenSSL, wolfSSL and NaCl. Most of the programs are verified fully automatically and efficiently. A screencast that showcases llvm2CryptoLine can be found at https://youtu.be/QXuSmja45VA. Source code is available at https://github.com/fmlab-iis/llvm2cryptoline.

Publisher's Version Video Info
P4b: A Translator from P4 Programs to Boogie
Chong Ye and Fei He
(Tsinghua University, China)
P4 is a mainstream language for Software Defined Network (SDN) data planes. P4 is designed to achieve target-independent, protocol-independent, and configurable SDN data planes. However, logic errors may occur in P4 programs, resulting in improper packet processing, which may cause serious network errors and information disclosure. In addition, P4 programs contain many branches and thus are more challenging to ensure correctness. Formal verification is a powerful technique to verify the correctness of P4 programs. Unfortunately, current P4 verification studies lack basic toolchains, and their intermediate languages are not expressive enough. We present P4b, an efficient translator from P4 programs to Boogie, a verification-oriented intermediate representation. We provide formal translation rules to ensure the correctness of the translation process. The translated results can be verified by the toolchain of Boogie. We conducted experiments on 170 P4 programs collected from GitHub, and the experimental results demonstrate that our translator is useful and practical. The screencast is available at https://youtu.be/8_rEj3QFQeM. The tool is available at https://github.com/Invincibleyc/P4B-Translator.

Publisher's Version Video Info
D2S2: Drag ’n’ Drop Mobile App Screen Search
Soumik Mohian, Tony Tang, Tuan Trinh, Don Dang, and Christoph Csallner
(University of Texas at Arlington, USA)
The lack of diverse UI element representations in publicly available datasets hinders the scalability of sketch-based interactive mobile search. This paper introduces D2S2, a novel approach that addresses this limitation via drag-and-drop mobile screen search, accommodating visual and text-based queries. D2S2 searches 58k Rico screens for relevant UI examples based on UI element attributes, including type, position, shape, and text. In an evaluation with 10 novice software developers D2S2 successfully retrieves target screens within the top-20 search results in 15/19 attempts within a minute. The tool offers interactive and iterative search, updating its search results each time the user modifies the search query. Interested users can freely access D2S2 (http://pixeltoapp.com/D2S2), build on D2S2 or replicate results via D2S2’s open-source implementation (https://github.com/toni-tang/D2S2), or watch D2S2’s video demonstration (https://youtu.be/fdoYiw8lAn0).

Publisher's Version Video
CONAN: Statically Detecting Connectivity Issues in Android Applications
Alejandro Mazuera-Rozo, Camilo Escobar-Velásquez, Juan Espitia-Acero, Mario Linares-Vásquez, and Gabriele Bavota
(USI Lugano, Switzerland; University of Los Andes, Colombia)
Mobile apps are increasingly used in daily activities. Most apps require Internet connectivity to be fully exploited. Despite the fact that global access to the Internet has improved over the years, there are still complex connectivity scenarios, including situations with zero/unreliable connectivity. In such scenarios, improper handling of Eventual Connectivity Issues may cause bugs and crashes that worsen the user experience. Even though these issues have been studied in the literature, no automatic detection techniques are available. To address the mentioned gap, we have created the open source CONAN tool. CONAN can statically detect 16 types of Eventual Connectivity Issues within Android apps; it works at the source code level and alerts developers of any connectivity issue, highlighting them directly in the IDE or generating a report explaining the detected errors. In this paper, we present the technical aspects and a video of our tool, which are publicly available at https://tinyurl.com/CONAN-lint.

Publisher's Version Video Info

Student Research Competition

A Data Set of Extracted Rationale from Linux Kernel Commit Messages
Mouna Dhaouadi
(Université de Montréal, Canada)
Developer’s commit messages contain information about decisions taken and their rationale. Extracting this information is challenging since we lack a detailed understanding of how developers express these concepts. Our work-in-progress targets this challenge by producing a labelled data set of commit messages for a Linux Kernel component. We report preliminary analyses which suggest that larger commit messages and more experienced developers commits tend towards having 40% of sentences containing rationale. This may indicate a guideline for developers to target.

Publisher's Version
Detecting Overfitting of Machine Learning Techniques for Automatic Vulnerability Detection
Niklas Risse
(MPI-SP, Germany)
Recent results of machine learning for automatic vulnerability detection have been very promising indeed: Given only the source code of a function f, models trained by machine learning techniques can decide if f contains a security flaw with up to 70% accuracy.
But how do we know that these results are general and not specific to the datasets? To study this question, researchers proposed to amplify the testing set by injecting semantic preserving changes and found that the model’s accuracy significantly drops. In other words, the model uses some unrelated features during classification. In order to increase the robustness of the model, researchers proposed to train on amplified training data, and indeed model accuracy increased to previous levels.
In this paper, we replicate and continue this investigation, and provide an actionable model benchmarking methodology to help researchers better evaluate advances in machine learning for vulnerability detection. Specifically, we propose a cross validation algorithm, where a semantic preserving transformation is applied during the amplification of either the training set or the testing set. Using 11 transformations and 3 ML techniques, we find that the improved robustness only applies to the specific transformations used during training data amplification. In other words, the robustified models still rely on unrelated features for predicting the vulnerabilities in the testing data.

Publisher's Version
Detection of Optimizations Missed by the Compiler
Yi Zhang
(Nanjing University, China)
With the increasing significance of compilers in software development, identifying optimization bugs and enhancing their performance has become a significant challenge. In recent years, many studies have targeted only specific analyses to identify missed optimizations (e.g., in data flow analyses). While there have been some general approaches, most rely on differential testing between different compilers, which makes it difficult to identify optimization bugs that are common to multiple compilers. This paper tackles these challenges by introducing an effective and general approach called MOD. MOD works by using a manually-curated mapping between optimizations, ensuring that code should be consistently optimized: if one optimization triggers but the other does not, that indicates a bug in either of the two optimizations. We implemented MOD as a tool to detect missed optimizations in the available expressions of the LLVM. Experimental results show that MOD can report 20 correct alerts in one hour of detection of random test programs.

Publisher's Version
Do All Software Projects Die When Not Maintained? Analyzing Developer Maintenance to Predict OSS Usage
Emily Nguyen
(University of California at Los Angeles, USA)
Abstract: Past research suggests software should be continuously maintained in order to remain useful in our digital society. To determine whether these studies on software evolution are supported in modern-day software libraries, we conduct a natural experiment on 26,050 GitHub repositories, statistically modeling library usage based on their package-level downloads against different factors related to project maintenance.

Publisher's Version
Inferring Complexity Bounds from Recurrence Relations
Didier Ishimwe
(George Mason University, USA)
Determining program complexity bounds is a fundamental problem with a variety of applications in software development. In this paper we present a novel approach for computing the asymptotic complexity bounds of non-deterministic recursive programs by solving dynamically inferred recurrence relations. Recurrences are inferred from program execution traces and solved using the annihilator method and Master Theorem to obtain closed-form solutions representing the complexity bounds.
Our preliminary evaluation shows that this approach can learn correct bounds for popular classical recursive programs (e.g. O(n2) for quicksort), achieving more precise bounds for exponential programs than previously reported (e.g. O((1+√5/2)n) for fibonacci), and capturing a wide range of bounds including non-linear polynomial and non-polynomial, logarithmic, and exponential relations.

Publisher's Version
LLM-Based Code Generation Method for Golang Compiler Testing
Qiuhan Gu
(Nanjing University, China)
Modern optimizing compilers are among the most complex software systems humans build. One way to identify subtle compiler bugs is fuzzing. Both the quantity and the quality of testcases are crucial to the performance of fuzzing. Traditional testcase-generation methods, such as Csmith and YARPGen, have been proven successful at discovering compiler bugs. However, such generated testcases have limited coverage and quantity. In this paper, we present a code generation method for compiler testing based on LLM to maximize the quality and quantity of the generated code. In particular, to avoid undefined behavior and syntax errors in generated testcases, we design a filter strategy to clean the source code, preparing a high-quality dataset for the model training. Besides, we present a seed schedule strategy to improve code generation. We apply the method to test the Golang compiler and the result shows that our pipeline outperforms previous methods both qualitatively and quantitatively. It produces testcases with an average coverage of 3.38%, in contrast to the testcases generated by GoFuzz, which have an average coverage of 0.44%. Moreover, among all the generated testcases, only 2.79% exhibited syntax errors, and none displayed undefined behavior.

Publisher's Version
Privacy-Centric Log Parsing for Timely, Proactive Personal Data Protection
Issam Sedki
(Concordia University, Canada)
This paper presents a privacy-centric approach to log parsing, addressing the growing need for privacy compliance in log management. We propose a novel log parser that focuses on data minimization, a key principle in privacy protection. By integrating privacy considerations into the log parsing process, our approach enables proactive and timely privacy compliance and mitigation of privacy breaches.

Publisher's Version
STraceBERT: Source Code Retrieval using Semantic Application Traces
Claudio Spiess
(University of California at Davis, USA)
Software reverse engineering is an essential task in software engineering and security, but it can be a challenging process, especially for adversarial artifacts. To address this challenge, we present STraceBERT, a novel approach that utilizes a Java dynamic analysis tool to record calls to core Java libraries, and pretrain a BERT-style model on the recorded application traces for effective method source code retrieval from a candidate set. Our experiments demonstrate the effectiveness of STraceBERT in retrieving the source code compared to existing approaches. Our proposed approach offers a promising solution to the problem of code retrieval in software reverse engineering and opens up new avenues for further research in this area.

Publisher's Version
The Call Graph Chronicles: Unleashing the Power Within
Masudul Hasan Masud Bhuiyan
(CISPA Helmholtz Center for Information Security, Germany)
Call graph generation is critical for program understanding and analysis, but achieving both accuracy and precision is challenging. Existing methods trade off one for the other, particularly in dy- namic languages like JavaScript. This paper introduces "Graphia," an approach that combines structural and semantic information using a Graph Neural Network (GNN) to enhance call graph accu- racy. Graphia’s two-step process employs an initial call graph as training data for the GNN, which then uncovers true call edges in new programs. Experimental results show Graphia significantly improves true positive rates in vulnerability detection, achieving up to 95%. This approach advances call graph accuracy by effectively incorporating code structure and context, particularly in complex dynamic language scenarios.

Publisher's Version
The State of Survival in OSS: The Impact of Diversity
Zixuan Feng
(Oregon State University, USA)
Maintaining and retaining contributors is crucial for Open Source (OSS) projects. However, there is often a high turnover among contributors (in some projects as high as 80%). The survivability of contributors is influenced by various factors, including their demographics. Research on contributors’ survivability must, therefore, consider diversity factors. This study longitudinally analyzed the impact of demographic attributes on survivability in the Flutter community through the lens of gender, region, and compensation. The preliminary analysis reveals that affiliated or Western contributors have a higher survival probability than volunteer or Non-Western contributors. However, no significant difference was found in the survival probability between men and women.

Publisher's Version

proc time: 67.82