ESEC/FSE 2023 – Author Index |
Contents -
Abstracts -
Authors
|
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Abreu, Rui |
ESEC/FSE '23-INDUSTRY: "Dead Code Removal at Meta: ..."
Dead Code Removal at Meta: Automatically Deleting Millions of Lines of Code and Petabytes of Deprecated Data
Will Shackleton, Katriel Cohn-Gordon, Peter C. Rigby, Rui Abreu, James Gill, Nachiappan Nagappan, Karim Nakad, Ioannis Papagiannis, Luke Petre, Giorgi Megreli, Patrick Riggs, and James Saindon (Meta, USA; Concordia University, Canada) Software constantly evolves in response to user needs: new features are built, deployed, mature and grow old, and eventually their usage drops enough to merit switching them off. In any large codebase, this feature lifecycle can naturally lead to retaining unnecessary code and data. Removing these respects users’ privacy expectations, as well as helping engineers to work efficiently. In prior software engineering research, we have found little evidence of code deprecation or dead-code removal at industrial scale. We describe Systematic Code and Asset Removal Framework (SCARF), a product deprecation system to assist engineers working in large codebases. SCARF identifies unused code and data assets and safely removes them. It operates fully automatically, including committing code and dropping database tables. It also gathers developer input where it cannot take automated actions, leading to further removals. Dead code removal increases the quality and consistency of large codebases, aids with knowledge management and improves reliability. SCARF has had an important impact at Meta. In the last year alone, it has removed petabytes of data across 12.8 million distinct assets, and deleted over 104 million lines of code. @InProceedings{ESEC/FSE23p1705, author = {Will Shackleton and Katriel Cohn-Gordon and Peter C. Rigby and Rui Abreu and James Gill and Nachiappan Nagappan and Karim Nakad and Ioannis Papagiannis and Luke Petre and Giorgi Megreli and Patrick Riggs and James Saindon}, title = {Dead Code Removal at Meta: Automatically Deleting Millions of Lines of Code and Petabytes of Deprecated Data}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1705--1715}, doi = {10.1145/3611643.3613871}, year = {2023}, } Publisher's Version ESEC/FSE '23-INDUSTRY: "Modeling the Centrality of ..." Modeling the Centrality of Developer Output with Software Supply Chains Audris Mockus, Peter C. Rigby, Rui Abreu, Parth Suresh, Yifen Chen, and Nachiappan Nagappan (Meta, USA; University of Tennessee, USA; Concordia University, Canada) Raw developer output, as measured by the number of changes a developer makes to the system, is simplistic and potentially misleading measure of productivity as new developers tend to work on peripheral and experienced developers on more central parts of the system. In this work, we use Software Supply Chain (SSC) networks and Katz centrality and PageRank on these networks to suggest a more nuanced measure of developer productivity. Our SSC is a network that represents the relationships between developers and artifacts that make up a system. We combine author-to-file, co-changing files, call hierarchies, and reporting structure into a single SSC and calculate the centrality of each node. The measures of centrality can be used to better understand variations in the impact of developer output at Meta. We start by partially replicating prior work and show that the raw number of developer commits plateaus over a project-specific period. However, the centrality of developer work grows for the entire period of study, but the growth slows after one year. This implies that while raw output might plateau, more experienced developers work on more central parts of the system. Finally, we investigate the incremental contribution of SSC attributes in modeling developer output. We find that local attributes such as the number of reports and the specific project do not explain much variation (𝑅2 = 5.8%). In contrast, adding Katz centrality or PageRank produces a model with an 𝑅2 above 30%. SSCs and their centrality provide valuable insights into the centrality and importance of a developer’s work. @InProceedings{ESEC/FSE23p1809, author = {Audris Mockus and Peter C. Rigby and Rui Abreu and Parth Suresh and Yifen Chen and Nachiappan Nagappan}, title = {Modeling the Centrality of Developer Output with Software Supply Chains}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1809--1819}, doi = {10.1145/3611643.3613873}, year = {2023}, } Publisher's Version |
|
Acharya, Birendra |
ESEC/FSE '23: "A Case Study of Developer ..."
A Case Study of Developer Bots: Motivations, Perceptions, and Challenges
Sumit Asthana, Hitesh Sajnani, Elena Voyloshnikova, Birendra Acharya, and Kim Herzig (University of Michigan, USA; Trade Desk, USA; Microsoft, USA) Continuous integration and deployment (CI/CD) is now a widely adopted development model in practice as it reduces the time from ideas to customers. This adoption has also revived the idea of "shifting left" during software development -- a practice intended to find and prevent defects early in the software delivery process. To assist with that, engineering systems integrate developer bots in the development workflow to improve developer productivity and help them identify issues early in the software delivery process. In this paper, we present a case study of developer bots in Microsoft. We identify and analyze 23 developer bots that are deployed across 13,000 repositories and assist about 6,000 developers daily in their CI/CD software development workflows. We classify these bots across five major categories: Config Violation, Security, Data-privacy, Developer Productivity, and Code Quality. By conducting interviews and surveys with bot developers and bot users and by analyzing about half a million historical bot actions spanning over one and a half years, we present software workflows that motivate bot instrumentation, factors impacting their usefulness as perceived by bot users, and challenges associated with their use. Our findings echo existing issues with bots, such as noise, and illustrate new benefits (e.g., cross-team communication) and challenges (e.g., too many bots) for large software teams. @InProceedings{ESEC/FSE23p1268, author = {Sumit Asthana and Hitesh Sajnani and Elena Voyloshnikova and Birendra Acharya and Kim Herzig}, title = {A Case Study of Developer Bots: Motivations, Perceptions, and Challenges}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1268--1280}, doi = {10.1145/3611643.3616248}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Adams, Bram |
ESEC/FSE '23: "IoPV: On Inconsistent Option ..."
IoPV: On Inconsistent Option Performance Variations
Jinfu Chen, Zishuo Ding, Yiming Tang, Mohammed Sayagh, Heng Li, Bram Adams, and Weiyi Shang (Wuhan University, China; University of Waterloo, Canada; Rochester Institute of Technology, USA; ÉTS, Canada; Polytechnique Montréal, Canada; Queen’s University, Canada) Maintaining a good performance of a software system is a primordial task when evolving a software system. The performance regression issues are among the dominant problems that large software systems face. In addition, these large systems tend to be highly configurable, which allows users to change the behaviour of these systems by simply altering the values of certain configuration options. However, such flexibility comes with a cost. Such software systems suffer throughout their evolution from what we refer to as “Inconsistent Option Performance Variation” (IoPV ). An IoPV indicates, for a given commit, that the performance regression or improvement of different values of the same configuration option is inconsistent compared to the prior commit. For instance, a new change might not suffer from any performance regression under the default configuration (i.e., when all the options are set to their default values), while altering one option’s value manifests a regression, which we refer to as a hidden regression as it is not manifested under the default configuration. Similarly, when developers improve the performance of their systems, performance regression might be manifested under a subset of the existing configurations. Unfortunately, such hidden regressions are harmful as they can go unseen to the production environment. In this paper, we first quantify how prevalent (in)consistent performance regression or improvement is among the values of an option. In particular, we study over 803 Hadoop and 502 Cassandra commits, for which we execute a total of 4,902 and 4,197 tests, respectively, amounting to 12,536 machine hours of testing. We observe that IoPV is a common problem that is difficult to manually predict. 69% and 93% of the Hadoop and Cassandra commits have at least one configuration that hides a performance regression. Worse, most of the commits have different options or tests leading to IoPV and hiding performance regressions. Therefore, we propose a prediction model that identifies whether a given combination of commit, test, and option (CTO) manifests an IoPV. Our evaluation for different models shows that random forest is the best performing classifier, with a median AUC of 0.91 and 0.82 for Hadoop and Cassandra, respectively. Our paper defines and provides scientific evidence about the IoPV problem and its prevalence, which can be explored by future work. In addition, we provide an initial machine learning model for predicting IoPV. @InProceedings{ESEC/FSE23p845, author = {Jinfu Chen and Zishuo Ding and Yiming Tang and Mohammed Sayagh and Heng Li and Bram Adams and Weiyi Shang}, title = {IoPV: On Inconsistent Option Performance Variations}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {845--857}, doi = {10.1145/3611643.3616319}, year = {2023}, } Publisher's Version |
|
Agarwal, Shubham |
ESEC/FSE '23: "Outage-Watch: Early Prediction ..."
Outage-Watch: Early Prediction of Outages using Extreme Event Regularizer
Shubham Agarwal, Sarthak Chakraborty, Shaddy Garg, Sumit Bisht, Chahat Jain, Ashritha Gonuguntla, and Shiv Saini (Adobe Research, India; University of Illinois at Urbana-Champaign, USA; Adobe, India; Amazon, India; Traceable.ai, India; Cisco, India) Cloud services are omnipresent and critical cloud service failure is a fact of life. In order to retain customers and prevent revenue loss, it is important to provide high reliability guarantees for these services. One way to do this is by predicting outages in advance, which can help in reducing the severity as well as time to recovery. It is difficult to forecast critical failures due to the rarity of these events. Moreover, critical failures are ill-defined in terms of observable data. Our proposed method, Outage-Watch, defines critical service outages as deteriorations in the Quality of Service (QoS) captured by a set of metrics. Outage-Watch detects such outages in advance by using current system state to predict whether the QoS metrics will cross a threshold and initiate an extreme event. A mixture of Gaussian is used to model the distribution of the QoS metrics for flexibility and an extreme event regularizer helps in improving learning in tail of the distribution. An outage is predicted if the probability of any one of the QoS metrics crossing threshold changes significantly. Our evaluation on a real-world SaaS company dataset shows that Outage-Watch significantly outperforms traditional methods with an average AUC of 0.98. Additionally, Outage-Watch detects all the outages exhibiting a change in service metrics and reduces the Mean Time To Detection (MTTD) of outages by up to 88% when deployed in an enterprise cloud-service system, demonstrating efficacy of our proposed method. @InProceedings{ESEC/FSE23p682, author = {Shubham Agarwal and Sarthak Chakraborty and Shaddy Garg and Sumit Bisht and Chahat Jain and Ashritha Gonuguntla and Shiv Saini}, title = {Outage-Watch: Early Prediction of Outages using Extreme Event Regularizer}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {682--694}, doi = {10.1145/3611643.3616316}, year = {2023}, } Publisher's Version |
|
Ahmad, Wasi |
ESEC/FSE '23: "Towards Greener Yet Powerful ..."
Towards Greener Yet Powerful Code Generation via Quantization: An Empirical Study
Xiaokai Wei, Sujan Kumar Gonugondla, Shiqi Wang, Wasi Ahmad, Baishakhi Ray, Haifeng Qian, Xiaopeng Li, Varun Kumar, Zijian Wang, Yuchen Tian, Qing Sun, Ben Athiwaratkun, Mingyue Shang, Murali Krishna Ramanathan, Parminder Bhatia, and Bing Xiang (AWS AI Labs, USA) ML-powered code generation aims to assist developers to write code in a more productive manner by intelligently generating code blocks based on natural language prompts. Recently, large pretrained deep learning models have pushed the boundary of code generation and achieved impressive performance. However, the huge number of model parameters poses a significant challenge to their adoption in a typical software development environment, where a developer might use a standard laptop or mid-size server to develop code. Such large models cost significant resources in terms of memory, latency, dollars, as well as carbon footprint. Model compression is a promising approach to address these challenges. We have identified quantization as one of the most promising compression techniques for code-generation as it avoids expensive retraining costs. As quantization represents model parameters with lower-bit integer (e.g., int8), the model size and runtime latency would both benefit. We empirically evaluate quantized models on code generation tasks across different dimensions: (i) resource usage and carbon footprint, (ii) accuracy, and (iii) robustness. Through systematic experiments we find a code-aware quantization recipe that could run even a 6-billion-parameter model in a regular laptop without significant accuracy or robustness degradation. We find that the recipe is readily applicable to code summarization task as well. @InProceedings{ESEC/FSE23p224, author = {Xiaokai Wei and Sujan Kumar Gonugondla and Shiqi Wang and Wasi Ahmad and Baishakhi Ray and Haifeng Qian and Xiaopeng Li and Varun Kumar and Zijian Wang and Yuchen Tian and Qing Sun and Ben Athiwaratkun and Mingyue Shang and Murali Krishna Ramanathan and Parminder Bhatia and Bing Xiang}, title = {Towards Greener Yet Powerful Code Generation via Quantization: An Empirical Study}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {224--236}, doi = {10.1145/3611643.3616302}, year = {2023}, } Publisher's Version |
|
Ahmed, Khaled |
ESEC/FSE '23: "ViaLin: Path-Aware Dynamic ..."
ViaLin: Path-Aware Dynamic Taint Analysis for Android
Khaled Ahmed, Yingying Wang, Mieszko Lis, and Julia Rubin (University of British Columbia, Canada) Dynamic taint analysis - a program analysis technique that checks whether information flows between particular source and sink locations in the program, has numerous applications in security, program comprehension, and software testing. Specifically, in mobile software, taint analysis is often used to determine whether mobile apps contain stealthy behaviors that leak user-sensitive information to unauthorized third-party servers. While a number of dynamic taint analysis techniques for Android software have been recently proposed, none of them are able to report the complete information propagation path, only reporting flow endpoints, i.e., sources and sinks of the detected information flows. This design optimizes for runtime performance and allows the techniques to run efficiently on a mobile device. Yet, it impedes the applicability and usefulness of the techniques: an analyst using the tool would need to manually identify information propagation paths, e.g., to determine whether information was properly handled before being released, which is a challenging task in large real-world applications. In this paper, we address this problem by proposing a dynamic taint analysis technique that reports accurate taint propagation paths. We implement it in a tool, ViaLin, and evaluate it on a set of existing benchmark applications and on 16 large Android applications from the Google Play store. Our evaluation shows that ViaLin accurately detects taint flow paths while running on a mobile device with a reasonable time and memory overhead. @InProceedings{ESEC/FSE23p1598, author = {Khaled Ahmed and Yingying Wang and Mieszko Lis and Julia Rubin}, title = {ViaLin: Path-Aware Dynamic Taint Analysis for Android}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1598--1610}, doi = {10.1145/3611643.3616330}, year = {2023}, } Publisher's Version |
|
Ahmed, Shibbir |
ESEC/FSE '23: "Design by Contract for Deep ..."
Design by Contract for Deep Learning APIs
Shibbir Ahmed, Sayem Mohammad Imtiaz, Samantha Syeda Khairunnesa, Breno Dantas Cruz, and Hridesh Rajan (Iowa State University, USA; Bradley University, USA) Deep Learning (DL) techniques are increasingly being incorporated in critical software systems today. DL software is buggy too. Recent work in SE has characterized these bugs, studied fix patterns, and proposed detection and localization strategies. In this work, we introduce a preventative measure. We propose design by contract for DL libraries, DL Contract for short, to document the properties of DL libraries and provide developers with a mechanism to identify bugs during development. While DL Contract builds on the traditional design by contract techniques, we need to address unique challenges. In particular, we need to document properties of the training process that are not visible at the functional interface of the DL libraries. To solve these problems, we have introduced mechanisms that allow developers to specify properties of the model architecture, data, and training process. We have designed and implemented DL Contract for Python-based DL libraries and used it to document the properties of Keras, a well-known DL library. We evaluate DL Contract in terms of effectiveness, runtime overhead, and usability. To evaluate the utility of DL Contract, we have developed 15 sample contracts specifically for training problems and structural bugs. We have adopted four well-vetted benchmarks from prior works on DL bug detection and repair. For the effectiveness, DL Contract correctly detects 259 bugs in 272 real-world buggy programs, from well-vetted benchmarks provided in prior work on DL bug detection and repair. We found that the DL Contract overhead is fairly minimal for the used benchmarks. Lastly, to evaluate the usability, we conducted a survey of twenty participants who have used DL Contract to find and fix bugs. The results reveal that DL Contract can be very helpful to DL application developers when debugging their code. @InProceedings{ESEC/FSE23p94, author = {Shibbir Ahmed and Sayem Mohammad Imtiaz and Samantha Syeda Khairunnesa and Breno Dantas Cruz and Hridesh Rajan}, title = {Design by Contract for Deep Learning APIs}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {94--106}, doi = {10.1145/3611643.3616247}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable |
|
Ahmed, Syed Yusuf |
ESEC/FSE '23-DEMO: "MASC: A Tool for Mutation-Based ..."
MASC: A Tool for Mutation-Based Evaluation of Static Crypto-API Misuse Detectors
Amit Seal Ami, Syed Yusuf Ahmed, Radowan Mahmud Redoy, Nathan Cooper, Kaushal Kafle, Kevin Moran, Denys Poshyvanyk, and Adwait Nadkarni (College of William and Mary, USA; University of Dhaka, Bangladesh; University of Central Florida, USA) While software engineers are optimistically adopting crypto-API misuse detectors (or crypto-detectors) in their software development cycles, this momentum must be accompanied by a rigorous understanding of crypto-detectors’ effectiveness at finding crypto-API misuses in practice. This demo paper presents the technical details and usage scenarios of our tool, namely Mutation Analysis for evaluating Static Crypto-API misuse detectors (MASC). We developed 12 generalizable, usage based mutation operators and three mutation scopes, namely Main Scope, Similarity Scope, and Exhaustive Scope, which can be used to expressively instantiate compilable variants of the crypto-API misuse cases. Using MASC, we evaluated nine major crypto-detectors, and discovered 19 unique, undocumented flaws. We designed MASC to be configurable and user-friendly; a user can configure the parameters to change the nature of generated mutations. Furthermore, MASC comes with both Command Line Interface and Web-based front-end, making it practical for users of different levels of expertise. @InProceedings{ESEC/FSE23p2162, author = {Amit Seal Ami and Syed Yusuf Ahmed and Radowan Mahmud Redoy and Nathan Cooper and Kaushal Kafle and Kevin Moran and Denys Poshyvanyk and Adwait Nadkarni}, title = {MASC: A Tool for Mutation-Based Evaluation of Static Crypto-API Misuse Detectors}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2162--2166}, doi = {10.1145/3611643.3613099}, year = {2023}, } Publisher's Version |
|
Aktas, Ethem Utku |
ESEC/FSE '23-INDUSTRY: "Issue Report Validation in ..."
Issue Report Validation in an Industrial Context
Ethem Utku Aktas, Ebru Cakmak, Mete Cihad Inan, and Cemal Yilmaz (Softtech Research and Development, Turkiye; Microsoft EMEA, Turkiye; Sabanci University, Turkiye) Effective issue triaging is crucial for software development teams to improve software quality, and thus customer satisfaction. Validating issue reports manually can be time-consuming, hindering the overall efficiency of the triaging process. This paper presents an approach on automating the validation of issue reports to accelerate the issue triaging process in an industrial set-up. We work on 1,200 randomly selected issue reports in banking domain, written in Turkish, an agglutinative language, meaning that new words can be formed with linear concatenation of suffixes to express entire sentences. We manually label these reports for validity, and extract the relevant patterns indicating that they are invalid. Since the issue reports we work on are written in an agglutinative language, we use morphological analysis to extract the features. Using the proposed feature extractors, we utilize a machine learning based approach to predict the issue reports’ validity, performing a 0.77 F1-score. @InProceedings{ESEC/FSE23p2026, author = {Ethem Utku Aktas and Ebru Cakmak and Mete Cihad Inan and Cemal Yilmaz}, title = {Issue Report Validation in an Industrial Context}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2026--2031}, doi = {10.1145/3611643.3613887}, year = {2023}, } Publisher's Version |
|
Ali, Shaukat |
ESEC/FSE '23-INDUSTRY: "Testing Real-World Healthcare ..."
Testing Real-World Healthcare IoT Application: Experiences and Lessons Learned
Hassan Sartaj, Shaukat Ali, Tao Yue, and Kjetil Moberg (Simula Research Laboratory, Norway; Oslo Metropolitan University, Norway; Norwegian Health Authority, Norway) Healthcare Internet of Things (IoT) applications require rigorous testing to ensure their dependability. Such applications are typically integrated with various third-party healthcare applications and medical devices through REST APIs. This integrated network of healthcare IoT applications leads to REST APIs with complicated and interdependent structures, thus creating a major challenge for automated system-level testing. We report an industrial evaluation of a state-of-the-art REST APIs testing approach (RESTest) on a real-world healthcare IoT application. We analyze the effectiveness of RESTest’s testing strategies regarding REST APIs failures, faults in the application, and REST API coverage, by experimenting with six REST APIs of 41 API endpoints of the healthcare IoT application. Results show that several failures are discovered in different REST APIs with ≈56% coverage using RESTest. Moreover, nine potential faults are identified. Using the evidence collected from the experiments, we provide our experiences and lessons learned. @InProceedings{ESEC/FSE23p2044, author = {Hassan Sartaj and Shaukat Ali and Tao Yue and Kjetil Moberg}, title = {Testing Real-World Healthcare IoT Application: Experiences and Lessons Learned}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2044--2049}, doi = {10.1145/3611643.3613888}, year = {2023}, } Publisher's Version ESEC/FSE '23-INDUSTRY: "EvoCLINICAL: Evolving Cyber-Cyber ..." EvoCLINICAL: Evolving Cyber-Cyber Digital Twin with Active Transfer Learning for Automated Cancer Registry System Chengjie Lu, Qinghua Xu, Tao Yue, Shaukat Ali, Thomas Schwitalla, and Jan F. Nygård (Simula Research Laboratory, Norway; University of Oslo, Norway; Oslo Metropolitan University, Norway; Cancer Registry of Norway, Norway; Arctic University of Norway, Norway) The Cancer Registry of Norway (CRN) collects information on cancer patients by receiving cancer messages from different medical entities (e.g., medical labs, hospitals) in Norway. Such messages are validated by an automated cancer registry system: GURI. Its correct operation is crucial since it lays the foundation for cancer research and provides critical cancer-related statistics to its stakeholders. Constructing a cyber-cyber digital twin (CCDT) for GURI can facilitate various experiments and advanced analyses of the operational state of GURI without requiring intensive interactions with the real system. However, GURI constantly evolves due to novel medical diagnostics and treatment, technological advances, etc. Accordingly, CCDT should evolve as well to synchronize with GURI. A key challenge of achieving such synchronization is that evolving CCDT needs abundant data labelled by the new GURI. To tackle this challenge, we propose EvoCLINICAL, which considers the CCDT developed for the previous version of GURI as the pretrained model and fine-tunes it with the dataset labelled by querying a new GURI version. EvoCLINICAL employs a genetic algorithm to select an optimal subset of cancer messages from a candidate dataset and query GURI with it. We evaluate EvoCLINICAL on three evolution processes. The precision, recall, and F1 score are all greater than 91%, demonstrating the effectiveness of EvoCLINICAL. Furthermore, we replace the active learning part of EvoCLINICAL with random selection to study the contribution of transfer learning to the overall performance of EvoCLINICAL. Results show that employing active learning in EvoCLINICAL increases its performances consistently. @InProceedings{ESEC/FSE23p1973, author = {Chengjie Lu and Qinghua Xu and Tao Yue and Shaukat Ali and Thomas Schwitalla and Jan F. Nygård}, title = {EvoCLINICAL: Evolving Cyber-Cyber Digital Twin with Active Transfer Learning for Automated Cancer Registry System}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1973--1984}, doi = {10.1145/3611643.3613897}, year = {2023}, } Publisher's Version ESEC/FSE '23-INDUSTRY: "Automated Test Generation ..." Automated Test Generation for Medical Rules Web Services: A Case Study at the Cancer Registry of Norway Christoph Laaber, Tao Yue, Shaukat Ali, Thomas Schwitalla, and Jan F. Nygård (Simula Research Laboratory, Norway; Oslo Metropolitan University, Norway; Cancer Registry of Norway, Norway; UiT The Arctic University of Norway, Norway) The Cancer Registry of Norway (CRN) collects, curates, and manages data related to cancer patients in Norway, supported by an interactive, human-in-the-loop, socio-technical decision support software system. Automated software testing of this software system is inevitable; however, currently, it is limited in CRN’s practice. To this end, we present an industrial case study to evaluate an AI-based system-level testing tool, i.e., EvoMaster, in terms of its effectiveness in testing CRN’s software system. In particular, we focus on GURI, CRN’s medical rule engine, which is a key component at the CRN. We test GURI with EvoMaster’s black-box and white-box tools and study their test effectiveness regarding code coverage, errors found, and domain-specific rule coverage. The results show that all EvoMaster tools achieve a similar code coverage; i.e., around 19% line, 13% branch, and 20% method; and find a similar number of errors; i.e., 1 in GURI’s code. Concerning domain-specific coverage, EvoMaster’s black-box tool is the most effective in generating tests that lead to applied rules; i.e., 100% of the aggregation rules and between 12.86% and 25.81% of the validation rules; and to diverse rule execution results; i.e., 86.84% to 89.95% of the aggregation rules and 0.93% to 1.72% of the validation rules pass, and 1.70% to 3.12% of the aggregation rules and 1.58% to 3.74% of the validation rules fail. We further observe that the results are consistent across 10 versions of the rules. Based on these results, we recommend using EvoMaster’s black-box tool to test GURI since it provides good results and advances the current state of practice at the CRN. Nonetheless, EvoMaster needs to be extended to employ domain-specific optimization objectives to improve test effectiveness further. Finally, we conclude with lessons learned and potential research directions, which we believe are applicable in a general context. @InProceedings{ESEC/FSE23p1937, author = {Christoph Laaber and Tao Yue and Shaukat Ali and Thomas Schwitalla and Jan F. Nygård}, title = {Automated Test Generation for Medical Rules Web Services: A Case Study at the Cancer Registry of Norway}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1937--1948}, doi = {10.1145/3611643.3613882}, year = {2023}, } Publisher's Version ESEC/FSE '23-INDUSTRY: "KDDT: Knowledge Distillation-Empowered ..." KDDT: Knowledge Distillation-Empowered Digital Twin for Anomaly Detection Qinghua Xu, Shaukat Ali, Tao Yue, Zaimovic Nedim, and Inderjeet Singh (Simula Research Laboratory, Norway; University of Oslo, Norway; Oslo Metropolitan University, Norway; Alstom Rail, Sweden) Cyber-physical systems (CPSs), like train control and management systems (TCMS), are becoming ubiquitous in critical infrastructures. As safety-critical systems, ensuring their dependability during operation is crucial. Digital twins (DTs) have been increasingly studied for this purpose owing to their capability of runtime monitoring and warning, prediction and detection of anomalies, etc. However, constructing a DT for anomaly detection in TCMS necessitates sufficient training data and extracting both chronological and context features with high quality. Hence, in this paper, we propose a novel method named KDDT for TCMS anomaly detection. KDDT harnesses a language model (LM) and a long short-term memory (LSTM) network to extract contexts and chronological features, respectively. To enrich data volume, KDDT benefits from out-of-domain data with knowledge distillation (KD). We evaluated KDDT with two datasets from our industry partner Alstom and obtained the F1 scores of 0.931 and 0.915, respectively, demonstrating the effectiveness of KDDT. We also explored individual contributions of the DT model, LM, and KD to the overall performance of KDDT, via a comprehensive empirical study, and observed average F1 score improvements of 12.4%, 3%, and 6.05%, respectively. @InProceedings{ESEC/FSE23p1867, author = {Qinghua Xu and Shaukat Ali and Tao Yue and Zaimovic Nedim and Inderjeet Singh}, title = {KDDT: Knowledge Distillation-Empowered Digital Twin for Anomaly Detection}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1867--1878}, doi = {10.1145/3611643.3613879}, year = {2023}, } Publisher's Version |
|
Alimadadi, Saba |
ESEC/FSE '23: "Code Coverage Criteria for ..."
Code Coverage Criteria for Asynchronous Programs
Mohammad Ganji, Saba Alimadadi, and Frank Tip (Simon Fraser University, Canada; Northeastern University, USA) Asynchronous software often exhibits complex and error-prone behaviors that should be tested thoroughly. Code coverage has been the most popular metric to assess test suite quality. However, traditional code coverage criteria do not adequately reflect completion, interactions, and error handling of asynchronous operations. This paper proposes novel test adequacy criteria for measuring: (i) completion of asynchronous operations in terms of both successful and exceptional execution, (ii) registration of reactions for handling both possible outcomes, and (iii) execution of said reactions through tests. We implement JScope, a tool for automatically measuring coverage according to these criteria in JavaScript applications, as an interactive plug-in for Visual Studio Code. An evaluation of JScope on 20 JavaScript applications shows that the proposed criteria can help improve assessment of test adequacy, complementing traditional criteria. According to our investigation of 15 real GitHub issues concerned with asynchrony, the new criteria can help reveal faulty asynchronous behaviors that are untested yet are deemed covered by traditional coverage criteria. We also report on a controlled experiment with 12 participants to investigate the usefulness of JScope in realistic settings, demonstrating its effectiveness in improving programmers’ ability to assess test adequacy and detect untested behavior of asynchronous code. @InProceedings{ESEC/FSE23p1307, author = {Mohammad Ganji and Saba Alimadadi and Frank Tip}, title = {Code Coverage Criteria for Asynchronous Programs}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1307--1319}, doi = {10.1145/3611643.3616292}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Allamanis, Miltiadis |
ESEC/FSE '23-INDUSTRY: "AdaptivePaste: Intelligent ..."
AdaptivePaste: Intelligent Copy-Paste in IDE
Xiaoyu Liu, Jinu Jang, Neel Sundaresan, Miltiadis Allamanis, and Alexey Svyatkovskiy (Microsoft, USA; Google, UK) In software development, it is common for programmers to copy-paste or port code snippets and then adapt them to their use case. This scenario motivates the code adaptation task – a variant of program repair which aims to adapt variable identifiers in a pasted snippet of code to the surrounding, preexisting context. However, no existing approach has been shown to effectively address this task. In this paper, we introduce AdaptivePaste, a learning-based approach to source code adaptation, based on transformers and a dedicated dataflow-aware deobfuscation pre-training task to learn meaningful representations of variable usage patterns. We demonstrate that AdaptivePaste can learn to adapt Python source code snippets with 67.8% exact match accuracy. We study the impact of confidence thresholds on the model predictions, showing the model precision can be further improved to 85.9% with our parallel-decoder transformer model in a selective code adaptation setting. To assess the practical use of AdaptivePaste we perform a user study among Python software developers on real-world copy-paste instances. The results show that AdaptivePaste reduces dwell time to nearly half the time it takes to port code manually, and helps to avoid bugs. In addition, we utilize the participant feedback to identify potential avenues for improvement. @InProceedings{ESEC/FSE23p1844, author = {Xiaoyu Liu and Jinu Jang and Neel Sundaresan and Miltiadis Allamanis and Alexey Svyatkovskiy}, title = {AdaptivePaste: Intelligent Copy-Paste in IDE}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1844--1854}, doi = {10.1145/3611643.3613895}, year = {2023}, } Publisher's Version |
|
Alon, Uri |
ESEC/FSE '23: "Contextual Predictive Mutation ..."
Contextual Predictive Mutation Testing
Kush Jain, Uri Alon, Alex Groce, and Claire Le Goues (Carnegie Mellon University, USA; Northern Arizona University, USA) Mutation testing is a powerful technique for assessing and improving test suite quality that artificially introduces bugs and checks whether the test suites catch them. However, it is also computationally expensive and thus does not scale to large systems and projects. One promising recent approach to tackling this scalability problem uses machine learning to predict whether the tests will detect the synthetic bugs, without actually running those tests. However, existing predictive mutation testing approaches still misclassify 33% of detection outcomes on a randomly sampled set of mutant-test suite pairs. We introduce MutationBERT, an approach for predictive mutation testing that simultaneously encodes the source method mutation and test method, capturing key context in the input representation. Thanks to its higher precision, MutationBERT saves 33% of the time spent by a prior approach on checking/verifying live mutants. MutationBERT, also outperforms the state-of-the-art in both same project and cross project settings, with meaningful improvements in precision, recall, and F1 score. We validate our input representation, and aggregation approaches for lifting predictions from the test matrix level to the test suite level, finding similar improvements in performance. MutationBERT not only enhances the state-of-the-art in predictive mutation testing, but also presents practical benefits for real-world applications, both in saving developer time and finding hard to detect mutants that prior approaches do not. @InProceedings{ESEC/FSE23p250, author = {Kush Jain and Uri Alon and Alex Groce and Claire Le Goues}, title = {Contextual Predictive Mutation Testing}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {250--261}, doi = {10.1145/3611643.3616289}, year = {2023}, } Publisher's Version |
|
Ami, Amit Seal |
ESEC/FSE '23-DEMO: "MASC: A Tool for Mutation-Based ..."
MASC: A Tool for Mutation-Based Evaluation of Static Crypto-API Misuse Detectors
Amit Seal Ami, Syed Yusuf Ahmed, Radowan Mahmud Redoy, Nathan Cooper, Kaushal Kafle, Kevin Moran, Denys Poshyvanyk, and Adwait Nadkarni (College of William and Mary, USA; University of Dhaka, Bangladesh; University of Central Florida, USA) While software engineers are optimistically adopting crypto-API misuse detectors (or crypto-detectors) in their software development cycles, this momentum must be accompanied by a rigorous understanding of crypto-detectors’ effectiveness at finding crypto-API misuses in practice. This demo paper presents the technical details and usage scenarios of our tool, namely Mutation Analysis for evaluating Static Crypto-API misuse detectors (MASC). We developed 12 generalizable, usage based mutation operators and three mutation scopes, namely Main Scope, Similarity Scope, and Exhaustive Scope, which can be used to expressively instantiate compilable variants of the crypto-API misuse cases. Using MASC, we evaluated nine major crypto-detectors, and discovered 19 unique, undocumented flaws. We designed MASC to be configurable and user-friendly; a user can configure the parameters to change the nature of generated mutations. Furthermore, MASC comes with both Command Line Interface and Web-based front-end, making it practical for users of different levels of expertise. @InProceedings{ESEC/FSE23p2162, author = {Amit Seal Ami and Syed Yusuf Ahmed and Radowan Mahmud Redoy and Nathan Cooper and Kaushal Kafle and Kevin Moran and Denys Poshyvanyk and Adwait Nadkarni}, title = {MASC: A Tool for Mutation-Based Evaluation of Static Crypto-API Misuse Detectors}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2162--2166}, doi = {10.1145/3611643.3613099}, year = {2023}, } Publisher's Version |
|
Asthana, Sumit |
ESEC/FSE '23: "A Case Study of Developer ..."
A Case Study of Developer Bots: Motivations, Perceptions, and Challenges
Sumit Asthana, Hitesh Sajnani, Elena Voyloshnikova, Birendra Acharya, and Kim Herzig (University of Michigan, USA; Trade Desk, USA; Microsoft, USA) Continuous integration and deployment (CI/CD) is now a widely adopted development model in practice as it reduces the time from ideas to customers. This adoption has also revived the idea of "shifting left" during software development -- a practice intended to find and prevent defects early in the software delivery process. To assist with that, engineering systems integrate developer bots in the development workflow to improve developer productivity and help them identify issues early in the software delivery process. In this paper, we present a case study of developer bots in Microsoft. We identify and analyze 23 developer bots that are deployed across 13,000 repositories and assist about 6,000 developers daily in their CI/CD software development workflows. We classify these bots across five major categories: Config Violation, Security, Data-privacy, Developer Productivity, and Code Quality. By conducting interviews and surveys with bot developers and bot users and by analyzing about half a million historical bot actions spanning over one and a half years, we present software workflows that motivate bot instrumentation, factors impacting their usefulness as perceived by bot users, and challenges associated with their use. Our findings echo existing issues with bots, such as noise, and illustrate new benefits (e.g., cross-team communication) and challenges (e.g., too many bots) for large software teams. @InProceedings{ESEC/FSE23p1268, author = {Sumit Asthana and Hitesh Sajnani and Elena Voyloshnikova and Birendra Acharya and Kim Herzig}, title = {A Case Study of Developer Bots: Motivations, Perceptions, and Challenges}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1268--1280}, doi = {10.1145/3611643.3616248}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Athiwaratkun, Ben |
ESEC/FSE '23: "Towards Greener Yet Powerful ..."
Towards Greener Yet Powerful Code Generation via Quantization: An Empirical Study
Xiaokai Wei, Sujan Kumar Gonugondla, Shiqi Wang, Wasi Ahmad, Baishakhi Ray, Haifeng Qian, Xiaopeng Li, Varun Kumar, Zijian Wang, Yuchen Tian, Qing Sun, Ben Athiwaratkun, Mingyue Shang, Murali Krishna Ramanathan, Parminder Bhatia, and Bing Xiang (AWS AI Labs, USA) ML-powered code generation aims to assist developers to write code in a more productive manner by intelligently generating code blocks based on natural language prompts. Recently, large pretrained deep learning models have pushed the boundary of code generation and achieved impressive performance. However, the huge number of model parameters poses a significant challenge to their adoption in a typical software development environment, where a developer might use a standard laptop or mid-size server to develop code. Such large models cost significant resources in terms of memory, latency, dollars, as well as carbon footprint. Model compression is a promising approach to address these challenges. We have identified quantization as one of the most promising compression techniques for code-generation as it avoids expensive retraining costs. As quantization represents model parameters with lower-bit integer (e.g., int8), the model size and runtime latency would both benefit. We empirically evaluate quantized models on code generation tasks across different dimensions: (i) resource usage and carbon footprint, (ii) accuracy, and (iii) robustness. Through systematic experiments we find a code-aware quantization recipe that could run even a 6-billion-parameter model in a regular laptop without significant accuracy or robustness degradation. We find that the recipe is readily applicable to code summarization task as well. @InProceedings{ESEC/FSE23p224, author = {Xiaokai Wei and Sujan Kumar Gonugondla and Shiqi Wang and Wasi Ahmad and Baishakhi Ray and Haifeng Qian and Xiaopeng Li and Varun Kumar and Zijian Wang and Yuchen Tian and Qing Sun and Ben Athiwaratkun and Mingyue Shang and Murali Krishna Ramanathan and Parminder Bhatia and Bing Xiang}, title = {Towards Greener Yet Powerful Code Generation via Quantization: An Empirical Study}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {224--236}, doi = {10.1145/3611643.3616302}, year = {2023}, } Publisher's Version |
|
Aura, Tuomas |
ESEC/FSE '23-INDUSTRY: "Analyzing Microservice Connectivity ..."
Analyzing Microservice Connectivity with Kubesonde
Jacopo Bufalino, Mario Di Francesco, and Tuomas Aura (Aalto University, Finland; Eficode, Finland) Modern cloud-based applications are composed of several microservices that interact over a network. They are complex distributed systems, to the point that developers may not even be aware of how microservices connect to each other and to the Internet. As a consequence, the security of these applications can be greatly compromised. This work explicitly targets this context by providing a methodology to assess microservice connectivity, a software tool that implements it, and findings from analyzing real cloud applications. Specifically, it introduces Kubesonde, a cloud-native software that instruments live applications running on a Kubernetes cluster to analyze microservice connectivity, with minimal impact on performance. An assessment of microservices in 200 popular cloud applications with Kubesonde revealed significant issues in terms of network isolation: more than 60% of them had discrepancies between their declared and actual connectivity, and none restricted outbound connections towards the Internet. Our analysis shows that Kubesonde offers valuable insights on the connectivity between microservices, beyond what is possible with existing tools. @InProceedings{ESEC/FSE23p2038, author = {Jacopo Bufalino and Mario Di Francesco and Tuomas Aura}, title = {Analyzing Microservice Connectivity with Kubesonde}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2038--2043}, doi = {10.1145/3611643.3613899}, year = {2023}, } Publisher's Version |
|
Bacchelli, Alberto |
ESEC/FSE '23: "EvaCRC: Evaluating Code Review ..."
EvaCRC: Evaluating Code Review Comments
Lanxin Yang, Jinwei Xu, Yifan Zhang, He Zhang, and Alberto Bacchelli (Nanjing University, China; University of Zurich, Switzerland) In code reviews, developers examine code changes authored by peers and provide feedback through comments. Despite the importance of these comments, no accepted approach currently exists for assessing their quality. Therefore, this study has two main objectives: (1) to devise a conceptual model for an explainable evaluation of review comment quality, and (2) to develop models for the automated evaluation of comments according to the conceptual model. To do so, we conduct mixed-method studies and propose a new approach: EvaCRC (Evaluating Code Review Comments). To achieve the first goal, we collect and synthesize quality attributes of review comments, by triangulating data from both authoritative documentation on code review standards and academic literature. We then validate these attributes using real-world instances. Finally, we establish mappings between quality attributes and grades by inquiring domain experts, thus defining our final explainable conceptual model. To achieve the second goal, EvaCRC leverages multi-label learning. To evaluate and refine EvaCRC, we conduct an industrial case study with a global ICT enterprise. The results indicate that EvaCRC can effectively evaluate review comments while offering reasons for the grades. Data and materials: https://doi.org/10.5281/zenodo.8297481 @InProceedings{ESEC/FSE23p275, author = {Lanxin Yang and Jinwei Xu and Yifan Zhang and He Zhang and Alberto Bacchelli}, title = {EvaCRC: Evaluating Code Review Comments}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {275--287}, doi = {10.1145/3611643.3616245}, year = {2023}, } Publisher's Version Info |
|
Bach, Thomas |
ESEC/FSE '23-INDUSTRY: "A Multidimensional Analysis ..."
A Multidimensional Analysis of Bug Density in SAP HANA
Julian Reck, Thomas Bach, and Jan Stoess (SAP, Germany; University of Applied Sciences Karlsruhe, Germany) Researchers and practitioners have been studying correlations between software metrics and defects for decades. The typical approach is to postulate a hypothesis that a certain metric correlates with the number of defects. A statistical test then utilizes historical data to accept or reject the hypothesis. Although this methodology has been widely adopted, our own experience is that such correlations are often limited in their practical relevance, particularly for large industrial projects: Interpreting and arguing about them is challenging and cumbersome; the difference between correlation and causation might not be clear; and the practical impact of a correlation is often questioned due to misconceptions between a statistical conclusion and the impact on singular events. Instead of discussing correlations, we found that the analysis for binary testedness results in more fruitful discussions. Binary testedness, as proposed by prior work, utilizes a metric to divide the source code into two parts and verifies whether more (or less) defects appear in each part than expected. In our work, we leverage the binary testedness approach and analyze several software metrics for a large industrial project to illustrate the concept. We furthermore introduce dynamic thresholds as a novel and more practical approach for source code classification compared to the static binary classification of previous works. Our results show that some studied metrics have a significant correlation with bug distribution, but effect sizes differ by several magnitudes across metrics. Overall, our approach moves away from “metric X correlates with defects” to a more fruitful “source code with attribute X has more (or less) bugs than expected”, reframing the discussion from questioning statistics and methods towards an evidence-based root cause analysis. @InProceedings{ESEC/FSE23p1997, author = {Julian Reck and Thomas Bach and Jan Stoess}, title = {A Multidimensional Analysis of Bug Density in SAP HANA}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1997--2007}, doi = {10.1145/3611643.3613875}, year = {2023}, } Publisher's Version Archive submitted (420 kB) |
|
Bagheri, Hamid |
ESEC/FSE '23: "𝜇Akka: Mutation Testing ..."
𝜇Akka: Mutation Testing for Actor Concurrency in Akka using Real-World Bugs
Mohsen Moradi Moghadam, Mehdi Bagherzadeh, Raffi Khatchadourian, and Hamid Bagheri (Oakland University, USA; City University of New York, USA; University of Nebraska-Lincoln, USA) Actor concurrency is becoming increasingly important in the real world and mission-critical software. This requires these applications to be free from actor bugs, that occur in the real world, and have tests that are effective in finding these bugs. Mutation testing is a well-established technique that transforms an application to induce its likely bugs and evaluate the effectiveness of its tests in finding these bugs. Mutation testing is available for a broad spectrum of applications and their bugs, ranging from web to mobile to machine learning, and is used at scale in companies like Google and Facebook. However, there still is no mutation testing for actor concurrency that uses real-world actor bugs. In this paper, we propose 𝜇Akka, a framework for mutation testing of Akka actor concurrency using real actor bugs. Akka is a popular industrial-strength implementation of actor concurrency. To design, implement, and evaluate 𝜇Akka, we take the following major steps: (1) manually analyze a recent set of 186 real Akka bugs from Stack Overflow and GitHub to understand their causes; (2) design a set of 32 mutation operators, with 138 source code changes in Akka API, to emulate these causes and induce their bugs; (3) implement these operators in an Eclipse plugin for Java Akka; (4) use the plugin to generate 11.7k mutants of 10 real GitHub applications, with 446.4k lines of code and 7.9k tests; (5) run these tests on these mutants to measure the quality of mutants and effectiveness of tests; (6) use PIT to generate 26.2k mutants to compare 𝜇Akka and PIT mutant quality and test effectiveness. PIT is a popular mutation testing tool with traditional operators; (7) manually analyze the bug coverage and overlap of 𝜇Akka, PIT, and actor operators in a previous work; and (8) discuss a few implications of our findings. Among others, we find that 𝜇Akka mutants are higher quality, cover more bugs, and tests are less effective in detecting them. @InProceedings{ESEC/FSE23p262, author = {Mohsen Moradi Moghadam and Mehdi Bagherzadeh and Raffi Khatchadourian and Hamid Bagheri}, title = {𝜇Akka: Mutation Testing for Actor Concurrency in Akka using Real-World Bugs}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {262--274}, doi = {10.1145/3611643.3616362}, year = {2023}, } Publisher's Version |
|
Bagherzadeh, Mehdi |
ESEC/FSE '23: "𝜇Akka: Mutation Testing ..."
𝜇Akka: Mutation Testing for Actor Concurrency in Akka using Real-World Bugs
Mohsen Moradi Moghadam, Mehdi Bagherzadeh, Raffi Khatchadourian, and Hamid Bagheri (Oakland University, USA; City University of New York, USA; University of Nebraska-Lincoln, USA) Actor concurrency is becoming increasingly important in the real world and mission-critical software. This requires these applications to be free from actor bugs, that occur in the real world, and have tests that are effective in finding these bugs. Mutation testing is a well-established technique that transforms an application to induce its likely bugs and evaluate the effectiveness of its tests in finding these bugs. Mutation testing is available for a broad spectrum of applications and their bugs, ranging from web to mobile to machine learning, and is used at scale in companies like Google and Facebook. However, there still is no mutation testing for actor concurrency that uses real-world actor bugs. In this paper, we propose 𝜇Akka, a framework for mutation testing of Akka actor concurrency using real actor bugs. Akka is a popular industrial-strength implementation of actor concurrency. To design, implement, and evaluate 𝜇Akka, we take the following major steps: (1) manually analyze a recent set of 186 real Akka bugs from Stack Overflow and GitHub to understand their causes; (2) design a set of 32 mutation operators, with 138 source code changes in Akka API, to emulate these causes and induce their bugs; (3) implement these operators in an Eclipse plugin for Java Akka; (4) use the plugin to generate 11.7k mutants of 10 real GitHub applications, with 446.4k lines of code and 7.9k tests; (5) run these tests on these mutants to measure the quality of mutants and effectiveness of tests; (6) use PIT to generate 26.2k mutants to compare 𝜇Akka and PIT mutant quality and test effectiveness. PIT is a popular mutation testing tool with traditional operators; (7) manually analyze the bug coverage and overlap of 𝜇Akka, PIT, and actor operators in a previous work; and (8) discuss a few implications of our findings. Among others, we find that 𝜇Akka mutants are higher quality, cover more bugs, and tests are less effective in detecting them. @InProceedings{ESEC/FSE23p262, author = {Mohsen Moradi Moghadam and Mehdi Bagherzadeh and Raffi Khatchadourian and Hamid Bagheri}, title = {𝜇Akka: Mutation Testing for Actor Concurrency in Akka using Real-World Bugs}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {262--274}, doi = {10.1145/3611643.3616362}, year = {2023}, } Publisher's Version |
|
Bai, Haonan |
ESEC/FSE '23: "BiasAsker: Measuring the Bias ..."
BiasAsker: Measuring the Bias in Conversational AI System
Yuxuan Wan, Wenxuan Wang, Pinjia He, Jiazhen Gu, Haonan Bai, and Michael R. Lyu (Chinese University of Hong Kong, China) Powered by advanced Artificial Intelligence (AI) techniques, conversational AI systems, such as ChatGPT, and digital assistants like Siri, have been widely deployed in daily life. However, such systems may still produce content containing biases and stereotypes, causing potential social problems. Due to modern AI techniques’ data-driven, black-box nature, comprehensively identifying and measuring biases in conversational systems remains challenging. Particularly, it is hard to generate inputs that can comprehensively trigger potential bias due to the lack of data containing both social groups and biased properties. In addition, modern conversational systems can produce diverse responses (e.g., chatting and explanation), which makes existing bias detection methods based solely on sentiment and toxicity hardly being adopted. In this paper, we propose BiasAsker, an automated framework to identify and measure social bias in conversational AI systems. To obtain social groups and biased properties, we construct a comprehensive social bias dataset containing a total of 841 groups and 5,021 biased properties. Given the dataset, BiasAsker automatically generates questions and adopts a novel method based on existence measurement to identify two types of biases (i.e., absolute bias and related bias) in conversational systems. Extensive experiments on eight commercial systems and two famous research models, such as ChatGPT and GPT-3, show that 32.83% of the questions generated by BiasAsker can trigger biased behaviors in these widely deployed conversational systems. All the code, data, and experimental results have been released to facilitate future research. @InProceedings{ESEC/FSE23p515, author = {Yuxuan Wan and Wenxuan Wang and Pinjia He and Jiazhen Gu and Haonan Bai and Michael R. Lyu}, title = {BiasAsker: Measuring the Bias in Conversational AI System}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {515--527}, doi = {10.1145/3611643.3616310}, year = {2023}, } Publisher's Version |
|
Bajpai, Yasharth |
ESEC/FSE '23: "Grace: Language Models Meet ..."
Grace: Language Models Meet Code Edits
Priyanshu Gupta, Avishree Khare, Yasharth Bajpai, Saikat Chakraborty, Sumit Gulwani, Aditya Kanade, Arjun Radhakrishna, Gustavo Soares, and Ashish Tiwari (Microsoft, India; University of Pennsylvania, USA; Microsoft Research, USA; Microsoft, USA; Microsoft Research, India) Developers spend a significant amount of time in editing code for a variety of reasons such as bug fixing or adding new features. Designing effective methods to predict code edits has been an active yet challenging area of research due to the diversity of code edits and the difficulty of capturing the developer intent. In this work, we address these challenges by endowing pre-trained large language models (LLMs) with the knowledge of relevant prior associated edits, which we call the Grace (Generation conditioned on Associated Code Edits) method. The generative capability of the LLMs helps address the diversity in code changes and conditioning code generation on prior edits helps capture the latent developer intent. We evaluate two well-known LLMs, codex and CodeT5, in zero-shot and fine-tuning settings respectively. In our experiments with two datasets, Grace boosts the performance of the LLMs significantly, enabling them to generate 29% and 54% more correctly edited code in top-1 suggestions relative to the current state-of-the-art symbolic and neural approaches, respectively. @InProceedings{ESEC/FSE23p1483, author = {Priyanshu Gupta and Avishree Khare and Yasharth Bajpai and Saikat Chakraborty and Sumit Gulwani and Aditya Kanade and Arjun Radhakrishna and Gustavo Soares and Ashish Tiwari}, title = {Grace: Language Models Meet Code Edits}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1483--1495}, doi = {10.1145/3611643.3616253}, year = {2023}, } Publisher's Version Published Artifact Archive submitted (770 kB) Artifacts Available |
|
Banerjee, Subarno |
ESEC/FSE '23-INDUSTRY: "Compositional Taint Analysis ..."
Compositional Taint Analysis for Enforcing Security Policies at Scale
Subarno Banerjee, Siwei Cui, Michael Emmi, Antonio Filieri, Liana Hadarean, Peixuan Li, Linghui Luo, Goran Piskachev, Nicolás Rosner, Aritra Sengupta, Omer Tripp, and Jingbo Wang (Amazon Web Services, USA; Texas A&M University, USA; Amazon Web Services, Germany; University of Southern California, USA) Automated static dataflow analysis is an effective technique for detecting security critical issues like sensitive data leak, and vulnerability to injection attacks. Ensuring high precision and recall requires an analysis that is context, field and object sensitive. However, it is challenging to attain high precision and recall and scale to large industrial code bases. Compositional style analyses in which individual software components are analyzed separately, independent from their usage contexts, compute reusable summaries of components. This is an essential feature when deploying such analyses in CI/CD at code-review time or when scanning deployed container images. In both these settings the majority of software components stay the same between subsequent scans. However, it is not obvious how to extend such analyses to check the kind of contextual taint specifications that arise in practice, while maintaining compositionality. In this work we present contextual dataflow modeling, an extension to the compositional analysis to check complex taint specifications and significantly increasing recall and precision. Furthermore, we show how such high-fidelity analysis can scale in production using three key optimizations: (i) discarding intermediate results for previously-analyzed components, an optimization exploiting the compositional nature of our analysis; (ii) a scope-reduction analysis to reduce the scope of the taint analysis w.r.t. the taint specifications being checked, and (iii) caching of analysis models. We show a 9.85% reduction in false positive rate on a comprehensive test suite comprising the OWASP open-source benchmarks as well as internal real-world code samples. We measure the performance and scalability impact of each individual optimization using open source JVM packages from the Maven central repository and internal AWS service codebases. This combination of high precision, recall, performance, and scalability has allowed us to enforce security policies at scale both internally within Amazon as well as for external customers. @InProceedings{ESEC/FSE23p1985, author = {Subarno Banerjee and Siwei Cui and Michael Emmi and Antonio Filieri and Liana Hadarean and Peixuan Li and Linghui Luo and Goran Piskachev and Nicolás Rosner and Aritra Sengupta and Omer Tripp and Jingbo Wang}, title = {Compositional Taint Analysis for Enforcing Security Policies at Scale}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1985--1996}, doi = {10.1145/3611643.3613889}, year = {2023}, } Publisher's Version |
|
Bansal, Aakash |
ESEC/FSE '23-DEMO: "A Language Model of Java Methods ..."
A Language Model of Java Methods with Train/Test Deduplication
Chia-Yi Su, Aakash Bansal, Vijayanta Jain, Sepideh Ghanavati, and Collin McMillan (University of Notre Dame, USA; University of Maine, USA) This tool demonstration presents a research toolkit for a language model of Java source code. The target audience includes researchers studying problems at the granularity level of subroutines, statements, or variables in Java. In contrast to many existing language models, we prioritize features for researchers including an open and easily-searchable training set, a held out test set with different levels of deduplication from the training set, infrastructure for deduplicating new examples, and an implementation platform suitable for execution on equipment accessible to a relatively modest budget. Our model is a GPT2-like architecture with 350m parameters. Our training set includes 52m Java methods (9b tokens) and 13m StackOverflow threads (10.5b tokens). To improve accessibility of research to more members of the community, we limit local resource requirements to GPUs with 16GB video memory. We provide a test set of held out Java methods that include descriptive comments, including the entire Java projects for those methods. We also provide deduplication tools using precomputed hash tables at various similarity thresholds to help researchers ensure that their own test examples are not in the training set. We make all our tools and data open source and available via Huggingface and Github. @InProceedings{ESEC/FSE23p2152, author = {Chia-Yi Su and Aakash Bansal and Vijayanta Jain and Sepideh Ghanavati and Collin McMillan}, title = {A Language Model of Java Methods with Train/Test Deduplication}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2152--2156}, doi = {10.1145/3611643.3613090}, year = {2023}, } Publisher's Version Info |
|
Bansal, Chetan |
ESEC/FSE '23-INDUSTRY: "Detection Is Better Than Cure: ..."
Detection Is Better Than Cure: A Cloud Incidents Perspective
Vaibhav Ganatra, Anjaly Parayil, Supriyo Ghosh, Yu Kang, Minghua Ma, Chetan Bansal, Suman Nath, and Jonathan Mace (Microsoft, India; Microsoft, China; Microsoft, USA) Cloud providers use automated watchdogs or monitors to continuously observe service availability and to proactively report incidents when system performance degrades. Improper monitoring can lead to delays in the detection and mitigation of production incidents, which can be extremely expensive in terms of customer impacts and manual toil from engineering resources. Therefore, a systematic understanding of the pitfalls in current monitoring practices and how they can lead to production incidents is crucial for ensuring {continuous} reliability of cloud services. In this work, we carefully study the production incidents from the past year at Microsoft to understand the monitoring gaps in a hyperscale cloud platform. We conduct an extensive empirical study to answer: (1) What are the major causes of failures in early detection of production incidents and what are the steps taken for mitigation, (2) What is the impact of failures in early detection, (3) How do we recommend best monitoring practices for different services, and (4) How can we leverage the insights from this study to enhance the reliability of the cloud services. This study provides a deeper understanding of existing monitoring gaps in cloud platforms, uncover interesting insights and provide guidance for best monitoring practices for ensuring continuous reliability. @InProceedings{ESEC/FSE23p1891, author = {Vaibhav Ganatra and Anjaly Parayil and Supriyo Ghosh and Yu Kang and Minghua Ma and Chetan Bansal and Suman Nath and Jonathan Mace}, title = {Detection Is Better Than Cure: A Cloud Incidents Perspective}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1891--1902}, doi = {10.1145/3611643.3613898}, year = {2023}, } Publisher's Version |
|
Bardin, Sébastien |
ESEC/FSE '23: "Scalable Program Clone Search ..."
Scalable Program Clone Search through Spectral Analysis
Tristan Benoit, Jean-Yves Marion, and Sébastien Bardin (LORIA, France; CNRS, France; Université de Lorraine, France; CEA, France; Université Paris-Saclay, France) We consider the problem of program clone search, i.e. given a target program and a repository of known programs (all in executable format), the goal is to find the program in the repository most similar to the target program -- with potential applications in terms of reverse engineering, program clustering, malware lineage and software theft detection. Recent years have witnessed a blooming in code similarity techniques, yet most of them focus on function-level similarity and function clone search, while we are interested in program-level similarity and program clone search. Actually, our study shows that prior similarity approaches are either too slow to handle large program repositories, or not precise enough, or yet not robust against slight variations introduced by compilers, source code versions or light obfuscations. We propose a novel spectral analysis method for program-level similarity and program clone search called Programs Spectral Similarity (PSS). In a nutshell, PSS one-time spectral feature extraction is tailored for large repositories, making it a perfect fit for program clone search. We have compared the different approaches with extensive benchmarks, showing that PSS reaches a sweet spot in terms of precision, speed and robustness. @InProceedings{ESEC/FSE23p808, author = {Tristan Benoit and Jean-Yves Marion and Sébastien Bardin}, title = {Scalable Program Clone Search through Spectral Analysis}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {808--820}, doi = {10.1145/3611643.3616279}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable |
|
Bavand, Amir Hossein |
ESEC/FSE '23: "Accelerating Continuous Integration ..."
Accelerating Continuous Integration with Parallel Batch Testing
Emad Fallahzadeh, Amir Hossein Bavand, and Peter C. Rigby (Concordia University, Canada) Continuous integration at scale is costly but essential to software development. Various test optimization techniques including test selection and prioritization aim to reduce the cost. Test batching is an effective alternative, but overlooked technique. This study evaluates parallelization’s effect by adjusting machine count for test batching and introduces two novel approaches. We establish TestAll as a baseline to study the impact of parallelism and machine count on feedback time. We re-evaluate ConstantBatching and introduce DynamicBatching, which adapts batch size based on the remaining changes in the queue. We also propose TestCaseBatching, enabling new builds to join a batch before full test execution, thus speeding up continuous integration. Our evaluations utilize Ericsson’s results and 276 million test outcomes from open-source Chrome, assessing feedback time, execution reduction, and providing access to Chrome project scripts and data. The results reveal a non-linear impact of test parallelization on feedback time, as each test delay compounds across the entire test queue. ConstantBatching, with a batch size of 4, utilizes up to 72% fewer machines to maintain the actual average feedback time and provides a constant execution reduction of up to 75%. Similarly, DynamicBatching maintains the actual average feedback time with up to 91% fewer machines and exhibits variable execution reduction of up to 99%. TestCaseBatching holds the line of the actual average feedback time with up to 81% fewer machines and demonstrates variable execution reduction of up to 67%. We recommend practitioners use DynamicBatching and TestCaseBatching to reduce the required testing machines efficiently. Analyzing historical data to find the threshold where adding more machines has minimal impact on feedback time is also crucial for resource-effective testing. @InProceedings{ESEC/FSE23p55, author = {Emad Fallahzadeh and Amir Hossein Bavand and Peter C. Rigby}, title = {Accelerating Continuous Integration with Parallel Batch Testing}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {55--67}, doi = {10.1145/3611643.3616255}, year = {2023}, } Publisher's Version |
|
Bavota, Gabriele |
ESEC/FSE '23-DEMO: "CONAN: Statically Detecting ..."
CONAN: Statically Detecting Connectivity Issues in Android Applications
Alejandro Mazuera-Rozo, Camilo Escobar-Velásquez, Juan Espitia-Acero, Mario Linares-Vásquez, and Gabriele Bavota (USI Lugano, Switzerland; University of Los Andes, Colombia) Mobile apps are increasingly used in daily activities. Most apps require Internet connectivity to be fully exploited. Despite the fact that global access to the Internet has improved over the years, there are still complex connectivity scenarios, including situations with zero/unreliable connectivity. In such scenarios, improper handling of Eventual Connectivity Issues may cause bugs and crashes that worsen the user experience. Even though these issues have been studied in the literature, no automatic detection techniques are available. To address the mentioned gap, we have created the open source CONAN tool. CONAN can statically detect 16 types of Eventual Connectivity Issues within Android apps; it works at the source code level and alerts developers of any connectivity issue, highlighting them directly in the IDE or generating a report explaining the detected errors. In this paper, we present the technical aspects and a video of our tool, which are publicly available at https://tinyurl.com/CONAN-lint. @InProceedings{ESEC/FSE23p2182, author = {Alejandro Mazuera-Rozo and Camilo Escobar-Velásquez and Juan Espitia-Acero and Mario Linares-Vásquez and Gabriele Bavota}, title = {CONAN: Statically Detecting Connectivity Issues in Android Applications}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2182--2186}, doi = {10.1145/3611643.3613097}, year = {2023}, } Publisher's Version Video Info |
|
Bell, Jonathan |
ESEC/FSE '23-DEMO: "npm-follower: A Complete Dataset ..."
npm-follower: A Complete Dataset Tracking the NPM Ecosystem
Donald Pinckney, Federico Cassano, Arjun Guha, and Jonathan Bell (Northeastern University, USA) Software developers typically rely upon a large network of dependencies to build their applications. For instance, the NPM package repository contains over 3 million packages and serves tens of billions of downloads weekly. Understanding the structure and nature of packages, dependencies, and published code requires datasets that provide researchers with easy access to metadata and code of packages. However, prior work on NPM dataset construction typically has two limitations: 1) only metadata is scraped, and 2) packages or versions that are deleted from NPM can not be scraped. Over 330,000 versions of packages were deleted from NPM between July 2022 and May 2023. This data is critical for researchers as it often pertains to important questions of security and malware. We present npm-follower, a dataset and crawling architecture which archives metadata and code of all packages and versions as they are published, and is thus able to retain data which is later deleted. The dataset currently includes over 35 million versions of packages, and grows at a rate of about 1 million versions per month. The dataset is designed to be easily used by researchers answering questions involving either metadata or program analysis. Both the code and dataset are available at: https://dependencies.science @InProceedings{ESEC/FSE23p2132, author = {Donald Pinckney and Federico Cassano and Arjun Guha and Jonathan Bell}, title = {npm-follower: A Complete Dataset Tracking the NPM Ecosystem}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2132--2136}, doi = {10.1145/3611643.3613094}, year = {2023}, } Publisher's Version |
|
Bendimerad, Anes |
ESEC/FSE '23-INDUSTRY: "On-Premise AIOps Infrastructure ..."
On-Premise AIOps Infrastructure for a Software Editor SME: An Experience Report
Anes Bendimerad, Youcef Remil, Romain Mathonat, and Mehdi Kaytoue (Infologic, France; INSA Lyon, France; CNRS, France; LIRIS UMR5205, France) Information Technology has become a critical component in various industries, leading to an increased focus on software maintenance and monitoring. With the complexities of modern software systems, traditional maintenance approaches have become insufficient. The concept of AIOps has emerged to enhance predictive maintenance using Big Data and Machine Learning capabilities. However, exploiting AIOps requires addressing several challenges related to the complexity of data and incident management. Commercial solutions exist, but they may not be suitable for certain companies due to high costs, data governance issues, and limitations in covering private software. This paper investigates the feasibility of implementing on-premise AIOps solutions by leveraging open-source tools. We introduce a comprehensive AIOps infrastructure that we have successfully deployed in our company, and we provide the rationale behind different choices that we made to build its various components. Particularly, we provide insights into our approach and criteria for selecting a data management system and we explain its integration. Our experience can be beneficial for companies seeking to internally manage their software maintenance processes with a modern AIOps approach. @InProceedings{ESEC/FSE23p1820, author = {Anes Bendimerad and Youcef Remil and Romain Mathonat and Mehdi Kaytoue}, title = {On-Premise AIOps Infrastructure for a Software Editor SME: An Experience Report}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1820--1831}, doi = {10.1145/3611643.3613876}, year = {2023}, } Publisher's Version |
|
Benoit, Tristan |
ESEC/FSE '23: "Scalable Program Clone Search ..."
Scalable Program Clone Search through Spectral Analysis
Tristan Benoit, Jean-Yves Marion, and Sébastien Bardin (LORIA, France; CNRS, France; Université de Lorraine, France; CEA, France; Université Paris-Saclay, France) We consider the problem of program clone search, i.e. given a target program and a repository of known programs (all in executable format), the goal is to find the program in the repository most similar to the target program -- with potential applications in terms of reverse engineering, program clustering, malware lineage and software theft detection. Recent years have witnessed a blooming in code similarity techniques, yet most of them focus on function-level similarity and function clone search, while we are interested in program-level similarity and program clone search. Actually, our study shows that prior similarity approaches are either too slow to handle large program repositories, or not precise enough, or yet not robust against slight variations introduced by compilers, source code versions or light obfuscations. We propose a novel spectral analysis method for program-level similarity and program clone search called Programs Spectral Similarity (PSS). In a nutshell, PSS one-time spectral feature extraction is tailored for large repositories, making it a perfect fit for program clone search. We have compared the different approaches with extensive benchmarks, showing that PSS reaches a sweet spot in terms of precision, speed and robustness. @InProceedings{ESEC/FSE23p808, author = {Tristan Benoit and Jean-Yves Marion and Sébastien Bardin}, title = {Scalable Program Clone Search through Spectral Analysis}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {808--820}, doi = {10.1145/3611643.3616279}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable |
|
Berger, Thorsten |
ESEC/FSE '23-IVR: "A Vision on Intentions in ..."
A Vision on Intentions in Software Engineering
Jacob Krüger, Yi Li, Chenguang Zhu, Marsha Chechik, Thorsten Berger, and Julia Rubin (Eindhoven University of Technology, Netherlands; Nanyang Technological University, Singapore; University of Texas at Austin, USA; University of Toronto, Canada; Ruhr University Bochum, Germany; Chalmers - University of Gothenburg, Sweden; University of British Columbia, Canada) Intentions are fundamental in software engineering, but they are typically only implicitly considered through different abstractions, such as requirements, use cases, features, or issues. Specifically, software engineers develop and evolve (i.e., change) a software system based on such abstractions of a stakeholder’s intention—something a stakeholder wants the system to be able to do. Unfortunately, existing abstractions are (inherently) limited when it comes to representing stakeholder intentions and are mostly used for documenting only. So, whether a change in a system fulfills its underlying intention (and only this one) is an essential problem in practice that motivates many research areas (e.g., testing to ensure intended behavior, untangling intentions in commits). We argue that none of the existing abstractions is ideal for capturing intentions and controlling software evolution, which is why intentions are often vague and must be recovered, untangled, or understood in retrospect. In this paper, we reflect on the role of intentions (represented by changes) in software engineering and sketch how improving their management may support developers. Particularly, we argue that continuously managing and controlling intentions as well as their fulfillment has the potential to improve the reasoning about which stakeholder requests have been addressed, avoid misunderstandings, and prevent expensive retrospective analyses. To guide future research for achieving such benefits for researchers and practitioners, we discuss the relationships between different abstractions and intentions, and propose steps towards managing intentions. @InProceedings{ESEC/FSE23p2117, author = {Jacob Krüger and Yi Li and Chenguang Zhu and Marsha Chechik and Thorsten Berger and Julia Rubin}, title = {A Vision on Intentions in Software Engineering}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2117--2121}, doi = {10.1145/3611643.3613087}, year = {2023}, } Publisher's Version |
|
Bezemer, Cor-Paul |
ESEC/FSE '23-INDUSTRY: "Prioritizing Natural Language ..."
Prioritizing Natural Language Test Cases Based on Highly-Used Game Features
Markos Viggiato, Dale Paas, and Cor-Paul Bezemer (University of Alberta, Canada; Prodigy Education, Canada) Software testing is still a manual activity in many industries, such as the gaming industry. But manually executing tests becomes impractical as the system grows and resources are restricted, mainly in a scenario with short release cycles. Test case prioritization is a commonly used technique to optimize the test execution. However, most prioritization approaches do not work for manual test cases as they require source code information or test execution history, which is often not available in a manual testing scenario. In this paper, we propose a prioritization approach for manual test cases written in natural language based on the tested application features (in particular, highly-used application features). Our approach consists of (1) identifying the tested features from natural language test cases (with zero-shot classification techniques) and (2) prioritizing test cases based on the features that they test. We leveraged the NSGA-II genetic algorithm for the multi-objective optimization of the test case ordering to maximize the coverage of highly-used features while minimizing the cumulative execution time. Our findings show that we can successfully identify the application features covered by test cases using an ensemble of pre-trained models with strong zero-shot capabilities (an F-score of 76.1%). Also, our prioritization approaches can find test case orderings that cover highly-used application features early in the test execution while keeping the time required to execute test cases short. QA engineers can use our approach to focus the test execution on test cases that cover features that are relevant to users. @InProceedings{ESEC/FSE23p1961, author = {Markos Viggiato and Dale Paas and Cor-Paul Bezemer}, title = {Prioritizing Natural Language Test Cases Based on Highly-Used Game Features}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1961--1972}, doi = {10.1145/3611643.3613872}, year = {2023}, } Publisher's Version |
|
Bhatia, Parminder |
ESEC/FSE '23: "Towards Greener Yet Powerful ..."
Towards Greener Yet Powerful Code Generation via Quantization: An Empirical Study
Xiaokai Wei, Sujan Kumar Gonugondla, Shiqi Wang, Wasi Ahmad, Baishakhi Ray, Haifeng Qian, Xiaopeng Li, Varun Kumar, Zijian Wang, Yuchen Tian, Qing Sun, Ben Athiwaratkun, Mingyue Shang, Murali Krishna Ramanathan, Parminder Bhatia, and Bing Xiang (AWS AI Labs, USA) ML-powered code generation aims to assist developers to write code in a more productive manner by intelligently generating code blocks based on natural language prompts. Recently, large pretrained deep learning models have pushed the boundary of code generation and achieved impressive performance. However, the huge number of model parameters poses a significant challenge to their adoption in a typical software development environment, where a developer might use a standard laptop or mid-size server to develop code. Such large models cost significant resources in terms of memory, latency, dollars, as well as carbon footprint. Model compression is a promising approach to address these challenges. We have identified quantization as one of the most promising compression techniques for code-generation as it avoids expensive retraining costs. As quantization represents model parameters with lower-bit integer (e.g., int8), the model size and runtime latency would both benefit. We empirically evaluate quantized models on code generation tasks across different dimensions: (i) resource usage and carbon footprint, (ii) accuracy, and (iii) robustness. Through systematic experiments we find a code-aware quantization recipe that could run even a 6-billion-parameter model in a regular laptop without significant accuracy or robustness degradation. We find that the recipe is readily applicable to code summarization task as well. @InProceedings{ESEC/FSE23p224, author = {Xiaokai Wei and Sujan Kumar Gonugondla and Shiqi Wang and Wasi Ahmad and Baishakhi Ray and Haifeng Qian and Xiaopeng Li and Varun Kumar and Zijian Wang and Yuchen Tian and Qing Sun and Ben Athiwaratkun and Mingyue Shang and Murali Krishna Ramanathan and Parminder Bhatia and Bing Xiang}, title = {Towards Greener Yet Powerful Code Generation via Quantization: An Empirical Study}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {224--236}, doi = {10.1145/3611643.3616302}, year = {2023}, } Publisher's Version |
|
Bhuiyan, Masudul Hasan Masud |
ESEC/FSE '23-SRC: "The Call Graph Chronicles: ..."
The Call Graph Chronicles: Unleashing the Power Within
Masudul Hasan Masud Bhuiyan (CISPA Helmholtz Center for Information Security, Germany) Call graph generation is critical for program understanding and analysis, but achieving both accuracy and precision is challenging. Existing methods trade off one for the other, particularly in dy- namic languages like JavaScript. This paper introduces "Graphia," an approach that combines structural and semantic information using a Graph Neural Network (GNN) to enhance call graph accu- racy. Graphia’s two-step process employs an initial call graph as training data for the GNN, which then uncovers true call edges in new programs. Experimental results show Graphia significantly improves true positive rates in vulnerability detection, achieving up to 95%. This approach advances call graph accuracy by effectively incorporating code structure and context, particularly in complex dynamic language scenarios. @InProceedings{ESEC/FSE23p2210, author = {Masudul Hasan Masud Bhuiyan}, title = {The Call Graph Chronicles: Unleashing the Power Within}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2210--2212}, doi = {10.1145/3611643.3617854}, year = {2023}, } Publisher's Version |
|
Bianchi, Antonio |
ESEC/FSE '23: "Crystallizer: A Hybrid Path ..."
Crystallizer: A Hybrid Path Analysis Framework to Aid in Uncovering Deserialization Vulnerabilities
Prashast Srivastava, Flavio Toffalini, Kostyantyn Vorobyov, François Gauthier, Antonio Bianchi, and Mathias Payer (Purdue University, USA; EPFL, Switzerland; Oracle Labs, Australia) Applications use serialization and deserialization to exchange data. Serialization allows developers to exchange messages or perform remote method invocation in distributed applications. However, the application logic itself is responsible for security. Adversaries may abuse bugs in the deserialization logic to forcibly invoke attacker-controlled methods by crafting malicious bytestreams (payloads). Crystallizer presents a novel hybrid framework to automatically uncover deserialization vulnerabilities by combining static and dynamic analyses. Our intuition is to first over-approximate possible payloads through static analysis (to constrain the search space). Then, we use dynamic analysis to instantiate concrete payloads as a proof-of-concept of a vulnerability (giving the analyst concrete examples of possible attacks). Our proof-of-concept focuses on Java deserialization as the imminent domain of such attacks. We evaluate our prototype on seven popular Java libraries against state-of-the-art frameworks for uncovering gadget chains. In contrast to existing tools, we uncovered 41 previously unknown exploitable chains. Furthermore, we show the real-world security impact of Crystallizer by using it to synthesize gadget chains to mount RCE and DoS attacks on three popular Java applications. We have responsibly disclosed all newly discovered vulnerabilities. @InProceedings{ESEC/FSE23p1586, author = {Prashast Srivastava and Flavio Toffalini and Kostyantyn Vorobyov and François Gauthier and Antonio Bianchi and Mathias Payer}, title = {Crystallizer: A Hybrid Path Analysis Framework to Aid in Uncovering Deserialization Vulnerabilities}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1586--1597}, doi = {10.1145/3611643.3616313}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable |
|
Bisht, Sumit |
ESEC/FSE '23: "Outage-Watch: Early Prediction ..."
Outage-Watch: Early Prediction of Outages using Extreme Event Regularizer
Shubham Agarwal, Sarthak Chakraborty, Shaddy Garg, Sumit Bisht, Chahat Jain, Ashritha Gonuguntla, and Shiv Saini (Adobe Research, India; University of Illinois at Urbana-Champaign, USA; Adobe, India; Amazon, India; Traceable.ai, India; Cisco, India) Cloud services are omnipresent and critical cloud service failure is a fact of life. In order to retain customers and prevent revenue loss, it is important to provide high reliability guarantees for these services. One way to do this is by predicting outages in advance, which can help in reducing the severity as well as time to recovery. It is difficult to forecast critical failures due to the rarity of these events. Moreover, critical failures are ill-defined in terms of observable data. Our proposed method, Outage-Watch, defines critical service outages as deteriorations in the Quality of Service (QoS) captured by a set of metrics. Outage-Watch detects such outages in advance by using current system state to predict whether the QoS metrics will cross a threshold and initiate an extreme event. A mixture of Gaussian is used to model the distribution of the QoS metrics for flexibility and an extreme event regularizer helps in improving learning in tail of the distribution. An outage is predicted if the probability of any one of the QoS metrics crossing threshold changes significantly. Our evaluation on a real-world SaaS company dataset shows that Outage-Watch significantly outperforms traditional methods with an average AUC of 0.98. Additionally, Outage-Watch detects all the outages exhibiting a change in service metrics and reduces the Mean Time To Detection (MTTD) of outages by up to 88% when deployed in an enterprise cloud-service system, demonstrating efficacy of our proposed method. @InProceedings{ESEC/FSE23p682, author = {Shubham Agarwal and Sarthak Chakraborty and Shaddy Garg and Sumit Bisht and Chahat Jain and Ashritha Gonuguntla and Shiv Saini}, title = {Outage-Watch: Early Prediction of Outages using Extreme Event Regularizer}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {682--694}, doi = {10.1145/3611643.3616316}, year = {2023}, } Publisher's Version |
|
Bissyandé, Tegawendé F. |
ESEC/FSE '23: "Natural Language to Code: ..."
Natural Language to Code: How Far Are We?
Shangwen Wang, Mingyang Geng, Bo Lin, Zhensu Sun, Ming Wen, Yepang Liu, Li Li, Tegawendé F. Bissyandé, and Xiaoguang Mao (National University of Defense Technology, China; Singapore Management University, Singapore; Huazhong University of Science and Technology, China; Southern University of Science and Technology, China; Beihang University, China; University of Luxembourg, Luxembourg) A longstanding dream in software engineering research is to devise effective approaches for automating development tasks based on developers' informally-specified intentions. Such intentions are generally in the form of natural language descriptions. In recent literature, a number of approaches have been proposed to automate tasks such as code search and even code generation based on natural language inputs. While these approaches vary in terms of technical designs, their objective is the same: transforming a developer's intention into source code. The literature, however, lacks a comprehensive understanding towards the effectiveness of existing techniques as well as their complementarity to each other. We propose to fill this gap through a large-scale empirical study where we systematically evaluate natural language to code techniques. Specifically, we consider six state-of-the-art techniques targeting code search, and four targeting code generation. Through extensive evaluations on a dataset of 22K+ natural language queries, our study reveals the following major findings: (1) code search techniques based on model pre-training are so far the most effective while code generation techniques can also provide promising results; (2) complementarity widely exists among the existing techniques; and (3) combining the ten techniques together can enhance the performance for 35% compared with the most effective standalone technique. Finally, we propose a post-processing strategy to automatically integrate different techniques based on their generated code. Experimental results show that our devised strategy is both effective and extensible. @InProceedings{ESEC/FSE23p375, author = {Shangwen Wang and Mingyang Geng and Bo Lin and Zhensu Sun and Ming Wen and Yepang Liu and Li Li and Tegawendé F. Bissyandé and Xiaoguang Mao}, title = {Natural Language to Code: How Far Are We?}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {375--387}, doi = {10.1145/3611643.3616323}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Biswas, Sumon |
ESEC/FSE '23: "Fix Fairness, Don’t Ruin ..."
Fix Fairness, Don’t Ruin Accuracy: Performance Aware Fairness Repair using AutoML
Giang Nguyen, Sumon Biswas, and Hridesh Rajan (Iowa State University, USA; Carnegie Mellon University, USA) Machine learning (ML) is increasingly being used in critical decision-making software, but incidents have raised questions about the fairness of ML predictions. To address this issue, new tools and methods are needed to mitigate bias in ML-based software. Previous studies have proposed bias mitigation algorithms that only work in specific situations and often result in a loss of accuracy. Our proposed solution is a novel approach that utilizes automated machine learning (AutoML) techniques to mitigate bias. Our approach includes two key innovations: a novel optimization function and a fairness-aware search space. By improving the default optimization function of AutoML and incorporating fairness objectives, we are able to mitigate bias with little to no loss of accuracy. Additionally, we propose a fairness-aware search space pruning method for AutoML to reduce computational cost and repair time. Our approach, built on the state-of-the-art Auto-Sklearn tool, is designed to reduce bias in real-world scenarios. In order to demonstrate the effectiveness of our approach, we evaluated our approach on four fairness problems and 16 different ML models, and our results show a significant improvement over the baseline and existing bias mitigation techniques. Our approach, Fair-AutoML, successfully repaired 60 out of 64 buggy cases, while existing bias mitigation techniques only repaired up to 44 out of 64 cases. @InProceedings{ESEC/FSE23p502, author = {Giang Nguyen and Sumon Biswas and Hridesh Rajan}, title = {Fix Fairness, Don’t Ruin Accuracy: Performance Aware Fairness Repair using AutoML}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {502--514}, doi = {10.1145/3611643.3616257}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Functional |
|
Björkman, Mårten |
ESEC/FSE '23-INDUSTRY: "Diffusion-Based Time Series ..."
Diffusion-Based Time Series Data Imputation for Cloud Failure Prediction at Microsoft 365
Fangkai Yang, Wenjie Yin, Lu Wang, Tianci Li, Pu Zhao, Bo Liu, Paul Wang, Bo Qiao, Yudong Liu, Mårten Björkman, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang (Microsoft, China; KTH Royal Institute of Technology, Sweden; Microsoft, USA) Ensuring reliability in large-scale cloud systems like Microsoft 365 is crucial. Cloud failures, such as disk and node failure, threaten service reliability, causing service interruptions and financial loss. Existing works focus on failure prediction and proactively taking action before failures happen. However, they suffer from poor data quality, like data missing in model training and prediction, which limits performance. In this paper, we focus on enhancing data quality through data imputation by the proposed Diffusion+, a sample-efficient diffusion model, to impute the missing data efficiently conditioned on the observed data. Experiments with industrial datasets and application practice show that our model contributes to improving the performance of downstream failure prediction. @InProceedings{ESEC/FSE23p2050, author = {Fangkai Yang and Wenjie Yin and Lu Wang and Tianci Li and Pu Zhao and Bo Liu and Paul Wang and Bo Qiao and Yudong Liu and Mårten Björkman and Saravan Rajmohan and Qingwei Lin and Dongmei Zhang}, title = {Diffusion-Based Time Series Data Imputation for Cloud Failure Prediction at Microsoft 365}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2050--2055}, doi = {10.1145/3611643.3613866}, year = {2023}, } Publisher's Version |
|
Blouin, Arnaud |
ESEC/FSE '23: "HyperDiff: Computing Source ..."
HyperDiff: Computing Source Code Diffs at Scale
Quentin Le Dilavrec, Djamel Eddine Khelladi, Arnaud Blouin, and Jean-Marc Jézéquel (University of Rennes, France; CNRS - University of Rennes, France; INSA Rennes, France) With the advent of fast software evolution and multistage releases, temporal code analysis is becoming useful for various purposes, such as bug cause identification, bug prediction or code evolution analysis. Temporal code analyses can consist in analyzing multiple Abstract Syntax Trees (ASTs) extracted from code evolutions, e.g. one AST for each commit or release. Core feature to temporal analysis is code differencing: the computation of the so-called Diff or edit script between two given versions of the code. However, jointly analyzing and computing the difference on thousands versions of code faces scalability issues. Mainly because of the cost of: 1) parsing the original and evolved code in two source and target ASTs; 2) wasting resources by not reusing intermediate computation results that can be shared between versions. This paper details a novel approach based on time-oriented data structures that makes code differencing scale up to large software codebases. In particular, we leverage on the HyperAST, a novel representation of code histories, to propose an incremental and memory efficient approach by lazifying the well known GumTree diffing algorithms, a mainstream code differencing algorithm and tool. We evaluated our approach on a curated list of 19 large software projects and compared it to GumTree. Our approach outperforms it in scalability both in time and memory. We observed an order-of-magnitude difference: 1) in CPU time from x1.2 to x12.7 for the total time of diff computation and up to x226 in intermediate phases of the diff computation, and 2) in memory footprint of x4.5 per AST node. The approach produced 99.3% of identical diffs with respect to GumTree. @InProceedings{ESEC/FSE23p288, author = {Quentin Le Dilavrec and Djamel Eddine Khelladi and Arnaud Blouin and Jean-Marc Jézéquel}, title = {HyperDiff: Computing Source Code Diffs at Scale}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {288--299}, doi = {10.1145/3611643.3616312}, year = {2023}, } Publisher's Version Published Artifact Info Artifacts Available |
|
Böhme, Marcel |
ESEC/FSE '23: "Statistical Reachability Analysis ..."
Statistical Reachability Analysis
Seongmin Lee and Marcel Böhme (MPI-SP, Germany) Given a target program state (or statement) s, what is the probability that an input reaches s? This is the quantitative reachability analysis problem. For instance, quantitative reachability analysis can be used to approximate the reliability of a program (where s is a bad state). Traditionally, quantitative reachability analysis is solved as a model counting problem for a formal constraint that represents the (approximate) reachability of s along paths in the program, i.e., probabilistic reachability analysis. However, in preliminary experiments, we failed to run state-of-the-art probabilistic reachability analysis on reasonably large programs. In this paper, we explore statistical methods to estimate reachability probability. An advantage of statistical reasoning is that the size and composition of the program are insubstantial as long as the program can be executed. We are particularly interested in the error compared to the state-of-the-art probabilistic reachability analysis. We realize that existing estimators do not exploit the inherent structure of the program and develop structure-aware estimators to further reduce the estimation error given the same number of samples. Our empirical evaluation on previous and new benchmark programs shows that (i) our statistical reachability analysis outperforms state-of-the-art probabilistic reachability analysis tools in terms of accuracy, efficiency, and scalability, and (ii) our structure-aware estimators further outperform (blackbox) estimators that do not exploit the inherent program structure. We also identify multiple program properties that limit the applicability of the existing probabilistic analysis techniques. @InProceedings{ESEC/FSE23p326, author = {Seongmin Lee and Marcel Böhme}, title = {Statistical Reachability Analysis}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {326--337}, doi = {10.1145/3611643.3616268}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable |
|
Bruck, Lucas |
ESEC/FSE '23-INDUSTRY: "Test Case Generation for Drivability ..."
Test Case Generation for Drivability Requirements of an Automotive Cruise Controller: An Experience with an Industrial Simulator
Federico Formica, Nicholas Petrunti, Lucas Bruck, Vera Pantelic, Mark Lawford, and Claudio Menghi (McMaster University, Canada; University of Bergamo, Italy) Automotive software development requires engineers to test their systems to detect violations of both functional and drivability requirements. Functional requirements define the functionality of the automotive software. Drivability requirements refer to the driver's perception of the interactions with the vehicle; for example, they typically require limiting the acceleration and jerk perceived by the driver within given thresholds. While functional requirements are extensively considered by the research literature, drivability requirements garner less attention. This industrial paper describes our experience assessing the usefulness of an automated search-based software testing (SBST) framework in generating failure-revealing test cases for functional and drivability requirements. We report on our experience with the VI-CarRealTime simulator, an industrial virtual modeling and simulation environment widely used in the automotive domain. We designed a Cruise Control system in Simulink for a four-wheel vehicle, in an iterative fashion, by producing 21 model versions. We used the SBST framework for each version of the model to search for failure-revealing test cases revealing requirement violations. Our results show that the SBST framework successfully identified a failure-revealing test case for 66.7% of our model versions, requiring, on average, 245.9s and 3.8 iterations. We present lessons learned, reflect on the generality of our results, and discuss how our results improve the state of practice. @InProceedings{ESEC/FSE23p1949, author = {Federico Formica and Nicholas Petrunti and Lucas Bruck and Vera Pantelic and Mark Lawford and Claudio Menghi}, title = {Test Case Generation for Drivability Requirements of an Automotive Cruise Controller: An Experience with an Industrial Simulator}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1949--1960}, doi = {10.1145/3611643.3613894}, year = {2023}, } Publisher's Version |
|
Brun, Yuriy |
ESEC/FSE '23: "Baldur: Whole-Proof Generation ..."
Baldur: Whole-Proof Generation and Repair with Large Language Models
Emily First, Markus N. Rabe, Talia Ringer, and Yuriy Brun (University of Massachusetts, USA; Augment Computing, USA; University of Illinois at Urbana-Champaign, USA) Formally verifying software is a highly desirable but labor-intensive task. Recent work has developed methods to automate formal verification using proof assistants, such as Coq and Isabelle/HOL, e.g., by training a model to predict one proof step at a time and using that model to search through the space of possible proofs. This paper introduces a new method to automate formal verification: We use large language models, trained on natural language and code and fine-tuned on proofs, to generate whole proofs at once. We then demonstrate that a model fine-tuned to repair generated proofs further increasing proving power. This paper: (1) Demonstrates that whole-proof generation using transformers is possible and is as effective but more efficient than search-based techniques. (2) Demonstrates that giving the learned model additional context, such as a prior failed proof attempt and the ensuing error message, results in proof repair that further improves automated proof generation. (3) Establishes, together with prior work, a new state of the art for fully automated proof synthesis. We reify our method in a prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL theorems and their proofs, empirically showing the effectiveness of whole-proof generation, repair, and added context. We also show that Baldur complements the state-of-the-art tool, Thor, by automatically generating proofs for an additional 8.7% of the theorems. Together, Baldur and Thor can prove 65.7% of the theorems fully automatically. This paper paves the way for new research into using large language models for automating formal verification. @InProceedings{ESEC/FSE23p1229, author = {Emily First and Markus N. Rabe and Talia Ringer and Yuriy Brun}, title = {Baldur: Whole-Proof Generation and Repair with Large Language Models}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1229--1241}, doi = {10.1145/3611643.3616243}, year = {2023}, } Publisher's Version |
|
Bufalino, Jacopo |
ESEC/FSE '23-INDUSTRY: "Analyzing Microservice Connectivity ..."
Analyzing Microservice Connectivity with Kubesonde
Jacopo Bufalino, Mario Di Francesco, and Tuomas Aura (Aalto University, Finland; Eficode, Finland) Modern cloud-based applications are composed of several microservices that interact over a network. They are complex distributed systems, to the point that developers may not even be aware of how microservices connect to each other and to the Internet. As a consequence, the security of these applications can be greatly compromised. This work explicitly targets this context by providing a methodology to assess microservice connectivity, a software tool that implements it, and findings from analyzing real cloud applications. Specifically, it introduces Kubesonde, a cloud-native software that instruments live applications running on a Kubernetes cluster to analyze microservice connectivity, with minimal impact on performance. An assessment of microservices in 200 popular cloud applications with Kubesonde revealed significant issues in terms of network isolation: more than 60% of them had discrepancies between their declared and actual connectivity, and none restricted outbound connections towards the Internet. Our analysis shows that Kubesonde offers valuable insights on the connectivity between microservices, beyond what is possible with existing tools. @InProceedings{ESEC/FSE23p2038, author = {Jacopo Bufalino and Mario Di Francesco and Tuomas Aura}, title = {Analyzing Microservice Connectivity with Kubesonde}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2038--2043}, doi = {10.1145/3611643.3613899}, year = {2023}, } Publisher's Version |
|
Burnett, Margaret |
ESEC/FSE '23-KEY: "Getting Outside the Bug Boxes ..."
Getting Outside the Bug Boxes (Keynote)
Margaret Burnett (Oregon State University, USA) Sometimes, we humans find ourselves a bit slow to abandon the comfort of sitting “inside the box”, and this can detract from our ability to innovate. In this talk, I’ll share some outside-the-box perspectives, gleaned from decades of software engineering work, on boxes I’ve seen when thinking about bugs — from failures to faults, from finding to fixing, and from traditional to very non-traditional notions of “what counts” as a bug. I’ll consider the intellectually freeing perspectives that can come from moving outside the “mechanisms” box to policies; the enhancement to applicability from moving outside sub-sub-area boxes to the whole software lifecycle; the differences revealed when moving outside the “typical developer” box to diverse humans; and the plethora of possibilities arising from moving outside the “buggy code” box to a wide range of bug types. @InProceedings{ESEC/FSE23p2, author = {Margaret Burnett}, title = {Getting Outside the Bug Boxes (Keynote)}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2--2}, doi = {10.1145/3611643.3633452}, year = {2023}, } Publisher's Version |
|
Caballero, Juan |
ESEC/FSE '23: "LibKit: Detecting Third-Party ..."
LibKit: Detecting Third-Party Libraries in iOS Apps
Daniel Domínguez-Álvarez, Alejandro de la Cruz, Alessandra Gorla, and Juan Caballero (IMDEA Software Institute, Spain; University of Verona, Italy) We present LibKit, the first approach and tool for detecting the name and version of third-party libraries (TPLs) present in iOS apps. LibKit automatically builds fingerprints for 86K library versions available through the CocoaPods dependency manager and matches them on the decrypted app executables to identify the TPLs (name and version) an iOS app uses. LibKit supports apps written in Swift and Objective-C, detects statically and dynamically linked libraries, and addresses challenges such as partially included libraries and different compiler versions and configurations producing variants of the same library version. On a ground truth of 95 open-source apps, LibKit identifies libraries with a precision of 0.911 and a recall of 0.839. LibKit also significantly outperforms the state-of-the-art CRiOS tool for identifying TPL boundaries. When applied to 1,500 apps from the iTunes Store, LibKit detects 47,015 library versions, identifying popular apps that contain old library versions. @InProceedings{ESEC/FSE23p1407, author = {Daniel Domínguez-Álvarez and Alejandro de la Cruz and Alessandra Gorla and Juan Caballero}, title = {LibKit: Detecting Third-Party Libraries in iOS Apps}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1407--1418}, doi = {10.1145/3611643.3616344}, year = {2023}, } Publisher's Version Published Artifact Info Artifacts Available |
|
Cabra-Acela, Laura |
ESEC/FSE '23-DEMO: "On Using Information Retrieval ..."
On Using Information Retrieval to Recommend Machine Learning Good Practices for Software Engineers
Laura Cabra-Acela, Anamaria Mojica-Hanke, Mario Linares-Vásquez, and Steffen Herbold (University of Los Andes, Colombia; University of Passau, Germany) Machine learning (ML) is nowadays widely used for different purposes and with several disciplines. From self-driving cars to automated medical diagnosis, machine learning models extensively support users’ daily activities, and software engineering tasks are no exception. Not embracing good ML practices may lead to pitfalls that hinder the performance of an ML system and potentially lead to unexpected results. Despite the existence of documentation and literature about ML best practices, many non-ML experts turn towards gray literature like blogs and Q&A systems when looking for help and guidance when implementing ML systems. To better aid users in distilling relevant knowledge from such sources, we propose a recommender system that recommends ML practices based on the user’s context. As a first step in creating a recommender system for machine learning practices, we implemented Idaka. A tool that provides two different approaches for retrieving/generating ML best practices: i) an information retrieval (IR) engine and ii) a large language model. The IR-engine uses BM25 as the algorithm for retrieving the practices, and a large language model, in our case Alpaca. The platform has been designed to allow comparative studies of best practices retrieval tools. Idaka is publicly available at GitHub: https://bit.ly/idaka. Video: https://youtu.be/cEb-AhIPxnM @InProceedings{ESEC/FSE23p2142, author = {Laura Cabra-Acela and Anamaria Mojica-Hanke and Mario Linares-Vásquez and Steffen Herbold}, title = {On Using Information Retrieval to Recommend Machine Learning Good Practices for Software Engineers}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2142--2146}, doi = {10.1145/3611643.3613093}, year = {2023}, } Publisher's Version Published Artifact Video Info Artifacts Available |
|
Cai, Chang |
ESEC/FSE '23-INDUSTRY: "LightF3: A Lightweight Fully-Process ..."
LightF3: A Lightweight Fully-Process Formal Framework for Automated Verifying Railway Interlocking Systems
Yibo Dong, Xiaoyu Zhang, Yicong Xu, Chang Cai, Yu Chen, Weikai Miao, Jianwen Li, and Geguang Pu (East China Normal University, China; Shanghai Trusted Industrial Control Platform, China) Interlocking has long played a crucial role in railway systems. Its functional correctness, particularly concerning safety, forms the foundation of the entire signaling system. To date, numerous efforts have been made to formally model and verify interlocking systems. However, two main problems persist in most prior work: (1) The formal description of the interlocking system heavily depends on reusing existing models, which often results in overgeneralization and failing to fully utilize the intrinsic characteristics of interlocking systems. (2) The verification techniques of current approaches may quickly become outdated, and there is no adaptable method to integrate state-of-the-art verification algorithms or tools. To address the above issues, we present LightF3, a lightweight and fully-process formal framework for modeling and verifying railway interlocking systems. LightF3 provides RIS-FL, a formal language based on FQLTL (a variant of LTL) to model the system and its specifications. LightF3 transforms the RIS-FL model automatically to the aiger model, which is the mainstream input of state-of-the-art model checkers, and then invokes the most advanced checkers to complete the verification task. We evaluated LightF3 by testing five real station instances from our industrial partner, demonstrating its effectiveness as a new framework. Additionally, we analyzed the statistics of the verification results from different model-checking techniques, providing useful conclusions for both the railway interlocking and formal methods communities. @InProceedings{ESEC/FSE23p1914, author = {Yibo Dong and Xiaoyu Zhang and Yicong Xu and Chang Cai and Yu Chen and Weikai Miao and Jianwen Li and Geguang Pu}, title = {LightF3: A Lightweight Fully-Process Formal Framework for Automated Verifying Railway Interlocking Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1914--1925}, doi = {10.1145/3611643.3613874}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Cai, Shaowei |
ESEC/FSE '23: "CAmpactor: A Novel and Effective ..."
CAmpactor: A Novel and Effective Local Search Algorithm for Optimizing Pairwise Covering Arrays
Qiyuan Zhao, Chuan Luo, Shaowei Cai, Wei Wu, Jinkun Lin, Hongyu Zhang, and Chunming Hu (Beihang University, China; Institute of Software at Chinese Academy of Sciences, China; Central South University, China; Xiangjiang Laboratory, China; SeedMath Technology, China; Chongqing University, China) The increasing demand for software customization has led to the development of highly configurable systems. Combinatorial interaction testing (CIT) is an effective method for testing these types of systems. The ultimate goal of CIT is to generate a test suite of acceptable size, called a t-wise covering array (CA), where t is the testing strength. Pairwise testing (i.e., CIT with t=2) is recognized to be the most widely-used CIT technique and has strong fault detection capability. In pairwise testing, the most important problem is pairwise CA generation (PCAG), which is to generate a pairwise CA (PCA) of minimum size. However, existing state-of-the-art PCAG algorithms suffer from the severe scalability challenge; that is, they cannot tackle large-scale PCAG instances effectively, resulting in PCAs of large sizes. To alleviate this challenge, in this paper we propose CAmpactor, a novel and effective local search algorithm for compacting given PCAs into smaller sizes. Extensive experiments on a large number of real-world, public PCAG instances show that the sizes of CAmpactor's generated PCAs are around 45% smaller than the sizes of PCAs constructed by existing state-of-the-art PCAG algorithms, indicating its superiority. Also, our evaluation confirms the generality of CAmpactor, since CAmpactor can reduce the sizes of PCAs generated by a variety of PCAG algorithms. @InProceedings{ESEC/FSE23p81, author = {Qiyuan Zhao and Chuan Luo and Shaowei Cai and Wei Wu and Jinkun Lin and Hongyu Zhang and Chunming Hu}, title = {CAmpactor: A Novel and Effective Local Search Algorithm for Optimizing Pairwise Covering Arrays}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {81--93}, doi = {10.1145/3611643.3616284}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable |
|
Cai, Yan |
ESEC/FSE '23: "Discovering Parallelisms in ..."
Discovering Parallelisms in Python Programs
Siwei Wei, Guyang Song, Senlin Zhu, Ruoyi Ruan, Shihao Zhu, and Yan Cai (Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Ant Group, China) Parallelization is a promising way to improve the performance of Python programs. Unfortunately, developers may miss parallelization possibilities, because they usually do not concentrate on parallelization. Many approaches have been proposed to parallelize Python programs automatically, however, they are either domain-specific or require manual annotation. Thus they cannot solve the problem well in general. In this paper, we propose PyPar, an effective tool aiming at discovering parallelization possibilities in real-world Python programs. PyPar doesn’t need manual annotation and is universally applicable. It first drives a data-dependence analysis to determine whether two pieces of code can run concurrently. The key is the use of a graph-theoretic approach. Next, it adopts a dynamic selection strategy to eliminate inefficient parallelisms. Finally, PyPar produces a parallelism report as well as a referential parallelized program, which is built by PyPar using one of the three parallelization methods (thread-based, processbased, and Ray-based). We have implemented a prototype of PyPar and evaluated it on six well-designed widely-used real-world Python packages: Scikit-Image, SciPy, librosa, trimesh, Scikit-learn and seaborn. In total, 1,240 functions are tested, and PyPar found 127 parallelizable functions among them. Based on manual filtering, only 7 of them are false positives (i.e., a 94.5% precision). The remaining 120 are parallelizable (almost 10% among all functions under test), and most of them can be efficiently sped up by gaining an acceleration of up to 90% , with an average of 44%. The acceleration in practice is close to theoretical estimation. The results show that even well-designed practical Python programs can be further parallelized for speeding up, and PyPar can bring effective and efficient parallelization on real-world Python programs. @InProceedings{ESEC/FSE23p832, author = {Siwei Wei and Guyang Song and Senlin Zhu and Ruoyi Ruan and Shihao Zhu and Yan Cai}, title = {Discovering Parallelisms in Python Programs}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {832--844}, doi = {10.1145/3611643.3616259}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable |
|
Cakmak, Ebru |
ESEC/FSE '23-INDUSTRY: "Issue Report Validation in ..."
Issue Report Validation in an Industrial Context
Ethem Utku Aktas, Ebru Cakmak, Mete Cihad Inan, and Cemal Yilmaz (Softtech Research and Development, Turkiye; Microsoft EMEA, Turkiye; Sabanci University, Turkiye) Effective issue triaging is crucial for software development teams to improve software quality, and thus customer satisfaction. Validating issue reports manually can be time-consuming, hindering the overall efficiency of the triaging process. This paper presents an approach on automating the validation of issue reports to accelerate the issue triaging process in an industrial set-up. We work on 1,200 randomly selected issue reports in banking domain, written in Turkish, an agglutinative language, meaning that new words can be formed with linear concatenation of suffixes to express entire sentences. We manually label these reports for validity, and extract the relevant patterns indicating that they are invalid. Since the issue reports we work on are written in an agglutinative language, we use morphological analysis to extract the features. Using the proposed feature extractors, we utilize a machine learning based approach to predict the issue reports’ validity, performing a 0.77 F1-score. @InProceedings{ESEC/FSE23p2026, author = {Ethem Utku Aktas and Ebru Cakmak and Mete Cihad Inan and Cemal Yilmaz}, title = {Issue Report Validation in an Industrial Context}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2026--2031}, doi = {10.1145/3611643.3613887}, year = {2023}, } Publisher's Version |
|
Camtepe, Seyit |
ESEC/FSE '23: "Mate! Are You Really Aware? ..."
Mate! Are You Really Aware? An Explainability-Guided Testing Framework for Robustness of Malware Detectors
Ruoxi Sun, Minhui Xue, Gareth Tyson, Tian Dong, Shaofeng Li, Shuo Wang, Haojin Zhu, Seyit Camtepe, and Surya Nepal (CSIRO’s Data61, Australia; Cybersecurity CRC, Australia; Hong Kong University of Science and Technology, China; Shanghai Jiao Tong University, China; Peng Cheng Laboratory, China) Numerous open-source and commercial malware detectors are available. However, their efficacy is threatened by new adversarial attacks, whereby malware attempts to evade detection, e.g., by performing feature-space manipulation. In this work, we propose an explainability-guided and model-agnostic testing framework for robustness of malware detectors when confronted with adversarial attacks. The framework introduces the concept of Accrued Malicious Magnitude (AMM) to identify which malware features could be manipulated to maximize the likelihood of evading detection. We then use this framework to test several state-of-the-art malware detectors' ability to detect manipulated malware. We find that (i) commercial antivirus engines are vulnerable to AMM-guided test cases; (ii) the ability of a manipulated malware generated using one detector to evade detection by another detector (i.e., transferability) depends on the overlap of features with large AMM values between the different detectors; and (iii) AMM values effectively measure the fragility of features (i.e., capability of feature-space manipulation to flip the prediction results) and explain the robustness of malware detectors facing evasion attacks. Our findings shed light on the limitations of current malware detectors, as well as how they can be improved. @InProceedings{ESEC/FSE23p1573, author = {Ruoxi Sun and Minhui Xue and Gareth Tyson and Tian Dong and Shaofeng Li and Shuo Wang and Haojin Zhu and Seyit Camtepe and Surya Nepal}, title = {Mate! Are You Really Aware? An Explainability-Guided Testing Framework for Robustness of Malware Detectors}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1573--1585}, doi = {10.1145/3611643.3616309}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Cao, Jialun |
ESEC/FSE '23: "Understanding the Bug Characteristics ..."
Understanding the Bug Characteristics and Fix Strategies of Federated Learning Systems
Xiaohu Du, Xiao Chen, Jialun Cao, Ming Wen, Shing-Chi Cheung, and Hai Jin (Huazhong University of Science and Technology, China; Hong Kong University of Science and Technology, China) Federated learning (FL) is an emerging machine learning paradigm that aims to address the problem of isolated data islands. To preserve privacy, FL allows machine learning models and deep neural networks to be trained from decentralized data kept privately at individual devices. FL has been increasingly adopted in missioncritical fields such as finance and healthcare. However, bugs in FL systems are inevitable and may result in catastrophic consequences such as financial loss, inappropriate medical decision, and violation of data privacy ordinance. While many recent studies were conducted to understand the bugs in machine learning systems, there is no existing study to characterize the bugs arising from the unique nature of FL systems. To fill the gap, we collected 395 real bugs from six popular FL frameworks (Tensorflow Federated, PySyft, FATE, Flower, PaddleFL, and Fedlearner) in GitHub and StackOverflow, and then manually analyzed their symptoms and impacts, prone stages, root causes, and fix strategies. Furthermore, we report a series of findings and actionable implications that can potentially facilitate the detection of FL bugs. @InProceedings{ESEC/FSE23p1358, author = {Xiaohu Du and Xiao Chen and Jialun Cao and Ming Wen and Shing-Chi Cheung and Hai Jin}, title = {Understanding the Bug Characteristics and Fix Strategies of Federated Learning Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1358--1370}, doi = {10.1145/3611643.3616347}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable ESEC/FSE '23: "Testing Coreference Resolution ..." Testing Coreference Resolution Systems without Labeled Test Sets Jialun Cao, Yaojie Lu, Ming Wen, and Shing-Chi Cheung (Hong Kong University of Science and Technology, China; Guangzhou HKUST Fok Ying Tung Research Institute, China; Institute of Software at Chinese Academy of Sciences, China; Huazhong University of Science and Technology, China) Coreference resolution (CR) is a task to resolve different expressions (e.g., named entities, pronouns) that refer to the same real-world en- tity/event. It is a core natural language processing (NLP) component that underlies and empowers major downstream NLP applications such as machine translation, chatbots, and question-answering. De- spite its broad impact, the problem of testing CR systems has rarely been studied. A major difficulty is the shortage of a labeled dataset for testing. While it is possible to feed arbitrary sentences as test inputs to a CR system, a test oracle that captures their expected test outputs (coreference relations) is hard to define automatically. To address the challenge, we propose Crest, an automated testing methodology for CR systems. Crest uses constituency and depen- dency relations to construct pairs of test inputs subject to the same coreference. These relations can be leveraged to define the meta- morphic relation for metamorphic testing. We compare Crest with five state-of-the-art test generation baselines on two popular CR systems, and apply them to generate tests from 1,000 sentences randomly sampled from CoNLL-2012, a popular dataset for corefer- ence resolution. Experimental results show that Crest outperforms baselines significantly. The issues reported by Crest are all true positives (i.e., 100% precision), compared with 63% to 75% achieved by the baselines. @InProceedings{ESEC/FSE23p107, author = {Jialun Cao and Yaojie Lu and Ming Wen and Shing-Chi Cheung}, title = {Testing Coreference Resolution Systems without Labeled Test Sets}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {107--119}, doi = {10.1145/3611643.3616258}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Cao, Junming |
ESEC/FSE '23: "Demystifying Dependency Bugs ..."
Demystifying Dependency Bugs in Deep Learning Stack
Kaifeng Huang, Bihuan Chen, Susheng Wu, Junming Cao, Lei Ma, and Xin Peng (Fudan University, China; University of Tokyo, Japan; University of Alberta, Canada) Deep learning (DL) applications, built upon a heterogeneous and complex DL stack (e.g., Nvidia GPU, Linux, CUDA driver, Python runtime, and TensorFlow), are subject to software and hardware dependencies across the DL stack. One challenge in dependency management across the entire engineering lifecycle is posed by the asynchronous and radical evolution and the complex version constraints among dependencies. Developers may introduce dependency bugs (DBs) in selecting, using and maintaining dependencies. However, the characteristics of DBs in DL stack is still under-investigated, hindering practical solutions to dependency management in DL stack. To bridge this gap, this paper presents the first comprehensive study to characterize symptoms, root causes and fix patterns of DBs across the whole DL stack with 446 DBs collected from StackOverflow posts and GitHub issues. For each DB, we first investigate the symptom as well as the lifecycle stage and dependency where the symptom is exposed. Then, we analyze the root cause as well as the lifecycle stage and dependency where the root cause is introduced. Finally, we explore the fix pattern and the knowledge sources that are used to fix it. Our findings from this study shed light on practical implications on dependency management. @InProceedings{ESEC/FSE23p450, author = {Kaifeng Huang and Bihuan Chen and Susheng Wu and Junming Cao and Lei Ma and Xin Peng}, title = {Demystifying Dependency Bugs in Deep Learning Stack}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {450--462}, doi = {10.1145/3611643.3616325}, year = {2023}, } Publisher's Version |
|
Carlson, Chris |
ESEC/FSE '23-INDUSTRY: "Ownership in the Hands of ..."
Ownership in the Hands of Accountability at Brightsquid: A Case Study and a Developer Survey
Umme Ayman Koana, Francis Chew, Chris Carlson, and Maleknaz Nayebi (York University, Canada; Brightsquid, Canada) The COVID−19 pandemic has accelerated the adoption of digital health solutions. This has presented significant challenges for software development teams to swiftly adjust to the market needs and demand. To address these challenges, product management teams have had to adapt their approach to software development, reshaping their processes to meet the demands of the pandemic. Brighsquid implemented a new task assignment process aimed at enhancing developer accountability toward the customer. To assess the impact of this change on code ownership, we conducted a code change analysis. Additionally, we surveyed 67 developers to investigate the relationship between accountability and ownership more broadly. The findings of our case study indicate that the revised assignment model not only increased the perceived sense of accountability within the production team but also improved code resilience against ownership changes. Moreover, the survey results revealed that a majority of the participating developers (67.5%) associated perceived accountability with artifact ownership. @InProceedings{ESEC/FSE23p2008, author = {Umme Ayman Koana and Francis Chew and Chris Carlson and Maleknaz Nayebi}, title = {Ownership in the Hands of Accountability at Brightsquid: A Case Study and a Developer Survey}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2008--2019}, doi = {10.1145/3611643.3613890}, year = {2023}, } Publisher's Version |
|
Cassano, Federico |
ESEC/FSE '23-DEMO: "npm-follower: A Complete Dataset ..."
npm-follower: A Complete Dataset Tracking the NPM Ecosystem
Donald Pinckney, Federico Cassano, Arjun Guha, and Jonathan Bell (Northeastern University, USA) Software developers typically rely upon a large network of dependencies to build their applications. For instance, the NPM package repository contains over 3 million packages and serves tens of billions of downloads weekly. Understanding the structure and nature of packages, dependencies, and published code requires datasets that provide researchers with easy access to metadata and code of packages. However, prior work on NPM dataset construction typically has two limitations: 1) only metadata is scraped, and 2) packages or versions that are deleted from NPM can not be scraped. Over 330,000 versions of packages were deleted from NPM between July 2022 and May 2023. This data is critical for researchers as it often pertains to important questions of security and malware. We present npm-follower, a dataset and crawling architecture which archives metadata and code of all packages and versions as they are published, and is thus able to retain data which is later deleted. The dataset currently includes over 35 million versions of packages, and grows at a rate of about 1 million versions per month. The dataset is designed to be easily used by researchers answering questions involving either metadata or program analysis. Both the code and dataset are available at: https://dependencies.science @InProceedings{ESEC/FSE23p2132, author = {Donald Pinckney and Federico Cassano and Arjun Guha and Jonathan Bell}, title = {npm-follower: A Complete Dataset Tracking the NPM Ecosystem}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2132--2136}, doi = {10.1145/3611643.3613094}, year = {2023}, } Publisher's Version |
|
Cha, Sang Kil |
ESEC/FSE '23: "FunProbe: Probing Functions ..."
FunProbe: Probing Functions from Binary Code through Probabilistic Analysis
Soomin Kim, Hyungseok Kim, and Sang Kil Cha (KAIST, South Korea) Current function identification techniques have been mostly focused on a specific set of binaries compiled for a specific CPU architecture. While recent deep-learning-based approaches theoretically can handle binaries from different architectures, they require significant computation resources for training and inference, making their use less practical. Furthermore, due to the lack of interpretability of such models, it is fundamentally difficult to gain insight from them. Hence, in this paper, we propose FunProbe, an efficient system for identifying functions from binaries using probabilistic inference. In particular, we identify 16 architecture-neutral hints for function identification, and devise an effective method to combine them in a probabilistic framework. We evaluate our tool on a large dataset consisting of 19,872 real-world binaries compiled for six major CPU architectures. The results are promising. FunProbe shows the best accuracy compared to five state-of-the-art tools we tested, while it takes only 6 seconds on average to analyze a single binary. Notably, FunProbe is 6× faster on average in identifying functions than XDA, a state-of-the-art deep-learning tool that leverages GPU in its inference phase. @InProceedings{ESEC/FSE23p1419, author = {Soomin Kim and Hyungseok Kim and Sang Kil Cha}, title = {FunProbe: Probing Functions from Binary Code through Probabilistic Analysis}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1419--1430}, doi = {10.1145/3611643.3616366}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable |
|
Chakraborty, Saikat |
ESEC/FSE '23: "Grace: Language Models Meet ..."
Grace: Language Models Meet Code Edits
Priyanshu Gupta, Avishree Khare, Yasharth Bajpai, Saikat Chakraborty, Sumit Gulwani, Aditya Kanade, Arjun Radhakrishna, Gustavo Soares, and Ashish Tiwari (Microsoft, India; University of Pennsylvania, USA; Microsoft Research, USA; Microsoft, USA; Microsoft Research, India) Developers spend a significant amount of time in editing code for a variety of reasons such as bug fixing or adding new features. Designing effective methods to predict code edits has been an active yet challenging area of research due to the diversity of code edits and the difficulty of capturing the developer intent. In this work, we address these challenges by endowing pre-trained large language models (LLMs) with the knowledge of relevant prior associated edits, which we call the Grace (Generation conditioned on Associated Code Edits) method. The generative capability of the LLMs helps address the diversity in code changes and conditioning code generation on prior edits helps capture the latent developer intent. We evaluate two well-known LLMs, codex and CodeT5, in zero-shot and fine-tuning settings respectively. In our experiments with two datasets, Grace boosts the performance of the LLMs significantly, enabling them to generate 29% and 54% more correctly edited code in top-1 suggestions relative to the current state-of-the-art symbolic and neural approaches, respectively. @InProceedings{ESEC/FSE23p1483, author = {Priyanshu Gupta and Avishree Khare and Yasharth Bajpai and Saikat Chakraborty and Sumit Gulwani and Aditya Kanade and Arjun Radhakrishna and Gustavo Soares and Ashish Tiwari}, title = {Grace: Language Models Meet Code Edits}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1483--1495}, doi = {10.1145/3611643.3616253}, year = {2023}, } Publisher's Version Published Artifact Archive submitted (770 kB) Artifacts Available |
|
Chakraborty, Sarthak |
ESEC/FSE '23: "Outage-Watch: Early Prediction ..."
Outage-Watch: Early Prediction of Outages using Extreme Event Regularizer
Shubham Agarwal, Sarthak Chakraborty, Shaddy Garg, Sumit Bisht, Chahat Jain, Ashritha Gonuguntla, and Shiv Saini (Adobe Research, India; University of Illinois at Urbana-Champaign, USA; Adobe, India; Amazon, India; Traceable.ai, India; Cisco, India) Cloud services are omnipresent and critical cloud service failure is a fact of life. In order to retain customers and prevent revenue loss, it is important to provide high reliability guarantees for these services. One way to do this is by predicting outages in advance, which can help in reducing the severity as well as time to recovery. It is difficult to forecast critical failures due to the rarity of these events. Moreover, critical failures are ill-defined in terms of observable data. Our proposed method, Outage-Watch, defines critical service outages as deteriorations in the Quality of Service (QoS) captured by a set of metrics. Outage-Watch detects such outages in advance by using current system state to predict whether the QoS metrics will cross a threshold and initiate an extreme event. A mixture of Gaussian is used to model the distribution of the QoS metrics for flexibility and an extreme event regularizer helps in improving learning in tail of the distribution. An outage is predicted if the probability of any one of the QoS metrics crossing threshold changes significantly. Our evaluation on a real-world SaaS company dataset shows that Outage-Watch significantly outperforms traditional methods with an average AUC of 0.98. Additionally, Outage-Watch detects all the outages exhibiting a change in service metrics and reduces the Mean Time To Detection (MTTD) of outages by up to 88% when deployed in an enterprise cloud-service system, demonstrating efficacy of our proposed method. @InProceedings{ESEC/FSE23p682, author = {Shubham Agarwal and Sarthak Chakraborty and Shaddy Garg and Sumit Bisht and Chahat Jain and Ashritha Gonuguntla and Shiv Saini}, title = {Outage-Watch: Early Prediction of Outages using Extreme Event Regularizer}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {682--694}, doi = {10.1145/3611643.3616316}, year = {2023}, } Publisher's Version |
|
Chang, Alison |
ESEC/FSE '23: "Building and Sustaining Ethnically, ..."
Building and Sustaining Ethnically, Racially, and Gender Diverse Software Engineering Teams: A Study at Google
Ella Dagan, Anita Sarma, Alison Chang, Sarah D’Angelo, Jill Dicker, and Emerson Murphy-Hill (Google, USA; Oregon State University, USA) Teams that build software are largely demographically homogeneous. Without diversity, homogeneous perspectives dominate how, why, and for whom software is designed. To understand how teams can successfully build and sustain diversity, we interviewed 11 engineers and 9 managers from some of the most gender and racially diverse teams at Google, a large software company. Qualitatively analyzing the interviews, we found shared approaches to recruiting, hiring, and promoting an inclusive environment, all of which create a positive feedback loop. Our findings produce actionable practices that every member of the team can take to increase diversity by fostering a more inclusive software engineering environment. @InProceedings{ESEC/FSE23p631, author = {Ella Dagan and Anita Sarma and Alison Chang and Sarah D’Angelo and Jill Dicker and Emerson Murphy-Hill}, title = {Building and Sustaining Ethnically, Racially, and Gender Diverse Software Engineering Teams: A Study at Google}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {631--643}, doi = {10.1145/3611643.3616273}, year = {2023}, } Publisher's Version |
|
Chaparro, Oscar |
ESEC/FSE '23: "On the Relationship between ..."
On the Relationship between Code Verifiability and Understandability
Kobi Feldman, Martin Kellogg, and Oscar Chaparro (College of William and Mary, USA; New Jersey Institute of Technology, USA) Proponents of software verification have argued that simpler code is easier to verify: that is, that verification tools issue fewer false positives and require less human intervention when analyzing simpler code. We empirically validate this assumption by comparing the number of warnings produced by four state-of-the-art verification tools on 211 snippets of Java code with 20 metrics of code comprehensibility from human subjects in six prior studies. Our experiments, based on a statistical (meta-)analysis, show that, in aggregate, there is a small correlation (r = 0.23) between understandability and verifiability. The results support the claim that easy-to-verify code is often easier to understand than code that requires more effort to verify. Our work has implications for the users and designers of verification tools and for future attempts to automatically measure code comprehensibility: verification tools may have ancillary benefits to understandability, and measuring understandability may require reasoning about semantic, not just syntactic, code properties. @InProceedings{ESEC/FSE23p211, author = {Kobi Feldman and Martin Kellogg and Oscar Chaparro}, title = {On the Relationship between Code Verifiability and Understandability}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {211--223}, doi = {10.1145/3611643.3616242}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable |
|
Chatterjee, Preetha |
ESEC/FSE '23-IVR: "Exploring Moral Principles ..."
Exploring Moral Principles Exhibited in OSS: A Case Study on GitHub Heated Issues
Ramtin Ehsani, Rezvaneh Rezapour, and Preetha Chatterjee (Drexel University, USA) To foster collaboration and inclusivity in Open Source Software (OSS) projects, it is crucial to understand and detect patterns of toxic language that may drive contributors away, especially those from underrepresented communities. Although machine learning-based toxicity detection tools trained on domain-specific data have shown promise, their design lacks an understanding of the unique nature and triggers of toxicity in OSS discussions, highlighting the need for further investigation. In this study, we employ Moral Foundations Theory to examine the relationship between moral principles and toxicity in OSS. Specifically, we analyze toxic communications in GitHub issue threads to identify and understand five types of moral principles exhibited in text, and explore their potential association with toxic behavior. Our preliminary findings suggest a possible link between moral principles and toxic comments in OSS communications, with each moral principle associated with at least one type of toxicity. The potential of MFT in toxicity detection warrants further investigation. @InProceedings{ESEC/FSE23p2092, author = {Ramtin Ehsani and Rezvaneh Rezapour and Preetha Chatterjee}, title = {Exploring Moral Principles Exhibited in OSS: A Case Study on GitHub Heated Issues}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2092--2096}, doi = {10.1145/3611643.3613077}, year = {2023}, } Publisher's Version ESEC/FSE '23-IVR: "Towards Understanding Emotions ..." Towards Understanding Emotions in Informal Developer Interactions: A Gitter Chat Study Amirali Sajadi, Kostadin Damevski, and Preetha Chatterjee (Drexel University, USA; Virginia Commonwealth University, USA) Emotions play a significant role in teamwork and collaborative activities like software development. While researchers have analyzed developer emotions in various software artifacts (e.g., issues, pull requests), few studies have focused on understanding the broad spectrum of emotions expressed in chats. As one of the most widely used means of communication, chats contain valuable information in the form of informal conversations, such as negative perspectives about adopting a tool. In this paper, we present a dataset of developer chat messages manually annotated with a wide range of emotion labels (and sub-labels), and analyze the type of information present in those messages. We also investigate the unique signals of emotions specific to chats and distinguish them from other forms of software communication. Our findings suggest that chats have fewer expressions of Approval and Fear but more expressions of Curiosity compared to GitHub comments. We also notice that Confusion is frequently observed when discussing programming-related information such as unexpected software behavior. Overall, our study highlights the potential of mining emotions in developer chats for supporting software maintenance and evolution tools. @InProceedings{ESEC/FSE23p2097, author = {Amirali Sajadi and Kostadin Damevski and Preetha Chatterjee}, title = {Towards Understanding Emotions in Informal Developer Interactions: A Gitter Chat Study}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2097--2101}, doi = {10.1145/3611643.3613084}, year = {2023}, } Publisher's Version |
|
Chechik, Marsha |
ESEC/FSE '23-IVR: "A Vision on Intentions in ..."
A Vision on Intentions in Software Engineering
Jacob Krüger, Yi Li, Chenguang Zhu, Marsha Chechik, Thorsten Berger, and Julia Rubin (Eindhoven University of Technology, Netherlands; Nanyang Technological University, Singapore; University of Texas at Austin, USA; University of Toronto, Canada; Ruhr University Bochum, Germany; Chalmers - University of Gothenburg, Sweden; University of British Columbia, Canada) Intentions are fundamental in software engineering, but they are typically only implicitly considered through different abstractions, such as requirements, use cases, features, or issues. Specifically, software engineers develop and evolve (i.e., change) a software system based on such abstractions of a stakeholder’s intention—something a stakeholder wants the system to be able to do. Unfortunately, existing abstractions are (inherently) limited when it comes to representing stakeholder intentions and are mostly used for documenting only. So, whether a change in a system fulfills its underlying intention (and only this one) is an essential problem in practice that motivates many research areas (e.g., testing to ensure intended behavior, untangling intentions in commits). We argue that none of the existing abstractions is ideal for capturing intentions and controlling software evolution, which is why intentions are often vague and must be recovered, untangled, or understood in retrospect. In this paper, we reflect on the role of intentions (represented by changes) in software engineering and sketch how improving their management may support developers. Particularly, we argue that continuously managing and controlling intentions as well as their fulfillment has the potential to improve the reasoning about which stakeholder requests have been addressed, avoid misunderstandings, and prevent expensive retrospective analyses. To guide future research for achieving such benefits for researchers and practitioners, we discuss the relationships between different abstractions and intentions, and propose steps towards managing intentions. @InProceedings{ESEC/FSE23p2117, author = {Jacob Krüger and Yi Li and Chenguang Zhu and Marsha Chechik and Thorsten Berger and Julia Rubin}, title = {A Vision on Intentions in Software Engineering}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2117--2121}, doi = {10.1145/3611643.3613087}, year = {2023}, } Publisher's Version ESEC/FSE '23-IVR: "Towards Feature-Based Analysis ..." Towards Feature-Based Analysis of the Machine Learning Development Lifecycle Boyue Caroline Hu and Marsha Chechik (University of Toronto, Canada) The safety and trustworthiness of systems with components that are based on Machine Learning (ML) require an in-depth understanding and analysis of all stages in its Development Lifecycle (MLDL). High-level abstractions of desired functionalities, model behaviour, and data are called features, and they have been studied by different communities across all MLDL stages. In this paper, we propose to support Software Engineering analysis of the MLDL through features, calling it feature-based analysis of the MLDL. First, to achieve a shared understanding of features among different experts, we establish a taxonomy of existing feature definitions currently used in various MLDL stages. Through this taxonomy, we map features from different stages to each other, discover gaps and future research directions and identify areas of collaboration between Software Engineering and other MLDL experts. @InProceedings{ESEC/FSE23p2087, author = {Boyue Caroline Hu and Marsha Chechik}, title = {Towards Feature-Based Analysis of the Machine Learning Development Lifecycle}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2087--2091}, doi = {10.1145/3611643.3613082}, year = {2023}, } Publisher's Version ESEC/FSE '23: "DecompoVision: Reliability ..." DecompoVision: Reliability Analysis of Machine Vision Components through Decomposition and Reuse Boyue Caroline Hu, Lina Marsso, Nikita Dvornik, Huakun Shen, and Marsha Chechik (University of Toronto, Canada; Waabi, Canada) Analyzing reliability of Machine Vision Components (MVC) against scene changes (such as rain or fog) in their operational environment is crucial for safety-critical applications. Safety analysis relies on the availability of precisely specified and, ideally, machine-verifiable requirements. The state-of-the-art reliability framework ICRAF developed machine-verifiable requirements obtained using human performance data. However, ICRAF is limited to analyzing reliability of MVCs solving simple vision tasks, such as image classification. Yet, many real-world safety-critical systems require solving more complex vision tasks, such as object detection and instance segmentation. Fortunately, many complex vision tasks (which we call “c-tasks”) can be represented as a sequence of simple vision subtasks. For instance, object detection can be decomposed as object localization followed by classification. Based on this fact, in this paper, we show that the analysis of c-tasks can also be decomposed as a sequential analysis of their simple subtasks, which allows us to apply existing techniques for analyzing simple vision tasks. Specifically, we propose a modular reliability framework, DecompoVision, that decomposes: (1) the problem of solving a c-task, (2) the reliability requirements, and (3) the reliability analysis, and, as a result, provides deeper insights into MVC reliability. DecompoVision extends ICRAF to handle complex vision tasks and enables reuse of existing artifacts across different c-tasks. We capture new reliability gaps by checking our requirements on 13 widely used object detection MVCs, and, for the first time, benchmark segmentation MVCs. @InProceedings{ESEC/FSE23p541, author = {Boyue Caroline Hu and Lina Marsso and Nikita Dvornik and Huakun Shen and Marsha Chechik}, title = {DecompoVision: Reliability Analysis of Machine Vision Components through Decomposition and Reuse}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {541--552}, doi = {10.1145/3611643.3616333}, year = {2023}, } Publisher's Version |
|
Chen, Bihuan |
ESEC/FSE '23: "Demystifying Dependency Bugs ..."
Demystifying Dependency Bugs in Deep Learning Stack
Kaifeng Huang, Bihuan Chen, Susheng Wu, Junming Cao, Lei Ma, and Xin Peng (Fudan University, China; University of Tokyo, Japan; University of Alberta, Canada) Deep learning (DL) applications, built upon a heterogeneous and complex DL stack (e.g., Nvidia GPU, Linux, CUDA driver, Python runtime, and TensorFlow), are subject to software and hardware dependencies across the DL stack. One challenge in dependency management across the entire engineering lifecycle is posed by the asynchronous and radical evolution and the complex version constraints among dependencies. Developers may introduce dependency bugs (DBs) in selecting, using and maintaining dependencies. However, the characteristics of DBs in DL stack is still under-investigated, hindering practical solutions to dependency management in DL stack. To bridge this gap, this paper presents the first comprehensive study to characterize symptoms, root causes and fix patterns of DBs across the whole DL stack with 446 DBs collected from StackOverflow posts and GitHub issues. For each DB, we first investigate the symptom as well as the lifecycle stage and dependency where the symptom is exposed. Then, we analyze the root cause as well as the lifecycle stage and dependency where the root cause is introduced. Finally, we explore the fix pattern and the knowledge sources that are used to fix it. Our findings from this study shed light on practical implications on dependency management. @InProceedings{ESEC/FSE23p450, author = {Kaifeng Huang and Bihuan Chen and Susheng Wu and Junming Cao and Lei Ma and Xin Peng}, title = {Demystifying Dependency Bugs in Deep Learning Stack}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {450--462}, doi = {10.1145/3611643.3616325}, year = {2023}, } Publisher's Version |
|
Chen, Chaoyi |
ESEC/FSE '23-INDUSTRY: "Appaction: Automatic GUI Interaction ..."
Appaction: Automatic GUI Interaction for Mobile Apps via Holistic Widget Perception
Yongxiang Hu, Jiazhen Gu, Shuqing Hu, Yu Zhang, Wenjie Tian, Shiyu Guo, Chaoyi Chen, and Yangfan Zhou (Fudan University, China; Meituan, China; Shanghai Key Laboratory of Intelligent Information Processing, China) In industrial practice, GUI (Graphic User Interface) testing of mobile apps still inevitably relies on huge manual efforts. The major efforts are those on understanding the GUIs, so that testing scripts can be written accordingly. Quality assurance could therefore be very labor-intensive, especially for modern commercial mobile apps, where one may include tremendous, diverse, and complex GUIs, e.g., those for placing orders of different commercial items. To reduce such human efforts, we propose Appaction, a learning-based automatic GUI interaction approach we developed for Meituan, one of the largest E-commerce providers with over 600 million users. Appaction can automatically analyze the target GUI and understand what each input of the GUI is about, so that corresponding valid inputs can be entered accordingly. To this end, Appaction adopts a multi-modal model to learn from human experiences in perceiving a GUI. This allows it to infer corresponding valid input events that can properly interact with the GUI. In this way, the target app can be effectively exercised. We present our experiences in Meituan on applying Appaction to popular commercial apps. We demonstrate the effectiveness of Appaction in GUI analysis, and it can perform correct interactions for numerous form pages. @InProceedings{ESEC/FSE23p1786, author = {Yongxiang Hu and Jiazhen Gu and Shuqing Hu and Yu Zhang and Wenjie Tian and Shiyu Guo and Chaoyi Chen and Yangfan Zhou}, title = {Appaction: Automatic GUI Interaction for Mobile Apps via Holistic Widget Perception}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1786--1797}, doi = {10.1145/3611643.3613885}, year = {2023}, } Publisher's Version |
|
Chen, Chengjun |
ESEC/FSE '23: "OOM-Guard: Towards Improving ..."
OOM-Guard: Towards Improving the Ergonomics of Rust OOM Handling via a Reservation-Based Approach
Chengjun Chen, Zhicong Zhang, Hongliang Tian, Shoumeng Yan, and Hui Xu (Fudan University, China; Ant Group, China) Out of memory (OOM) is an exceptional system state where any further memory allocation requests may fail. Such allocation failures would crash the process or system if not handled properly, and they may also lead to an inconsistent program state that cannot be recovered easily. Current mechanisms for preventing such hazards highly rely on the manual effort of the programmers themselves. This paper studies the OOM issues of Rust, which is an emerging system programming language that stresses the importance of memory safety but still lacks handy mechanisms to handle OOM well. Even worse, Rust employs an infallible mode of memory allocations by default. As a result, the program written by Rust would simply abort itself when OOM occurs. Such crashes would lead to critical robustness issues for services or modules of operating systems. We propose OOM-Guard, a handy approach for Rust programmers to handle OOM. OOM-Guard is by nature a reservation-based approach that aims to convert the handlings for many possible failed memory allocations into handlings for a smaller number of reservations. In order to achieve efficient reservation, OOM-Guard incorporates a subtle cost analysis algorithm based on static analysis and a proxy allocator. We then apply OOM-Guard to two well-known Rust projects, Bento and rCore. Results show that OOM-Guard can largely reduce developers' efforts for handling OOM and incurs trivial overhead in both memory space and execution time. @InProceedings{ESEC/FSE23p733, author = {Chengjun Chen and Zhicong Zhang and Hongliang Tian and Shoumeng Yan and Hui Xu}, title = {OOM-Guard: Towards Improving the Ergonomics of Rust OOM Handling via a Reservation-Based Approach}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {733--744}, doi = {10.1145/3611643.3616303}, year = {2023}, } Publisher's Version |
|
Chen, Chunyang |
ESEC/FSE '23-INDUSTRY: "Towards Efficient Record and ..."
Towards Efficient Record and Replay: A Case Study in WeChat
Sidong Feng, Haochuan Lu, Ting Xiong, Yuetang Deng, and Chunyang Chen (Monash University, Australia; Tencent, China) WeChat, a widely-used messenger app boasting over 1 billion monthly active users, requires effective app quality assurance for its complex features. Record-and-replay tools are crucial in achieving this goal. Despite the extensive development of these tools, the impact of waiting time between replay events has been largely overlooked. On one hand, a long waiting time for executing replay events on fully-rendered GUIs slows down the process. On the other hand, a short waiting time can lead to events executing on partially-rendered GUIs, negatively affecting replay effectiveness. An optimal waiting time should strike a balance between effectiveness and efficiency. We introduce WeReplay, a lightweight image-based approach that dynamically adjusts inter-event time based on the GUI rendering state. Given the real-time streaming on the GUI, WeReplay employs a deep learning model to infer the rendering state and synchronize with the replaying tool, scheduling the next event when the GUI is fully rendered. Our evaluation shows that our model achieves 92.1% precision and 93.3% recall in discerning GUI rendering states in the WeChat app. Through assessing the performance in replaying 23 common WeChat usage scenarios, WeReplay successfully replays all scenarios on the same and different devices more efficiently than the state-of-the-practice baselines. @InProceedings{ESEC/FSE23p1681, author = {Sidong Feng and Haochuan Lu and Ting Xiong and Yuetang Deng and Chunyang Chen}, title = {Towards Efficient Record and Replay: A Case Study in WeChat}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1681--1692}, doi = {10.1145/3611643.3613880}, year = {2023}, } Publisher's Version ESEC/FSE '23: "Automated and Context-Aware ..." Automated and Context-Aware Repair of Color-Related Accessibility Issues for Android Apps Yuxin Zhang, Sen Chen, Lingling Fan, Chunyang Chen, and Xiaohong Li (Tianjin University, China; Nankai University, China; Monash University, Australia) Approximately 15% of the world's population is suffering from various disabilities or impairments. However, many mobile UX designers and developers disregard the significance of accessibility for those with disabilities when developing apps. It is unbelievable that one in seven people might not have the same level of access that other users have, which actually violates many legal and regulatory standards. On the contrary, if the apps are developed with accessibility in mind, it will drastically improve the user experience for all users as well as maximize revenue. Thus, a large number of studies and some effective tools for detecting accessibility issues have been conducted and proposed to mitigate such a severe problem. However, compared with detection, the repair work is obviously falling behind. Especially for the color-related accessibility issues, which is one of the top issues in apps with a greatly negative impact on vision and user experience. Apps with such issues are difficult to use for people with low vision and the elderly. Unfortunately, such an issue type cannot be directly fixed by existing repair techniques. To this end, we propose Iris, an automated and context-aware repair method to fix the color-related accessibility issues (i.e., the text contrast issues and the image contrast issues) for apps. By leveraging a novel context-aware technique that resolves the optimal colors and a vital phase of attribute-to-repair localization, Iris not only repairs the color contrast issues but also guarantees the consistency of the design style between the original UI page and repaired UI page. Our experiments unveiled that Iris can achieve a 91.38% repair success rate with high effectiveness and efficiency. The usefulness of Iris has also been evaluated by a user study with a high satisfaction rate as well as developers' positive feedback. 9 of 40 submitted pull requests on GitHub repositories have been accepted and merged into the projects by app developers, and another 4 developers are actively discussing with us for further repair. Iris is publicly available to facilitate this new research direction. @InProceedings{ESEC/FSE23p1255, author = {Yuxin Zhang and Sen Chen and Lingling Fan and Chunyang Chen and Xiaohong Li}, title = {Automated and Context-Aware Repair of Color-Related Accessibility Issues for Android Apps}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1255--1267}, doi = {10.1145/3611643.3616329}, year = {2023}, } Publisher's Version Info |
|
Chen, Hongxu |
ESEC/FSE '23: "Software Architecture Recovery ..."
Software Architecture Recovery with Information Fusion
Yiran Zhang, Zhengzi Xu, Chengwei Liu, Hongxu Chen, Jianwen Sun, Dong Qiu, and Yang Liu (Nanyang Technological University, Singapore; Huawei Technologies, China) Understanding the architecture is vital for effectively maintaining and managing large software systems. However, as software systems evolve over time, their architectures inevitably change. To keep up with the change, architects need to track the implementation-level changes and update the architectural documentation accordingly, which is time-consuming and error-prone. Therefore, many automatic architecture recovery techniques have been proposed to ease this process. Despite efforts have been made to improve the accuracy of architecture recovery, existing solutions still suffer from two limitations. First, most of them only use one or two type of information for the recovery, ignoring the potential usefulness of other sources. Second, they tend to use the information in a coarse-grained manner, overlooking important details within it. To address these limitations, we propose SARIF, a fully automated architecture recovery technique, which incorporates three types of comprehensive information, including dependencies, code text and folder structure. SARIF can recover architecture more accurately by thoroughly analyzing the details of each type of information and adaptively fusing them based on their relevance and quality. To evaluate SARIF, we collected six projects with published ground-truth architectures and three open-source projects labeled by our industrial collaborators. We compared SARIF with nine state-of-the-art techniques using three commonly-used architecture similarity metrics and two new metrics. The experimental results show that SARIF is 36.1% more accurate than the best of the previous techniques on average. By providing comprehensive architecture, SARIF can help users understand systems effectively and reduce the manual effort of obtaining ground-truth architectures. @InProceedings{ESEC/FSE23p1535, author = {Yiran Zhang and Zhengzi Xu and Chengwei Liu and Hongxu Chen and Jianwen Sun and Dong Qiu and Yang Liu}, title = {Software Architecture Recovery with Information Fusion}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1535--1547}, doi = {10.1145/3611643.3616285}, year = {2023}, } Publisher's Version |
|
Chen, Hongyang |
ESEC/FSE '23: "Nezha: Interpretable Fine-Grained ..."
Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data
Guangba Yu, Pengfei Chen, Yufeng Li, Hongyang Chen, Xiaoyun Li, and Zibin Zheng (Sun Yat-sen University, China) Root cause analysis (RCA) in large-scale microservice systems is a critical and challenging task. To understand and localize root causes of unexpected faults, modern observability tools collect and preserve multi-modal observability data, including metrics, traces, and logs. Since system faults may manifest as anomalies in different data sources, existing RCA approaches that rely on single-modal data are constrained in the granularity and interpretability of root causes. In this study, we present Nezha, an interpretable and fine-grained RCA approach that pinpoints root causes at the code region and resource type level by incorporative analysis of multi-modal data. Nezha transforms heterogeneous multi-modal data into a homogeneous event representation and extracts event patterns by constructing and mining event graphs. The core idea of Nezha is to compare event patterns in the fault-free phase with those in the fault-suffering phase to localize root causes in an interpretable way. Practical implementation and experimental evaluations on two microservice applications show that Nezha achieves a high top1 accuracy (89.77%) on average at the code region and resource type level and outperforms state-of-the-art approaches by a large margin. Two ablation studies further confirm the contributions of incorporating multi-modal data. @InProceedings{ESEC/FSE23p553, author = {Guangba Yu and Pengfei Chen and Yufeng Li and Hongyang Chen and Xiaoyun Li and Zibin Zheng}, title = {Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {553--565}, doi = {10.1145/3611643.3616249}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Chen, Jinfu |
ESEC/FSE '23: "IoPV: On Inconsistent Option ..."
IoPV: On Inconsistent Option Performance Variations
Jinfu Chen, Zishuo Ding, Yiming Tang, Mohammed Sayagh, Heng Li, Bram Adams, and Weiyi Shang (Wuhan University, China; University of Waterloo, Canada; Rochester Institute of Technology, USA; ÉTS, Canada; Polytechnique Montréal, Canada; Queen’s University, Canada) Maintaining a good performance of a software system is a primordial task when evolving a software system. The performance regression issues are among the dominant problems that large software systems face. In addition, these large systems tend to be highly configurable, which allows users to change the behaviour of these systems by simply altering the values of certain configuration options. However, such flexibility comes with a cost. Such software systems suffer throughout their evolution from what we refer to as “Inconsistent Option Performance Variation” (IoPV ). An IoPV indicates, for a given commit, that the performance regression or improvement of different values of the same configuration option is inconsistent compared to the prior commit. For instance, a new change might not suffer from any performance regression under the default configuration (i.e., when all the options are set to their default values), while altering one option’s value manifests a regression, which we refer to as a hidden regression as it is not manifested under the default configuration. Similarly, when developers improve the performance of their systems, performance regression might be manifested under a subset of the existing configurations. Unfortunately, such hidden regressions are harmful as they can go unseen to the production environment. In this paper, we first quantify how prevalent (in)consistent performance regression or improvement is among the values of an option. In particular, we study over 803 Hadoop and 502 Cassandra commits, for which we execute a total of 4,902 and 4,197 tests, respectively, amounting to 12,536 machine hours of testing. We observe that IoPV is a common problem that is difficult to manually predict. 69% and 93% of the Hadoop and Cassandra commits have at least one configuration that hides a performance regression. Worse, most of the commits have different options or tests leading to IoPV and hiding performance regressions. Therefore, we propose a prediction model that identifies whether a given combination of commit, test, and option (CTO) manifests an IoPV. Our evaluation for different models shows that random forest is the best performing classifier, with a median AUC of 0.91 and 0.82 for Hadoop and Cassandra, respectively. Our paper defines and provides scientific evidence about the IoPV problem and its prevalence, which can be explored by future work. In addition, we provide an initial machine learning model for predicting IoPV. @InProceedings{ESEC/FSE23p845, author = {Jinfu Chen and Zishuo Ding and Yiming Tang and Mohammed Sayagh and Heng Li and Bram Adams and Weiyi Shang}, title = {IoPV: On Inconsistent Option Performance Variations}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {845--857}, doi = {10.1145/3611643.3616319}, year = {2023}, } Publisher's Version |
|
Chen, Junjie |
ESEC/FSE '23: "Enhancing Coverage-Guided ..."
Enhancing Coverage-Guided Fuzzing via Phantom Program
Mingyuan Wu, Kunqiu Chen, Qi Luo, Jiahong Xiang, Ji Qi, Junjie Chen, Heming Cui, and Yuqun Zhang (Southern University of Science and Technology, China; University of Hong Kong, China; Tianjin University, China) For coverage-guided fuzzers, many of their adopted seeds are usually underused by exploring limited program states since essentially all their executions have to abide by rigorous program dependencies while only limited seeds are capable of accessing dependencies. Moreover, even when iteratively executing such limited seeds, the fuzzers have to repeatedly access the covered program states before uncovering new states. Such facts indicate that exploration power on program states of seeds has not been sufficiently leveraged by the existing coverage-guided fuzzing strategies. To tackle these issues, we propose a coverage-guided fuzzer, namely MirageFuzz, to mitigate the program dependencies when executing seeds for enhancing their exploration power on program states. Specifically, MirageFuzz first creates a “phantom” program of the target program by reducing its program dependencies corresponding to conditional statements while retaining their original semantics. Accordingly, MirageFuzz performs dual fuzzing, i.e., the source fuzzing to fuzz the original program and the phantom fuzzing to fuzz the phantom program simultaneously. Then, MirageFuzz applies the taint-based mutation mechanism to generate a new seed by updating the target conditional statement of a given seed from the source fuzzing with the corresponding condition value derived by the phantom fuzzing. To evaluate the effectiveness of MirageFuzz, we build a benchmark suite with 18 projects commonly adopted by recent fuzzing papers, and select seven open-source fuzzers as baselines for performance comparison with MirageFuzz. The experiment results suggest that MirageFuzz outperforms our baseline fuzzers from 13.42% to 77.96% averagely. Furthermore, MirageFuzz exposes 29 previously unknown bugs where 4 of them have been confirmed and 3 have been fixed by the corresponding developers. @InProceedings{ESEC/FSE23p1037, author = {Mingyuan Wu and Kunqiu Chen and Qi Luo and Jiahong Xiang and Ji Qi and Junjie Chen and Heming Cui and Yuqun Zhang}, title = {Enhancing Coverage-Guided Fuzzing via Phantom Program}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1037--1049}, doi = {10.1145/3611643.3616294}, year = {2023}, } Publisher's Version ESEC/FSE '23: "SJFuzz: Seed and Mutator Scheduling ..." SJFuzz: Seed and Mutator Scheduling for JVM Fuzzing Mingyuan Wu, Yicheng Ouyang, Minghai Lu, Junjie Chen, Yingquan Zhao, Heming Cui, Guowei Yang, and Yuqun Zhang (Southern University of Science and Technology, China; University of Hong Kong, China; Tianjin University, China; University of Queensland, Australia) While the Java Virtual Machine (JVM) plays a vital role in ensuring correct executions of Java applications, testing JVMs via generating and running class files on them can be rather challenging. The existing techniques, e.g., ClassFuzz and Classming, attempt to leverage the power of fuzzing and differential testing to cope with JVM intricacies by exposing discrepant execution results among different JVMs, i.e., inter-JVM discrepancies, for testing analytics. However, their adopted fuzzers are insufficiently guided since they include no well-designed seed and mutator scheduling mechanisms, leading to inefficient differential testing. To address such issues, in this paper, we propose SJFuzz, the first JVM fuzzing framework with seed and mutator scheduling mechanisms for automated JVM differential testing. Overall, SJFuzz aims to mutate class files via control flow mutators to facilitate the exposure of inter-JVM discrepancies. To this end, SJFuzz schedules seeds (class files) for mutations based on the discrepancy and diversity guidance. SJFuzz also schedules mutators for diversifying class file generation. To evaluate SJFuzz, we conduct an extensive study on multiple representative real-world JVMs, and the experimental results show that SJFuzz significantly outperforms the state-of-the-art mutation-based and generation-based JVM fuzzers in terms of the inter-JVM discrepancy exposure and the class file diversity. Moreover, SJFuzz successfully reported 46 potential JVM issues, and 20 of them have been confirmed as bugs and 16 have been fixed by the JVM developers. @InProceedings{ESEC/FSE23p1062, author = {Mingyuan Wu and Yicheng Ouyang and Minghai Lu and Junjie Chen and Yingquan Zhao and Heming Cui and Guowei Yang and Yuqun Zhang}, title = {SJFuzz: Seed and Mutator Scheduling for JVM Fuzzing}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1062--1074}, doi = {10.1145/3611643.3616277}, year = {2023}, } Publisher's Version ESEC/FSE '23: "Statfier: Automated Testing ..." Statfier: Automated Testing of Static Analyzers via Semantic-Preserving Program Transformations Huaien Zhang, Yu Pei, Junjie Chen, and Shin Hwei Tan (Southern University of Science and Technology, China; Hong Kong Polytechnic University, China; Tianjin University, China; Concordia University, Canada) Static analyzers reason about the behaviors of programs without executing them and report issues when they violate pre-defined desirable properties. One of the key limitations of static analyzers is their tendency to produce inaccurate and incomplete analysis results, i.e., they often generate too many spurious warnings and miss important issues. To help enhance the reliability of a static analyzer, developers usually manually write tests involving input programs and the corresponding expected analysis results for the analyzers. Meanwhile, a static analyzer often includes example programs in its documentation to demonstrate the desirable properties and/or their violations. Our key insight is that we can reuse programs extracted either from the official test suite or documentation and apply semantic-preserving transformations to them to generate variants. We studied the quality of input programs from these two sources and found that most rules in static analyzers are covered by at least one input program, implying the potential of using these programs as the basis for test generation. We present Statfier, a heuristic-based automated testing approach for static analyzers that generates program variants via semantic-preserving transformations and detects inconsistencies between the original program and variants (indicate inaccurate analysis results in the static analyzer). To select variants that are more likely to reveal new bugs, Statfier uses two key heuristics: (1) analysis report guided location selection that uses program locations in the reports produced by static analyzers to perform transformations and (2) structure diversity driven variant selection that chooses variants with different program contexts and diverse types of transformations. Our experiments with five popular static analyzers show that Statfier can find 79 bugs in these analyzers, of which 46 have been confirmed. @InProceedings{ESEC/FSE23p237, author = {Huaien Zhang and Yu Pei and Junjie Chen and Shin Hwei Tan}, title = {Statfier: Automated Testing of Static Analyzers via Semantic-Preserving Program Transformations}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {237--249}, doi = {10.1145/3611643.3616272}, year = {2023}, } Publisher's Version |
|
Chen, Kunqiu |
ESEC/FSE '23: "Enhancing Coverage-Guided ..."
Enhancing Coverage-Guided Fuzzing via Phantom Program
Mingyuan Wu, Kunqiu Chen, Qi Luo, Jiahong Xiang, Ji Qi, Junjie Chen, Heming Cui, and Yuqun Zhang (Southern University of Science and Technology, China; University of Hong Kong, China; Tianjin University, China) For coverage-guided fuzzers, many of their adopted seeds are usually underused by exploring limited program states since essentially all their executions have to abide by rigorous program dependencies while only limited seeds are capable of accessing dependencies. Moreover, even when iteratively executing such limited seeds, the fuzzers have to repeatedly access the covered program states before uncovering new states. Such facts indicate that exploration power on program states of seeds has not been sufficiently leveraged by the existing coverage-guided fuzzing strategies. To tackle these issues, we propose a coverage-guided fuzzer, namely MirageFuzz, to mitigate the program dependencies when executing seeds for enhancing their exploration power on program states. Specifically, MirageFuzz first creates a “phantom” program of the target program by reducing its program dependencies corresponding to conditional statements while retaining their original semantics. Accordingly, MirageFuzz performs dual fuzzing, i.e., the source fuzzing to fuzz the original program and the phantom fuzzing to fuzz the phantom program simultaneously. Then, MirageFuzz applies the taint-based mutation mechanism to generate a new seed by updating the target conditional statement of a given seed from the source fuzzing with the corresponding condition value derived by the phantom fuzzing. To evaluate the effectiveness of MirageFuzz, we build a benchmark suite with 18 projects commonly adopted by recent fuzzing papers, and select seven open-source fuzzers as baselines for performance comparison with MirageFuzz. The experiment results suggest that MirageFuzz outperforms our baseline fuzzers from 13.42% to 77.96% averagely. Furthermore, MirageFuzz exposes 29 previously unknown bugs where 4 of them have been confirmed and 3 have been fixed by the corresponding developers. @InProceedings{ESEC/FSE23p1037, author = {Mingyuan Wu and Kunqiu Chen and Qi Luo and Jiahong Xiang and Ji Qi and Junjie Chen and Heming Cui and Yuqun Zhang}, title = {Enhancing Coverage-Guided Fuzzing via Phantom Program}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1037--1049}, doi = {10.1145/3611643.3616294}, year = {2023}, } Publisher's Version |
|
Chen, Lingchao |
ESEC/FSE '23-INDUSTRY: "Last Diff Analyzer: Multi-language ..."
Last Diff Analyzer: Multi-language Automated Approver for Behavior-Preserving Code Revisions
Yuxin Wang, Adam Welc, Lazaro Clapp, and Lingchao Chen (Uber Technologies, USA; Mysten Labs, USA) Code review is a crucial step in ensuring the quality and maintainability of software systems. However, this process can be time-consuming and resource-intensive, especially in large-scale projects where a significant number of code changes are submitted every day. Fortunately, not all code changes require human reviews, as some may only contain syntactic modifications that do not alter the behavior of the system, such as format changes, variable / function renamings, and constant extractions. In this paper, we propose a multi-language automated code approver — Last Diff Analyzer for Go and Java, which is able to detect if a reviewable incremental unit of code change (diff) contains only changes that do not modify system behavior. It is built on top of a novel multi-language static analysis framework that unifies common features of multiple languages while keeping unique language constructs separate. This makes it easy to extend to other languages such as TypeScript, Kotlin, Swift, and others. Besides skipping unnecessary code reviews, Last Diff Analyzer could be further applied to skip certain resource-intensive end-to-end (E2E) tests for auto-approved diffs for significant reduction of resource usage. We have deployed the analyzer at scale within Uber, and data collected in production shows that approximately 15% of analyzed diffs are auto-approved weekly for code reviews. Furthermore, 13.5% reduction in server node usage dedicated to E2E tests (measured by number of executed E2E tests) is observed as a result of skipping E2E tests, compared to the node usage if Last Diff Analyzer were not enabled. @InProceedings{ESEC/FSE23p1693, author = {Yuxin Wang and Adam Welc and Lazaro Clapp and Lingchao Chen}, title = {Last Diff Analyzer: Multi-language Automated Approver for Behavior-Preserving Code Revisions}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1693--1704}, doi = {10.1145/3611643.3613870}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Chen, Muhao |
ESEC/FSE '23: "Knowledge-Based Version Incompatibility ..."
Knowledge-Based Version Incompatibility Detection for Deep Learning
Zhongkai Zhao, Bonan Kou, Mohamed Yilmaz Ibrahim, Muhao Chen, and Tianyi Zhang (Tongji University, China; Purdue University, USA; University of California at Davis, USA) Version incompatibility issues are rampant when reusing or reproducing deep learning models and applications. Existing techniques are limited to library dependency specifications declared in PyPI. Therefore, these techniques cannot detect version issues due to undocumented version constraints or issues involving hardware drivers or OS. To address this challenge, we propose to leverage the abundant discussions of DL version issues from Stack Overflow to facilitate version incompatibility detection. We reformulate the problem of knowledge extraction as a Question-Answering (QA) problem and use a pre-trained QA model to extract version compatibility knowledge from online discussions. The extracted knowledge is further consolidated into a weighted knowledge graph to detect potential version incompatibilities when reusing a DL project. Our evaluation results show that (1) our approach can accurately extract version knowledge with 84% accuracy, and (2) our approach can accurately identify 65% of known version issues in 10 popular DL projects with a high precision (92%), while two state-of-the-art approaches can only detect 29% and 6% of these issues with 33% and 17% precision respectively. @InProceedings{ESEC/FSE23p708, author = {Zhongkai Zhao and Bonan Kou and Mohamed Yilmaz Ibrahim and Muhao Chen and Tianyi Zhang}, title = {Knowledge-Based Version Incompatibility Detection for Deep Learning}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {708--719}, doi = {10.1145/3611643.3616364}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable |
|
Chen, Ningjiang |
ESEC/FSE '23: "Log Parsing with Generalization ..."
Log Parsing with Generalization Ability under New Log Types
Siyu Yu, Yifan Wu, Zhijing Li, Pinjia He, Ningjiang Chen, and Changjian Liu (Guangxi University, China; Peking University, China; Chinese University of Hong Kong, China) Log parsing, which converts semi-structured logs into structured logs, is the first step for automated log analysis. Existing parsers are still unsatisfactory in real-world systems due to new log types in new-coming logs. In practice, available logs collected during system runtime often do not contain all the possible log types of a system because log types related to infrequently activated system states are unlikely to be recorded and new log types are frequently introduced with system updates. Meanwhile, most existing parsers require preprocessing to extract variables in advance, but preprocessing is based on the operator’s prior knowledge of available logs and therefore may not work well on new log types. In addition, parser parameters set based on available logs are difficult to generalize to new log types. To support new log types, we propose a variable generation imitation strategy to craft a novel log parsing approach with generalization ability, called Log3T. Log3T employs a pre-trained transformer encoder-based model to extract log templates and can update parameters at parsing time to adapt to new log types by a modified test-time training. Experimental results on 16 benchmark datasets show that Log3T outperforms the state-of-the-art parsers in terms of parsing accuracy. In addition, Log3T can automatically adapt to new log types in new-coming logs. @InProceedings{ESEC/FSE23p425, author = {Siyu Yu and Yifan Wu and Zhijing Li and Pinjia He and Ningjiang Chen and Changjian Liu}, title = {Log Parsing with Generalization Ability under New Log Types}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {425--437}, doi = {10.1145/3611643.3616355}, year = {2023}, } Publisher's Version |
|
Chen, Pengfei |
ESEC/FSE '23: "Nezha: Interpretable Fine-Grained ..."
Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data
Guangba Yu, Pengfei Chen, Yufeng Li, Hongyang Chen, Xiaoyun Li, and Zibin Zheng (Sun Yat-sen University, China) Root cause analysis (RCA) in large-scale microservice systems is a critical and challenging task. To understand and localize root causes of unexpected faults, modern observability tools collect and preserve multi-modal observability data, including metrics, traces, and logs. Since system faults may manifest as anomalies in different data sources, existing RCA approaches that rely on single-modal data are constrained in the granularity and interpretability of root causes. In this study, we present Nezha, an interpretable and fine-grained RCA approach that pinpoints root causes at the code region and resource type level by incorporative analysis of multi-modal data. Nezha transforms heterogeneous multi-modal data into a homogeneous event representation and extracts event patterns by constructing and mining event graphs. The core idea of Nezha is to compare event patterns in the fault-free phase with those in the fault-suffering phase to localize root causes in an interpretable way. Practical implementation and experimental evaluations on two microservice applications show that Nezha achieves a high top1 accuracy (89.77%) on average at the code region and resource type level and outperforms state-of-the-art approaches by a large margin. Two ablation studies further confirm the contributions of incorporating multi-modal data. @InProceedings{ESEC/FSE23p553, author = {Guangba Yu and Pengfei Chen and Yufeng Li and Hongyang Chen and Xiaoyun Li and Zibin Zheng}, title = {Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {553--565}, doi = {10.1145/3611643.3616249}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available ESEC/FSE '23: "DiagConfig: Configuration ..." DiagConfig: Configuration Diagnosis of Performance Violations in Configurable Software Systems Zhiming Chen, Pengfei Chen, Peipei Wang, Guangba Yu, Zilong He, and Genting Mai (Sun Yat-sen University, China; ByteDance Infrastructure System Lab, USA) Performance degradation due to misconfiguration in software systems that violates SLOs (service-level objectives) is commonplace. Diagnosing and explaining the root causes of such performance violations in configurable software systems is often challenging due to their increasing complexity. Although there are many tools and techniques for diagnosing performance violations, they provide limited evidence to attribute causes of observed performance violations to specific configurations. This is because the configuration is not originally considered in those tools. This paper proposes DiagConfig, specifically designed to conduct configuration diagnosis of performance violations. It leverages static code analysis to track configuration option propagation, identifies performance-sensitive options, detects performance violations, and constructs cause-effect chains that help stakeholders better understand the relationship between configuration and performance violations. Experimental evaluations with eight real-world software demonstrate that DiagConfig produces fewer false positives than a state-of-the-art documentation analysis-based tool (i.e., 5 vs 41) in the identification of performance-sensitive options, and outperforms a statistics-based debugging tool in the diagnosis of performance violations caused by configuration changes, offering more comprehensive results (recall: 0.892 vs 0.289). Moreover, we also show that DiagConfig can accelerate auto-tuning by compressing configuration space. @InProceedings{ESEC/FSE23p566, author = {Zhiming Chen and Pengfei Chen and Peipei Wang and Guangba Yu and Zilong He and Genting Mai}, title = {DiagConfig: Configuration Diagnosis of Performance Violations in Configurable Software Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {566--578}, doi = {10.1145/3611643.3616300}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Chen, Qingjun |
ESEC/FSE '23-INDUSTRY: "TraceDiag: Adaptive, Interpretable, ..."
TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems
Ruomeng Ding, Chaoyun Zhang, Lu Wang, Yong Xu, Minghua Ma, Xiaomin Wu, Meng Zhang, Qingjun Chen, Xin Gao, Xuedong Gao, Hao Fan, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang (Microsoft, China; Microsoft 365, China; Microsoft 365, USA) Root Cause Analysis (RCA) is becoming increasingly crucial for ensuring the reliability of microservice systems. However, performing RCA on modern microservice systems can be challenging due to their large scale, as they usually comprise hundreds of components, leading significant human effort. This paper proposes TraceDiag, an end-to-end RCA framework that addresses the challenges for large-scale microservice systems. It leverages reinforcement learning to learn a pruning policy for the service dependency graph to automatically eliminates redundant components, thereby significantly improving the RCA efficiency. The learned pruning policy is interpretable and fully adaptive to new RCA instances. With the pruned graph, a causal-based method can be executed with high accuracy and efficiency. The proposed TraceDiag framework is evaluated on real data traces collected from the Microsoft Exchange system, and demonstrates superior performance compared to state-of-the-art RCA approaches. Notably, TraceDiag has been integrated as a critical component in the Microsoft M365 Exchange, resulting in a significant improvement in the system's reliability and a considerable reduction in the human effort required for RCA. @InProceedings{ESEC/FSE23p1762, author = {Ruomeng Ding and Chaoyun Zhang and Lu Wang and Yong Xu and Minghua Ma and Xiaomin Wu and Meng Zhang and Qingjun Chen and Xin Gao and Xuedong Gao and Hao Fan and Saravan Rajmohan and Qingwei Lin and Dongmei Zhang}, title = {TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1762--1773}, doi = {10.1145/3611643.3613864}, year = {2023}, } Publisher's Version |
|
Chen, Ruiling |
ESEC/FSE '23-DEMO: "llvm2CryptoLine: Verifying ..."
llvm2CryptoLine: Verifying Arithmetic in Cryptographic C Programs
Ruiling Chen, Jiaxiang Liu, Xiaomu Shi, Ming-Hsien Tsai, Bow-Yaw Wang, and Bo-Yin Yang (Shenzhen University, China; Institute of Software at Chinese Academy of Sciences, China; National Institute of Cyber Security, Taiwan; Academia Sinica, Taiwan) Correct implementations of cryptographic primitives are essential for modern security. These implementations often contain arithmetic operations involving non-linear computations that are infamously hard to verify. We present llvm2CryptoLine, an automated formal verification tool for arithmetic operations in cryptographic C programs. llvm2CryptoLine successfully verifies 51 arithmetic C programs from industrial cryptographic libraries OpenSSL, wolfSSL and NaCl. Most of the programs are verified fully automatically and efficiently. A screencast that showcases llvm2CryptoLine can be found at https://youtu.be/QXuSmja45VA. Source code is available at https://github.com/fmlab-iis/llvm2cryptoline. @InProceedings{ESEC/FSE23p2167, author = {Ruiling Chen and Jiaxiang Liu and Xiaomu Shi and Ming-Hsien Tsai and Bow-Yaw Wang and Bo-Yin Yang}, title = {llvm2CryptoLine: Verifying Arithmetic in Cryptographic C Programs}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2167--2171}, doi = {10.1145/3611643.3613096}, year = {2023}, } Publisher's Version Video Info |
|
Chen, Sen |
ESEC/FSE '23: "Comparison and Evaluation ..."
Comparison and Evaluation on Static Application Security Testing (SAST) Tools for Java
Kaixuan Li, Sen Chen, Lingling Fan, Ruitao Feng, Han Liu, Chengwei Liu, Yang Liu, and Yixiang Chen (East China Normal University, China; Tianjin University, China; Nankai University, China; UNSW, Australia; Nanyang Technological University, Singapore) Static application security testing (SAST) takes a significant role in the software development life cycle (SDLC). However, it is challenging to comprehensively evaluate the effectiveness of SAST tools to determine which is the better one for detecting vulnerabilities. In this paper, based on well-defined criteria, we first selected seven free or open-source SAST tools from 161 existing tools for further evaluation. Owing to the synthetic and newly-constructed real-world benchmarks, we evaluated and compared these SAST tools from different and comprehensive perspectives such as effectiveness, consistency, and performance. While SAST tools perform well on synthetic benchmarks, our results indicate that only 12.7% of real-world vulnerabilities can be detected by the selected tools. Even combining the detection capability of all tools, most vulnerabilities (70.9%) remain undetected, especially those beyond resource control and insufficiently neutralized input/output vulnerabilities. The fact is that although they have already built the corresponding detecting rules and integrated them into their capabilities, the detection result still did not meet the expectations. All useful findings unveiled in our comprehensive study indeed help to provide guidance on tool development, improvement, evaluation, and selection for developers, researchers, and potential users. @InProceedings{ESEC/FSE23p921, author = {Kaixuan Li and Sen Chen and Lingling Fan and Ruitao Feng and Han Liu and Chengwei Liu and Yang Liu and Yixiang Chen}, title = {Comparison and Evaluation on Static Application Security Testing (SAST) Tools for Java}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {921--933}, doi = {10.1145/3611643.3616262}, year = {2023}, } Publisher's Version ESEC/FSE '23: "Software Composition Analysis ..." Software Composition Analysis for Vulnerability Detection: An Empirical Study on Java Projects Lida Zhao, Sen Chen, Zhengzi Xu, Chengwei Liu, Lyuye Zhang, Jiahui Wu, Jun Sun, and Yang Liu (Singapore Management University, Singapore; Tianjin University, China; Nanyang Technological University, Singapore) Software composition analysis (SCA) tools are proposed to detect potential vulnerabilities introduced by open-source software (OSS) imported as third-party libraries (TPL). With the increasing complexity of software functionality, SCA tools may encounter various scenarios during the dependency resolution process, such as diverse formats of artifacts, diverse dependency imports, and diverse dependency specifications. However, there still lacks a comprehensive evaluation of SCA tools for Java that takes into account the above scenarios. This could lead to a confined interpretation of comparisons, improper use of tools, and hinder further improvements of the tools. To fill this gap, we proposed an Evaluation Model which consists of Scan Modes, Scan Methods, and SCA Scope for Maven (SSM), for comprehensive assessments of the dependency resolving capabilities and effectiveness of SCA tools. Based on the Evaluation Model, we first qualitatively examined 6 SCA tools’ capabilities. Next, the accuracy of dependency and vulnerability is quantitatively evaluated with a large-scale dataset (21,130 Maven modules with 73,499 unique dependencies) under two Scan Modes (i.e., build scan and pre-build scan). The results show that most tools do not fully support SSM, which leads to compromised accuracy. For dependency detection, the average F1-score is 0.890 and 0.692 for build and pre-build respectively, and for vulnerability accuracy, the average F1-score is 0.475. However, proper support for SSM reduces dependency detection false positives by 34.24% and false negatives by 6.91%. This further leads to a reduction of 18.28% in false positives and 8.72% in false negatives in vulnerability reports. @InProceedings{ESEC/FSE23p960, author = {Lida Zhao and Sen Chen and Zhengzi Xu and Chengwei Liu and Lyuye Zhang and Jiahui Wu and Jun Sun and Yang Liu}, title = {Software Composition Analysis for Vulnerability Detection: An Empirical Study on Java Projects}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {960--972}, doi = {10.1145/3611643.3616299}, year = {2023}, } Publisher's Version ESEC/FSE '23: "Automated and Context-Aware ..." Automated and Context-Aware Repair of Color-Related Accessibility Issues for Android Apps Yuxin Zhang, Sen Chen, Lingling Fan, Chunyang Chen, and Xiaohong Li (Tianjin University, China; Nankai University, China; Monash University, Australia) Approximately 15% of the world's population is suffering from various disabilities or impairments. However, many mobile UX designers and developers disregard the significance of accessibility for those with disabilities when developing apps. It is unbelievable that one in seven people might not have the same level of access that other users have, which actually violates many legal and regulatory standards. On the contrary, if the apps are developed with accessibility in mind, it will drastically improve the user experience for all users as well as maximize revenue. Thus, a large number of studies and some effective tools for detecting accessibility issues have been conducted and proposed to mitigate such a severe problem. However, compared with detection, the repair work is obviously falling behind. Especially for the color-related accessibility issues, which is one of the top issues in apps with a greatly negative impact on vision and user experience. Apps with such issues are difficult to use for people with low vision and the elderly. Unfortunately, such an issue type cannot be directly fixed by existing repair techniques. To this end, we propose Iris, an automated and context-aware repair method to fix the color-related accessibility issues (i.e., the text contrast issues and the image contrast issues) for apps. By leveraging a novel context-aware technique that resolves the optimal colors and a vital phase of attribute-to-repair localization, Iris not only repairs the color contrast issues but also guarantees the consistency of the design style between the original UI page and repaired UI page. Our experiments unveiled that Iris can achieve a 91.38% repair success rate with high effectiveness and efficiency. The usefulness of Iris has also been evaluated by a user study with a high satisfaction rate as well as developers' positive feedback. 9 of 40 submitted pull requests on GitHub repositories have been accepted and merged into the projects by app developers, and another 4 developers are actively discussing with us for further repair. Iris is publicly available to facilitate this new research direction. @InProceedings{ESEC/FSE23p1255, author = {Yuxin Zhang and Sen Chen and Lingling Fan and Chunyang Chen and Xiaohong Li}, title = {Automated and Context-Aware Repair of Color-Related Accessibility Issues for Android Apps}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1255--1267}, doi = {10.1145/3611643.3616329}, year = {2023}, } Publisher's Version Info |
|
Chen, Sophie |
ESEC/FSE '23-IVR: "Reflecting on the Use of the ..."
Reflecting on the Use of the Policy-Process-Product Theory in Empirical Software Engineering
Kelechi G. Kalu, Taylor R. Schorlemmer, Sophie Chen, Kyle A. Robinson, Erik Kocinare, and James C. Davis (Purdue University, USA; University of Michigan, USA) The primary theory of software engineering is that an organization’s Policies and Processes influence the quality of its Products. We call this the PPP Theory. Although empirical software engineering research has grown common, it is unclear whether researchers are trying to evaluate the PPP Theory. To assess this, we analyzed half (33) of the empirical works published over the last two years in three prominent software engineering conferences. In this sample, 70% focus on policies/processes or products, not both. Only 33% provided measurements relating policy/process and products. We make four recommendations: (1) Use PPP Theory in study design; (2) Study feedback relationships; (3) Diversify the studied feed-forward relationships; and (4) Disentangle policy and process. Let us remember that research results are in the context of, and with respect to, the relationship between software products, processes, and policies. @InProceedings{ESEC/FSE23p2112, author = {Kelechi G. Kalu and Taylor R. Schorlemmer and Sophie Chen and Kyle A. Robinson and Erik Kocinare and James C. Davis}, title = {Reflecting on the Use of the Policy-Process-Product Theory in Empirical Software Engineering}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2112--2116}, doi = {10.1145/3611643.3613075}, year = {2023}, } Publisher's Version |
|
Chen, Tao |
ESEC/FSE '23: "Predicting Software Performance ..."
Predicting Software Performance with Divide-and-Learn
Jingzhi Gong and Tao Chen (University of Electronic Science and Technology of China, China; Loughborough University, UK; University of Birmingham, UK) Predicting the performance of highly configurable software systems is the foundation for performance testing and quality assurance. To that end, recent work has been relying on machine/deep learning to model software performance. However, a crucial yet unaddressed challenge is how to cater for the sparsity inherited from the configuration landscape: the influence of configuration options (features) and the distribution of data samples are highly sparse. In this paper, we propose an approach based on the concept of “divide-and-learn”, dubbed DaL. The basic idea is that, to handle sample sparsity, we divide the samples from the configuration landscape into distant divisions, for each of which we build a regularized Deep Neural Network as the local model to deal with the feature sparsity. A newly given configuration would then be assigned to the right model of division for the final prediction. Experiment results from eight real-world systems and five sets of training data reveal that, compared with the state-of-the-art approaches, DaL performs no worse than the best counterpart on 33 out of 40 cases (within which 26 cases are significantly better) with up to 1.94× improvement on accuracy; requires fewer samples to reach the same/better accuracy; and producing acceptable training overhead. Practically, DaL also considerably improves different global models when using them as the underlying local models, which further strengthens its flexibility. To promote open science, all the data, code, and supplementary figures of this work can be accessed at our repository: https://github.com/ideas-labo/DaL. @InProceedings{ESEC/FSE23p858, author = {Jingzhi Gong and Tao Chen}, title = {Predicting Software Performance with Divide-and-Learn}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {858--870}, doi = {10.1145/3611643.3616334}, year = {2023}, } Publisher's Version Info |
|
Chen, Ting |
ESEC/FSE '23: "DeepInfer: Deep Type Inference ..."
DeepInfer: Deep Type Inference from Smart Contract Bytecode
Kunsong Zhao, Zihao Li, Jianfeng Li, He Ye, Xiapu Luo, and Ting Chen (Hong Kong Polytechnic University, China; Xi’an Jiaotong University, China; KTH Royal Institute of Technology, Sweden; University of Electronic Science and Technology of China, China) Smart contracts play an increasingly important role in Ethereum platform. It provides various functions implementing numerous services, whose bytecode runs on Ethereum Virtual Machine. To use services by invoking corresponding functions, the callers need to know the function signatures. Moreover, such signatures provide crucial information for many downstream applications, e.g., identifying smart contracts, fuzzing, detecting vulnerabilities, etc. However, it is challenging to infer function signatures from the bytecode due to a lack of type information. Existing work solving this problem depended heavily on limited databases or hard-coded heuristic patterns. However, these approaches are hard to be adapted to semantic differences in distinct languages and various compiler versions when developing smart contracts. In this paper, we propose a novel framework DeepInfer that first leverages deep learning techniques to automatically infer function signatures and returns. The novelties of DeepInfer are: 1) DeepInfer lifts the bytecode into the Intermediate Representation (IR) to preserve code semantics; 2) DeepInfer extracts the type-related knowledge (e.g., critical data flows, constant values, and control flow graphs) from the IR to recover function signatures and returns. We conduct experiments on Solidity and Vyper smart contracts and the results show that DeepInfer performs faster and more accurate than existing tools, while being immune to changes in different languages and various compiler versions. @InProceedings{ESEC/FSE23p745, author = {Kunsong Zhao and Zihao Li and Jianfeng Li and He Ye and Xiapu Luo and Ting Chen}, title = {DeepInfer: Deep Type Inference from Smart Contract Bytecode}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {745--757}, doi = {10.1145/3611643.3616343}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable |
|
Chen, Xiao |
ESEC/FSE '23-DEMO: "LazyCow: A Lightweight Crowdsourced ..."
LazyCow: A Lightweight Crowdsourced Testing Tool for Taming Android Fragmentation
Xiaoyu Sun, Xiao Chen, Yonghui Liu, John Grundy, and Li Li (Australian National University, Australia; Monash University, Australia; Beihang University, China) Android fragmentation refers to the increasing variety of Android devices and operating system versions. Their number make it impossible to test an app on every supported device, resulting in many device compatibility issues and leading to poor user experiences. To mitigate this, a number of works that automatically detect compatibility issues have been proposed. However, current state-of-the-art techniques can only be used to detect specific types of compatibility issues (i.e., compatibility issues caused by API signature evolution), i.e., many other essential categories of compatibility issues are still unknown. For instance, customised OS versions on real devices and semantic OS modifications could result in severe compatibility issues that are difficult to detect statically. In order to address this research gap and facilitate the prospect of taming Android frag- mentation through crowdsourced efforts, we propose LazyCow, a novel, lightweight, crowdsourced testing tool. Our experimental results involving thousands of test cases on real Android devices demonstrate that LazyCow is effective at autonomously identifying and validating API-induced compatibility issues. The source code of both client side and server side are all made publicly available in our artifact package. A demo video of our tool is available at https://www.youtube.com/watch?v=_xzWv_mo5xQ. @InProceedings{ESEC/FSE23p2127, author = {Xiaoyu Sun and Xiao Chen and Yonghui Liu and John Grundy and Li Li}, title = {LazyCow: A Lightweight Crowdsourced Testing Tool for Taming Android Fragmentation}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2127--2131}, doi = {10.1145/3611643.3613098}, year = {2023}, } Publisher's Version ESEC/FSE '23: "Understanding the Bug Characteristics ..." Understanding the Bug Characteristics and Fix Strategies of Federated Learning Systems Xiaohu Du, Xiao Chen, Jialun Cao, Ming Wen, Shing-Chi Cheung, and Hai Jin (Huazhong University of Science and Technology, China; Hong Kong University of Science and Technology, China) Federated learning (FL) is an emerging machine learning paradigm that aims to address the problem of isolated data islands. To preserve privacy, FL allows machine learning models and deep neural networks to be trained from decentralized data kept privately at individual devices. FL has been increasingly adopted in missioncritical fields such as finance and healthcare. However, bugs in FL systems are inevitable and may result in catastrophic consequences such as financial loss, inappropriate medical decision, and violation of data privacy ordinance. While many recent studies were conducted to understand the bugs in machine learning systems, there is no existing study to characterize the bugs arising from the unique nature of FL systems. To fill the gap, we collected 395 real bugs from six popular FL frameworks (Tensorflow Federated, PySyft, FATE, Flower, PaddleFL, and Fedlearner) in GitHub and StackOverflow, and then manually analyzed their symptoms and impacts, prone stages, root causes, and fix strategies. Furthermore, we report a series of findings and actionable implications that can potentially facilitate the detection of FL bugs. @InProceedings{ESEC/FSE23p1358, author = {Xiaohu Du and Xiao Chen and Jialun Cao and Ming Wen and Shing-Chi Cheung and Hai Jin}, title = {Understanding the Bug Characteristics and Fix Strategies of Federated Learning Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1358--1370}, doi = {10.1145/3611643.3616347}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable |
|
Chen, Yang |
ESEC/FSE '23-INDUSTRY: "xASTNN: Improved Code Representations ..."
xASTNN: Improved Code Representations for Industrial Practice
Zhiwei Xu, Min Zhou, Xibin Zhao, Yang Chen, Xi Cheng, and Hongyu Zhang (Tsinghua University, China; Fudan University, China; VMware, China; Chongqing University, China) The application of deep learning techniques in software engineering becomes increasingly popular. One key problem is developing high-quality and easy-to-use source code representations for code-related tasks. The research community has acquired impressive results in recent years. However, due to the deployment difficulties and performance bottlenecks, seldom these approaches are applied to the industry. In this paper, we present xASTNN, an eXtreme Abstract Syntax Tree (AST)-based Neural Network for source code representation, aiming to push this technique to industrial practice. The proposed xASTNN has three advantages. First, xASTNN is completely based on widely-used ASTs and does not require complicated data pre-processing, making it applicable to various programming languages and practical scenarios. Second, three closely-related designs are proposed to guarantee the effectiveness of xASTNN, including statement subtree sequence for code naturalness, gated recursive unit for syntactical information, and gated recurrent unit for sequential information. Third, a dynamic batching algorithm is introduced to significantly reduce the time complexity of xASTNN. Two code comprehension downstream tasks, code classification and code clone detection, are adopted for evaluation. The results demonstrate that our xASTNN can improve the state-of-the-art while being faster than the baselines. @InProceedings{ESEC/FSE23p1727, author = {Zhiwei Xu and Min Zhou and Xibin Zhao and Yang Chen and Xi Cheng and Hongyu Zhang}, title = {xASTNN: Improved Code Representations for Industrial Practice}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1727--1738}, doi = {10.1145/3611643.3613869}, year = {2023}, } Publisher's Version |
|
Chen, Yifen |
ESEC/FSE '23-INDUSTRY: "Modeling the Centrality of ..."
Modeling the Centrality of Developer Output with Software Supply Chains
Audris Mockus, Peter C. Rigby, Rui Abreu, Parth Suresh, Yifen Chen, and Nachiappan Nagappan (Meta, USA; University of Tennessee, USA; Concordia University, Canada) Raw developer output, as measured by the number of changes a developer makes to the system, is simplistic and potentially misleading measure of productivity as new developers tend to work on peripheral and experienced developers on more central parts of the system. In this work, we use Software Supply Chain (SSC) networks and Katz centrality and PageRank on these networks to suggest a more nuanced measure of developer productivity. Our SSC is a network that represents the relationships between developers and artifacts that make up a system. We combine author-to-file, co-changing files, call hierarchies, and reporting structure into a single SSC and calculate the centrality of each node. The measures of centrality can be used to better understand variations in the impact of developer output at Meta. We start by partially replicating prior work and show that the raw number of developer commits plateaus over a project-specific period. However, the centrality of developer work grows for the entire period of study, but the growth slows after one year. This implies that while raw output might plateau, more experienced developers work on more central parts of the system. Finally, we investigate the incremental contribution of SSC attributes in modeling developer output. We find that local attributes such as the number of reports and the specific project do not explain much variation (𝑅2 = 5.8%). In contrast, adding Katz centrality or PageRank produces a model with an 𝑅2 above 30%. SSCs and their centrality provide valuable insights into the centrality and importance of a developer’s work. @InProceedings{ESEC/FSE23p1809, author = {Audris Mockus and Peter C. Rigby and Rui Abreu and Parth Suresh and Yifen Chen and Nachiappan Nagappan}, title = {Modeling the Centrality of Developer Output with Software Supply Chains}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1809--1819}, doi = {10.1145/3611643.3613873}, year = {2023}, } Publisher's Version |
|
Chen, Yixiang |
ESEC/FSE '23: "Comparison and Evaluation ..."
Comparison and Evaluation on Static Application Security Testing (SAST) Tools for Java
Kaixuan Li, Sen Chen, Lingling Fan, Ruitao Feng, Han Liu, Chengwei Liu, Yang Liu, and Yixiang Chen (East China Normal University, China; Tianjin University, China; Nankai University, China; UNSW, Australia; Nanyang Technological University, Singapore) Static application security testing (SAST) takes a significant role in the software development life cycle (SDLC). However, it is challenging to comprehensively evaluate the effectiveness of SAST tools to determine which is the better one for detecting vulnerabilities. In this paper, based on well-defined criteria, we first selected seven free or open-source SAST tools from 161 existing tools for further evaluation. Owing to the synthetic and newly-constructed real-world benchmarks, we evaluated and compared these SAST tools from different and comprehensive perspectives such as effectiveness, consistency, and performance. While SAST tools perform well on synthetic benchmarks, our results indicate that only 12.7% of real-world vulnerabilities can be detected by the selected tools. Even combining the detection capability of all tools, most vulnerabilities (70.9%) remain undetected, especially those beyond resource control and insufficiently neutralized input/output vulnerabilities. The fact is that although they have already built the corresponding detecting rules and integrated them into their capabilities, the detection result still did not meet the expectations. All useful findings unveiled in our comprehensive study indeed help to provide guidance on tool development, improvement, evaluation, and selection for developers, researchers, and potential users. @InProceedings{ESEC/FSE23p921, author = {Kaixuan Li and Sen Chen and Lingling Fan and Ruitao Feng and Han Liu and Chengwei Liu and Yang Liu and Yixiang Chen}, title = {Comparison and Evaluation on Static Application Security Testing (SAST) Tools for Java}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {921--933}, doi = {10.1145/3611643.3616262}, year = {2023}, } Publisher's Version |
|
Chen, Yu |
ESEC/FSE '23-INDUSTRY: "LightF3: A Lightweight Fully-Process ..."
LightF3: A Lightweight Fully-Process Formal Framework for Automated Verifying Railway Interlocking Systems
Yibo Dong, Xiaoyu Zhang, Yicong Xu, Chang Cai, Yu Chen, Weikai Miao, Jianwen Li, and Geguang Pu (East China Normal University, China; Shanghai Trusted Industrial Control Platform, China) Interlocking has long played a crucial role in railway systems. Its functional correctness, particularly concerning safety, forms the foundation of the entire signaling system. To date, numerous efforts have been made to formally model and verify interlocking systems. However, two main problems persist in most prior work: (1) The formal description of the interlocking system heavily depends on reusing existing models, which often results in overgeneralization and failing to fully utilize the intrinsic characteristics of interlocking systems. (2) The verification techniques of current approaches may quickly become outdated, and there is no adaptable method to integrate state-of-the-art verification algorithms or tools. To address the above issues, we present LightF3, a lightweight and fully-process formal framework for modeling and verifying railway interlocking systems. LightF3 provides RIS-FL, a formal language based on FQLTL (a variant of LTL) to model the system and its specifications. LightF3 transforms the RIS-FL model automatically to the aiger model, which is the mainstream input of state-of-the-art model checkers, and then invokes the most advanced checkers to complete the verification task. We evaluated LightF3 by testing five real station instances from our industrial partner, demonstrating its effectiveness as a new framework. Additionally, we analyzed the statistics of the verification results from different model-checking techniques, providing useful conclusions for both the railway interlocking and formal methods communities. @InProceedings{ESEC/FSE23p1914, author = {Yibo Dong and Xiaoyu Zhang and Yicong Xu and Chang Cai and Yu Chen and Weikai Miao and Jianwen Li and Geguang Pu}, title = {LightF3: A Lightweight Fully-Process Formal Framework for Automated Verifying Railway Interlocking Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1914--1925}, doi = {10.1145/3611643.3613874}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Chen, Zhenyu |
ESEC/FSE '23: "Dynamic Data Fault Localization ..."
Dynamic Data Fault Localization for Deep Neural Networks
Yining Yin, Yang Feng, Shihao Weng, Zixi Liu, Yuan Yao, Yichi Zhang, Zhihong Zhao, and Zhenyu Chen (Nanjing University, China) Rich datasets have empowered various deep learning (DL) applications, leading to remarkable success in many fields. However, data faults hidden in the datasets could result in DL applications behaving unpredictably and even cause massive monetary and life losses. To alleviate this problem, in this paper, we propose a dynamic data fault localization approach, namely DFauLo, to locate the mislabeled and noisy data in the deep learning datasets. DFauLo is inspired by the conventional mutation-based code fault localization, but utilizes the differences between DNN mutants to amplify and identify the potential data faults. Specifically, it first generates multiple DNN model mutants of the original trained model. Then it extracts features from these mutants and maps them into a suspiciousness score indicating the probability of the given data being a data fault. Moreover, DFauLo is the first dynamic data fault localization technique, prioritizing the suspected data based on user feedback, and providing the generalizability to unseen data faults during training. To validate DFauLo, we extensively evaluate it on 26 cases with various fault types, data types, and model structures. We also evaluate DFauLo on three widely-used benchmark datasets. The results show that DFauLo outperforms the state-of-the-art techniques in almost all cases and locates hundreds of different types of real data faults in benchmark datasets. @InProceedings{ESEC/FSE23p1345, author = {Yining Yin and Yang Feng and Shihao Weng and Zixi Liu and Yuan Yao and Yichi Zhang and Zhihong Zhao and Zhenyu Chen}, title = {Dynamic Data Fault Localization for Deep Neural Networks}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1345--1357}, doi = {10.1145/3611643.3616345}, year = {2023}, } Publisher's Version Published Artifact Info Artifacts Available Artifacts Functional ESEC/FSE '23: "Benchmarking Robustness of ..." Benchmarking Robustness of AI-Enabled Multi-sensor Fusion Systems: Challenges and Opportunities Xinyu Gao, Zhijie Wang, Yang Feng, Lei Ma, Zhenyu Chen, and Baowen Xu (Nanjing University, China; University of Alberta, Canada; University of Tokyo, Japan) Multi-Sensor Fusion (MSF) based perception systems have been the foundation in supporting many industrial applications and domains, such as self-driving cars, robotic arms, and unmanned aerial vehicles. Over the past few years, the fast progress in datadriven artificial intelligence (AI) has brought a fast-increasing trend to empower MSF systems by deep learning techniques to further improve performance, especially on intelligent systems and their perception systems. Although quite a few AI-enabled MSF perception systems and techniques have been proposed, up to the present, limited benchmarks that focus on MSF perception are publicly available. Given that many intelligent systems such as self-driving cars are operated in safety-critical contexts where perception systems play an important role, there comes an urgent need for a more in-depth understanding of the performance and reliability of these MSF systems. To bridge this gap, we initiate an early step in this direction and construct a public benchmark of AI-enabled MSF-based perception systems including three commonly adopted tasks (i.e., object detection, object tracking, and depth completion). Based on this, to comprehensively understand MSF systems’ robustness and reliability, we design 14 common and realistic corruption patterns to synthesize large-scale corrupted datasets. We further perform a systematic evaluation of these systems through our large-scale evaluation and identify the following key findings: (1) existing AI-enabled MSF systems are not robust enough against corrupted sensor signals; (2) small synchronization and calibration errors can lead to a crash of AI-enabled MSF systems; (3) existing AI-enabled MSF systems are usually tightly-coupled in which bugs/errors from an individual sensor could result in a system crash; (4) the robustness of MSF systems can be enhanced by improving fusion mechanisms. Our results reveal the vulnerability of the current AI-enabled MSF perception systems, calling for researchers and practitioners to take robustness and reliability into account when designing AI-enabled MSF. Our benchmark, code, and detailed evaluation results are publicly available at https://sites.google.com/view/ai-msf-benchmark. @InProceedings{ESEC/FSE23p871, author = {Xinyu Gao and Zhijie Wang and Yang Feng and Lei Ma and Zhenyu Chen and Baowen Xu}, title = {Benchmarking Robustness of AI-Enabled Multi-sensor Fusion Systems: Challenges and Opportunities}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {871--882}, doi = {10.1145/3611643.3616278}, year = {2023}, } Publisher's Version Published Artifact Info Artifacts Available |
|
Chen, Zhiming |
ESEC/FSE '23: "DiagConfig: Configuration ..."
DiagConfig: Configuration Diagnosis of Performance Violations in Configurable Software Systems
Zhiming Chen, Pengfei Chen, Peipei Wang, Guangba Yu, Zilong He, and Genting Mai (Sun Yat-sen University, China; ByteDance Infrastructure System Lab, USA) Performance degradation due to misconfiguration in software systems that violates SLOs (service-level objectives) is commonplace. Diagnosing and explaining the root causes of such performance violations in configurable software systems is often challenging due to their increasing complexity. Although there are many tools and techniques for diagnosing performance violations, they provide limited evidence to attribute causes of observed performance violations to specific configurations. This is because the configuration is not originally considered in those tools. This paper proposes DiagConfig, specifically designed to conduct configuration diagnosis of performance violations. It leverages static code analysis to track configuration option propagation, identifies performance-sensitive options, detects performance violations, and constructs cause-effect chains that help stakeholders better understand the relationship between configuration and performance violations. Experimental evaluations with eight real-world software demonstrate that DiagConfig produces fewer false positives than a state-of-the-art documentation analysis-based tool (i.e., 5 vs 41) in the identification of performance-sensitive options, and outperforms a statistics-based debugging tool in the diagnosis of performance violations caused by configuration changes, offering more comprehensive results (recall: 0.892 vs 0.289). Moreover, we also show that DiagConfig can accelerate auto-tuning by compressing configuration space. @InProceedings{ESEC/FSE23p566, author = {Zhiming Chen and Pengfei Chen and Peipei Wang and Guangba Yu and Zilong He and Genting Mai}, title = {DiagConfig: Configuration Diagnosis of Performance Violations in Configurable Software Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {566--578}, doi = {10.1145/3611643.3616300}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Chen, Zimin |
ESEC/FSE '23-INDUSTRY: "MuRS: Mutant Ranking and Suppression ..."
MuRS: Mutant Ranking and Suppression using Identifier Templates
Zimin Chen, Małgorzata Salawa, Manushree Vijayvergiya, Goran Petrović, Marko Ivanković, and René Just (KTH Royal Institute of Technology, Sweden; Google, Switzerland; University of Washington, USA) Diff-based mutation testing is a mutation testing approach that only mutates lines affected by a code change under review. This approach scales independently of the code-base size and introduces test goals (mutants) that are directly relevant to an engineer’s goal such as fixing a bug, adding a new feature, or refactoring existing functionality. Google’s mutation testing service integrates diff-based mutation testing into the code review process and continuously gathers developer feedback on mutants surfaced during code review. To enhance the developer experience, the mutation testing service uses a number of manually-written rules that suppress not-useful mutants—mutants that have consistently received negative developer feedback. However, while effective, manually implementing suppression rules requires significant engineering time. This paper proposes and evaluates MuRS, an automated approach that groups mutants by patterns in the source code under test and uses these patterns to rank and suppress future mutants based on historical developer feedback on mutants in the same group. To evaluate MuRS, we conducted an A/B testing study, comparing MuRS to the existing mutation testing service. Despite the strong baseline, which uses manually-written suppression rules, the results show a statistically significantly lower negative feedback ratio of 11.45% for MuRS versus 12.41% for the baseline. The results also show that MuRS is able to recover existing suppression rules implemented in the baseline. Finally, the results show that statement-deletion mutant groups received both the most positive and negative developer feedback, suggesting a need for additional context that can distinguish between useful and not-useful mutants in these groups. Overall, MuRS is able to recover existing suppression rules and automatically learn additional, finer-grained suppression rules from developer feedback. @InProceedings{ESEC/FSE23p1798, author = {Zimin Chen and Małgorzata Salawa and Manushree Vijayvergiya and Goran Petrović and Marko Ivanković and René Just}, title = {MuRS: Mutant Ranking and Suppression using Identifier Templates}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1798--1808}, doi = {10.1145/3611643.3613901}, year = {2023}, } Publisher's Version |
|
Cheng, Siyuan |
ESEC/FSE '23: "PEM: Representing Binary Program ..."
PEM: Representing Binary Program Semantics for Similarity Analysis via a Probabilistic Execution Model
Xiangzhe Xu, Zhou Xuan, Shiwei Feng, Siyuan Cheng, Yapeng Ye, Qingkai Shi, Guanhong Tao, Le Yu, Zhuo Zhang, and Xiangyu Zhang (Purdue University, USA) Binary similarity analysis determines if two binary executables are from the same source program. Existing techniques leverage static and dynamic program features and may utilize advanced Deep Learning techniques. Although they have demonstrated great potential, the community believes that a more effective representation of program semantics can further improve similarity analysis. In this paper, we propose a new method to represent binary program semantics. It is based on a novel probabilistic execution engine that can effectively sample the input space and the program path space of subject binaries. More importantly, it ensures that the collected samples are comparable across binaries, addressing the substantial variations of input specifications. Our evaluation on 9 real-world projects with 35k functions, and comparison with 6 state-of-the-art techniques show that PEM can achieve a precision of 96% with common settings, outperforming the baselines by 10-20%. @InProceedings{ESEC/FSE23p401, author = {Xiangzhe Xu and Zhou Xuan and Shiwei Feng and Siyuan Cheng and Yapeng Ye and Qingkai Shi and Guanhong Tao and Le Yu and Zhuo Zhang and Xiangyu Zhang}, title = {PEM: Representing Binary Program Semantics for Similarity Analysis via a Probabilistic Execution Model}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {401--412}, doi = {10.1145/3611643.3616301}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable |
|
Cheng, Xi |
ESEC/FSE '23-INDUSTRY: "xASTNN: Improved Code Representations ..."
xASTNN: Improved Code Representations for Industrial Practice
Zhiwei Xu, Min Zhou, Xibin Zhao, Yang Chen, Xi Cheng, and Hongyu Zhang (Tsinghua University, China; Fudan University, China; VMware, China; Chongqing University, China) The application of deep learning techniques in software engineering becomes increasingly popular. One key problem is developing high-quality and easy-to-use source code representations for code-related tasks. The research community has acquired impressive results in recent years. However, due to the deployment difficulties and performance bottlenecks, seldom these approaches are applied to the industry. In this paper, we present xASTNN, an eXtreme Abstract Syntax Tree (AST)-based Neural Network for source code representation, aiming to push this technique to industrial practice. The proposed xASTNN has three advantages. First, xASTNN is completely based on widely-used ASTs and does not require complicated data pre-processing, making it applicable to various programming languages and practical scenarios. Second, three closely-related designs are proposed to guarantee the effectiveness of xASTNN, including statement subtree sequence for code naturalness, gated recursive unit for syntactical information, and gated recurrent unit for sequential information. Third, a dynamic batching algorithm is introduced to significantly reduce the time complexity of xASTNN. Two code comprehension downstream tasks, code classification and code clone detection, are adopted for evaluation. The results demonstrate that our xASTNN can improve the state-of-the-art while being faster than the baselines. @InProceedings{ESEC/FSE23p1727, author = {Zhiwei Xu and Min Zhou and Xibin Zhao and Yang Chen and Xi Cheng and Hongyu Zhang}, title = {xASTNN: Improved Code Representations for Industrial Practice}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1727--1738}, doi = {10.1145/3611643.3613869}, year = {2023}, } Publisher's Version |
|
Cheng, Yueqiang |
ESEC/FSE '23: "Input-Driven Dynamic Program ..."
Input-Driven Dynamic Program Debloating for Code-Reuse Attack Mitigation
Xiaoke Wang, Tao Hui, Lei Zhao, and Yueqiang Cheng (Wuhan University, China; NIO, USA) Modern software is bloated, especially for libraries. The unnecessary code not only brings severe vulnerabilities, but also assists attackers to construct exploits. To mitigate the damage of bloated libraries, researchers have proposed several debloating techniques to remove or restrict the invocation of unused code in a library. However, existing approaches either statically keep code for all expected inputs, which leave unused code for each concrete input, or rely on runtime context to dynamically determine the necessary code, which could be manipulated by attackers. In this paper, we propose Picup, a practical approach that dynamically customizes libraries for each input. Based on the observation that the behavior of a program mainly depends on the given input, we design Picup to predict the necessary library functions immediately after we get the input, which erases the unused code before attackers can affect the decision-making data. To achieve an effective prediction, we adopt a convolutional neural network (CNN) with attention mechanism to extract key bytes from the input and map them to library functions. We evaluate Picup on real-world benchmarks and popular applications. The results show that we can predict the necessary library functions with 97.56% accuracy, and reduce the code size by 87.55% on average with low overheads. These results indicate that Picup is a practical solution for secure and effective library debloating. @InProceedings{ESEC/FSE23p934, author = {Xiaoke Wang and Tao Hui and Lei Zhao and Yueqiang Cheng}, title = {Input-Driven Dynamic Program Debloating for Code-Reuse Attack Mitigation}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {934--946}, doi = {10.1145/3611643.3616274}, year = {2023}, } Publisher's Version |
|
Cheng, Yutong |
ESEC/FSE '23: "Hue: A User-Adaptive Parser ..."
Hue: A User-Adaptive Parser for Hybrid Logs
Junjielong Xu, Qiuai Fu, Zhouruixing Zhu, Yutong Cheng, Zhijing Li, Yuchi Ma, and Pinjia He (Chinese University of Hong Kong, Shenzhen, China; Huawei Cloud Computing Technologies, China) Log parsing, which extracts log templates from semi-structured logs and produces structured logs, is the first and the most critical step in automated log analysis. While existing log parsers have achieved decent results, they suffer from two major limitations by design. First, they do not natively support hybrid logs that consist of both single-line logs and multi-line logs (Java Exception and Hadoop Counters). Second, they fall short in integrating domain knowledge in parsing, making it hard to identify ambiguous tokens in logs. This paper defines a new research problem, hybrid log parsing, as a superset of traditional log parsing tasks, and proposes Hue, the first attempt for hybrid log parsing via a user-adaptive manner. Specifically, Hue converts each log message to a sequence of special wildcards using a key casting table and determines the log types via line aggregating and pattern extracting. In addition, Hue can effectively utilize user feedback via a novel merge-reject strategy, making it possible to quickly adapt to complex and changing log templates. We evaluated Hue on three hybrid log datasets and sixteen widely-used single-line log datasets (Loghub). The results show that Hue achieves an average grouping accuracy of 0.845 on hybrid logs, which largely outperforms the best results (0.563 on average) obtained by existing parsers. Hue also exhibits SOTA performance on single-line log datasets. @InProceedings{ESEC/FSE23p413, author = {Junjielong Xu and Qiuai Fu and Zhouruixing Zhu and Yutong Cheng and Zhijing Li and Yuchi Ma and Pinjia He}, title = {Hue: A User-Adaptive Parser for Hybrid Logs}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {413--424}, doi = {10.1145/3611643.3616260}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Functional |
|
Cheung, Shing-Chi |
ESEC/FSE '23: "Can Machine Learning Pipelines ..."
Can Machine Learning Pipelines Be Better Configured?
Yibo Wang, Ying Wang, Tingwei Zhang, Yue Yu, Shing-Chi Cheung, Hai Yu, and Zhiliang Zhu (Northeastern University, China; Hong Kong University of Science and Technology, China; National University of Defense Technology, China) A Machine Learning (ML) pipeline configures the workflow of a learning task using the APIs provided by ML libraries. However, a pipeline’s performance can vary significantly across different configurations of ML library versions. Misconfigured pipelines can result in inferior performance, such as poor execution time and memory usage, numeric errors and even crashes. A pipeline is subject to misconfiguration if it exhibits significantly inconsistent performance upon changes in the versions of its configured libraries or the combination of these libraries. We refer to such performance inconsistency as a pipeline configuration (PLC) issue. There is no prior systematic study on the pervasiveness, impact and root causes of PLC issues. A systematic understanding of these issues helps configure effective ML pipelines and identify misconfigured ones. In this paper, we conduct the first empirical study of PLC issues. To better dig into the problem, we propose Piecer, an infrastructure that automatically generates a set of pipeline variants by varying different version combinations of ML libraries and compares their performance inconsistencies. We apply Piecer to the 3,380 pipelines that can be deployed out of the 11,363 ML pipelines collected from multiple ML competitions at Kaggle platform. The empirical study results show that 1,092 (32.3 @InProceedings{ESEC/FSE23p463, author = {Yibo Wang and Ying Wang and Tingwei Zhang and Yue Yu and Shing-Chi Cheung and Hai Yu and Zhiliang Zhu}, title = {Can Machine Learning Pipelines Be Better Configured?}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {463--475}, doi = {10.1145/3611643.3616352}, year = {2023}, } Publisher's Version Info ESEC/FSE '23: "Understanding the Bug Characteristics ..." Understanding the Bug Characteristics and Fix Strategies of Federated Learning Systems Xiaohu Du, Xiao Chen, Jialun Cao, Ming Wen, Shing-Chi Cheung, and Hai Jin (Huazhong University of Science and Technology, China; Hong Kong University of Science and Technology, China) Federated learning (FL) is an emerging machine learning paradigm that aims to address the problem of isolated data islands. To preserve privacy, FL allows machine learning models and deep neural networks to be trained from decentralized data kept privately at individual devices. FL has been increasingly adopted in missioncritical fields such as finance and healthcare. However, bugs in FL systems are inevitable and may result in catastrophic consequences such as financial loss, inappropriate medical decision, and violation of data privacy ordinance. While many recent studies were conducted to understand the bugs in machine learning systems, there is no existing study to characterize the bugs arising from the unique nature of FL systems. To fill the gap, we collected 395 real bugs from six popular FL frameworks (Tensorflow Federated, PySyft, FATE, Flower, PaddleFL, and Fedlearner) in GitHub and StackOverflow, and then manually analyzed their symptoms and impacts, prone stages, root causes, and fix strategies. Furthermore, we report a series of findings and actionable implications that can potentially facilitate the detection of FL bugs. @InProceedings{ESEC/FSE23p1358, author = {Xiaohu Du and Xiao Chen and Jialun Cao and Ming Wen and Shing-Chi Cheung and Hai Jin}, title = {Understanding the Bug Characteristics and Fix Strategies of Federated Learning Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1358--1370}, doi = {10.1145/3611643.3616347}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable ESEC/FSE '23: "Testing Coreference Resolution ..." Testing Coreference Resolution Systems without Labeled Test Sets Jialun Cao, Yaojie Lu, Ming Wen, and Shing-Chi Cheung (Hong Kong University of Science and Technology, China; Guangzhou HKUST Fok Ying Tung Research Institute, China; Institute of Software at Chinese Academy of Sciences, China; Huazhong University of Science and Technology, China) Coreference resolution (CR) is a task to resolve different expressions (e.g., named entities, pronouns) that refer to the same real-world en- tity/event. It is a core natural language processing (NLP) component that underlies and empowers major downstream NLP applications such as machine translation, chatbots, and question-answering. De- spite its broad impact, the problem of testing CR systems has rarely been studied. A major difficulty is the shortage of a labeled dataset for testing. While it is possible to feed arbitrary sentences as test inputs to a CR system, a test oracle that captures their expected test outputs (coreference relations) is hard to define automatically. To address the challenge, we propose Crest, an automated testing methodology for CR systems. Crest uses constituency and depen- dency relations to construct pairs of test inputs subject to the same coreference. These relations can be leveraged to define the meta- morphic relation for metamorphic testing. We compare Crest with five state-of-the-art test generation baselines on two popular CR systems, and apply them to generate tests from 1,000 sentences randomly sampled from CoNLL-2012, a popular dataset for corefer- ence resolution. Experimental results show that Crest outperforms baselines significantly. The issues reported by Crest are all true positives (i.e., 100% precision), compared with 63% to 75% achieved by the baselines. @InProceedings{ESEC/FSE23p107, author = {Jialun Cao and Yaojie Lu and Ming Wen and Shing-Chi Cheung}, title = {Testing Coreference Resolution Systems without Labeled Test Sets}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {107--119}, doi = {10.1145/3611643.3616258}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Chew, Francis |
ESEC/FSE '23-INDUSTRY: "Ownership in the Hands of ..."
Ownership in the Hands of Accountability at Brightsquid: A Case Study and a Developer Survey
Umme Ayman Koana, Francis Chew, Chris Carlson, and Maleknaz Nayebi (York University, Canada; Brightsquid, Canada) The COVID−19 pandemic has accelerated the adoption of digital health solutions. This has presented significant challenges for software development teams to swiftly adjust to the market needs and demand. To address these challenges, product management teams have had to adapt their approach to software development, reshaping their processes to meet the demands of the pandemic. Brighsquid implemented a new task assignment process aimed at enhancing developer accountability toward the customer. To assess the impact of this change on code ownership, we conducted a code change analysis. Additionally, we surveyed 67 developers to investigate the relationship between accountability and ownership more broadly. The findings of our case study indicate that the revised assignment model not only increased the perceived sense of accountability within the production team but also improved code resilience against ownership changes. Moreover, the survey results revealed that a majority of the participating developers (67.5%) associated perceived accountability with artifact ownership. @InProceedings{ESEC/FSE23p2008, author = {Umme Ayman Koana and Francis Chew and Chris Carlson and Maleknaz Nayebi}, title = {Ownership in the Hands of Accountability at Brightsquid: A Case Study and a Developer Survey}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2008--2019}, doi = {10.1145/3611643.3613890}, year = {2023}, } Publisher's Version |
|
Chi, Xiaye |
ESEC/FSE '23: "An Automated Approach to Extracting ..."
An Automated Approach to Extracting Local Variables
Xiaye Chi, Hui Liu, Guangjie Li, Weixiao Wang, Yunni Xia, Yanjie Jiang, Yuxia Zhang, and Weixing Ji (Beijing Institute of Technology, China; National Innovation Institute of Defense Technology, China; Chongqing University, China) Extract local variable is a well-known and widely used refactoring. It is frequently employed to replace one or more occurrences of a complex expression with simple accesses to a newly added variable. Although most IDEs provide tool support for extract local variables, such tools without deep analysis of the refactorings may result in semantic errors. To this end, in this paper, we propose a novel and more reliable approach, called ValExtractor, to conduct extract variable refactorings automatically. The major challenge of automated extract local variable refactorings is how to efficiently and accurately identify the side effect of the extracted expressions and the potential interaction between the extracted expressions and their contexts without time-consuming dynamic execution of the involved programs. To resolve this challenge, ValExtractor leverages a lightweight static source code analysis to validate the side effect of the selected expression, and to identify which occurrences of the selected expression could be extracted together without changing the semantics of the program or introducing potential new exceptions. Our evaluation results on open-source Java applications suggest that Eclipse and IntelliJ IDEA, the state-of-the-practice refactoring engines, resulted in a large number of faulty extract variable refactorings whereas ValExtractor successfully avoided all such errors. The proposed approach has been merged into (and distributed with) Eclipse to improve the safety of extract local variable refactoring. @InProceedings{ESEC/FSE23p313, author = {Xiaye Chi and Hui Liu and Guangjie Li and Weixiao Wang and Yunni Xia and Yanjie Jiang and Yuxia Zhang and Weixing Ji}, title = {An Automated Approach to Extracting Local Variables}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {313--325}, doi = {10.1145/3611643.3616261}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Chimalakonda, Sridhar |
ESEC/FSE '23-DEMO: "DENT: A Tool for Tagging Stack ..."
DENT: A Tool for Tagging Stack Overflow Posts with Deep Learning Energy Patterns
Shriram Shanbhag, Sridhar Chimalakonda, Vibhu Saujanya Sharma, and Vikrant Kaulgud (IIT Tirupati, India; Accenture Labs, India) Energy efficiency has become an important consideration in deep learning systems. However, it remains a largely under-emphasized aspect during the development. Despite the emergence of energy-efficient deep learning patterns, their adoption remains a challenge due to limited awareness. To address this gap, we present DENT (Deep Learning Energy Pattern Tagger, a Chrome extension used to add "energy pattern tags" to the deep learning related questions from Stack Overflow. The idea of DENT is to hint to the developers about the possible energy-saving opportunities associated with the Stack Overflow post through energy pattern labels. We hope this will increase awareness about energy patterns in deep learning and improve their adoption. A preliminary evaluation of DENT achieved an average precision of 0.74, recall of 0.66, and an F1-score of 0.65 with an accuracy of 66%. The demonstration of the tool is available at https://youtu.be/S0Wf_w0xajw and the related artifacts are available at https://rishalab.github.io/DENT/ @InProceedings{ESEC/FSE23p2157, author = {Shriram Shanbhag and Sridhar Chimalakonda and Vibhu Saujanya Sharma and Vikrant Kaulgud}, title = {DENT: A Tool for Tagging Stack Overflow Posts with Deep Learning Energy Patterns}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2157--2161}, doi = {10.1145/3611643.3613092}, year = {2023}, } Publisher's Version |
|
Choi, Sangheon |
ESEC/FSE '23: "NaNofuzz: A Usable Tool for ..."
NaNofuzz: A Usable Tool for Automatic Test Generation
Matthew C. Davis, Sangheon Choi, Sam Estep, Brad A. Myers, and Joshua Sunshine (Carnegie Mellon University, USA; Rose-Hulman Institute of Technology, USA) In the United States alone, software testing labor is estimated to cost $48 billion USD per year. Despite widespread test execution automation and automation in other areas of software engineering, test suites continue to be created manually by software engineers. We have built a test generation tool, called NaNofuzz, that helps users find bugs in their code by suggesting tests where the output is likely indicative of a bug, e.g., that return NaN (not-a-number) values. NaNofuzz is an interactive tool embedded in a development environment to fit into the programmer's workflow. NaNofuzz tests a function with as little as one button press, analyses the program to determine inputs it should evaluate, executes the program on those inputs, and categorizes outputs to prioritize likely bugs. We conducted a randomized controlled trial with 28 professional software engineers using NaNofuzz as the intervention treatment and the popular manual testing tool, Jest, as the control treatment. Participants using NaNofuzz on average identified bugs more accurately (p < .05, by 30%), were more confident in their tests (p < .03, by 20%), and finished their tasks more quickly (p < .007, by 30%). @InProceedings{ESEC/FSE23p1114, author = {Matthew C. Davis and Sangheon Choi and Sam Estep and Brad A. Myers and Joshua Sunshine}, title = {NaNofuzz: A Usable Tool for Automatic Test Generation}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1114--1126}, doi = {10.1145/3611643.3616327}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Cito, Jürgen |
ESEC/FSE '23-INDUSTRY: "Understanding Hackers’ Work: ..."
Understanding Hackers’ Work: An Empirical Study of Offensive Security Practitioners
Andreas Happe and Jürgen Cito (TU Wien, Austria) Offensive security-tests are commonly employed to pro-actively discover potential vulnerabilities. They are performed by specialists, also known as penetration-testers or white-hat hackers. The chronic lack of available white-hat hackers prevents sufficient security test coverage of software. Research into automation tries to alleviate this problem by improving the efficiency of security testing. To achieve this, researchers and tool builders need a solid understanding of how hackers work, their assumptions, and pain points. In this paper, we present a first data-driven exploratory qualitative study of twelve security professionals, their work and problems occurring therein. We perform a thematic analysis to gain insights into the execution of security assignments, hackers' thought processes and encountered challenges. This analysis allows us to conclude with recommendations for researchers and tool builders, to increase the efficiency of their automation and identify novel areas for research. @InProceedings{ESEC/FSE23p1669, author = {Andreas Happe and Jürgen Cito}, title = {Understanding Hackers’ Work: An Empirical Study of Offensive Security Practitioners}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1669--1680}, doi = {10.1145/3611643.3613900}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available ESEC/FSE '23-IVR: "Getting pwn’d by AI: Penetration ..." Getting pwn’d by AI: Penetration Testing with Large Language Models Andreas Happe and Jürgen Cito (TU Wien, Austria) The field of software security testing, more specifically penetration testing, requires high levels of expertise and involves many manual testing and analysis steps. This paper explores the potential use of large-language models, such as GPT3.5, to augment penetration testers with AI sparring partners. We explore two distinct use cases: high-level task planning for security testing assignments and low-level vulnerability hunting within a vulnerable virtual machine. For the latter, we implemented a closed-feedback loop between LLM-generated low-level actions with a vulnerable virtual machine (connected through SSH) and allowed the LLM to analyze the machine state for vulnerabilities and suggest concrete attack vectors which were automatically executed within the virtual machine. We discuss promising initial results, detail avenues for improvement, and close deliberating on the ethics of AI sparring partners. @InProceedings{ESEC/FSE23p2082, author = {Andreas Happe and Jürgen Cito}, title = {Getting pwn’d by AI: Penetration Testing with Large Language Models}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2082--2086}, doi = {10.1145/3611643.3613083}, year = {2023}, } Publisher's Version Info |
|
Clapp, Lazaro |
ESEC/FSE '23-INDUSTRY: "Last Diff Analyzer: Multi-language ..."
Last Diff Analyzer: Multi-language Automated Approver for Behavior-Preserving Code Revisions
Yuxin Wang, Adam Welc, Lazaro Clapp, and Lingchao Chen (Uber Technologies, USA; Mysten Labs, USA) Code review is a crucial step in ensuring the quality and maintainability of software systems. However, this process can be time-consuming and resource-intensive, especially in large-scale projects where a significant number of code changes are submitted every day. Fortunately, not all code changes require human reviews, as some may only contain syntactic modifications that do not alter the behavior of the system, such as format changes, variable / function renamings, and constant extractions. In this paper, we propose a multi-language automated code approver — Last Diff Analyzer for Go and Java, which is able to detect if a reviewable incremental unit of code change (diff) contains only changes that do not modify system behavior. It is built on top of a novel multi-language static analysis framework that unifies common features of multiple languages while keeping unique language constructs separate. This makes it easy to extend to other languages such as TypeScript, Kotlin, Swift, and others. Besides skipping unnecessary code reviews, Last Diff Analyzer could be further applied to skip certain resource-intensive end-to-end (E2E) tests for auto-approved diffs for significant reduction of resource usage. We have deployed the analyzer at scale within Uber, and data collected in production shows that approximately 15% of analyzed diffs are auto-approved weekly for code reviews. Furthermore, 13.5% reduction in server node usage dedicated to E2E tests (measured by number of executed E2E tests) is observed as a result of skipping E2E tests, compared to the node usage if Last Diff Analyzer were not enabled. @InProceedings{ESEC/FSE23p1693, author = {Yuxin Wang and Adam Welc and Lazaro Clapp and Lingchao Chen}, title = {Last Diff Analyzer: Multi-language Automated Approver for Behavior-Preserving Code Revisions}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1693--1704}, doi = {10.1145/3611643.3613870}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available ESEC/FSE '23: "Practical Inference of Nullability ..." Practical Inference of Nullability Types Nima Karimipour, Justin Pham, Lazaro Clapp, and Manu Sridharan (University of California, Riverside, USA; Uber Technologies, USA) NullPointerExceptions (NPEs), caused by dereferencing null, fre- quently cause crashes in Java programs. Pluggable type checking is highly effective in preventing Java NPEs. However, this approach is difficult to adopt for large, existing code bases, as it requires manually inserting a significant number of type qualifiers into the code. Hence, a tool to automatically infer these qualifiers could make adoption of type-based NPE prevention significantly easier. We present a novel and practical approach to automatic inference of nullability type qualifiers for Java. Our technique searches for a set of qualifiers that maximizes the amount of code that can be successfully type checked. The search uses the type checker as a black box oracle, easing compatibility with existing tools. However, this approach can be costly, as evaluating the impact of a qualifier requires re-running the checker. We present a technique for safely evaluating many qualifiers in a single checker run, dramatically reducing running times. We also describe extensions to make the approach practical in a real-world deployment. We implemented our approach in an open-source tool Null- AwayAnnotator, designed to work with the NullAway type checker. We evaluated NullAwayAnnotator’s effectiveness on both open- source projects and commercial code. NullAwayAnnotator re- duces the number of reported NullAway errors by 69.5% on average. Further, our optimizations enable NullAwayAnnotator to scale to large Java programs. NullAwayAnnotator has been highly effective in practice: in a production deployment, it has already been used to add NullAway checking to 160 production modules totaling over 1.3 million lines of Java code. @InProceedings{ESEC/FSE23p1395, author = {Nima Karimipour and Justin Pham and Lazaro Clapp and Manu Sridharan}, title = {Practical Inference of Nullability Types}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1395--1406}, doi = {10.1145/3611643.3616326}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Functional |
|
Cohn-Gordon, Katriel |
ESEC/FSE '23-INDUSTRY: "Dead Code Removal at Meta: ..."
Dead Code Removal at Meta: Automatically Deleting Millions of Lines of Code and Petabytes of Deprecated Data
Will Shackleton, Katriel Cohn-Gordon, Peter C. Rigby, Rui Abreu, James Gill, Nachiappan Nagappan, Karim Nakad, Ioannis Papagiannis, Luke Petre, Giorgi Megreli, Patrick Riggs, and James Saindon (Meta, USA; Concordia University, Canada) Software constantly evolves in response to user needs: new features are built, deployed, mature and grow old, and eventually their usage drops enough to merit switching them off. In any large codebase, this feature lifecycle can naturally lead to retaining unnecessary code and data. Removing these respects users’ privacy expectations, as well as helping engineers to work efficiently. In prior software engineering research, we have found little evidence of code deprecation or dead-code removal at industrial scale. We describe Systematic Code and Asset Removal Framework (SCARF), a product deprecation system to assist engineers working in large codebases. SCARF identifies unused code and data assets and safely removes them. It operates fully automatically, including committing code and dropping database tables. It also gathers developer input where it cannot take automated actions, leading to further removals. Dead code removal increases the quality and consistency of large codebases, aids with knowledge management and improves reliability. SCARF has had an important impact at Meta. In the last year alone, it has removed petabytes of data across 12.8 million distinct assets, and deleted over 104 million lines of code. @InProceedings{ESEC/FSE23p1705, author = {Will Shackleton and Katriel Cohn-Gordon and Peter C. Rigby and Rui Abreu and James Gill and Nachiappan Nagappan and Karim Nakad and Ioannis Papagiannis and Luke Petre and Giorgi Megreli and Patrick Riggs and James Saindon}, title = {Dead Code Removal at Meta: Automatically Deleting Millions of Lines of Code and Petabytes of Deprecated Data}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1705--1715}, doi = {10.1145/3611643.3613871}, year = {2023}, } Publisher's Version |
|
Cooper, Nathan |
ESEC/FSE '23-DEMO: "MASC: A Tool for Mutation-Based ..."
MASC: A Tool for Mutation-Based Evaluation of Static Crypto-API Misuse Detectors
Amit Seal Ami, Syed Yusuf Ahmed, Radowan Mahmud Redoy, Nathan Cooper, Kaushal Kafle, Kevin Moran, Denys Poshyvanyk, and Adwait Nadkarni (College of William and Mary, USA; University of Dhaka, Bangladesh; University of Central Florida, USA) While software engineers are optimistically adopting crypto-API misuse detectors (or crypto-detectors) in their software development cycles, this momentum must be accompanied by a rigorous understanding of crypto-detectors’ effectiveness at finding crypto-API misuses in practice. This demo paper presents the technical details and usage scenarios of our tool, namely Mutation Analysis for evaluating Static Crypto-API misuse detectors (MASC). We developed 12 generalizable, usage based mutation operators and three mutation scopes, namely Main Scope, Similarity Scope, and Exhaustive Scope, which can be used to expressively instantiate compilable variants of the crypto-API misuse cases. Using MASC, we evaluated nine major crypto-detectors, and discovered 19 unique, undocumented flaws. We designed MASC to be configurable and user-friendly; a user can configure the parameters to change the nature of generated mutations. Furthermore, MASC comes with both Command Line Interface and Web-based front-end, making it practical for users of different levels of expertise. @InProceedings{ESEC/FSE23p2162, author = {Amit Seal Ami and Syed Yusuf Ahmed and Radowan Mahmud Redoy and Nathan Cooper and Kaushal Kafle and Kevin Moran and Denys Poshyvanyk and Adwait Nadkarni}, title = {MASC: A Tool for Mutation-Based Evaluation of Static Crypto-API Misuse Detectors}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2162--2166}, doi = {10.1145/3611643.3613099}, year = {2023}, } Publisher's Version |
|
Cordy, Maxime |
ESEC/FSE '23-IVR: "Towards Strengthening Formal ..."
Towards Strengthening Formal Specifications with Mutation Model Checking
Maxime Cordy, Sami Lazreg, Axel Legay, and Pierre Yves Schobbens (University of Luxembourg, Luxembourg; Université Catholique de Louvain, Belgium; University of Namur, Belgium) We propose mutation model checking as an approach to strengthen formal specifications used for model checking. Inspired by mutation testing, our approach concludes that specifications are not strong enough if they fail to detect faults in purposely mutated models. Our preliminary experiments on two case studies confirm the relevance of the problem: their specification can only detect 40% and 60% of randomly generated mutants. As a result, we propose a framework to strengthen the original specification, such that the original model satisfies the strengthened specification but the mutants do not. @InProceedings{ESEC/FSE23p2102, author = {Maxime Cordy and Sami Lazreg and Axel Legay and Pierre Yves Schobbens}, title = {Towards Strengthening Formal Specifications with Mutation Model Checking}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2102--2106}, doi = {10.1145/3611643.3613080}, year = {2023}, } Publisher's Version |
|
Correnson, Arthur |
ESEC/FSE '23: "Engineering a Formally Verified ..."
Engineering a Formally Verified Automated Bug Finder
Arthur Correnson and Dominic Steinhöfel (CISPA Helmholtz Center for Information Security, Germany) Symbolic execution is a program analysis technique executing programs with symbolic instead of concrete inputs. This principle allows for exploring many program paths at once. Despite its wide adoption—in particular for program testing–little effort was dedicated to studying the semantic foundations of symbolic execution. Without these foundations, critical questions regarding the correctness of symbolic executors cannot be satisfyingly answered: Can a reported bug be reproduced, or is it a false positive (soundness)? Can we be sure to find all bugs if we let the testing tool run long enough (completeness)? This paper presents a systematic approach for engineering provably sound and complete symbolic execution-based bug finders by relating a programming language’s operational semantics with a symbolic semantics. In contrast to prior work on symbolic execution semantics, we address the correctness of critical implementation details of symbolic bug finders, including the search strategy and the role of constraint solvers to prune the search space. We showcase our approach by implementing WiSE, a prototype of a verified bug finder for an imperative language, in the Coq proof assistant and proving it sound and complete. We demonstrate that the design principles of WiSE survive outside the ecosystem of interactive proof assistants by (1) automatically extracting an OCaml implementation and (2) transforming WiSE to PyWiSE, a functionally equivalent Python version. @InProceedings{ESEC/FSE23p1165, author = {Arthur Correnson and Dominic Steinhöfel}, title = {Engineering a Formally Verified Automated Bug Finder}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1165--1176}, doi = {10.1145/3611643.3616290}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable |
|
Cruz, Breno Dantas |
ESEC/FSE '23: "Design by Contract for Deep ..."
Design by Contract for Deep Learning APIs
Shibbir Ahmed, Sayem Mohammad Imtiaz, Samantha Syeda Khairunnesa, Breno Dantas Cruz, and Hridesh Rajan (Iowa State University, USA; Bradley University, USA) Deep Learning (DL) techniques are increasingly being incorporated in critical software systems today. DL software is buggy too. Recent work in SE has characterized these bugs, studied fix patterns, and proposed detection and localization strategies. In this work, we introduce a preventative measure. We propose design by contract for DL libraries, DL Contract for short, to document the properties of DL libraries and provide developers with a mechanism to identify bugs during development. While DL Contract builds on the traditional design by contract techniques, we need to address unique challenges. In particular, we need to document properties of the training process that are not visible at the functional interface of the DL libraries. To solve these problems, we have introduced mechanisms that allow developers to specify properties of the model architecture, data, and training process. We have designed and implemented DL Contract for Python-based DL libraries and used it to document the properties of Keras, a well-known DL library. We evaluate DL Contract in terms of effectiveness, runtime overhead, and usability. To evaluate the utility of DL Contract, we have developed 15 sample contracts specifically for training problems and structural bugs. We have adopted four well-vetted benchmarks from prior works on DL bug detection and repair. For the effectiveness, DL Contract correctly detects 259 bugs in 272 real-world buggy programs, from well-vetted benchmarks provided in prior work on DL bug detection and repair. We found that the DL Contract overhead is fairly minimal for the used benchmarks. Lastly, to evaluate the usability, we conducted a survey of twenty participants who have used DL Contract to find and fix bugs. The results reveal that DL Contract can be very helpful to DL application developers when debugging their code. @InProceedings{ESEC/FSE23p94, author = {Shibbir Ahmed and Sayem Mohammad Imtiaz and Samantha Syeda Khairunnesa and Breno Dantas Cruz and Hridesh Rajan}, title = {Design by Contract for Deep Learning APIs}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {94--106}, doi = {10.1145/3611643.3616247}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable |
|
Csallner, Christoph |
ESEC/FSE '23-DEMO: "D2S2: Drag ’n’ Drop Mobile ..."
D2S2: Drag ’n’ Drop Mobile App Screen Search
Soumik Mohian, Tony Tang, Tuan Trinh, Don Dang, and Christoph Csallner (University of Texas at Arlington, USA) The lack of diverse UI element representations in publicly available datasets hinders the scalability of sketch-based interactive mobile search. This paper introduces D2S2, a novel approach that addresses this limitation via drag-and-drop mobile screen search, accommodating visual and text-based queries. D2S2 searches 58k Rico screens for relevant UI examples based on UI element attributes, including type, position, shape, and text. In an evaluation with 10 novice software developers D2S2 successfully retrieves target screens within the top-20 search results in 15/19 attempts within a minute. The tool offers interactive and iterative search, updating its search results each time the user modifies the search query. Interested users can freely access D2S2 (http://pixeltoapp.com/D2S2), build on D2S2 or replicate results via D2S2’s open-source implementation (https://github.com/toni-tang/D2S2), or watch D2S2’s video demonstration (https://youtu.be/fdoYiw8lAn0). @InProceedings{ESEC/FSE23p2177, author = {Soumik Mohian and Tony Tang and Tuan Trinh and Don Dang and Christoph Csallner}, title = {D2S2: Drag ’n’ Drop Mobile App Screen Search}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2177--2181}, doi = {10.1145/3611643.3613100}, year = {2023}, } Publisher's Version Video |
|
Cui, Heming |
ESEC/FSE '23: "Enhancing Coverage-Guided ..."
Enhancing Coverage-Guided Fuzzing via Phantom Program
Mingyuan Wu, Kunqiu Chen, Qi Luo, Jiahong Xiang, Ji Qi, Junjie Chen, Heming Cui, and Yuqun Zhang (Southern University of Science and Technology, China; University of Hong Kong, China; Tianjin University, China) For coverage-guided fuzzers, many of their adopted seeds are usually underused by exploring limited program states since essentially all their executions have to abide by rigorous program dependencies while only limited seeds are capable of accessing dependencies. Moreover, even when iteratively executing such limited seeds, the fuzzers have to repeatedly access the covered program states before uncovering new states. Such facts indicate that exploration power on program states of seeds has not been sufficiently leveraged by the existing coverage-guided fuzzing strategies. To tackle these issues, we propose a coverage-guided fuzzer, namely MirageFuzz, to mitigate the program dependencies when executing seeds for enhancing their exploration power on program states. Specifically, MirageFuzz first creates a “phantom” program of the target program by reducing its program dependencies corresponding to conditional statements while retaining their original semantics. Accordingly, MirageFuzz performs dual fuzzing, i.e., the source fuzzing to fuzz the original program and the phantom fuzzing to fuzz the phantom program simultaneously. Then, MirageFuzz applies the taint-based mutation mechanism to generate a new seed by updating the target conditional statement of a given seed from the source fuzzing with the corresponding condition value derived by the phantom fuzzing. To evaluate the effectiveness of MirageFuzz, we build a benchmark suite with 18 projects commonly adopted by recent fuzzing papers, and select seven open-source fuzzers as baselines for performance comparison with MirageFuzz. The experiment results suggest that MirageFuzz outperforms our baseline fuzzers from 13.42% to 77.96% averagely. Furthermore, MirageFuzz exposes 29 previously unknown bugs where 4 of them have been confirmed and 3 have been fixed by the corresponding developers. @InProceedings{ESEC/FSE23p1037, author = {Mingyuan Wu and Kunqiu Chen and Qi Luo and Jiahong Xiang and Ji Qi and Junjie Chen and Heming Cui and Yuqun Zhang}, title = {Enhancing Coverage-Guided Fuzzing via Phantom Program}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1037--1049}, doi = {10.1145/3611643.3616294}, year = {2023}, } Publisher's Version ESEC/FSE '23: "SJFuzz: Seed and Mutator Scheduling ..." SJFuzz: Seed and Mutator Scheduling for JVM Fuzzing Mingyuan Wu, Yicheng Ouyang, Minghai Lu, Junjie Chen, Yingquan Zhao, Heming Cui, Guowei Yang, and Yuqun Zhang (Southern University of Science and Technology, China; University of Hong Kong, China; Tianjin University, China; University of Queensland, Australia) While the Java Virtual Machine (JVM) plays a vital role in ensuring correct executions of Java applications, testing JVMs via generating and running class files on them can be rather challenging. The existing techniques, e.g., ClassFuzz and Classming, attempt to leverage the power of fuzzing and differential testing to cope with JVM intricacies by exposing discrepant execution results among different JVMs, i.e., inter-JVM discrepancies, for testing analytics. However, their adopted fuzzers are insufficiently guided since they include no well-designed seed and mutator scheduling mechanisms, leading to inefficient differential testing. To address such issues, in this paper, we propose SJFuzz, the first JVM fuzzing framework with seed and mutator scheduling mechanisms for automated JVM differential testing. Overall, SJFuzz aims to mutate class files via control flow mutators to facilitate the exposure of inter-JVM discrepancies. To this end, SJFuzz schedules seeds (class files) for mutations based on the discrepancy and diversity guidance. SJFuzz also schedules mutators for diversifying class file generation. To evaluate SJFuzz, we conduct an extensive study on multiple representative real-world JVMs, and the experimental results show that SJFuzz significantly outperforms the state-of-the-art mutation-based and generation-based JVM fuzzers in terms of the inter-JVM discrepancy exposure and the class file diversity. Moreover, SJFuzz successfully reported 46 potential JVM issues, and 20 of them have been confirmed as bugs and 16 have been fixed by the JVM developers. @InProceedings{ESEC/FSE23p1062, author = {Mingyuan Wu and Yicheng Ouyang and Minghai Lu and Junjie Chen and Yingquan Zhao and Heming Cui and Guowei Yang and Yuqun Zhang}, title = {SJFuzz: Seed and Mutator Scheduling for JVM Fuzzing}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1062--1074}, doi = {10.1145/3611643.3616277}, year = {2023}, } Publisher's Version |
|
Cui, Siwei |
ESEC/FSE '23-INDUSTRY: "Compositional Taint Analysis ..."
Compositional Taint Analysis for Enforcing Security Policies at Scale
Subarno Banerjee, Siwei Cui, Michael Emmi, Antonio Filieri, Liana Hadarean, Peixuan Li, Linghui Luo, Goran Piskachev, Nicolás Rosner, Aritra Sengupta, Omer Tripp, and Jingbo Wang (Amazon Web Services, USA; Texas A&M University, USA; Amazon Web Services, Germany; University of Southern California, USA) Automated static dataflow analysis is an effective technique for detecting security critical issues like sensitive data leak, and vulnerability to injection attacks. Ensuring high precision and recall requires an analysis that is context, field and object sensitive. However, it is challenging to attain high precision and recall and scale to large industrial code bases. Compositional style analyses in which individual software components are analyzed separately, independent from their usage contexts, compute reusable summaries of components. This is an essential feature when deploying such analyses in CI/CD at code-review time or when scanning deployed container images. In both these settings the majority of software components stay the same between subsequent scans. However, it is not obvious how to extend such analyses to check the kind of contextual taint specifications that arise in practice, while maintaining compositionality. In this work we present contextual dataflow modeling, an extension to the compositional analysis to check complex taint specifications and significantly increasing recall and precision. Furthermore, we show how such high-fidelity analysis can scale in production using three key optimizations: (i) discarding intermediate results for previously-analyzed components, an optimization exploiting the compositional nature of our analysis; (ii) a scope-reduction analysis to reduce the scope of the taint analysis w.r.t. the taint specifications being checked, and (iii) caching of analysis models. We show a 9.85% reduction in false positive rate on a comprehensive test suite comprising the OWASP open-source benchmarks as well as internal real-world code samples. We measure the performance and scalability impact of each individual optimization using open source JVM packages from the Maven central repository and internal AWS service codebases. This combination of high precision, recall, performance, and scalability has allowed us to enforce security policies at scale both internally within Amazon as well as for external customers. @InProceedings{ESEC/FSE23p1985, author = {Subarno Banerjee and Siwei Cui and Michael Emmi and Antonio Filieri and Liana Hadarean and Peixuan Li and Linghui Luo and Goran Piskachev and Nicolás Rosner and Aritra Sengupta and Omer Tripp and Jingbo Wang}, title = {Compositional Taint Analysis for Enforcing Security Policies at Scale}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1985--1996}, doi = {10.1145/3611643.3613889}, year = {2023}, } Publisher's Version |
|
Dagan, Ella |
ESEC/FSE '23: "Building and Sustaining Ethnically, ..."
Building and Sustaining Ethnically, Racially, and Gender Diverse Software Engineering Teams: A Study at Google
Ella Dagan, Anita Sarma, Alison Chang, Sarah D’Angelo, Jill Dicker, and Emerson Murphy-Hill (Google, USA; Oregon State University, USA) Teams that build software are largely demographically homogeneous. Without diversity, homogeneous perspectives dominate how, why, and for whom software is designed. To understand how teams can successfully build and sustain diversity, we interviewed 11 engineers and 9 managers from some of the most gender and racially diverse teams at Google, a large software company. Qualitatively analyzing the interviews, we found shared approaches to recruiting, hiring, and promoting an inclusive environment, all of which create a positive feedback loop. Our findings produce actionable practices that every member of the team can take to increase diversity by fostering a more inclusive software engineering environment. @InProceedings{ESEC/FSE23p631, author = {Ella Dagan and Anita Sarma and Alison Chang and Sarah D’Angelo and Jill Dicker and Emerson Murphy-Hill}, title = {Building and Sustaining Ethnically, Racially, and Gender Diverse Software Engineering Teams: A Study at Google}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {631--643}, doi = {10.1145/3611643.3616273}, year = {2023}, } Publisher's Version |
|
Damevski, Kostadin |
ESEC/FSE '23-IVR: "Towards Understanding Emotions ..."
Towards Understanding Emotions in Informal Developer Interactions: A Gitter Chat Study
Amirali Sajadi, Kostadin Damevski, and Preetha Chatterjee (Drexel University, USA; Virginia Commonwealth University, USA) Emotions play a significant role in teamwork and collaborative activities like software development. While researchers have analyzed developer emotions in various software artifacts (e.g., issues, pull requests), few studies have focused on understanding the broad spectrum of emotions expressed in chats. As one of the most widely used means of communication, chats contain valuable information in the form of informal conversations, such as negative perspectives about adopting a tool. In this paper, we present a dataset of developer chat messages manually annotated with a wide range of emotion labels (and sub-labels), and analyze the type of information present in those messages. We also investigate the unique signals of emotions specific to chats and distinguish them from other forms of software communication. Our findings suggest that chats have fewer expressions of Approval and Fear but more expressions of Curiosity compared to GitHub comments. We also notice that Confusion is frequently observed when discussing programming-related information such as unexpected software behavior. Overall, our study highlights the potential of mining emotions in developer chats for supporting software maintenance and evolution tools. @InProceedings{ESEC/FSE23p2097, author = {Amirali Sajadi and Kostadin Damevski and Preetha Chatterjee}, title = {Towards Understanding Emotions in Informal Developer Interactions: A Gitter Chat Study}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2097--2101}, doi = {10.1145/3611643.3613084}, year = {2023}, } Publisher's Version |
|
Dang, Don |
ESEC/FSE '23-DEMO: "D2S2: Drag ’n’ Drop Mobile ..."
D2S2: Drag ’n’ Drop Mobile App Screen Search
Soumik Mohian, Tony Tang, Tuan Trinh, Don Dang, and Christoph Csallner (University of Texas at Arlington, USA) The lack of diverse UI element representations in publicly available datasets hinders the scalability of sketch-based interactive mobile search. This paper introduces D2S2, a novel approach that addresses this limitation via drag-and-drop mobile screen search, accommodating visual and text-based queries. D2S2 searches 58k Rico screens for relevant UI examples based on UI element attributes, including type, position, shape, and text. In an evaluation with 10 novice software developers D2S2 successfully retrieves target screens within the top-20 search results in 15/19 attempts within a minute. The tool offers interactive and iterative search, updating its search results each time the user modifies the search query. Interested users can freely access D2S2 (http://pixeltoapp.com/D2S2), build on D2S2 or replicate results via D2S2’s open-source implementation (https://github.com/toni-tang/D2S2), or watch D2S2’s video demonstration (https://youtu.be/fdoYiw8lAn0). @InProceedings{ESEC/FSE23p2177, author = {Soumik Mohian and Tony Tang and Tuan Trinh and Don Dang and Christoph Csallner}, title = {D2S2: Drag ’n’ Drop Mobile App Screen Search}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2177--2181}, doi = {10.1145/3611643.3613100}, year = {2023}, } Publisher's Version Video |
|
Dang, Yingnong |
ESEC/FSE '23-INDUSTRY: "Assess and Summarize: Improve ..."
Assess and Summarize: Improve Outage Understanding with Large Language Models
Pengxiang Jin, Shenglin Zhang, Minghua Ma, Haozhe Li, Yu Kang, Liqun Li, Yudong Liu, Bo Qiao, Chaoyun Zhang, Pu Zhao, Shilin He, Federica Sarro, Yingnong Dang, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang (Nankai University, China; Microsoft, China; Peking University, China; University College London, UK) Cloud systems have become increasingly popular in recent years due to their flexibility and scalability. Each time cloud computing applications and services hosted on the cloud are affected by a cloud outage, users can experience slow response times, connection issues or total service disruption, resulting in a significant negative business impact. Outages are usually comprised of several concurring events/source causes, and therefore understanding the context of outages is a very challenging yet crucial first step toward mitigating and resolving outages. In current practice, on-call engineers with in-depth domain knowledge, have to manually assess and summarize outages when they happen, which is time-consuming and labor-intensive. In this paper, we first present a large-scale empirical study investigating the way on-call engineers currently deal with cloud outages at Microsoft, and then present and empirically validate a novel approach (dubbed Oasis) to help the engineers in this task. Oasis is able to automatically assess the impact scope of outages as well as to produce human-readable summarization. Specifically, Oasis first assesses the impact scope of an outage by aggregating relevant incidents via multiple techniques. Then, it generates a human-readable summary by leveraging fine-tuned large language models like GPT-3.x. The impact assessment component of Oasis was introduced in Microsoft over three years ago, and it is now widely adopted, while the outage summarization component has been recently introduced, and in this article we present the results of an empirical evaluation we carried out on 18 real-world cloud systems as well as a human-based evaluation with outage owners. The results obtained show that Oasis can effectively and efficiently summarize outages, and lead Microsoft to deploy its first prototype which is currently under experimental adoption by some of the incident teams. @InProceedings{ESEC/FSE23p1657, author = {Pengxiang Jin and Shenglin Zhang and Minghua Ma and Haozhe Li and Yu Kang and Liqun Li and Yudong Liu and Bo Qiao and Chaoyun Zhang and Pu Zhao and Shilin He and Federica Sarro and Yingnong Dang and Saravan Rajmohan and Qingwei Lin and Dongmei Zhang}, title = {Assess and Summarize: Improve Outage Understanding with Large Language Models}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1657--1668}, doi = {10.1145/3611643.3613891}, year = {2023}, } Publisher's Version |
|
Dao, Tung |
ESEC/FSE '23-INDUSTRY: "Triggering Modes in Spectrum-Based ..."
Triggering Modes in Spectrum-Based Multi-location Fault Localization
Tung Dao, Na Meng, and ThanhVu Nguyen (Cvent, USA; Virginia Tech, USA; George Mason University, USA) Spectrum-based fault localization (SBFL) techniques can aid in debugging, but their practicality in industrial settings has been limited due to the large number of tests needed to execute before applying SBFL. Previous research has explored different trigger modes for SBFL and found that applying it immediately after the first test failure is also effective. However, this study only considered single-location bugs, while multi-location bugs are prevalent in real-world scenarios and especially at our company Cvent, which is interested in integrating SBFL to its CI/CD workflow. In this work, we investigate the effectiveness of SBFL on multi-location bugs and propose a framework called Instant Fault Localization for Multi-location Bugs (IFLM). We compare and evaluate four trigger modes of IFLM using open-source (Defects4J) and close-source (Cvent) bug datasets. Our study showed that it is not necessary to execute all test cases before applying SBFL. However, we also found that that applying SBFL right after the first failed test is less effective than applying it after executing all tests for multi-location bugs, which is contrary to the single-location bug study. We also observe differences in performance between real and artificial bugs. Our contributions include the development of IFLM and CVent bug datasets, analysis of SBFL effectiveness for multi-location bugs, and practical implications for integrating SBFL in industrial environments. @InProceedings{ESEC/FSE23p1774, author = {Tung Dao and Na Meng and ThanhVu Nguyen}, title = {Triggering Modes in Spectrum-Based Multi-location Fault Localization}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1774--1785}, doi = {10.1145/3611643.3613884}, year = {2023}, } Publisher's Version Info |
|
Davis, James C. |
ESEC/FSE '23-IVR: "Reflecting on the Use of the ..."
Reflecting on the Use of the Policy-Process-Product Theory in Empirical Software Engineering
Kelechi G. Kalu, Taylor R. Schorlemmer, Sophie Chen, Kyle A. Robinson, Erik Kocinare, and James C. Davis (Purdue University, USA; University of Michigan, USA) The primary theory of software engineering is that an organization’s Policies and Processes influence the quality of its Products. We call this the PPP Theory. Although empirical software engineering research has grown common, it is unclear whether researchers are trying to evaluate the PPP Theory. To assess this, we analyzed half (33) of the empirical works published over the last two years in three prominent software engineering conferences. In this sample, 70% focus on policies/processes or products, not both. Only 33% provided measurements relating policy/process and products. We make four recommendations: (1) Use PPP Theory in study design; (2) Study feedback relationships; (3) Diversify the studied feed-forward relationships; and (4) Disentangle policy and process. Let us remember that research results are in the context of, and with respect to, the relationship between software products, processes, and policies. @InProceedings{ESEC/FSE23p2112, author = {Kelechi G. Kalu and Taylor R. Schorlemmer and Sophie Chen and Kyle A. Robinson and Erik Kocinare and James C. Davis}, title = {Reflecting on the Use of the Policy-Process-Product Theory in Empirical Software Engineering}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2112--2116}, doi = {10.1145/3611643.3613075}, year = {2023}, } Publisher's Version |
|
Davis, Matthew C. |
ESEC/FSE '23: "NaNofuzz: A Usable Tool for ..."
NaNofuzz: A Usable Tool for Automatic Test Generation
Matthew C. Davis, Sangheon Choi, Sam Estep, Brad A. Myers, and Joshua Sunshine (Carnegie Mellon University, USA; Rose-Hulman Institute of Technology, USA) In the United States alone, software testing labor is estimated to cost $48 billion USD per year. Despite widespread test execution automation and automation in other areas of software engineering, test suites continue to be created manually by software engineers. We have built a test generation tool, called NaNofuzz, that helps users find bugs in their code by suggesting tests where the output is likely indicative of a bug, e.g., that return NaN (not-a-number) values. NaNofuzz is an interactive tool embedded in a development environment to fit into the programmer's workflow. NaNofuzz tests a function with as little as one button press, analyses the program to determine inputs it should evaluate, executes the program on those inputs, and categorizes outputs to prioritize likely bugs. We conducted a randomized controlled trial with 28 professional software engineers using NaNofuzz as the intervention treatment and the popular manual testing tool, Jest, as the control treatment. Participants using NaNofuzz on average identified bugs more accurately (p < .05, by 30%), were more confident in their tests (p < .03, by 20%), and finished their tasks more quickly (p < .007, by 30%). @InProceedings{ESEC/FSE23p1114, author = {Matthew C. Davis and Sangheon Choi and Sam Estep and Brad A. Myers and Joshua Sunshine}, title = {NaNofuzz: A Usable Tool for Automatic Test Generation}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1114--1126}, doi = {10.1145/3611643.3616327}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
De la Cruz, Alejandro |
ESEC/FSE '23: "LibKit: Detecting Third-Party ..."
LibKit: Detecting Third-Party Libraries in iOS Apps
Daniel Domínguez-Álvarez, Alejandro de la Cruz, Alessandra Gorla, and Juan Caballero (IMDEA Software Institute, Spain; University of Verona, Italy) We present LibKit, the first approach and tool for detecting the name and version of third-party libraries (TPLs) present in iOS apps. LibKit automatically builds fingerprints for 86K library versions available through the CocoaPods dependency manager and matches them on the decrypted app executables to identify the TPLs (name and version) an iOS app uses. LibKit supports apps written in Swift and Objective-C, detects statically and dynamically linked libraries, and addresses challenges such as partially included libraries and different compiler versions and configurations producing variants of the same library version. On a ground truth of 95 open-source apps, LibKit identifies libraries with a precision of 0.911 and a recall of 0.839. LibKit also significantly outperforms the state-of-the-art CRiOS tool for identifying TPL boundaries. When applied to 1,500 apps from the iTunes Store, LibKit detects 47,015 library versions, identifying popular apps that contain old library versions. @InProceedings{ESEC/FSE23p1407, author = {Daniel Domínguez-Álvarez and Alejandro de la Cruz and Alessandra Gorla and Juan Caballero}, title = {LibKit: Detecting Third-Party Libraries in iOS Apps}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1407--1418}, doi = {10.1145/3611643.3616344}, year = {2023}, } Publisher's Version Published Artifact Info Artifacts Available |
|
Deng, Yuetang |
ESEC/FSE '23-INDUSTRY: "Towards Efficient Record and ..."
Towards Efficient Record and Replay: A Case Study in WeChat
Sidong Feng, Haochuan Lu, Ting Xiong, Yuetang Deng, and Chunyang Chen (Monash University, Australia; Tencent, China) WeChat, a widely-used messenger app boasting over 1 billion monthly active users, requires effective app quality assurance for its complex features. Record-and-replay tools are crucial in achieving this goal. Despite the extensive development of these tools, the impact of waiting time between replay events has been largely overlooked. On one hand, a long waiting time for executing replay events on fully-rendered GUIs slows down the process. On the other hand, a short waiting time can lead to events executing on partially-rendered GUIs, negatively affecting replay effectiveness. An optimal waiting time should strike a balance between effectiveness and efficiency. We introduce WeReplay, a lightweight image-based approach that dynamically adjusts inter-event time based on the GUI rendering state. Given the real-time streaming on the GUI, WeReplay employs a deep learning model to infer the rendering state and synchronize with the replaying tool, scheduling the next event when the GUI is fully rendered. Our evaluation shows that our model achieves 92.1% precision and 93.3% recall in discerning GUI rendering states in the WeChat app. Through assessing the performance in replaying 23 common WeChat usage scenarios, WeReplay successfully replays all scenarios on the same and different devices more efficiently than the state-of-the-practice baselines. @InProceedings{ESEC/FSE23p1681, author = {Sidong Feng and Haochuan Lu and Ting Xiong and Yuetang Deng and Chunyang Chen}, title = {Towards Efficient Record and Replay: A Case Study in WeChat}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1681--1692}, doi = {10.1145/3611643.3613880}, year = {2023}, } Publisher's Version ESEC/FSE '23-INDUSTRY: "A Unified Framework for Mini-game ..." A Unified Framework for Mini-game Testing: Experience on WeChat Chaozheng Wang, Haochuan Lu, Cuiyun Gao, Zongjie Li, Ting Xiong, and Yuetang Deng (Chinese University of Hong Kong, China; Tencent, China; Hong Kong University of Science and Technology, China) Mobile games play an increasingly important role in our daily life. The quality of mobile games can substantially affect the user experience and game revenue. Different from traditional mobile games, the mini-games provided by our partner, Tencent, are embedded in the mobile app WeChat, so users do not need to install specific game apps and can directly play the games in the app. Due to the convenient installation, WeChat has attracted large numbers of developers to design and publish on the mini-game platform in the app. Until now, the platform has more than one hundred thousand published mini-games. Manually testing all the mini-games requires enormous effort and is impractical. There exist automated game testing methods; however, they are difficult to be applied for testing mini-games for the following reasons: 1) Effective game testing heavily relies on prior knowledge about game operations and extraction of GUI widget trees. However, this knowledge is specific and not always applicable when testing a large number of mini-games with complex game engines (e.g., Unity). 2) The highly diverse GUI widget design of mini-games deviates significantly from that of mobile apps. Such issue prevents the existing image-based GUI widget detection techniques from effectively detecting widgets in mini-games. To address the aforementioned issues, we propose a unified framework for black-box mini-game testing named iExplorer. iExplorer involves a mixed GUI widget detection approach incorporating both deep learning-based object detection and edge aggregation-based segmentation for detecting GUI widgets in mini-games. A category-aware testing strategy is then proposed for testing mini-games, with different categories of widgets (e.g., sliding and clicking widgets) considered. iExplorer has been deployed in for more than six months. In the past 30 days, iExplorer has tested large-scale mini-games (i.e., 76,000) and successfully found 22,144 real bugs. @InProceedings{ESEC/FSE23p1623, author = {Chaozheng Wang and Haochuan Lu and Cuiyun Gao and Zongjie Li and Ting Xiong and Yuetang Deng}, title = {A Unified Framework for Mini-game Testing: Experience on WeChat}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1623--1634}, doi = {10.1145/3611643.3613868}, year = {2023}, } Publisher's Version ESEC/FSE '23: "How Practitioners Expect Code ..." How Practitioners Expect Code Completion? Chaozheng Wang, Junhao Hu, Cuiyun Gao, Yu Jin, Tao Xie, Hailiang Huang, Zhenyu Lei, and Yuetang Deng (Chinese University of Hong Kong, China; Peking University, China; Tencent, USA; Tencent, China) Code completion has become a common practice for programmers during their daily programming activities. It automatically predicts the next tokens or statements that the programmers may use. Code completion aims to substantially save keystrokes and improve the programming efficiency for programmers. Although there exists substantial research on code completion, it is still unclear what practitioner expectations are on code completion and whether these expectations are met by the existing research. To address these questions, we perform a study by first interviewing 15 professionals and then surveying 599 practitioners from 18 IT companies about their expectations on code completion. We then compare the practitioner expectations with the existing research by conducting a literature review of papers on code completion published in major publication venues from 2012 to 2022. Based on the comparison, we highlight the directions desirable for researchers to invest efforts toward developing code completion techniques for meeting practitioner expectations. @InProceedings{ESEC/FSE23p1294, author = {Chaozheng Wang and Junhao Hu and Cuiyun Gao and Yu Jin and Tao Xie and Hailiang Huang and Zhenyu Lei and Yuetang Deng}, title = {How Practitioners Expect Code Completion?}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1294--1306}, doi = {10.1145/3611643.3616280}, year = {2023}, } Publisher's Version |
|
Derakhshanfar, Pouria |
ESEC/FSE '23-INDUSTRY: "BFSig: Leveraging File Significance ..."
BFSig: Leveraging File Significance in Bus Factor Estimation
Vahid Haratian, Mikhail Evtikhiev, Pouria Derakhshanfar, Eray Tüzün, and Vladimir Kovalenko (Bilkent University, Turkiye; JetBrains Research, Cyprus; JetBrains Research, Netherlands) Software projects experience the departure of developers due to various reasons. As developers are one of the main sources of knowledge in software projects, their absence will inevitably result in a certain degree of knowledge depletion. Bus Factor (BF) is a metric to evaluate how this knowledge loss can affect the project’s continuity. Conventionally, BF is calculated as the smallest set of developers, removing over half the project knowledge upon departure. Current state-of-the-art approaches measure developers’ knowledge by the number of authored files, utilizing version control system (VCS) information. However, numerous studies have shown that files in software projects have different significance. In this study, we explore how weighting files according to their significance affects the performance of two prevailing BF estimators. We derive significance scores by computing five well-known graph metrics from the project’s dependency graph: PageRank, In-/Out-/All-Degree, and Betweenness Centralities. Furthermore, we introduce BFSig , a prototype of our approach. Finally, we present a new dataset comprising reported BF scores collected by surveying software practitioners from five prominent Github repositories. Our results indicate that BFSig outperforms the baselines by up to an 18% reduction in terms of Normalized Mean Absolute Error (NMAE). Moreover, BFSig yields 18% fewer False Negatives in identifying potential risks associated with low BF. Besides, our respondent confirmed BFSig versatility by showing its ability to assess the BF of the project’s subfolders. In conclusion, we believe to estimate BF from authorship, software components of higher importance should be assigned heavier weight. Currently, BFSig exclusively explores the topological characteristics of these components. Nevertheless, considering attributes such as code complexity and bug proneness could potentially enhance the performance of BFSig. @InProceedings{ESEC/FSE23p1926, author = {Vahid Haratian and Mikhail Evtikhiev and Pouria Derakhshanfar and Eray Tüzün and Vladimir Kovalenko}, title = {BFSig: Leveraging File Significance in Bus Factor Estimation}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1926--1936}, doi = {10.1145/3611643.3613877}, year = {2023}, } Publisher's Version |
|
Dhaouadi, Mouna |
ESEC/FSE '23-SRC: "A Data Set of Extracted Rationale ..."
A Data Set of Extracted Rationale from Linux Kernel Commit Messages
Mouna Dhaouadi (Université de Montréal, Canada) Developer’s commit messages contain information about decisions taken and their rationale. Extracting this information is challenging since we lack a detailed understanding of how developers express these concepts. Our work-in-progress targets this challenge by producing a labelled data set of commit messages for a Linux Kernel component. We report preliminary analyses which suggest that larger commit messages and more experienced developers commits tend towards having 40% of sentences containing rationale. This may indicate a guideline for developers to target. @InProceedings{ESEC/FSE23p2187, author = {Mouna Dhaouadi}, title = {A Data Set of Extracted Rationale from Linux Kernel Commit Messages}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2187--2188}, doi = {10.1145/3611643.3617851}, year = {2023}, } Publisher's Version |
|
Diao, Zulong |
ESEC/FSE '23-INDUSTRY: "Beyond Sharing: Conflict-Aware ..."
Beyond Sharing: Conflict-Aware Multivariate Time Series Anomaly Detection
Haotian Si, Changhua Pei, Zhihan Li, Yadong Zhao, Jingjing Li, Haiming Zhang, Zulong Diao, Jianhui Li, Gaogang Xie, and Dan Pei (Computer Network Information Center at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Kuaishou Technology, China; Institute of Computing Technology at Chinese Academy of Sciences, China; Purple Mountain Laboratories, China; Tsinghua University, China) Massive key performance indicators (KPIs) are monitored as multivariate time series data (MTS) to ensure the reliability of the software applications and service system. Accurately detecting the abnormality of MTS is very critical for subsequent fault elimination. The scarcity of anomalies and manual labeling has led to the development of various self-supervised MTS anomaly detection (AD) methods, which optimize an overall objective/loss encompassing all metrics' regression objectives/losses. However, our empirical study uncovers the prevalence of conflicts among metrics' regression objectives, causing MTS models to grapple with different losses. This critical aspect significantly impacts detection performance but has been overlooked in existing approaches. To address this problem, by mimicking the design of multi-gate mixture-of-experts (MMoE), we introduce CAD, a Conflict-aware multivariate KPI Anomaly Detection algorithm. CAD offers an exclusive structure for each metric to mitigate potential conflicts while fostering inter-metric promotions. Upon thorough investigation, we find that the poor performance of vanilla MMoE mainly comes from the input-output misalignment settings of MTS formulation and convergence issues arising from expansive tasks. To address these challenges, we propose a straightforward yet effective task-oriented metric selection and p&s (personalized and shared) gating mechanism, which establishes CAD as the first practicable multi-task learning (MTL) based MTS AD model. Evaluations on multiple public datasets reveal that CAD obtains an average F1-score of 0.943 across three public datasets, notably outperforming state-of-the-art methods. Our code is accessible at https://github.com/dawnvince/MTS_CAD. @InProceedings{ESEC/FSE23p1635, author = {Haotian Si and Changhua Pei and Zhihan Li and Yadong Zhao and Jingjing Li and Haiming Zhang and Zulong Diao and Jianhui Li and Gaogang Xie and Dan Pei}, title = {Beyond Sharing: Conflict-Aware Multivariate Time Series Anomaly Detection}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1635--1645}, doi = {10.1145/3611643.3613896}, year = {2023}, } Publisher's Version Info |
|
Dicker, Jill |
ESEC/FSE '23: "Building and Sustaining Ethnically, ..."
Building and Sustaining Ethnically, Racially, and Gender Diverse Software Engineering Teams: A Study at Google
Ella Dagan, Anita Sarma, Alison Chang, Sarah D’Angelo, Jill Dicker, and Emerson Murphy-Hill (Google, USA; Oregon State University, USA) Teams that build software are largely demographically homogeneous. Without diversity, homogeneous perspectives dominate how, why, and for whom software is designed. To understand how teams can successfully build and sustain diversity, we interviewed 11 engineers and 9 managers from some of the most gender and racially diverse teams at Google, a large software company. Qualitatively analyzing the interviews, we found shared approaches to recruiting, hiring, and promoting an inclusive environment, all of which create a positive feedback loop. Our findings produce actionable practices that every member of the team can take to increase diversity by fostering a more inclusive software engineering environment. @InProceedings{ESEC/FSE23p631, author = {Ella Dagan and Anita Sarma and Alison Chang and Sarah D’Angelo and Jill Dicker and Emerson Murphy-Hill}, title = {Building and Sustaining Ethnically, Racially, and Gender Diverse Software Engineering Teams: A Study at Google}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {631--643}, doi = {10.1145/3611643.3616273}, year = {2023}, } Publisher's Version |
|
Di Francesco, Mario |
ESEC/FSE '23-INDUSTRY: "Analyzing Microservice Connectivity ..."
Analyzing Microservice Connectivity with Kubesonde
Jacopo Bufalino, Mario Di Francesco, and Tuomas Aura (Aalto University, Finland; Eficode, Finland) Modern cloud-based applications are composed of several microservices that interact over a network. They are complex distributed systems, to the point that developers may not even be aware of how microservices connect to each other and to the Internet. As a consequence, the security of these applications can be greatly compromised. This work explicitly targets this context by providing a methodology to assess microservice connectivity, a software tool that implements it, and findings from analyzing real cloud applications. Specifically, it introduces Kubesonde, a cloud-native software that instruments live applications running on a Kubernetes cluster to analyze microservice connectivity, with minimal impact on performance. An assessment of microservices in 200 popular cloud applications with Kubesonde revealed significant issues in terms of network isolation: more than 60% of them had discrepancies between their declared and actual connectivity, and none restricted outbound connections towards the Internet. Our analysis shows that Kubesonde offers valuable insights on the connectivity between microservices, beyond what is possible with existing tools. @InProceedings{ESEC/FSE23p2038, author = {Jacopo Bufalino and Mario Di Francesco and Tuomas Aura}, title = {Analyzing Microservice Connectivity with Kubesonde}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2038--2043}, doi = {10.1145/3611643.3613899}, year = {2023}, } Publisher's Version |
|
Ding, Ruomeng |
ESEC/FSE '23-INDUSTRY: "TraceDiag: Adaptive, Interpretable, ..."
TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems
Ruomeng Ding, Chaoyun Zhang, Lu Wang, Yong Xu, Minghua Ma, Xiaomin Wu, Meng Zhang, Qingjun Chen, Xin Gao, Xuedong Gao, Hao Fan, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang (Microsoft, China; Microsoft 365, China; Microsoft 365, USA) Root Cause Analysis (RCA) is becoming increasingly crucial for ensuring the reliability of microservice systems. However, performing RCA on modern microservice systems can be challenging due to their large scale, as they usually comprise hundreds of components, leading significant human effort. This paper proposes TraceDiag, an end-to-end RCA framework that addresses the challenges for large-scale microservice systems. It leverages reinforcement learning to learn a pruning policy for the service dependency graph to automatically eliminates redundant components, thereby significantly improving the RCA efficiency. The learned pruning policy is interpretable and fully adaptive to new RCA instances. With the pruned graph, a causal-based method can be executed with high accuracy and efficiency. The proposed TraceDiag framework is evaluated on real data traces collected from the Microsoft Exchange system, and demonstrates superior performance compared to state-of-the-art RCA approaches. Notably, TraceDiag has been integrated as a critical component in the Microsoft M365 Exchange, resulting in a significant improvement in the system's reliability and a considerable reduction in the human effort required for RCA. @InProceedings{ESEC/FSE23p1762, author = {Ruomeng Ding and Chaoyun Zhang and Lu Wang and Yong Xu and Minghua Ma and Xiaomin Wu and Meng Zhang and Qingjun Chen and Xin Gao and Xuedong Gao and Hao Fan and Saravan Rajmohan and Qingwei Lin and Dongmei Zhang}, title = {TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1762--1773}, doi = {10.1145/3611643.3613864}, year = {2023}, } Publisher's Version |
|
Ding, Zishuo |
ESEC/FSE '23: "IoPV: On Inconsistent Option ..."
IoPV: On Inconsistent Option Performance Variations
Jinfu Chen, Zishuo Ding, Yiming Tang, Mohammed Sayagh, Heng Li, Bram Adams, and Weiyi Shang (Wuhan University, China; University of Waterloo, Canada; Rochester Institute of Technology, USA; ÉTS, Canada; Polytechnique Montréal, Canada; Queen’s University, Canada) Maintaining a good performance of a software system is a primordial task when evolving a software system. The performance regression issues are among the dominant problems that large software systems face. In addition, these large systems tend to be highly configurable, which allows users to change the behaviour of these systems by simply altering the values of certain configuration options. However, such flexibility comes with a cost. Such software systems suffer throughout their evolution from what we refer to as “Inconsistent Option Performance Variation” (IoPV ). An IoPV indicates, for a given commit, that the performance regression or improvement of different values of the same configuration option is inconsistent compared to the prior commit. For instance, a new change might not suffer from any performance regression under the default configuration (i.e., when all the options are set to their default values), while altering one option’s value manifests a regression, which we refer to as a hidden regression as it is not manifested under the default configuration. Similarly, when developers improve the performance of their systems, performance regression might be manifested under a subset of the existing configurations. Unfortunately, such hidden regressions are harmful as they can go unseen to the production environment. In this paper, we first quantify how prevalent (in)consistent performance regression or improvement is among the values of an option. In particular, we study over 803 Hadoop and 502 Cassandra commits, for which we execute a total of 4,902 and 4,197 tests, respectively, amounting to 12,536 machine hours of testing. We observe that IoPV is a common problem that is difficult to manually predict. 69% and 93% of the Hadoop and Cassandra commits have at least one configuration that hides a performance regression. Worse, most of the commits have different options or tests leading to IoPV and hiding performance regressions. Therefore, we propose a prediction model that identifies whether a given combination of commit, test, and option (CTO) manifests an IoPV. Our evaluation for different models shows that random forest is the best performing classifier, with a median AUC of 0.91 and 0.82 for Hadoop and Cassandra, respectively. Our paper defines and provides scientific evidence about the IoPV problem and its prevalence, which can be explored by future work. In addition, we provide an initial machine learning model for predicting IoPV. @InProceedings{ESEC/FSE23p845, author = {Jinfu Chen and Zishuo Ding and Yiming Tang and Mohammed Sayagh and Heng Li and Bram Adams and Weiyi Shang}, title = {IoPV: On Inconsistent Option Performance Variations}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {845--857}, doi = {10.1145/3611643.3616319}, year = {2023}, } Publisher's Version |
|
Domínguez-Álvarez, Daniel |
ESEC/FSE '23: "LibKit: Detecting Third-Party ..."
LibKit: Detecting Third-Party Libraries in iOS Apps
Daniel Domínguez-Álvarez, Alejandro de la Cruz, Alessandra Gorla, and Juan Caballero (IMDEA Software Institute, Spain; University of Verona, Italy) We present LibKit, the first approach and tool for detecting the name and version of third-party libraries (TPLs) present in iOS apps. LibKit automatically builds fingerprints for 86K library versions available through the CocoaPods dependency manager and matches them on the decrypted app executables to identify the TPLs (name and version) an iOS app uses. LibKit supports apps written in Swift and Objective-C, detects statically and dynamically linked libraries, and addresses challenges such as partially included libraries and different compiler versions and configurations producing variants of the same library version. On a ground truth of 95 open-source apps, LibKit identifies libraries with a precision of 0.911 and a recall of 0.839. LibKit also significantly outperforms the state-of-the-art CRiOS tool for identifying TPL boundaries. When applied to 1,500 apps from the iTunes Store, LibKit detects 47,015 library versions, identifying popular apps that contain old library versions. @InProceedings{ESEC/FSE23p1407, author = {Daniel Domínguez-Álvarez and Alejandro de la Cruz and Alessandra Gorla and Juan Caballero}, title = {LibKit: Detecting Third-Party Libraries in iOS Apps}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1407--1418}, doi = {10.1145/3611643.3616344}, year = {2023}, } Publisher's Version Published Artifact Info Artifacts Available |
|
Dong, Jin Song |
ESEC/FSE '23: "DeepDebugger: An Interactive ..."
DeepDebugger: An Interactive Time-Travelling Debugging Approach for Deep Classifiers
Xianglin Yang, Yun Lin, Yifan Zhang, Linpeng Huang, Jin Song Dong, and Hong Mei (Shanghai Jiao Tong University, China; National University of Singapore, Singapore) A deep classifier is usually trained to (i) learn the numeric representation vector of samples and (ii) classify sample representations with learned classification boundaries. Time-travelling visualization, as an explainable AI technique, is designed to transform the model training dynamics into an animation of canvas with colorful dots and territories. Despite that the training dynamics of the high-level concepts such as sample representations and classification boundaries are now observable, the model developers can still be overwhelmed by tens of thousands of moving dots across hundreds of training epochs (i.e., frames in the animation), which makes them miss important training events. In this work, we make the first attempt to develop the model time-travelling visualizers to the model time-travelling debuggers, for its practical use in model debugging tasks. Specifically, given an animation of model training dynamics of sample representation and classification landscape, we propose DeepDebugger solution to recommend the samples of user interest in a human-in-the-loop manner. On one hand, DeepDebugger monitors the training dynamics of samples and recommends suspicious samples based on their abnormality. On the other hand, our recommendation is interactive and fault-resilient for the model developers to explore the training process. By learning users’ feedback, DeepDebugger refines its recommendation to fit their intention. Our extensive experiments on applying DeepDebugger on the known time-travelling visualizers show that DeepDebugger can (1) detect the majority of the abnormal movement of the training samples on canvas; (2) significantly boost the recommendation performance of samples of interest (5-10X more accurate than the baselines) with the runtime overhead of 0.015s per feedback; (3) be resilient under the 3%, 5%, 10% mistaken user feedback. Our user study of the tool shows that the interactive recommendation of DeepDebugger can help the participants accomplish the debugging tasks by saving 18.1% completion time and boosting the performance by 20.3%. @InProceedings{ESEC/FSE23p973, author = {Xianglin Yang and Yun Lin and Yifan Zhang and Linpeng Huang and Jin Song Dong and Hong Mei}, title = {DeepDebugger: An Interactive Time-Travelling Debugging Approach for Deep Classifiers}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {973--985}, doi = {10.1145/3611643.3616252}, year = {2023}, } Publisher's Version |
|
Dong, Tian |
ESEC/FSE '23: "Mate! Are You Really Aware? ..."
Mate! Are You Really Aware? An Explainability-Guided Testing Framework for Robustness of Malware Detectors
Ruoxi Sun, Minhui Xue, Gareth Tyson, Tian Dong, Shaofeng Li, Shuo Wang, Haojin Zhu, Seyit Camtepe, and Surya Nepal (CSIRO’s Data61, Australia; Cybersecurity CRC, Australia; Hong Kong University of Science and Technology, China; Shanghai Jiao Tong University, China; Peng Cheng Laboratory, China) Numerous open-source and commercial malware detectors are available. However, their efficacy is threatened by new adversarial attacks, whereby malware attempts to evade detection, e.g., by performing feature-space manipulation. In this work, we propose an explainability-guided and model-agnostic testing framework for robustness of malware detectors when confronted with adversarial attacks. The framework introduces the concept of Accrued Malicious Magnitude (AMM) to identify which malware features could be manipulated to maximize the likelihood of evading detection. We then use this framework to test several state-of-the-art malware detectors' ability to detect manipulated malware. We find that (i) commercial antivirus engines are vulnerable to AMM-guided test cases; (ii) the ability of a manipulated malware generated using one detector to evade detection by another detector (i.e., transferability) depends on the overlap of features with large AMM values between the different detectors; and (iii) AMM values effectively measure the fragility of features (i.e., capability of feature-space manipulation to flip the prediction results) and explain the robustness of malware detectors facing evasion attacks. Our findings shed light on the limitations of current malware detectors, as well as how they can be improved. @InProceedings{ESEC/FSE23p1573, author = {Ruoxi Sun and Minhui Xue and Gareth Tyson and Tian Dong and Shaofeng Li and Shuo Wang and Haojin Zhu and Seyit Camtepe and Surya Nepal}, title = {Mate! Are You Really Aware? An Explainability-Guided Testing Framework for Robustness of Malware Detectors}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1573--1585}, doi = {10.1145/3611643.3616309}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Dong, Yibo |
ESEC/FSE '23-INDUSTRY: "LightF3: A Lightweight Fully-Process ..."
LightF3: A Lightweight Fully-Process Formal Framework for Automated Verifying Railway Interlocking Systems
Yibo Dong, Xiaoyu Zhang, Yicong Xu, Chang Cai, Yu Chen, Weikai Miao, Jianwen Li, and Geguang Pu (East China Normal University, China; Shanghai Trusted Industrial Control Platform, China) Interlocking has long played a crucial role in railway systems. Its functional correctness, particularly concerning safety, forms the foundation of the entire signaling system. To date, numerous efforts have been made to formally model and verify interlocking systems. However, two main problems persist in most prior work: (1) The formal description of the interlocking system heavily depends on reusing existing models, which often results in overgeneralization and failing to fully utilize the intrinsic characteristics of interlocking systems. (2) The verification techniques of current approaches may quickly become outdated, and there is no adaptable method to integrate state-of-the-art verification algorithms or tools. To address the above issues, we present LightF3, a lightweight and fully-process formal framework for modeling and verifying railway interlocking systems. LightF3 provides RIS-FL, a formal language based on FQLTL (a variant of LTL) to model the system and its specifications. LightF3 transforms the RIS-FL model automatically to the aiger model, which is the mainstream input of state-of-the-art model checkers, and then invokes the most advanced checkers to complete the verification task. We evaluated LightF3 by testing five real station instances from our industrial partner, demonstrating its effectiveness as a new framework. Additionally, we analyzed the statistics of the verification results from different model-checking techniques, providing useful conclusions for both the railway interlocking and formal methods communities. @InProceedings{ESEC/FSE23p1914, author = {Yibo Dong and Xiaoyu Zhang and Yicong Xu and Chang Cai and Yu Chen and Weikai Miao and Jianwen Li and Geguang Pu}, title = {LightF3: A Lightweight Fully-Process Formal Framework for Automated Verifying Railway Interlocking Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1914--1925}, doi = {10.1145/3611643.3613874}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Dong, Yiwen |
ESEC/FSE '23-DEMO: "Ad Hoc Syntax-Guided Program ..."
Ad Hoc Syntax-Guided Program Reduction
Jia Le Tian, Mengxiao Zhang, Zhenyang Xu, Yongqiang Tian, Yiwen Dong, and Chengnian Sun (University of Waterloo, Canada) Program reduction is a widely adopted, indispensable technique for debugging language implementations such as compilers and interpreters. Given a program 𝑃 and a bug triggered by 𝑃, a program reducer can produce a minimized program 𝑃∗ that is derived from 𝑃 and still triggers the same bug. Perses is one of the state-of-the-art program reducers. It leverages the syntax of 𝑃 to guide the reduction process for efficiency and effectiveness. It is language-agnostic as its reduction algorithm is independent of any language-specific syntax. Conceptually to support a new language, Perses only needs the context-free grammar 𝐺 of the language; in practice, it is not easy. One needs to first manually transform 𝐺 into a special grammar form PNF with a tool provided by Perses, second manually change the code base of Perses to integrate the new language, and lastly build a binary of Perses. This paper presents our latest work to improve the usability of Perses by extending Perses to perform ad hoc program reduction for any new language as long as the language has a context-free grammar 𝐺. With this extended version (referred to as Persesadhoc), the difficulty of supporting new languages is significantly reduced: a user only needs to write a configuration file and execute one command to support a new language in Perses, compared to manually transforming the grammar format, modifying the code base, and re-building Perses. Our case study demonstrates that with Persesadhoc, the Perses related infrastructure code required for supporting GLSL can be reduced from 190 lines of code to 20. Our extensive evaluations also show that Persesadhoc is as effective and efficient as Perses in reducing hoc programs, and only takes 10 seconds to support a new language, which is negligible compared to the manual effort required in Perses. A video demonstration of the tool can be found at https://youtu.be/trYwOT0mXhU. @InProceedings{ESEC/FSE23p2137, author = {Jia Le Tian and Mengxiao Zhang and Zhenyang Xu and Yongqiang Tian and Yiwen Dong and Chengnian Sun}, title = {Ad Hoc Syntax-Guided Program Reduction}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2137--2141}, doi = {10.1145/3611643.3613101}, year = {2023}, } Publisher's Version Video |
|
Dou, Shihan |
ESEC/FSE '23: "Gitor: Scalable Code Clone ..."
Gitor: Scalable Code Clone Detection by Building Global Sample Graph
Junjie Shan, Shihan Dou, Yueming Wu, Hairu Wu, and Yang Liu (Westlake University, China; Fudan University, China; Nanyang Technological University, Singapore) Code clone detection is about finding out similar code fragments, which has drawn much attention in software engineering since it is important for software maintenance and evolution. Researchers have proposed many techniques and tools for source code clone detection, but current detection methods concentrate on analyzing or processing code samples individually without exploring the underlying connections among code samples. In this paper, we propose Gitor to capture the underlying connections among different code samples. Specifically, given a source code database, we first tokenize all code samples to extract the pre-defined individual information (keywords). After obtaining all samples’ individual information, we leverage them to build a large global sample graph where each node is a code sample or a type of individual information. Then we apply a node embedding technique on the global sample graph to extract all the samples’ vector representations. After collecting all code samples’ vectors, we can simply compare the similarity between any two samples to detect possible clone pairs. More importantly, since the obtained vector of a sample is from a global sample graph, we can combine it with its own code features to improve the code clone detection performance. To demonstrate the effectiveness of Gitor, we evaluate it on a widely used dataset namely BigCloneBench. Our experimental results show that Gitor has higher accuracy in terms of code clone detection and excellent execution time for inputs of various sizes (1–100 MLOC) compared to existing state-of-the-art tools. Moreover, we also evaluate the combination of Gitor with other traditional vector-based clone detection methods, the results show that the use of Gitor enables them detect more code clones with higher F1. @InProceedings{ESEC/FSE23p784, author = {Junjie Shan and Shihan Dou and Yueming Wu and Hairu Wu and Yang Liu}, title = {Gitor: Scalable Code Clone Detection by Building Global Sample Graph}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {784--795}, doi = {10.1145/3611643.3616371}, year = {2023}, } Publisher's Version |
|
Du, Xiaohu |
ESEC/FSE '23: "Understanding the Bug Characteristics ..."
Understanding the Bug Characteristics and Fix Strategies of Federated Learning Systems
Xiaohu Du, Xiao Chen, Jialun Cao, Ming Wen, Shing-Chi Cheung, and Hai Jin (Huazhong University of Science and Technology, China; Hong Kong University of Science and Technology, China) Federated learning (FL) is an emerging machine learning paradigm that aims to address the problem of isolated data islands. To preserve privacy, FL allows machine learning models and deep neural networks to be trained from decentralized data kept privately at individual devices. FL has been increasingly adopted in missioncritical fields such as finance and healthcare. However, bugs in FL systems are inevitable and may result in catastrophic consequences such as financial loss, inappropriate medical decision, and violation of data privacy ordinance. While many recent studies were conducted to understand the bugs in machine learning systems, there is no existing study to characterize the bugs arising from the unique nature of FL systems. To fill the gap, we collected 395 real bugs from six popular FL frameworks (Tensorflow Federated, PySyft, FATE, Flower, PaddleFL, and Fedlearner) in GitHub and StackOverflow, and then manually analyzed their symptoms and impacts, prone stages, root causes, and fix strategies. Furthermore, we report a series of findings and actionable implications that can potentially facilitate the detection of FL bugs. @InProceedings{ESEC/FSE23p1358, author = {Xiaohu Du and Xiao Chen and Jialun Cao and Ming Wen and Shing-Chi Cheung and Hai Jin}, title = {Understanding the Bug Characteristics and Fix Strategies of Federated Learning Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1358--1370}, doi = {10.1145/3611643.3616347}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable ESEC/FSE '23: "An Extensive Study on Adversarial ..." An Extensive Study on Adversarial Attack against Pre-trained Models of Code Xiaohu Du, Ming Wen, Zichao Wei, Shangwen Wang, and Hai Jin (Huazhong University of Science and Technology, China; National University of Defense Technology, China) Transformer-based pre-trained models of code (PTMC) have been widely utilized and have achieved state-of-the-art performance in many mission-critical applications. However, they can be vulnerable to adversarial attacks through identifier substitution or coding style transformation, which can significantly degrade accuracy and may further incur security concerns. Although several approaches have been proposed to generate adversarial examples for PTMC, the effectiveness and efficiency of such approaches, especially on different code intelligence tasks, has not been well understood. To bridge this gap, this study systematically analyzes five state-of-the-art adversarial attack approaches from three perspectives: effectiveness, efficiency, and the quality of generated examples. The results show that none of the five approaches balances all these perspectives. Particularly, approaches with a high attack success rate tend to be time-consuming; the adversarial code they generate often lack naturalness, and vice versa. To address this limitation, we explore the impact of perturbing identifiers under different contexts and find that identifier substitution within for and if statements is the most effective. Based on these findings, we propose a new approach that prioritizes different types of statements for various tasks and further utilizes beam search to generate adversarial examples. Evaluation results show that it outperforms the state-of-the-art ALERT in terms of both effectiveness and efficiency while preserving the naturalness of the generated adversarial examples. @InProceedings{ESEC/FSE23p489, author = {Xiaohu Du and Ming Wen and Zichao Wei and Shangwen Wang and Hai Jin}, title = {An Extensive Study on Adversarial Attack against Pre-trained Models of Code}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {489--501}, doi = {10.1145/3611643.3616356}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Du, Xiaoning |
ESEC/FSE '23: "CodeMark: Imperceptible Watermarking ..."
CodeMark: Imperceptible Watermarking for Code Datasets against Neural Code Completion Models
Zhensu Sun, Xiaoning Du, Fu Song, and Li Li (Beihang University, China; Monash University, Australia; Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Automotive Software Innovation Center, China) Code datasets are of immense value for training neural-network-based code completion models, where companies or organizations have made substantial investments to establish and process these datasets. Unluckily, these datasets, either built for proprietary or public usage, face the high risk of unauthorized exploits, resulting from data leakages, license violations, etc. Even worse, the "black-box" nature of neural models sets a high barrier for externals to audit their training datasets, which further connives these unauthorized usages. Currently, watermarking methods have been proposed to prohibit inappropriate usage of image and natural language datasets. However, due to domain specificity, they are not directly applicable to code datasets, leaving the copyright protection of this emerging and important field of code data still exposed to threats. To fill this gap, we propose a method, named CodeMark, to embed user-defined imperceptible watermarks into code datasets to trace their usage in training neural code completion models. CodeMark is based on adaptive semantic-preserving transformations, which preserve the exact functionality of the code data and keep the changes covert against rule-breakers. We implement CodeMark in a toolkit and conduct an extensive evaluation of code completion models. CodeMark is validated to fulfill all desired properties of practical watermarks, including harmlessness to model accuracy, verifiability, robustness, and imperceptibility. @InProceedings{ESEC/FSE23p1561, author = {Zhensu Sun and Xiaoning Du and Fu Song and Li Li}, title = {CodeMark: Imperceptible Watermarking for Code Datasets against Neural Code Completion Models}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1561--1572}, doi = {10.1145/3611643.3616297}, year = {2023}, } Publisher's Version ESEC/FSE '23: "DistXplore: Distribution-Guided ..." DistXplore: Distribution-Guided Testing for Evaluating and Enhancing Deep Learning Systems Longtian Wang, Xiaofei Xie, Xiaoning Du, Meng Tian, Qing Guo, Zheng Yang, and Chao Shen (Xi’an Jiaotong University, China; Singapore Management University, Singapore; Monash University, Australia; A*STAR, Singapore; Huawei, China) Deep learning (DL) models are trained on sampled data, where the distribution of training data differs from that of real-world data (i.e., the distribution shift), which reduces the model's robustness. Various testing techniques have been proposed, including distribution-unaware and distribution-aware methods. However, distribution-unaware testing lacks effectiveness by not explicitly considering the distribution of test cases and may generate redundant errors (within same distribution). Distribution-aware testing techniques primarily focus on generating test cases that follow the training distribution, missing out-of-distribution data that may also be valid and should be considered in the testing process. In this paper, we propose a novel distribution-guided approach for generating valid test cases with diverse distributions, which can better evaluate the model's robustness (i.e., generating hard-to-detect errors) and enhance the model's robustness (i.e., enriching training data). Unlike existing testing techniques that optimize individual test cases, DistXplore optimizes test suites that represent specific distributions. To evaluate and enhance the model's robustness, we design two metrics: distribution difference, which maximizes the similarity in distribution between two different classes of data to generate hard-to-detect errors, and distribution diversity, which increase the distribution diversity of generated test cases for enhancing the model's robustness. To evaluate the effectiveness of DistXplore in model evaluation and enhancement, we compare DistXplore with 14 state-of-the-art baselines on 10 models across 4 datasets. The evaluation results show that DisXplore not only detects a larger number of errors (e.g., 2×+ on average). Furthermore, DistXplore achieves a higher improvement in empirical robustness (e.g., 5.2% more accuracy improvement than the baselines on average). @InProceedings{ESEC/FSE23p68, author = {Longtian Wang and Xiaofei Xie and Xiaoning Du and Meng Tian and Qing Guo and Zheng Yang and Chao Shen}, title = {DistXplore: Distribution-Guided Testing for Evaluating and Enhancing Deep Learning Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {68--80}, doi = {10.1145/3611643.3616266}, year = {2023}, } Publisher's Version |
|
Du, Xueying |
ESEC/FSE '23: "KG4CraSolver: Recommending ..."
KG4CraSolver: Recommending Crash Solutions via Knowledge Graph
Xueying Du, Yiling Lou, Mingwei Liu, Xin Peng, and Tianyong Yang (Fudan University, China) Fixing crashes is challenging, and developers often discuss their encountered crashes and refer to similar crashes and solutions on online Q&A forums (e.g., Stack Overflow). However, a crash often involves very complex contexts, which includes different contextual elements, e.g., purposes, environments, code, and crash traces. Existing crash solution recommendation or general solution recommendation techniques only use an incomplete context or treat the entire context as pure texts to search relevant solutions for a given crash, resulting in inaccurate recommendation results. In this work, we propose a novel crash solution knowledge graph (KG) to summarize the complete crash context and its solution with a graph-structured representation. To construct the crash solution KG automatically, we propose to leverage prompt learning to construct the KG from SO threads with a small set of labeled data. Based on the constructed KG, we further propose a novel KG-based crash solution recommendation technique KG4CraSolver, which precisely finds the relevant SO thread for an encountered crash by finely analyzing and matching the complete crash context based on the crash solution KG. The evaluation results show that the constructed KG is of high quality and KG4CraSolver outperforms baselines in terms of all metrics (e.g., 13.4%-113.4% MRR improvements). Moreover, we perform a user study and find that KG4CraSolver helps participants find crash solutions 34.4% faster and 63.3% more accurately. @InProceedings{ESEC/FSE23p1242, author = {Xueying Du and Yiling Lou and Mingwei Liu and Xin Peng and Tianyong Yang}, title = {KG4CraSolver: Recommending Crash Solutions via Knowledge Graph}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1242--1254}, doi = {10.1145/3611643.3616317}, year = {2023}, } Publisher's Version Info ESEC/FSE '23: "Recommending Analogical APIs ..." Recommending Analogical APIs via Knowledge Graph Embedding Mingwei Liu, Yanjun Yang, Yiling Lou, Xin Peng, Zhong Zhou, Xueying Du, and Tianyong Yang (Fudan University, China) Library migration, which replaces the current library with a different one to retain the same software behavior, is common in software evolution. An essential part of this is finding an analogous API for the desired functionality. However, due to the multitude of libraries/APIs, manually finding such an API is time-consuming and error-prone. Researchers created automated analogical API recommendation techniques, notably documentation-based methods. Despite potential, these methods have limitations, e.g., incomplete semantic understanding in documentation and scalability issues. In this study, we present KGE4AR, a novel documentation-based approach using knowledge graph (KG) embedding for recommending analogical APIs during library migration. KGE4AR introduces a unified API KG to comprehensively represent documentation knowledge, capturing high-level semantics. It further embeds this unified API KG into vectors for efficient, scalable similarity calculation. We assess KGE4AR with 35,773 Java libraries in two scenarios, with and without target libraries. KGE4AR notably outperforms state-of-the-art techniques (e.g., 47.1%-143.0% and 11.7%-80.6% MRR improvements), showcasing scalability with growing library counts. @InProceedings{ESEC/FSE23p1496, author = {Mingwei Liu and Yanjun Yang and Yiling Lou and Xin Peng and Zhong Zhou and Xueying Du and Tianyong Yang}, title = {Recommending Analogical APIs via Knowledge Graph Embedding}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1496--1508}, doi = {10.1145/3611643.3616305}, year = {2023}, } Publisher's Version Info |
|
Du, Yali |
ESEC/FSE '23: "Pre-training Code Representation ..."
Pre-training Code Representation with Semantic Flow Graph for Effective Bug Localization
Yali Du and Zhongxing Yu (Shandong University, China) Enlightened by the big success of pre-training in natural language processing, pre-trained models for programming languages have been widely used to promote code intelligence in recent years. In particular, BERT has been used for bug localization tasks and impressive results have been obtained. However, these BERT-based bug localization techniques suffer from two issues. First, the pre-trained BERT model on source code does not adequately capture the deep semantics of program code. Second, the overall bug localization models neglect the necessity of large-scale negative samples in contrastive learning for representations of changesets and ignore the lexical similarity between bug reports and changesets during similarity estimation. We address these two issues by 1) proposing a novel directed, multiple-label code graph representation named Semantic Flow Graph (SFG), which compactly and adequately captures code semantics, 2) designing and training SemanticCodeBERT based on SFG, and 3) designing a novel Hierarchical Momentum Contrastive Bug Localization technique (HMCBL). Evaluation results show that our method achieves state-of-the-art performance in bug localization. @InProceedings{ESEC/FSE23p579, author = {Yali Du and Zhongxing Yu}, title = {Pre-training Code Representation with Semantic Flow Graph for Effective Bug Localization}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {579--591}, doi = {10.1145/3611643.3616338}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Duan, Zhenhua |
ESEC/FSE '23: "Detecting Atomicity Violations ..."
Detecting Atomicity Violations in Interrupt-Driven Programs via Interruption Points Selecting and Delayed ISR-Triggering
Bin Yu, Cong Tian, Hengrui Xing, Zuchao Yang, Jie Su, Xu Lu, Jiyu Yang, Liang Zhao, Xiaofeng Li, and Zhenhua Duan (Xidian University, China; Beijing Institute of Control Engineering, China) Interrupt-driven programs have been widely used in safety-critical areas such as aerospace and embedded systems. However, uncertain interleaving execution of interrupt service routines (ISRs) usually causes concurrency bugs. Specifically, when one or more ISRs attempt to preempt a sequence of instructions which are expected to be atomic, a kind of concurrency bugs namely atomicity violation may occur, and it is challenging to find this kind of bugs precisely and efficiently. In this paper, we propose a static approach for detecting atomicity violations in interrupt-driven programs. First, the program model is constructed with interruption points being selected to determine the possibly influenced ISRs. After that, reachability computation is conducted to build up a whole abstract reachability tree, and a delayed ISR-triggering strategy is employed to reduce the state space. Meanwhile, unserializable interleaving patterns are recognized to achieve the goal of atomicity violation detection. The approach has been implemented as a configurable tool namely CPA4AV. Extensive experiments show that CPA4AV is much more precise than the relative tools available with little extra time overhead. In addition, more complex situations can be dealt with CPA4AV. @InProceedings{ESEC/FSE23p1153, author = {Bin Yu and Cong Tian and Hengrui Xing and Zuchao Yang and Jie Su and Xu Lu and Jiyu Yang and Liang Zhao and Xiaofeng Li and Zhenhua Duan}, title = {Detecting Atomicity Violations in Interrupt-Driven Programs via Interruption Points Selecting and Delayed ISR-Triggering}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1153--1164}, doi = {10.1145/3611643.3616276}, year = {2023}, } Publisher's Version Info |
|
Dvornik, Nikita |
ESEC/FSE '23: "DecompoVision: Reliability ..."
DecompoVision: Reliability Analysis of Machine Vision Components through Decomposition and Reuse
Boyue Caroline Hu, Lina Marsso, Nikita Dvornik, Huakun Shen, and Marsha Chechik (University of Toronto, Canada; Waabi, Canada) Analyzing reliability of Machine Vision Components (MVC) against scene changes (such as rain or fog) in their operational environment is crucial for safety-critical applications. Safety analysis relies on the availability of precisely specified and, ideally, machine-verifiable requirements. The state-of-the-art reliability framework ICRAF developed machine-verifiable requirements obtained using human performance data. However, ICRAF is limited to analyzing reliability of MVCs solving simple vision tasks, such as image classification. Yet, many real-world safety-critical systems require solving more complex vision tasks, such as object detection and instance segmentation. Fortunately, many complex vision tasks (which we call “c-tasks”) can be represented as a sequence of simple vision subtasks. For instance, object detection can be decomposed as object localization followed by classification. Based on this fact, in this paper, we show that the analysis of c-tasks can also be decomposed as a sequential analysis of their simple subtasks, which allows us to apply existing techniques for analyzing simple vision tasks. Specifically, we propose a modular reliability framework, DecompoVision, that decomposes: (1) the problem of solving a c-task, (2) the reliability requirements, and (3) the reliability analysis, and, as a result, provides deeper insights into MVC reliability. DecompoVision extends ICRAF to handle complex vision tasks and enables reuse of existing artifacts across different c-tasks. We capture new reliability gaps by checking our requirements on 13 widely used object detection MVCs, and, for the first time, benchmark segmentation MVCs. @InProceedings{ESEC/FSE23p541, author = {Boyue Caroline Hu and Lina Marsso and Nikita Dvornik and Huakun Shen and Marsha Chechik}, title = {DecompoVision: Reliability Analysis of Machine Vision Components through Decomposition and Reuse}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {541--552}, doi = {10.1145/3611643.3616333}, year = {2023}, } Publisher's Version |
|
Dwyer, Matthew B. |
ESEC/FSE '23-IVR: "Deeper Notions of Correctness ..."
Deeper Notions of Correctness in Image-Based DNNs: Lifting Properties from Pixel to Entities
Felipe Toledo, David Shriver, Sebastian Elbaum, and Matthew B. Dwyer (University of Virginia, USA) Deep Neural Networks (DNNs) that process images are being widely used for many safety-critical tasks, from autonomous vehicles to medical diagnosis. Currently, DNN correctness properties are defined at the pixel level over the entire input. Such properties are useful to expose system failures related to sensor noise or adversarial attacks, but they cannot capture features that are relevant to domain-specific entities and reflect richer types of behaviors. To overcome this limitation, we envision the specification of properties based on the entities that may be present in image input, capturing their semantics and how they change. Creating such properties today is difficult as it requires determining where the entities appear in images, defining how each entity can change, and writing a specification that is compatible with each particular V&V client. We introduce an initial framework structured around those challenges to assist in the generation of Domain-specific Entity-based properties automatically by leveraging object detection models to identify entities in images and creating properties based on entity features. Our feasibility study provides initial evidence that the new properties can uncover interesting system failures, such as changes in skin color can modify the output of a gender classification network. We conclude by analyzing the framework potential to implement the vision and by outlining directions for future work. @InProceedings{ESEC/FSE23p2122, author = {Felipe Toledo and David Shriver and Sebastian Elbaum and Matthew B. Dwyer}, title = {Deeper Notions of Correctness in Image-Based DNNs: Lifting Properties from Pixel to Entities}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2122--2126}, doi = {10.1145/3611643.3613079}, year = {2023}, } Publisher's Version ESEC/FSE '23: "Neural-Based Test Oracle Generation: ..." Neural-Based Test Oracle Generation: A Large-Scale Evaluation and Lessons Learned Soneya Binta Hossain, Antonio Filieri, Matthew B. Dwyer, Sebastian Elbaum, and Willem Visser (University of Virginia, USA; Amazon Web Services, USA) Defining test oracles is crucial and central to test development, but manual construction of oracles is expensive. While recent neural-based automated test oracle generation techniques have shown promise, their real-world effectiveness remains a compelling question requiring further exploration and understanding. This paper investigates the effectiveness of TOGA, a recently developed neural-based method for automatic test oracle generation. TOGA utilizes EvoSuite-generated test inputs and generates both exception and assertion oracles. In a Defects4j study, TOGA outperformed specification, search, and neural-based techniques, detecting 57 bugs, including 30 unique bugs not detected by other methods. To gain a deeper understanding of its applicability in real-world settings, we conducted a series of external, extended, and conceptual replication studies of TOGA. In a large-scale study involving 25 real-world Java systems, 223.5K test cases, and 51K injected faults, we evaluate TOGA’s ability to improve fault-detection effectiveness relative to the state-of-the-practice and the state-of-the-art. We find that TOGA misclassifies the type of oracle needed 24% of the time and that when it classifies correctly around 62% of the time it is not confident enough to generate any assertion oracle. When it does generate an assertion oracle, more than 47% of them are false positives, and the true positive assertions only increase fault detection by 0.3% relative to prior work. These findings expose limitations of the state-of-the-art neural-based oracle generation technique, provide valuable insights for improvement, and offer lessons for evaluating future automated oracle generation methods. @InProceedings{ESEC/FSE23p120, author = {Soneya Binta Hossain and Antonio Filieri and Matthew B. Dwyer and Sebastian Elbaum and Willem Visser}, title = {Neural-Based Test Oracle Generation: A Large-Scale Evaluation and Lessons Learned}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {120--132}, doi = {10.1145/3611643.3616265}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
D’Angelo, Sarah |
ESEC/FSE '23: "Building and Sustaining Ethnically, ..."
Building and Sustaining Ethnically, Racially, and Gender Diverse Software Engineering Teams: A Study at Google
Ella Dagan, Anita Sarma, Alison Chang, Sarah D’Angelo, Jill Dicker, and Emerson Murphy-Hill (Google, USA; Oregon State University, USA) Teams that build software are largely demographically homogeneous. Without diversity, homogeneous perspectives dominate how, why, and for whom software is designed. To understand how teams can successfully build and sustain diversity, we interviewed 11 engineers and 9 managers from some of the most gender and racially diverse teams at Google, a large software company. Qualitatively analyzing the interviews, we found shared approaches to recruiting, hiring, and promoting an inclusive environment, all of which create a positive feedback loop. Our findings produce actionable practices that every member of the team can take to increase diversity by fostering a more inclusive software engineering environment. @InProceedings{ESEC/FSE23p631, author = {Ella Dagan and Anita Sarma and Alison Chang and Sarah D’Angelo and Jill Dicker and Emerson Murphy-Hill}, title = {Building and Sustaining Ethnically, Racially, and Gender Diverse Software Engineering Teams: A Study at Google}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {631--643}, doi = {10.1145/3611643.3616273}, year = {2023}, } Publisher's Version |
|
Eberlein, Martin |
ESEC/FSE '23: "Semantic Debugging ..."
Semantic Debugging
Martin Eberlein, Marius Smytzek, Dominic Steinhöfel, Lars Grunske, and Andreas Zeller (Humboldt University of Berlin, Germany; CISPA Helmholtz Center for Information Security, Germany) Why does my program fail? We present a novel and general technique to automatically determine failure causes and conditions, using logical properties over input elements: “The program fails if and only if int(<length>) > len(<payload>) holds—that is, the given <length> is larger than the <payload> length.” Our AVICENNA prototype uses modern techniques for inferring properties of passing and failing inputs and validating and refining hypotheses by having a constraint solver generate supporting test cases to obtain such diagnoses. As a result, AVICENNA produces crisp and expressive diagnoses even for complex failure conditions, considerably improving over the state of the art with diagnoses close to those of human experts. @InProceedings{ESEC/FSE23p438, author = {Martin Eberlein and Marius Smytzek and Dominic Steinhöfel and Lars Grunske and Andreas Zeller}, title = {Semantic Debugging}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {438--449}, doi = {10.1145/3611643.3616296}, year = {2023}, } Publisher's Version Published Artifact Info Artifacts Available Artifacts Reusable |
|
Ehsani, Ramtin |
ESEC/FSE '23-IVR: "Exploring Moral Principles ..."
Exploring Moral Principles Exhibited in OSS: A Case Study on GitHub Heated Issues
Ramtin Ehsani, Rezvaneh Rezapour, and Preetha Chatterjee (Drexel University, USA) To foster collaboration and inclusivity in Open Source Software (OSS) projects, it is crucial to understand and detect patterns of toxic language that may drive contributors away, especially those from underrepresented communities. Although machine learning-based toxicity detection tools trained on domain-specific data have shown promise, their design lacks an understanding of the unique nature and triggers of toxicity in OSS discussions, highlighting the need for further investigation. In this study, we employ Moral Foundations Theory to examine the relationship between moral principles and toxicity in OSS. Specifically, we analyze toxic communications in GitHub issue threads to identify and understand five types of moral principles exhibited in text, and explore their potential association with toxic behavior. Our preliminary findings suggest a possible link between moral principles and toxic comments in OSS communications, with each moral principle associated with at least one type of toxicity. The potential of MFT in toxicity detection warrants further investigation. @InProceedings{ESEC/FSE23p2092, author = {Ramtin Ehsani and Rezvaneh Rezapour and Preetha Chatterjee}, title = {Exploring Moral Principles Exhibited in OSS: A Case Study on GitHub Heated Issues}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2092--2096}, doi = {10.1145/3611643.3613077}, year = {2023}, } Publisher's Version |
|
Eisele, Max |
ESEC/FSE '23: "Revisiting Neural Program ..."
Revisiting Neural Program Smoothing for Fuzzing
Maria-Irina Nicolae, Max Eisele, and Andreas Zeller (Bosch, Germany; Saarland University, Germany; CISPA Helmholtz Center for Information Security, Germany) Testing with randomly generated inputs (fuzzing) has gained significant traction due to its capacity to expose program vulnerabilities automatically. Fuzz testing campaigns generate large amounts of data, making them ideal for the application of machine learning (ML). Neural program smoothing, a specific family of ML-guided fuzzers, aims to use a neural network as a smooth approximation of the program target for new test case generation. In this paper, we conduct the most extensive evaluation of neural program smoothing (NPS) fuzzers against standard gray-box fuzzers (>11 CPU years and >5.5 GPU years), and make the following contributions: We find that the original performance claims for NPS fuzzers do not hold; a gap we relate to fundamental, implementation, and experimental limitations of prior works. We contribute the first in-depth analysis of the contribution of machine learning and gradient-based mutations in NPS. We implement Neuzz++, which shows that addressing the practical limitations of NPS fuzzers improves performance, but that standard gray-box fuzzers almost always surpass NPS-based fuzzers. As a consequence, we propose new guidelines targeted at benchmarking fuzzing based on machine learning, and present MLFuzz, a platform with GPU access for easy and reproducible evaluation of ML-based fuzzers. Neuzz++, MLFuzz, and all our data are public. @InProceedings{ESEC/FSE23p133, author = {Maria-Irina Nicolae and Max Eisele and Andreas Zeller}, title = {Revisiting Neural Program Smoothing for Fuzzing}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {133--145}, doi = {10.1145/3611643.3616308}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable |
|
Elbaum, Sebastian |
ESEC/FSE '23-IVR: "Deeper Notions of Correctness ..."
Deeper Notions of Correctness in Image-Based DNNs: Lifting Properties from Pixel to Entities
Felipe Toledo, David Shriver, Sebastian Elbaum, and Matthew B. Dwyer (University of Virginia, USA) Deep Neural Networks (DNNs) that process images are being widely used for many safety-critical tasks, from autonomous vehicles to medical diagnosis. Currently, DNN correctness properties are defined at the pixel level over the entire input. Such properties are useful to expose system failures related to sensor noise or adversarial attacks, but they cannot capture features that are relevant to domain-specific entities and reflect richer types of behaviors. To overcome this limitation, we envision the specification of properties based on the entities that may be present in image input, capturing their semantics and how they change. Creating such properties today is difficult as it requires determining where the entities appear in images, defining how each entity can change, and writing a specification that is compatible with each particular V&V client. We introduce an initial framework structured around those challenges to assist in the generation of Domain-specific Entity-based properties automatically by leveraging object detection models to identify entities in images and creating properties based on entity features. Our feasibility study provides initial evidence that the new properties can uncover interesting system failures, such as changes in skin color can modify the output of a gender classification network. We conclude by analyzing the framework potential to implement the vision and by outlining directions for future work. @InProceedings{ESEC/FSE23p2122, author = {Felipe Toledo and David Shriver and Sebastian Elbaum and Matthew B. Dwyer}, title = {Deeper Notions of Correctness in Image-Based DNNs: Lifting Properties from Pixel to Entities}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2122--2126}, doi = {10.1145/3611643.3613079}, year = {2023}, } Publisher's Version ESEC/FSE '23: "Pitfalls in Experiments with ..." Pitfalls in Experiments with DNN4SE: An Analysis of the State of the Practice Sira Vegas and Sebastian Elbaum (Universidad Politécnica de Madrid, Spain; University of Virginia, USA) Software engineering (SE) techniques are increasingly relying on deep learning approaches to support many SE tasks, from bug triaging to code generation. To assess the efficacy of such techniques researchers typically perform controlled experiments. Conducting these experiments, however, is particularly challenging given the complexity of the space of variables involved, from specialized and intricate architectures and algorithms to a large number of training hyper-parameters and choices of evolving datasets, all compounded by how rapidly the machine learning technology is advancing, and the inherent sources of randomness in the training process. In this work we conduct a mapping study, examining 194 experiments with techniques that rely on deep neural networks (DNNs) appearing in 55 papers published in premier SE venues to provide a characterization of the state of the practice, pinpointing experiments’ common trends and pitfalls. Our study reveals that most of the experiments, including those that have received ACM artifact badges, have fundamental limitations that raise doubts about the reliability of their findings. More specifically, we find: 1) weak analyses to determine that there is a true relationship between independent and dependent variables (87% of the experiments), 2) limited control over the space of DNN relevant variables, which can render a relationship between dependent variables and treatments that may not be causal but rather correlational (100% of the experiments), and 3) lack of specificity in terms of what are the DNN variables and their values utilized in the experiments (86% of the experiments) to define the treatments being applied, which makes it unclear whether the techniques designed are the ones being assessed, or how the sources of extraneous variation are controlled. We provide some practical recommendations to address these limitations. @InProceedings{ESEC/FSE23p528, author = {Sira Vegas and Sebastian Elbaum}, title = {Pitfalls in Experiments with DNN4SE: An Analysis of the State of the Practice}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {528--540}, doi = {10.1145/3611643.3616320}, year = {2023}, } Publisher's Version Published Artifact Info Artifacts Available ESEC/FSE '23: "Neural-Based Test Oracle Generation: ..." Neural-Based Test Oracle Generation: A Large-Scale Evaluation and Lessons Learned Soneya Binta Hossain, Antonio Filieri, Matthew B. Dwyer, Sebastian Elbaum, and Willem Visser (University of Virginia, USA; Amazon Web Services, USA) Defining test oracles is crucial and central to test development, but manual construction of oracles is expensive. While recent neural-based automated test oracle generation techniques have shown promise, their real-world effectiveness remains a compelling question requiring further exploration and understanding. This paper investigates the effectiveness of TOGA, a recently developed neural-based method for automatic test oracle generation. TOGA utilizes EvoSuite-generated test inputs and generates both exception and assertion oracles. In a Defects4j study, TOGA outperformed specification, search, and neural-based techniques, detecting 57 bugs, including 30 unique bugs not detected by other methods. To gain a deeper understanding of its applicability in real-world settings, we conducted a series of external, extended, and conceptual replication studies of TOGA. In a large-scale study involving 25 real-world Java systems, 223.5K test cases, and 51K injected faults, we evaluate TOGA’s ability to improve fault-detection effectiveness relative to the state-of-the-practice and the state-of-the-art. We find that TOGA misclassifies the type of oracle needed 24% of the time and that when it classifies correctly around 62% of the time it is not confident enough to generate any assertion oracle. When it does generate an assertion oracle, more than 47% of them are false positives, and the true positive assertions only increase fault detection by 0.3% relative to prior work. These findings expose limitations of the state-of-the-art neural-based oracle generation technique, provide valuable insights for improvement, and offer lessons for evaluating future automated oracle generation methods. @InProceedings{ESEC/FSE23p120, author = {Soneya Binta Hossain and Antonio Filieri and Matthew B. Dwyer and Sebastian Elbaum and Willem Visser}, title = {Neural-Based Test Oracle Generation: A Large-Scale Evaluation and Lessons Learned}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {120--132}, doi = {10.1145/3611643.3616265}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Emmi, Michael |
ESEC/FSE '23-INDUSTRY: "Compositional Taint Analysis ..."
Compositional Taint Analysis for Enforcing Security Policies at Scale
Subarno Banerjee, Siwei Cui, Michael Emmi, Antonio Filieri, Liana Hadarean, Peixuan Li, Linghui Luo, Goran Piskachev, Nicolás Rosner, Aritra Sengupta, Omer Tripp, and Jingbo Wang (Amazon Web Services, USA; Texas A&M University, USA; Amazon Web Services, Germany; University of Southern California, USA) Automated static dataflow analysis is an effective technique for detecting security critical issues like sensitive data leak, and vulnerability to injection attacks. Ensuring high precision and recall requires an analysis that is context, field and object sensitive. However, it is challenging to attain high precision and recall and scale to large industrial code bases. Compositional style analyses in which individual software components are analyzed separately, independent from their usage contexts, compute reusable summaries of components. This is an essential feature when deploying such analyses in CI/CD at code-review time or when scanning deployed container images. In both these settings the majority of software components stay the same between subsequent scans. However, it is not obvious how to extend such analyses to check the kind of contextual taint specifications that arise in practice, while maintaining compositionality. In this work we present contextual dataflow modeling, an extension to the compositional analysis to check complex taint specifications and significantly increasing recall and precision. Furthermore, we show how such high-fidelity analysis can scale in production using three key optimizations: (i) discarding intermediate results for previously-analyzed components, an optimization exploiting the compositional nature of our analysis; (ii) a scope-reduction analysis to reduce the scope of the taint analysis w.r.t. the taint specifications being checked, and (iii) caching of analysis models. We show a 9.85% reduction in false positive rate on a comprehensive test suite comprising the OWASP open-source benchmarks as well as internal real-world code samples. We measure the performance and scalability impact of each individual optimization using open source JVM packages from the Maven central repository and internal AWS service codebases. This combination of high precision, recall, performance, and scalability has allowed us to enforce security policies at scale both internally within Amazon as well as for external customers. @InProceedings{ESEC/FSE23p1985, author = {Subarno Banerjee and Siwei Cui and Michael Emmi and Antonio Filieri and Liana Hadarean and Peixuan Li and Linghui Luo and Goran Piskachev and Nicolás Rosner and Aritra Sengupta and Omer Tripp and Jingbo Wang}, title = {Compositional Taint Analysis for Enforcing Security Policies at Scale}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1985--1996}, doi = {10.1145/3611643.3613889}, year = {2023}, } Publisher's Version |
|
Endres, Madeline |
ESEC/FSE '23: "A Four-Year Study of Student ..."
A Four-Year Study of Student Contributions to OSS vs. OSS4SG with a Lightweight Intervention
Zihan Fang, Madeline Endres, Thomas Zimmermann, Denae Ford, Westley Weimer, Kevin Leach, and Yu Huang (Vanderbilt University, USA; University of Michigan, USA; Microsoft Research, USA) Modern software engineering practice and training increasingly rely on Open Source Software (OSS). The recent growth in demand for professional software engineers has led to increased contributions to, and usage of, OSS. However, there is limited understanding of the factors affecting how developers, and how new or student developers in particular, decide which OSS projects to contribute to, a process critical to OSS sustainability, access, adoption, and growth. To better understand OSS contributions from the developers of tomorrow, we conducted a four-year study with 1,361 students investigating the life cycle of their contributions (from project selection to pull request acceptance). During the study, we also delivered a lightweight intervention to promote the awareness of open source projects for social good (OSS4SG), OSS projects that have positive impacts in other domains. Using both quantitative and qualitative methods, we analyze student experience reports and the pull requests they submit. Compared to general OSS projects, we find significant differences in project selection (𝑝 < 0.0001, effect size = 0.84), student motivation (𝑝 < 0.01, effect size = 0.13), and increased pull-request acceptance rates for OSS4SG contributions. We also find that our intervention correlates with increased student contributions to OSS4SG (𝑝 < 0.0001, effect size = 0.38). Finally, we analyze correlations of factors such as gender or working with a partner. Our findings may help improve the experience for new developers participating in OSS4SG and the quality of their contributions. We also hope our work helps educators, project leaders, and contributors to build a mutually-beneficial framework for the future growth of OSS4SG. @InProceedings{ESEC/FSE23p3, author = {Zihan Fang and Madeline Endres and Thomas Zimmermann and Denae Ford and Westley Weimer and Kevin Leach and Yu Huang}, title = {A Four-Year Study of Student Contributions to OSS vs. OSS4SG with a Lightweight Intervention}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {3--15}, doi = {10.1145/3611643.3616250}, year = {2023}, } Publisher's Version |
|
Escobar-Velásquez, Camilo |
ESEC/FSE '23-DEMO: "CONAN: Statically Detecting ..."
CONAN: Statically Detecting Connectivity Issues in Android Applications
Alejandro Mazuera-Rozo, Camilo Escobar-Velásquez, Juan Espitia-Acero, Mario Linares-Vásquez, and Gabriele Bavota (USI Lugano, Switzerland; University of Los Andes, Colombia) Mobile apps are increasingly used in daily activities. Most apps require Internet connectivity to be fully exploited. Despite the fact that global access to the Internet has improved over the years, there are still complex connectivity scenarios, including situations with zero/unreliable connectivity. In such scenarios, improper handling of Eventual Connectivity Issues may cause bugs and crashes that worsen the user experience. Even though these issues have been studied in the literature, no automatic detection techniques are available. To address the mentioned gap, we have created the open source CONAN tool. CONAN can statically detect 16 types of Eventual Connectivity Issues within Android apps; it works at the source code level and alerts developers of any connectivity issue, highlighting them directly in the IDE or generating a report explaining the detected errors. In this paper, we present the technical aspects and a video of our tool, which are publicly available at https://tinyurl.com/CONAN-lint. @InProceedings{ESEC/FSE23p2182, author = {Alejandro Mazuera-Rozo and Camilo Escobar-Velásquez and Juan Espitia-Acero and Mario Linares-Vásquez and Gabriele Bavota}, title = {CONAN: Statically Detecting Connectivity Issues in Android Applications}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2182--2186}, doi = {10.1145/3611643.3613097}, year = {2023}, } Publisher's Version Video Info |
|
Espitia-Acero, Juan |
ESEC/FSE '23-DEMO: "CONAN: Statically Detecting ..."
CONAN: Statically Detecting Connectivity Issues in Android Applications
Alejandro Mazuera-Rozo, Camilo Escobar-Velásquez, Juan Espitia-Acero, Mario Linares-Vásquez, and Gabriele Bavota (USI Lugano, Switzerland; University of Los Andes, Colombia) Mobile apps are increasingly used in daily activities. Most apps require Internet connectivity to be fully exploited. Despite the fact that global access to the Internet has improved over the years, there are still complex connectivity scenarios, including situations with zero/unreliable connectivity. In such scenarios, improper handling of Eventual Connectivity Issues may cause bugs and crashes that worsen the user experience. Even though these issues have been studied in the literature, no automatic detection techniques are available. To address the mentioned gap, we have created the open source CONAN tool. CONAN can statically detect 16 types of Eventual Connectivity Issues within Android apps; it works at the source code level and alerts developers of any connectivity issue, highlighting them directly in the IDE or generating a report explaining the detected errors. In this paper, we present the technical aspects and a video of our tool, which are publicly available at https://tinyurl.com/CONAN-lint. @InProceedings{ESEC/FSE23p2182, author = {Alejandro Mazuera-Rozo and Camilo Escobar-Velásquez and Juan Espitia-Acero and Mario Linares-Vásquez and Gabriele Bavota}, title = {CONAN: Statically Detecting Connectivity Issues in Android Applications}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2182--2186}, doi = {10.1145/3611643.3613097}, year = {2023}, } Publisher's Version Video Info |
|
Estep, Sam |
ESEC/FSE '23: "NaNofuzz: A Usable Tool for ..."
NaNofuzz: A Usable Tool for Automatic Test Generation
Matthew C. Davis, Sangheon Choi, Sam Estep, Brad A. Myers, and Joshua Sunshine (Carnegie Mellon University, USA; Rose-Hulman Institute of Technology, USA) In the United States alone, software testing labor is estimated to cost $48 billion USD per year. Despite widespread test execution automation and automation in other areas of software engineering, test suites continue to be created manually by software engineers. We have built a test generation tool, called NaNofuzz, that helps users find bugs in their code by suggesting tests where the output is likely indicative of a bug, e.g., that return NaN (not-a-number) values. NaNofuzz is an interactive tool embedded in a development environment to fit into the programmer's workflow. NaNofuzz tests a function with as little as one button press, analyses the program to determine inputs it should evaluate, executes the program on those inputs, and categorizes outputs to prioritize likely bugs. We conducted a randomized controlled trial with 28 professional software engineers using NaNofuzz as the intervention treatment and the popular manual testing tool, Jest, as the control treatment. Participants using NaNofuzz on average identified bugs more accurately (p < .05, by 30%), were more confident in their tests (p < .03, by 20%), and finished their tasks more quickly (p < .007, by 30%). @InProceedings{ESEC/FSE23p1114, author = {Matthew C. Davis and Sangheon Choi and Sam Estep and Brad A. Myers and Joshua Sunshine}, title = {NaNofuzz: A Usable Tool for Automatic Test Generation}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1114--1126}, doi = {10.1145/3611643.3616327}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Evtikhiev, Mikhail |
ESEC/FSE '23-INDUSTRY: "BFSig: Leveraging File Significance ..."
BFSig: Leveraging File Significance in Bus Factor Estimation
Vahid Haratian, Mikhail Evtikhiev, Pouria Derakhshanfar, Eray Tüzün, and Vladimir Kovalenko (Bilkent University, Turkiye; JetBrains Research, Cyprus; JetBrains Research, Netherlands) Software projects experience the departure of developers due to various reasons. As developers are one of the main sources of knowledge in software projects, their absence will inevitably result in a certain degree of knowledge depletion. Bus Factor (BF) is a metric to evaluate how this knowledge loss can affect the project’s continuity. Conventionally, BF is calculated as the smallest set of developers, removing over half the project knowledge upon departure. Current state-of-the-art approaches measure developers’ knowledge by the number of authored files, utilizing version control system (VCS) information. However, numerous studies have shown that files in software projects have different significance. In this study, we explore how weighting files according to their significance affects the performance of two prevailing BF estimators. We derive significance scores by computing five well-known graph metrics from the project’s dependency graph: PageRank, In-/Out-/All-Degree, and Betweenness Centralities. Furthermore, we introduce BFSig , a prototype of our approach. Finally, we present a new dataset comprising reported BF scores collected by surveying software practitioners from five prominent Github repositories. Our results indicate that BFSig outperforms the baselines by up to an 18% reduction in terms of Normalized Mean Absolute Error (NMAE). Moreover, BFSig yields 18% fewer False Negatives in identifying potential risks associated with low BF. Besides, our respondent confirmed BFSig versatility by showing its ability to assess the BF of the project’s subfolders. In conclusion, we believe to estimate BF from authorship, software components of higher importance should be assigned heavier weight. Currently, BFSig exclusively explores the topological characteristics of these components. Nevertheless, considering attributes such as code complexity and bug proneness could potentially enhance the performance of BFSig. @InProceedings{ESEC/FSE23p1926, author = {Vahid Haratian and Mikhail Evtikhiev and Pouria Derakhshanfar and Eray Tüzün and Vladimir Kovalenko}, title = {BFSig: Leveraging File Significance in Bus Factor Estimation}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1926--1936}, doi = {10.1145/3611643.3613877}, year = {2023}, } Publisher's Version |
|
Fallahzadeh, Emad |
ESEC/FSE '23: "Accelerating Continuous Integration ..."
Accelerating Continuous Integration with Parallel Batch Testing
Emad Fallahzadeh, Amir Hossein Bavand, and Peter C. Rigby (Concordia University, Canada) Continuous integration at scale is costly but essential to software development. Various test optimization techniques including test selection and prioritization aim to reduce the cost. Test batching is an effective alternative, but overlooked technique. This study evaluates parallelization’s effect by adjusting machine count for test batching and introduces two novel approaches. We establish TestAll as a baseline to study the impact of parallelism and machine count on feedback time. We re-evaluate ConstantBatching and introduce DynamicBatching, which adapts batch size based on the remaining changes in the queue. We also propose TestCaseBatching, enabling new builds to join a batch before full test execution, thus speeding up continuous integration. Our evaluations utilize Ericsson’s results and 276 million test outcomes from open-source Chrome, assessing feedback time, execution reduction, and providing access to Chrome project scripts and data. The results reveal a non-linear impact of test parallelization on feedback time, as each test delay compounds across the entire test queue. ConstantBatching, with a batch size of 4, utilizes up to 72% fewer machines to maintain the actual average feedback time and provides a constant execution reduction of up to 75%. Similarly, DynamicBatching maintains the actual average feedback time with up to 91% fewer machines and exhibits variable execution reduction of up to 99%. TestCaseBatching holds the line of the actual average feedback time with up to 81% fewer machines and demonstrates variable execution reduction of up to 67%. We recommend practitioners use DynamicBatching and TestCaseBatching to reduce the required testing machines efficiently. Analyzing historical data to find the threshold where adding more machines has minimal impact on feedback time is also crucial for resource-effective testing. @InProceedings{ESEC/FSE23p55, author = {Emad Fallahzadeh and Amir Hossein Bavand and Peter C. Rigby}, title = {Accelerating Continuous Integration with Parallel Batch Testing}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {55--67}, doi = {10.1145/3611643.3616255}, year = {2023}, } Publisher's Version |
|
Fan, Hao |
ESEC/FSE '23-INDUSTRY: "TraceDiag: Adaptive, Interpretable, ..."
TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems
Ruomeng Ding, Chaoyun Zhang, Lu Wang, Yong Xu, Minghua Ma, Xiaomin Wu, Meng Zhang, Qingjun Chen, Xin Gao, Xuedong Gao, Hao Fan, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang (Microsoft, China; Microsoft 365, China; Microsoft 365, USA) Root Cause Analysis (RCA) is becoming increasingly crucial for ensuring the reliability of microservice systems. However, performing RCA on modern microservice systems can be challenging due to their large scale, as they usually comprise hundreds of components, leading significant human effort. This paper proposes TraceDiag, an end-to-end RCA framework that addresses the challenges for large-scale microservice systems. It leverages reinforcement learning to learn a pruning policy for the service dependency graph to automatically eliminates redundant components, thereby significantly improving the RCA efficiency. The learned pruning policy is interpretable and fully adaptive to new RCA instances. With the pruned graph, a causal-based method can be executed with high accuracy and efficiency. The proposed TraceDiag framework is evaluated on real data traces collected from the Microsoft Exchange system, and demonstrates superior performance compared to state-of-the-art RCA approaches. Notably, TraceDiag has been integrated as a critical component in the Microsoft M365 Exchange, resulting in a significant improvement in the system's reliability and a considerable reduction in the human effort required for RCA. @InProceedings{ESEC/FSE23p1762, author = {Ruomeng Ding and Chaoyun Zhang and Lu Wang and Yong Xu and Minghua Ma and Xiaomin Wu and Meng Zhang and Qingjun Chen and Xin Gao and Xuedong Gao and Hao Fan and Saravan Rajmohan and Qingwei Lin and Dongmei Zhang}, title = {TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1762--1773}, doi = {10.1145/3611643.3613864}, year = {2023}, } Publisher's Version |
|
Fan, Lingling |
ESEC/FSE '23: "Comparison and Evaluation ..."
Comparison and Evaluation on Static Application Security Testing (SAST) Tools for Java
Kaixuan Li, Sen Chen, Lingling Fan, Ruitao Feng, Han Liu, Chengwei Liu, Yang Liu, and Yixiang Chen (East China Normal University, China; Tianjin University, China; Nankai University, China; UNSW, Australia; Nanyang Technological University, Singapore) Static application security testing (SAST) takes a significant role in the software development life cycle (SDLC). However, it is challenging to comprehensively evaluate the effectiveness of SAST tools to determine which is the better one for detecting vulnerabilities. In this paper, based on well-defined criteria, we first selected seven free or open-source SAST tools from 161 existing tools for further evaluation. Owing to the synthetic and newly-constructed real-world benchmarks, we evaluated and compared these SAST tools from different and comprehensive perspectives such as effectiveness, consistency, and performance. While SAST tools perform well on synthetic benchmarks, our results indicate that only 12.7% of real-world vulnerabilities can be detected by the selected tools. Even combining the detection capability of all tools, most vulnerabilities (70.9%) remain undetected, especially those beyond resource control and insufficiently neutralized input/output vulnerabilities. The fact is that although they have already built the corresponding detecting rules and integrated them into their capabilities, the detection result still did not meet the expectations. All useful findings unveiled in our comprehensive study indeed help to provide guidance on tool development, improvement, evaluation, and selection for developers, researchers, and potential users. @InProceedings{ESEC/FSE23p921, author = {Kaixuan Li and Sen Chen and Lingling Fan and Ruitao Feng and Han Liu and Chengwei Liu and Yang Liu and Yixiang Chen}, title = {Comparison and Evaluation on Static Application Security Testing (SAST) Tools for Java}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {921--933}, doi = {10.1145/3611643.3616262}, year = {2023}, } Publisher's Version ESEC/FSE '23: "Automated and Context-Aware ..." Automated and Context-Aware Repair of Color-Related Accessibility Issues for Android Apps Yuxin Zhang, Sen Chen, Lingling Fan, Chunyang Chen, and Xiaohong Li (Tianjin University, China; Nankai University, China; Monash University, Australia) Approximately 15% of the world's population is suffering from various disabilities or impairments. However, many mobile UX designers and developers disregard the significance of accessibility for those with disabilities when developing apps. It is unbelievable that one in seven people might not have the same level of access that other users have, which actually violates many legal and regulatory standards. On the contrary, if the apps are developed with accessibility in mind, it will drastically improve the user experience for all users as well as maximize revenue. Thus, a large number of studies and some effective tools for detecting accessibility issues have been conducted and proposed to mitigate such a severe problem. However, compared with detection, the repair work is obviously falling behind. Especially for the color-related accessibility issues, which is one of the top issues in apps with a greatly negative impact on vision and user experience. Apps with such issues are difficult to use for people with low vision and the elderly. Unfortunately, such an issue type cannot be directly fixed by existing repair techniques. To this end, we propose Iris, an automated and context-aware repair method to fix the color-related accessibility issues (i.e., the text contrast issues and the image contrast issues) for apps. By leveraging a novel context-aware technique that resolves the optimal colors and a vital phase of attribute-to-repair localization, Iris not only repairs the color contrast issues but also guarantees the consistency of the design style between the original UI page and repaired UI page. Our experiments unveiled that Iris can achieve a 91.38% repair success rate with high effectiveness and efficiency. The usefulness of Iris has also been evaluated by a user study with a high satisfaction rate as well as developers' positive feedback. 9 of 40 submitted pull requests on GitHub repositories have been accepted and merged into the projects by app developers, and another 4 developers are actively discussing with us for further repair. Iris is publicly available to facilitate this new research direction. @InProceedings{ESEC/FSE23p1255, author = {Yuxin Zhang and Sen Chen and Lingling Fan and Chunyang Chen and Xiaohong Li}, title = {Automated and Context-Aware Repair of Color-Related Accessibility Issues for Android Apps}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1255--1267}, doi = {10.1145/3611643.3616329}, year = {2023}, } Publisher's Version Info |
|
Fan, Zhenye |
ESEC/FSE '23: "A Generative and Mutational ..."
A Generative and Mutational Approach for Synthesizing Bug-Exposing Test Cases to Guide Compiler Fuzzing
Guixin Ye, Tianmin Hu, Zhanyong Tang, Zhenye Fan, Shin Hwei Tan, Bo Zhang, Wenxiang Qian, and Zheng Wang (Northwest University, China; Concordia University, Canada; Tencent, China; University of Leeds, UK) Random test case generation, or fuzzing, is a viable means for uncovering compiler bugs. Unfortunately, compiler fuzzing can be time-consuming and inefficient with purely randomly generated test cases due to the complexity of modern compilers. We present COMFUZZ, a focused compiler fuzzing framework. COMFUZZ aims to improve compiler fuzzing efficiency by focusing on testing components and language features that are likely to trigger compiler bugs. Our key insight is human developers tend to make common and repeat errors across compiler implementations; hence, we can leverage the previously reported buggy-exposing test cases of a programming language to test a new compiler implementation. To this end, COMFUZZ employs deep learning to learn a test program generator from open-source projects hosted on GitHub. With the machine-generated test programs in place, COMFUZZ then leverages a set of carefully designed mutation rules to improve the coverage and bug-exposing capabilities of the test cases. We evaluate COMFUZZ on 11 compilers for JS and Java programming languages. Within 260 hours of automated testing runs, we discovered 33 unique bugs across nine compilers, of which 29 have been confirmed and 22, including an API documentation defect, have already been fixed by the developers. We also compared COMFUZZ to eight prior fuzzers on four evaluation metrics. In a 24-hour comparative test, COMFUZZ uncovers at least 1.5× more bugs than the state-of-the-art baselines. @InProceedings{ESEC/FSE23p1127, author = {Guixin Ye and Tianmin Hu and Zhanyong Tang and Zhenye Fan and Shin Hwei Tan and Bo Zhang and Wenxiang Qian and Zheng Wang}, title = {A Generative and Mutational Approach for Synthesizing Bug-Exposing Test Cases to Guide Compiler Fuzzing}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1127--1139}, doi = {10.1145/3611643.3616332}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Fang, Hongbo |
ESEC/FSE '23: "Matching Skills, Past Collaboration, ..."
Matching Skills, Past Collaboration, and Limited Competition: Modeling When Open-Source Projects Attract Contributors
Hongbo Fang, James Herbsleb, and Bogdan Vasilescu (Carnegie Mellon University, USA) Attracting and retaining new developers is often at the heart of open-source project sustainability and success. Previous research found many intrinsic (or endogenous) project characteristics associated with the attractiveness of projects to new developers, but the impact of factors external to the project itself have largely been overlooked. In this work, we focus on one such external factor, a project's labor pool, which is defined as the set of contributors active in the overall open-source ecosystem that the project could plausibly attempt to recruit from at a given time. How are the size and characteristics of the labor pool associated with a project's attractiveness to new contributors? Through an empirical study of over 516,893 Python projects, we found that the size of the project's labor pool, the technical skill match, and the social connection between the project's labor pool and members of the focal project all significantly influence the number of new developers that the focal project attracts, with the competition between projects with overlapping labor pools also playing a role. Overall, the labor pool factors add considerable explanatory power compared to models with only project-level characteristics. @InProceedings{ESEC/FSE23p42, author = {Hongbo Fang and James Herbsleb and Bogdan Vasilescu}, title = {Matching Skills, Past Collaboration, and Limited Competition: Modeling When Open-Source Projects Attract Contributors}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {42--54}, doi = {10.1145/3611643.3616282}, year = {2023}, } Publisher's Version |
|
Fang, Zihan |
ESEC/FSE '23: "A Four-Year Study of Student ..."
A Four-Year Study of Student Contributions to OSS vs. OSS4SG with a Lightweight Intervention
Zihan Fang, Madeline Endres, Thomas Zimmermann, Denae Ford, Westley Weimer, Kevin Leach, and Yu Huang (Vanderbilt University, USA; University of Michigan, USA; Microsoft Research, USA) Modern software engineering practice and training increasingly rely on Open Source Software (OSS). The recent growth in demand for professional software engineers has led to increased contributions to, and usage of, OSS. However, there is limited understanding of the factors affecting how developers, and how new or student developers in particular, decide which OSS projects to contribute to, a process critical to OSS sustainability, access, adoption, and growth. To better understand OSS contributions from the developers of tomorrow, we conducted a four-year study with 1,361 students investigating the life cycle of their contributions (from project selection to pull request acceptance). During the study, we also delivered a lightweight intervention to promote the awareness of open source projects for social good (OSS4SG), OSS projects that have positive impacts in other domains. Using both quantitative and qualitative methods, we analyze student experience reports and the pull requests they submit. Compared to general OSS projects, we find significant differences in project selection (𝑝 < 0.0001, effect size = 0.84), student motivation (𝑝 < 0.01, effect size = 0.13), and increased pull-request acceptance rates for OSS4SG contributions. We also find that our intervention correlates with increased student contributions to OSS4SG (𝑝 < 0.0001, effect size = 0.38). Finally, we analyze correlations of factors such as gender or working with a partner. Our findings may help improve the experience for new developers participating in OSS4SG and the quality of their contributions. We also hope our work helps educators, project leaders, and contributors to build a mutually-beneficial framework for the future growth of OSS4SG. @InProceedings{ESEC/FSE23p3, author = {Zihan Fang and Madeline Endres and Thomas Zimmermann and Denae Ford and Westley Weimer and Kevin Leach and Yu Huang}, title = {A Four-Year Study of Student Contributions to OSS vs. OSS4SG with a Lightweight Intervention}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {3--15}, doi = {10.1145/3611643.3616250}, year = {2023}, } Publisher's Version |
|
Feldman, Kobi |
ESEC/FSE '23: "On the Relationship between ..."
On the Relationship between Code Verifiability and Understandability
Kobi Feldman, Martin Kellogg, and Oscar Chaparro (College of William and Mary, USA; New Jersey Institute of Technology, USA) Proponents of software verification have argued that simpler code is easier to verify: that is, that verification tools issue fewer false positives and require less human intervention when analyzing simpler code. We empirically validate this assumption by comparing the number of warnings produced by four state-of-the-art verification tools on 211 snippets of Java code with 20 metrics of code comprehensibility from human subjects in six prior studies. Our experiments, based on a statistical (meta-)analysis, show that, in aggregate, there is a small correlation (r = 0.23) between understandability and verifiability. The results support the claim that easy-to-verify code is often easier to understand than code that requires more effort to verify. Our work has implications for the users and designers of verification tools and for future attempts to automatically measure code comprehensibility: verification tools may have ancillary benefits to understandability, and measuring understandability may require reasoning about semantic, not just syntactic, code properties. @InProceedings{ESEC/FSE23p211, author = {Kobi Feldman and Martin Kellogg and Oscar Chaparro}, title = {On the Relationship between Code Verifiability and Understandability}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {211--223}, doi = {10.1145/3611643.3616242}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable |
|
Feng, Botao |
ESEC/FSE '23-INDUSTRY: "STEAM: Observability-Preserving ..."
STEAM: Observability-Preserving Trace Sampling
Shilin He, Botao Feng, Liqun Li, Xu Zhang, Yu Kang, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang (Microsoft Research, Beijing, China; Microsoft, Beijing, China; Microsoft, USA) In distributed systems and microservice applications, tracing is a crucial observability signal employed for comprehending their internal states. To mitigate the overhead associated with distributed tracing, most tracing frameworks utilize a uniform sampling strategy, which retains only a subset of traces. However, this approach is insufficient for preserving system observability. This is primarily attributed to the long-tail distribution of traces in practice, which results in the omission or rarity of minority yet critical traces after sampling. In this study, we introduce an observability-preserving trace sampling method, denoted as STEAM, which aims to retain as much information as possible in the sampled traces. We employ Graph Neural Networks (GNN) for trace representation, while incorporating domain knowledge of trace comparison through logical clauses. Subsequently, we employ a scalable approach to sample traces, emphasizing mutually dissimilar traces. STEAM has been implemented on top of OpenTelemetry, comprising approximately 1.6K lines of Golang code and 2K lines of Python code. Evaluation on four benchmark microservice applications and a production system demonstrates the superior performance of our approach compared to baseline methods. Furthermore, STEAM is capable of processing 15,000 traces in approximately 4 seconds. @InProceedings{ESEC/FSE23p1750, author = {Shilin He and Botao Feng and Liqun Li and Xu Zhang and Yu Kang and Qingwei Lin and Saravan Rajmohan and Dongmei Zhang}, title = {STEAM: Observability-Preserving Trace Sampling}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1750--1761}, doi = {10.1145/3611643.3613881}, year = {2023}, } Publisher's Version |
|
Feng, Ruitao |
ESEC/FSE '23: "Comparison and Evaluation ..."
Comparison and Evaluation on Static Application Security Testing (SAST) Tools for Java
Kaixuan Li, Sen Chen, Lingling Fan, Ruitao Feng, Han Liu, Chengwei Liu, Yang Liu, and Yixiang Chen (East China Normal University, China; Tianjin University, China; Nankai University, China; UNSW, Australia; Nanyang Technological University, Singapore) Static application security testing (SAST) takes a significant role in the software development life cycle (SDLC). However, it is challenging to comprehensively evaluate the effectiveness of SAST tools to determine which is the better one for detecting vulnerabilities. In this paper, based on well-defined criteria, we first selected seven free or open-source SAST tools from 161 existing tools for further evaluation. Owing to the synthetic and newly-constructed real-world benchmarks, we evaluated and compared these SAST tools from different and comprehensive perspectives such as effectiveness, consistency, and performance. While SAST tools perform well on synthetic benchmarks, our results indicate that only 12.7% of real-world vulnerabilities can be detected by the selected tools. Even combining the detection capability of all tools, most vulnerabilities (70.9%) remain undetected, especially those beyond resource control and insufficiently neutralized input/output vulnerabilities. The fact is that although they have already built the corresponding detecting rules and integrated them into their capabilities, the detection result still did not meet the expectations. All useful findings unveiled in our comprehensive study indeed help to provide guidance on tool development, improvement, evaluation, and selection for developers, researchers, and potential users. @InProceedings{ESEC/FSE23p921, author = {Kaixuan Li and Sen Chen and Lingling Fan and Ruitao Feng and Han Liu and Chengwei Liu and Yang Liu and Yixiang Chen}, title = {Comparison and Evaluation on Static Application Security Testing (SAST) Tools for Java}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {921--933}, doi = {10.1145/3611643.3616262}, year = {2023}, } Publisher's Version |
|
Feng, Shiwei |
ESEC/FSE '23: "PEM: Representing Binary Program ..."
PEM: Representing Binary Program Semantics for Similarity Analysis via a Probabilistic Execution Model
Xiangzhe Xu, Zhou Xuan, Shiwei Feng, Siyuan Cheng, Yapeng Ye, Qingkai Shi, Guanhong Tao, Le Yu, Zhuo Zhang, and Xiangyu Zhang (Purdue University, USA) Binary similarity analysis determines if two binary executables are from the same source program. Existing techniques leverage static and dynamic program features and may utilize advanced Deep Learning techniques. Although they have demonstrated great potential, the community believes that a more effective representation of program semantics can further improve similarity analysis. In this paper, we propose a new method to represent binary program semantics. It is based on a novel probabilistic execution engine that can effectively sample the input space and the program path space of subject binaries. More importantly, it ensures that the collected samples are comparable across binaries, addressing the substantial variations of input specifications. Our evaluation on 9 real-world projects with 35k functions, and comparison with 6 state-of-the-art techniques show that PEM can achieve a precision of 96% with common settings, outperforming the baselines by 10-20%. @InProceedings{ESEC/FSE23p401, author = {Xiangzhe Xu and Zhou Xuan and Shiwei Feng and Siyuan Cheng and Yapeng Ye and Qingkai Shi and Guanhong Tao and Le Yu and Zhuo Zhang and Xiangyu Zhang}, title = {PEM: Representing Binary Program Semantics for Similarity Analysis via a Probabilistic Execution Model}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {401--412}, doi = {10.1145/3611643.3616301}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable |
|
Feng, Sidong |
ESEC/FSE '23-INDUSTRY: "Towards Efficient Record and ..."
Towards Efficient Record and Replay: A Case Study in WeChat
Sidong Feng, Haochuan Lu, Ting Xiong, Yuetang Deng, and Chunyang Chen (Monash University, Australia; Tencent, China) WeChat, a widely-used messenger app boasting over 1 billion monthly active users, requires effective app quality assurance for its complex features. Record-and-replay tools are crucial in achieving this goal. Despite the extensive development of these tools, the impact of waiting time between replay events has been largely overlooked. On one hand, a long waiting time for executing replay events on fully-rendered GUIs slows down the process. On the other hand, a short waiting time can lead to events executing on partially-rendered GUIs, negatively affecting replay effectiveness. An optimal waiting time should strike a balance between effectiveness and efficiency. We introduce WeReplay, a lightweight image-based approach that dynamically adjusts inter-event time based on the GUI rendering state. Given the real-time streaming on the GUI, WeReplay employs a deep learning model to infer the rendering state and synchronize with the replaying tool, scheduling the next event when the GUI is fully rendered. Our evaluation shows that our model achieves 92.1% precision and 93.3% recall in discerning GUI rendering states in the WeChat app. Through assessing the performance in replaying 23 common WeChat usage scenarios, WeReplay successfully replays all scenarios on the same and different devices more efficiently than the state-of-the-practice baselines. @InProceedings{ESEC/FSE23p1681, author = {Sidong Feng and Haochuan Lu and Ting Xiong and Yuetang Deng and Chunyang Chen}, title = {Towards Efficient Record and Replay: A Case Study in WeChat}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1681--1692}, doi = {10.1145/3611643.3613880}, year = {2023}, } Publisher's Version |
|
Feng, Siyue |
ESEC/FSE '23: "Tritor: Detecting Semantic ..."
Tritor: Detecting Semantic Code Clones by Building Social Network-Based Triads Model
Deqing Zou, Siyue Feng, Yueming Wu, Wenqi Suo, and Hai Jin (Huazhong University of Science and Technology, China; Nanyang Technological University, Singapore) Code clone detection refers to finding the functional similarities between two code fragments, which is becoming increasingly important with the evolution of software engineering. It is reasonable because code cloning can increase maintenance costs and even cause the propagation of vulnerabilities, which can have a negative impact on software security. Numbers of code clone detection methods have been proposed, including tree-based methods that are capable of detecting semantic code clones. However, since tree structure is complex, these methods are difficult to apply to large-scale clone detection. In this paper, we propose a scalable semantic code clone detector based on semantically enhanced abstract syntax tree. Specifically, we add the control flow and data flow details into the original tree and regard the enhanced tree as a social network. Then we build a social network-based triads model to collect the similarity features between the two methods by analyzing different types of triads within the network. After obtaining all features, we use them to train a machine learning-based code clone detector (i.e., Tritor). Our comparative experimental results show that Tritor is superior to SourcererCC, RtvNN, Deckard, ASTNN, TBCNN, CDLH, and SCDetector, are equally good with DeepSim and FCCA. As for scalability, Tritor is about 39 times faster than another current state-of-the-art tree-based code clone detector ASTNN. @InProceedings{ESEC/FSE23p771, author = {Deqing Zou and Siyue Feng and Yueming Wu and Wenqi Suo and Hai Jin}, title = {Tritor: Detecting Semantic Code Clones by Building Social Network-Based Triads Model}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {771--783}, doi = {10.1145/3611643.3616354}, year = {2023}, } Publisher's Version Info |
|
Feng, Yang |
ESEC/FSE '23: "Dynamic Data Fault Localization ..."
Dynamic Data Fault Localization for Deep Neural Networks
Yining Yin, Yang Feng, Shihao Weng, Zixi Liu, Yuan Yao, Yichi Zhang, Zhihong Zhao, and Zhenyu Chen (Nanjing University, China) Rich datasets have empowered various deep learning (DL) applications, leading to remarkable success in many fields. However, data faults hidden in the datasets could result in DL applications behaving unpredictably and even cause massive monetary and life losses. To alleviate this problem, in this paper, we propose a dynamic data fault localization approach, namely DFauLo, to locate the mislabeled and noisy data in the deep learning datasets. DFauLo is inspired by the conventional mutation-based code fault localization, but utilizes the differences between DNN mutants to amplify and identify the potential data faults. Specifically, it first generates multiple DNN model mutants of the original trained model. Then it extracts features from these mutants and maps them into a suspiciousness score indicating the probability of the given data being a data fault. Moreover, DFauLo is the first dynamic data fault localization technique, prioritizing the suspected data based on user feedback, and providing the generalizability to unseen data faults during training. To validate DFauLo, we extensively evaluate it on 26 cases with various fault types, data types, and model structures. We also evaluate DFauLo on three widely-used benchmark datasets. The results show that DFauLo outperforms the state-of-the-art techniques in almost all cases and locates hundreds of different types of real data faults in benchmark datasets. @InProceedings{ESEC/FSE23p1345, author = {Yining Yin and Yang Feng and Shihao Weng and Zixi Liu and Yuan Yao and Yichi Zhang and Zhihong Zhao and Zhenyu Chen}, title = {Dynamic Data Fault Localization for Deep Neural Networks}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1345--1357}, doi = {10.1145/3611643.3616345}, year = {2023}, } Publisher's Version Published Artifact Info Artifacts Available Artifacts Functional ESEC/FSE '23: "Benchmarking Robustness of ..." Benchmarking Robustness of AI-Enabled Multi-sensor Fusion Systems: Challenges and Opportunities Xinyu Gao, Zhijie Wang, Yang Feng, Lei Ma, Zhenyu Chen, and Baowen Xu (Nanjing University, China; University of Alberta, Canada; University of Tokyo, Japan) Multi-Sensor Fusion (MSF) based perception systems have been the foundation in supporting many industrial applications and domains, such as self-driving cars, robotic arms, and unmanned aerial vehicles. Over the past few years, the fast progress in datadriven artificial intelligence (AI) has brought a fast-increasing trend to empower MSF systems by deep learning techniques to further improve performance, especially on intelligent systems and their perception systems. Although quite a few AI-enabled MSF perception systems and techniques have been proposed, up to the present, limited benchmarks that focus on MSF perception are publicly available. Given that many intelligent systems such as self-driving cars are operated in safety-critical contexts where perception systems play an important role, there comes an urgent need for a more in-depth understanding of the performance and reliability of these MSF systems. To bridge this gap, we initiate an early step in this direction and construct a public benchmark of AI-enabled MSF-based perception systems including three commonly adopted tasks (i.e., object detection, object tracking, and depth completion). Based on this, to comprehensively understand MSF systems’ robustness and reliability, we design 14 common and realistic corruption patterns to synthesize large-scale corrupted datasets. We further perform a systematic evaluation of these systems through our large-scale evaluation and identify the following key findings: (1) existing AI-enabled MSF systems are not robust enough against corrupted sensor signals; (2) small synchronization and calibration errors can lead to a crash of AI-enabled MSF systems; (3) existing AI-enabled MSF systems are usually tightly-coupled in which bugs/errors from an individual sensor could result in a system crash; (4) the robustness of MSF systems can be enhanced by improving fusion mechanisms. Our results reveal the vulnerability of the current AI-enabled MSF perception systems, calling for researchers and practitioners to take robustness and reliability into account when designing AI-enabled MSF. Our benchmark, code, and detailed evaluation results are publicly available at https://sites.google.com/view/ai-msf-benchmark. @InProceedings{ESEC/FSE23p871, author = {Xinyu Gao and Zhijie Wang and Yang Feng and Lei Ma and Zhenyu Chen and Baowen Xu}, title = {Benchmarking Robustness of AI-Enabled Multi-sensor Fusion Systems: Challenges and Opportunities}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {871--882}, doi = {10.1145/3611643.3616278}, year = {2023}, } Publisher's Version Published Artifact Info Artifacts Available |
|
Feng, Zixuan |
ESEC/FSE '23-SRC: "The State of Survival in OSS: ..."
The State of Survival in OSS: The Impact of Diversity
Zixuan Feng (Oregon State University, USA) Maintaining and retaining contributors is crucial for Open Source (OSS) projects. However, there is often a high turnover among contributors (in some projects as high as 80%). The survivability of contributors is influenced by various factors, including their demographics. Research on contributors’ survivability must, therefore, consider diversity factors. This study longitudinally analyzed the impact of demographic attributes on survivability in the Flutter community through the lens of gender, region, and compensation. The preliminary analysis reveals that affiliated or Western contributors have a higher survival probability than volunteer or Non-Western contributors. However, no significant difference was found in the survival probability between men and women. @InProceedings{ESEC/FSE23p2213, author = {Zixuan Feng}, title = {The State of Survival in OSS: The Impact of Diversity}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2213--2215}, doi = {10.1145/3611643.3617848}, year = {2023}, } Publisher's Version |
|
Filieri, Antonio |
ESEC/FSE '23-INDUSTRY: "Compositional Taint Analysis ..."
Compositional Taint Analysis for Enforcing Security Policies at Scale
Subarno Banerjee, Siwei Cui, Michael Emmi, Antonio Filieri, Liana Hadarean, Peixuan Li, Linghui Luo, Goran Piskachev, Nicolás Rosner, Aritra Sengupta, Omer Tripp, and Jingbo Wang (Amazon Web Services, USA; Texas A&M University, USA; Amazon Web Services, Germany; University of Southern California, USA) Automated static dataflow analysis is an effective technique for detecting security critical issues like sensitive data leak, and vulnerability to injection attacks. Ensuring high precision and recall requires an analysis that is context, field and object sensitive. However, it is challenging to attain high precision and recall and scale to large industrial code bases. Compositional style analyses in which individual software components are analyzed separately, independent from their usage contexts, compute reusable summaries of components. This is an essential feature when deploying such analyses in CI/CD at code-review time or when scanning deployed container images. In both these settings the majority of software components stay the same between subsequent scans. However, it is not obvious how to extend such analyses to check the kind of contextual taint specifications that arise in practice, while maintaining compositionality. In this work we present contextual dataflow modeling, an extension to the compositional analysis to check complex taint specifications and significantly increasing recall and precision. Furthermore, we show how such high-fidelity analysis can scale in production using three key optimizations: (i) discarding intermediate results for previously-analyzed components, an optimization exploiting the compositional nature of our analysis; (ii) a scope-reduction analysis to reduce the scope of the taint analysis w.r.t. the taint specifications being checked, and (iii) caching of analysis models. We show a 9.85% reduction in false positive rate on a comprehensive test suite comprising the OWASP open-source benchmarks as well as internal real-world code samples. We measure the performance and scalability impact of each individual optimization using open source JVM packages from the Maven central repository and internal AWS service codebases. This combination of high precision, recall, performance, and scalability has allowed us to enforce security policies at scale both internally within Amazon as well as for external customers. @InProceedings{ESEC/FSE23p1985, author = {Subarno Banerjee and Siwei Cui and Michael Emmi and Antonio Filieri and Liana Hadarean and Peixuan Li and Linghui Luo and Goran Piskachev and Nicolás Rosner and Aritra Sengupta and Omer Tripp and Jingbo Wang}, title = {Compositional Taint Analysis for Enforcing Security Policies at Scale}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1985--1996}, doi = {10.1145/3611643.3613889}, year = {2023}, } Publisher's Version ESEC/FSE '23: "Neural-Based Test Oracle Generation: ..." Neural-Based Test Oracle Generation: A Large-Scale Evaluation and Lessons Learned Soneya Binta Hossain, Antonio Filieri, Matthew B. Dwyer, Sebastian Elbaum, and Willem Visser (University of Virginia, USA; Amazon Web Services, USA) Defining test oracles is crucial and central to test development, but manual construction of oracles is expensive. While recent neural-based automated test oracle generation techniques have shown promise, their real-world effectiveness remains a compelling question requiring further exploration and understanding. This paper investigates the effectiveness of TOGA, a recently developed neural-based method for automatic test oracle generation. TOGA utilizes EvoSuite-generated test inputs and generates both exception and assertion oracles. In a Defects4j study, TOGA outperformed specification, search, and neural-based techniques, detecting 57 bugs, including 30 unique bugs not detected by other methods. To gain a deeper understanding of its applicability in real-world settings, we conducted a series of external, extended, and conceptual replication studies of TOGA. In a large-scale study involving 25 real-world Java systems, 223.5K test cases, and 51K injected faults, we evaluate TOGA’s ability to improve fault-detection effectiveness relative to the state-of-the-practice and the state-of-the-art. We find that TOGA misclassifies the type of oracle needed 24% of the time and that when it classifies correctly around 62% of the time it is not confident enough to generate any assertion oracle. When it does generate an assertion oracle, more than 47% of them are false positives, and the true positive assertions only increase fault detection by 0.3% relative to prior work. These findings expose limitations of the state-of-the-art neural-based oracle generation technique, provide valuable insights for improvement, and offer lessons for evaluating future automated oracle generation methods. @InProceedings{ESEC/FSE23p120, author = {Soneya Binta Hossain and Antonio Filieri and Matthew B. Dwyer and Sebastian Elbaum and Willem Visser}, title = {Neural-Based Test Oracle Generation: A Large-Scale Evaluation and Lessons Learned}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {120--132}, doi = {10.1145/3611643.3616265}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
First, Emily |
ESEC/FSE '23: "Baldur: Whole-Proof Generation ..."
Baldur: Whole-Proof Generation and Repair with Large Language Models
Emily First, Markus N. Rabe, Talia Ringer, and Yuriy Brun (University of Massachusetts, USA; Augment Computing, USA; University of Illinois at Urbana-Champaign, USA) Formally verifying software is a highly desirable but labor-intensive task. Recent work has developed methods to automate formal verification using proof assistants, such as Coq and Isabelle/HOL, e.g., by training a model to predict one proof step at a time and using that model to search through the space of possible proofs. This paper introduces a new method to automate formal verification: We use large language models, trained on natural language and code and fine-tuned on proofs, to generate whole proofs at once. We then demonstrate that a model fine-tuned to repair generated proofs further increasing proving power. This paper: (1) Demonstrates that whole-proof generation using transformers is possible and is as effective but more efficient than search-based techniques. (2) Demonstrates that giving the learned model additional context, such as a prior failed proof attempt and the ensuing error message, results in proof repair that further improves automated proof generation. (3) Establishes, together with prior work, a new state of the art for fully automated proof synthesis. We reify our method in a prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL theorems and their proofs, empirically showing the effectiveness of whole-proof generation, repair, and added context. We also show that Baldur complements the state-of-the-art tool, Thor, by automatically generating proofs for an additional 8.7% of the theorems. Together, Baldur and Thor can prove 65.7% of the theorems fully automatically. This paper paves the way for new research into using large language models for automating formal verification. @InProceedings{ESEC/FSE23p1229, author = {Emily First and Markus N. Rabe and Talia Ringer and Yuriy Brun}, title = {Baldur: Whole-Proof Generation and Repair with Large Language Models}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1229--1241}, doi = {10.1145/3611643.3616243}, year = {2023}, } Publisher's Version |
|
Ford, Denae |
ESEC/FSE '23: "A Four-Year Study of Student ..."
A Four-Year Study of Student Contributions to OSS vs. OSS4SG with a Lightweight Intervention
Zihan Fang, Madeline Endres, Thomas Zimmermann, Denae Ford, Westley Weimer, Kevin Leach, and Yu Huang (Vanderbilt University, USA; University of Michigan, USA; Microsoft Research, USA) Modern software engineering practice and training increasingly rely on Open Source Software (OSS). The recent growth in demand for professional software engineers has led to increased contributions to, and usage of, OSS. However, there is limited understanding of the factors affecting how developers, and how new or student developers in particular, decide which OSS projects to contribute to, a process critical to OSS sustainability, access, adoption, and growth. To better understand OSS contributions from the developers of tomorrow, we conducted a four-year study with 1,361 students investigating the life cycle of their contributions (from project selection to pull request acceptance). During the study, we also delivered a lightweight intervention to promote the awareness of open source projects for social good (OSS4SG), OSS projects that have positive impacts in other domains. Using both quantitative and qualitative methods, we analyze student experience reports and the pull requests they submit. Compared to general OSS projects, we find significant differences in project selection (𝑝 < 0.0001, effect size = 0.84), student motivation (𝑝 < 0.01, effect size = 0.13), and increased pull-request acceptance rates for OSS4SG contributions. We also find that our intervention correlates with increased student contributions to OSS4SG (𝑝 < 0.0001, effect size = 0.38). Finally, we analyze correlations of factors such as gender or working with a partner. Our findings may help improve the experience for new developers participating in OSS4SG and the quality of their contributions. We also hope our work helps educators, project leaders, and contributors to build a mutually-beneficial framework for the future growth of OSS4SG. @InProceedings{ESEC/FSE23p3, author = {Zihan Fang and Madeline Endres and Thomas Zimmermann and Denae Ford and Westley Weimer and Kevin Leach and Yu Huang}, title = {A Four-Year Study of Student Contributions to OSS vs. OSS4SG with a Lightweight Intervention}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {3--15}, doi = {10.1145/3611643.3616250}, year = {2023}, } Publisher's Version |
|
Formica, Federico |
ESEC/FSE '23-INDUSTRY: "Test Case Generation for Drivability ..."
Test Case Generation for Drivability Requirements of an Automotive Cruise Controller: An Experience with an Industrial Simulator
Federico Formica, Nicholas Petrunti, Lucas Bruck, Vera Pantelic, Mark Lawford, and Claudio Menghi (McMaster University, Canada; University of Bergamo, Italy) Automotive software development requires engineers to test their systems to detect violations of both functional and drivability requirements. Functional requirements define the functionality of the automotive software. Drivability requirements refer to the driver's perception of the interactions with the vehicle; for example, they typically require limiting the acceleration and jerk perceived by the driver within given thresholds. While functional requirements are extensively considered by the research literature, drivability requirements garner less attention. This industrial paper describes our experience assessing the usefulness of an automated search-based software testing (SBST) framework in generating failure-revealing test cases for functional and drivability requirements. We report on our experience with the VI-CarRealTime simulator, an industrial virtual modeling and simulation environment widely used in the automotive domain. We designed a Cruise Control system in Simulink for a four-wheel vehicle, in an iterative fashion, by producing 21 model versions. We used the SBST framework for each version of the model to search for failure-revealing test cases revealing requirement violations. Our results show that the SBST framework successfully identified a failure-revealing test case for 66.7% of our model versions, requiring, on average, 245.9s and 3.8 iterations. We present lessons learned, reflect on the generality of our results, and discuss how our results improve the state of practice. @InProceedings{ESEC/FSE23p1949, author = {Federico Formica and Nicholas Petrunti and Lucas Bruck and Vera Pantelic and Mark Lawford and Claudio Menghi}, title = {Test Case Generation for Drivability Requirements of an Automotive Cruise Controller: An Experience with an Industrial Simulator}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1949--1960}, doi = {10.1145/3611643.3613894}, year = {2023}, } Publisher's Version |
|
Franz, Michael |
ESEC/FSE '23: "A Highly Scalable, Hybrid, ..."
A Highly Scalable, Hybrid, Cross-Platform Timing Analysis Framework Providing Accurate Differential Throughput Estimation via Instruction-Level Tracing
Min-Yih Hsu, Felicitas Hetzelt, David Gens, Michael Maitland, and Michael Franz (University of California at Irvine, USA; SiFive, USA) Differential throughput estimation, i.e., predicting the performance impact of software changes, is critical when developing applications that rely on accurate timing bounds, such as automotive, avionic, or industrial control systems. However, developers often lack access to the target hardware to perform on-device measurements, and hence rely on instruction throughput estimation tools to evaluate performance impacts. State-of-the-art techniques broadly fall into two categories: dynamic and static. Dynamic approaches emulate program execution using cycle-accurate microarchitectural simulators resulting in high precision at the cost of long turnaround times and convoluted setups. Static approaches reduce overhead by predicting cycle counts outside of a concrete runtime environment. However, they are limited by the lack of dynamic runtime information and mostly focus on predictions over single basic blocks which requires developers to manually construct critical instruction sequences. We present MCAD, a hybrid timing analysis framework that combines the advantages of dynamic and static approaches. Instead of relying on heavyweight cycle-accurate emulation, MCAD collects instruction traces along with dynamic runtime information from QEMU and streams them to a static throughput estimator. This allows developers to accurately estimate the performance impact of software changes for complete programs within minutes, reducing turnaround times by orders of magnitude compared to existing approaches with similar accuracy. Our evaluation shows that MCAD scales to real-world applications such as FFmpeg and Clang with millions of instructions, achieving < 3% geo. mean error compared to ground truth timings from hardware-performance counters on x86 and ARM machines. @InProceedings{ESEC/FSE23p821, author = {Min-Yih Hsu and Felicitas Hetzelt and David Gens and Michael Maitland and Michael Franz}, title = {A Highly Scalable, Hybrid, Cross-Platform Timing Analysis Framework Providing Accurate Differential Throughput Estimation via Instruction-Level Tracing}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {821--831}, doi = {10.1145/3611643.3616246}, year = {2023}, } Publisher's Version |
|
Fronchetti, Felipe |
ESEC/FSE '23: "Do CONTRIBUTING Files Provide ..."
Do CONTRIBUTING Files Provide Information about OSS Newcomers’ Onboarding Barriers?
Felipe Fronchetti, David C. Shepherd, Igor Wiese, Christoph Treude, Marco Aurélio Gerosa, and Igor Steinmacher (Virginia Commonwealth University, USA; Lousiana State University, USA; Federal University of Technology Paraná, Brazil; University of Melbourne, Australia; Northern Arizona University, USA) Effectively onboarding newcomers is essential for the success of open source projects. These projects often provide onboarding guidelines in their ’CONTRIBUTING’ files (e.g., CONTRIBUTING.md on GitHub). These files explain, for example, how to find open tasks, implement solutions, and submit code for review. However, these files often do not follow a standard structure, can be too large, and miss barriers commonly found by newcomers. In this paper, we propose an automated approach to parse these CONTRIBUTING files and assess how they address onboarding barriers. We manually classified a sample of files according to a model of onboarding barriers from the literature, trained a machine learning classifier that automatically predicts the categories of each paragraph (precision: 0.655, recall: 0.662), and surveyed developers to investigate their perspective of the predictions’ adequacy (75% of the predictions were considered adequate). We found that CONTRIBUTING files typically do not cover the barriers newcomers face (52% of the analyzed projects missed at least 3 out of the 6 barriers faced by newcomers; 84% missed at least 2). Our analysis also revealed that information about choosing a task and talking with the community, two of the most recurrent barriers newcomers face, are neglected in more than 75% of the projects. We made available our classifier as an online service that analyzes the content of a given CONTRIBUTING file. Our approach may help community builders identify missing information in the project ecosystem they maintain and newcomers can understand what to expect in CONTRIBUTING files. @InProceedings{ESEC/FSE23p16, author = {Felipe Fronchetti and David C. Shepherd and Igor Wiese and Christoph Treude and Marco Aurélio Gerosa and Igor Steinmacher}, title = {Do CONTRIBUTING Files Provide Information about OSS Newcomers’ Onboarding Barriers?}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {16--28}, doi = {10.1145/3611643.3616288}, year = {2023}, } Publisher's Version Info |
|
Fu, Qiuai |
ESEC/FSE '23: "Hue: A User-Adaptive Parser ..."
Hue: A User-Adaptive Parser for Hybrid Logs
Junjielong Xu, Qiuai Fu, Zhouruixing Zhu, Yutong Cheng, Zhijing Li, Yuchi Ma, and Pinjia He (Chinese University of Hong Kong, Shenzhen, China; Huawei Cloud Computing Technologies, China) Log parsing, which extracts log templates from semi-structured logs and produces structured logs, is the first and the most critical step in automated log analysis. While existing log parsers have achieved decent results, they suffer from two major limitations by design. First, they do not natively support hybrid logs that consist of both single-line logs and multi-line logs (Java Exception and Hadoop Counters). Second, they fall short in integrating domain knowledge in parsing, making it hard to identify ambiguous tokens in logs. This paper defines a new research problem, hybrid log parsing, as a superset of traditional log parsing tasks, and proposes Hue, the first attempt for hybrid log parsing via a user-adaptive manner. Specifically, Hue converts each log message to a sequence of special wildcards using a key casting table and determines the log types via line aggregating and pattern extracting. In addition, Hue can effectively utilize user feedback via a novel merge-reject strategy, making it possible to quickly adapt to complex and changing log templates. We evaluated Hue on three hybrid log datasets and sixteen widely-used single-line log datasets (Loghub). The results show that Hue achieves an average grouping accuracy of 0.845 on hybrid logs, which largely outperforms the best results (0.563 on average) obtained by existing parsers. Hue also exhibits SOTA performance on single-line log datasets. @InProceedings{ESEC/FSE23p413, author = {Junjielong Xu and Qiuai Fu and Zhouruixing Zhu and Yutong Cheng and Zhijing Li and Yuchi Ma and Pinjia He}, title = {Hue: A User-Adaptive Parser for Hybrid Logs}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {413--424}, doi = {10.1145/3611643.3616260}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Functional |
|
Gall, Harald C. |
ESEC/FSE '23-IVR: "Towards Top-Down Automated ..."
Towards Top-Down Automated Development in Limited Scopes: A Neuro-Symbolic Framework from Expressibles to Executables
Jian Gu and Harald C. Gall (Monash University, Australia; University of Zurich, Switzerland) Deep code generation is a topic of deep learning for software engineering (DL4SE), which adopts neural models to generate code for the intended functions. Since end-to-end neural methods lack domain knowledge and software hierarchy awareness, they tend to perform poorly w.r.t project-level tasks. To systematically explore the potential improvements of code generation, we let it participate in the whole top-down development from expressibles to executables, which is possible in limited scopes. In the process, it benefits from massive samples, features, and knowledge. As the foundation, we suggest building a taxonomy on code data, namely code taxonomy, leveraging the categorization of code information. Moreover, we introduce a three-layer semantic pyramid (SP) to associate text data and code data. It identifies the information of different abstraction levels, and thus introduces the domain knowledge on development and reveals the hierarchy of software. Furthermore, we propose a semantic pyramid framework (SPF) as the approach, focusing on software of high modularity and low complexity. SPF divides the code generation process into stages and reserves spots for potential interactions. In addition, we conceived preliminary applications in software development to confirm the neuro-symbolic framework. @InProceedings{ESEC/FSE23p2072, author = {Jian Gu and Harald C. Gall}, title = {Towards Top-Down Automated Development in Limited Scopes: A Neuro-Symbolic Framework from Expressibles to Executables}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2072--2076}, doi = {10.1145/3611643.3613076}, year = {2023}, } Publisher's Version |
|
Ganatra, Vaibhav |
ESEC/FSE '23-INDUSTRY: "Detection Is Better Than Cure: ..."
Detection Is Better Than Cure: A Cloud Incidents Perspective
Vaibhav Ganatra, Anjaly Parayil, Supriyo Ghosh, Yu Kang, Minghua Ma, Chetan Bansal, Suman Nath, and Jonathan Mace (Microsoft, India; Microsoft, China; Microsoft, USA) Cloud providers use automated watchdogs or monitors to continuously observe service availability and to proactively report incidents when system performance degrades. Improper monitoring can lead to delays in the detection and mitigation of production incidents, which can be extremely expensive in terms of customer impacts and manual toil from engineering resources. Therefore, a systematic understanding of the pitfalls in current monitoring practices and how they can lead to production incidents is crucial for ensuring {continuous} reliability of cloud services. In this work, we carefully study the production incidents from the past year at Microsoft to understand the monitoring gaps in a hyperscale cloud platform. We conduct an extensive empirical study to answer: (1) What are the major causes of failures in early detection of production incidents and what are the steps taken for mitigation, (2) What is the impact of failures in early detection, (3) How do we recommend best monitoring practices for different services, and (4) How can we leverage the insights from this study to enhance the reliability of the cloud services. This study provides a deeper understanding of existing monitoring gaps in cloud platforms, uncover interesting insights and provide guidance for best monitoring practices for ensuring continuous reliability. @InProceedings{ESEC/FSE23p1891, author = {Vaibhav Ganatra and Anjaly Parayil and Supriyo Ghosh and Yu Kang and Minghua Ma and Chetan Bansal and Suman Nath and Jonathan Mace}, title = {Detection Is Better Than Cure: A Cloud Incidents Perspective}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1891--1902}, doi = {10.1145/3611643.3613898}, year = {2023}, } Publisher's Version |
|
Ganji, Mohammad |
ESEC/FSE '23: "Code Coverage Criteria for ..."
Code Coverage Criteria for Asynchronous Programs
Mohammad Ganji, Saba Alimadadi, and Frank Tip (Simon Fraser University, Canada; Northeastern University, USA) Asynchronous software often exhibits complex and error-prone behaviors that should be tested thoroughly. Code coverage has been the most popular metric to assess test suite quality. However, traditional code coverage criteria do not adequately reflect completion, interactions, and error handling of asynchronous operations. This paper proposes novel test adequacy criteria for measuring: (i) completion of asynchronous operations in terms of both successful and exceptional execution, (ii) registration of reactions for handling both possible outcomes, and (iii) execution of said reactions through tests. We implement JScope, a tool for automatically measuring coverage according to these criteria in JavaScript applications, as an interactive plug-in for Visual Studio Code. An evaluation of JScope on 20 JavaScript applications shows that the proposed criteria can help improve assessment of test adequacy, complementing traditional criteria. According to our investigation of 15 real GitHub issues concerned with asynchrony, the new criteria can help reveal faulty asynchronous behaviors that are untested yet are deemed covered by traditional coverage criteria. We also report on a controlled experiment with 12 participants to investigate the usefulness of JScope in realistic settings, demonstrating its effectiveness in improving programmers’ ability to assess test adequacy and detect untested behavior of asynchronous code. @InProceedings{ESEC/FSE23p1307, author = {Mohammad Ganji and Saba Alimadadi and Frank Tip}, title = {Code Coverage Criteria for Asynchronous Programs}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1307--1319}, doi = {10.1145/3611643.3616292}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Gao, Cuiyun |
ESEC/FSE '23-INDUSTRY: "A Unified Framework for Mini-game ..."
A Unified Framework for Mini-game Testing: Experience on WeChat
Chaozheng Wang, Haochuan Lu, Cuiyun Gao, Zongjie Li, Ting Xiong, and Yuetang Deng (Chinese University of Hong Kong, China; Tencent, China; Hong Kong University of Science and Technology, China) Mobile games play an increasingly important role in our daily life. The quality of mobile games can substantially affect the user experience and game revenue. Different from traditional mobile games, the mini-games provided by our partner, Tencent, are embedded in the mobile app WeChat, so users do not need to install specific game apps and can directly play the games in the app. Due to the convenient installation, WeChat has attracted large numbers of developers to design and publish on the mini-game platform in the app. Until now, the platform has more than one hundred thousand published mini-games. Manually testing all the mini-games requires enormous effort and is impractical. There exist automated game testing methods; however, they are difficult to be applied for testing mini-games for the following reasons: 1) Effective game testing heavily relies on prior knowledge about game operations and extraction of GUI widget trees. However, this knowledge is specific and not always applicable when testing a large number of mini-games with complex game engines (e.g., Unity). 2) The highly diverse GUI widget design of mini-games deviates significantly from that of mobile apps. Such issue prevents the existing image-based GUI widget detection techniques from effectively detecting widgets in mini-games. To address the aforementioned issues, we propose a unified framework for black-box mini-game testing named iExplorer. iExplorer involves a mixed GUI widget detection approach incorporating both deep learning-based object detection and edge aggregation-based segmentation for detecting GUI widgets in mini-games. A category-aware testing strategy is then proposed for testing mini-games, with different categories of widgets (e.g., sliding and clicking widgets) considered. iExplorer has been deployed in for more than six months. In the past 30 days, iExplorer has tested large-scale mini-games (i.e., 76,000) and successfully found 22,144 real bugs. @InProceedings{ESEC/FSE23p1623, author = {Chaozheng Wang and Haochuan Lu and Cuiyun Gao and Zongjie Li and Ting Xiong and Yuetang Deng}, title = {A Unified Framework for Mini-game Testing: Experience on WeChat}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1623--1634}, doi = {10.1145/3611643.3613868}, year = {2023}, } Publisher's Version ESEC/FSE '23: "How Practitioners Expect Code ..." How Practitioners Expect Code Completion? Chaozheng Wang, Junhao Hu, Cuiyun Gao, Yu Jin, Tao Xie, Hailiang Huang, Zhenyu Lei, and Yuetang Deng (Chinese University of Hong Kong, China; Peking University, China; Tencent, USA; Tencent, China) Code completion has become a common practice for programmers during their daily programming activities. It automatically predicts the next tokens or statements that the programmers may use. Code completion aims to substantially save keystrokes and improve the programming efficiency for programmers. Although there exists substantial research on code completion, it is still unclear what practitioner expectations are on code completion and whether these expectations are met by the existing research. To address these questions, we perform a study by first interviewing 15 professionals and then surveying 599 practitioners from 18 IT companies about their expectations on code completion. We then compare the practitioner expectations with the existing research by conducting a literature review of papers on code completion published in major publication venues from 2012 to 2022. Based on the comparison, we highlight the directions desirable for researchers to invest efforts toward developing code completion techniques for meeting practitioner expectations. @InProceedings{ESEC/FSE23p1294, author = {Chaozheng Wang and Junhao Hu and Cuiyun Gao and Yu Jin and Tao Xie and Hailiang Huang and Zhenyu Lei and Yuetang Deng}, title = {How Practitioners Expect Code Completion?}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1294--1306}, doi = {10.1145/3611643.3616280}, year = {2023}, } Publisher's Version |
|
Gao, Haoyu |
ESEC/FSE '23: "Evaluating Transfer Learning ..."
Evaluating Transfer Learning for Simplifying GitHub READMEs
Haoyu Gao, Christoph Treude, and Mansooreh Zahedi (University of Melbourne, Australia) Software documentation captures detailed knowledge about a software product, e.g., code, technologies, and design. It plays an important role in the coordination of development teams and in conveying ideas to various stakeholders. However, software documentation can be hard to comprehend if it is written with jargon and complicated sentence structure. In this study, we explored the potential of text simplification techniques in the domain of software engineering to automatically simplify GitHub README files. We collected software-related pairs of GitHub README files consisting of 14,588 entries, aligned difficult sentences with their simplified counterparts, and trained a Transformer-based model to automatically simplify difficult versions. To mitigate the sparse and noisy nature of the software-related simplification dataset, we applied general text simplification knowledge to this field. Since many general-domain difficult-to-simple Wikipedia document pairs are already publicly available, we explored the potential of transfer learning by first training the model on the Wikipedia data and then fine-tuning it on the README data. Using automated BLEU scores and human evaluation, we compared the performance of different transfer learning schemes and the baseline models without transfer learning. The transfer learning model using the best checkpoint trained on a general topic corpus achieved the best performance of 34.68 BLEU score and statistically significantly higher human annotation scores compared to the rest of the schemes and baselines. We conclude that using transfer learning is a promising direction to circumvent the lack of data and drift style problem in software README files simplification and achieved a better trade-off between simplification and preservation of meaning. @InProceedings{ESEC/FSE23p1548, author = {Haoyu Gao and Christoph Treude and Mansooreh Zahedi}, title = {Evaluating Transfer Learning for Simplifying GitHub READMEs}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1548--1560}, doi = {10.1145/3611643.3616291}, year = {2023}, } Publisher's Version |
|
Gao, Xin |
ESEC/FSE '23-INDUSTRY: "TraceDiag: Adaptive, Interpretable, ..."
TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems
Ruomeng Ding, Chaoyun Zhang, Lu Wang, Yong Xu, Minghua Ma, Xiaomin Wu, Meng Zhang, Qingjun Chen, Xin Gao, Xuedong Gao, Hao Fan, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang (Microsoft, China; Microsoft 365, China; Microsoft 365, USA) Root Cause Analysis (RCA) is becoming increasingly crucial for ensuring the reliability of microservice systems. However, performing RCA on modern microservice systems can be challenging due to their large scale, as they usually comprise hundreds of components, leading significant human effort. This paper proposes TraceDiag, an end-to-end RCA framework that addresses the challenges for large-scale microservice systems. It leverages reinforcement learning to learn a pruning policy for the service dependency graph to automatically eliminates redundant components, thereby significantly improving the RCA efficiency. The learned pruning policy is interpretable and fully adaptive to new RCA instances. With the pruned graph, a causal-based method can be executed with high accuracy and efficiency. The proposed TraceDiag framework is evaluated on real data traces collected from the Microsoft Exchange system, and demonstrates superior performance compared to state-of-the-art RCA approaches. Notably, TraceDiag has been integrated as a critical component in the Microsoft M365 Exchange, resulting in a significant improvement in the system's reliability and a considerable reduction in the human effort required for RCA. @InProceedings{ESEC/FSE23p1762, author = {Ruomeng Ding and Chaoyun Zhang and Lu Wang and Yong Xu and Minghua Ma and Xiaomin Wu and Meng Zhang and Qingjun Chen and Xin Gao and Xuedong Gao and Hao Fan and Saravan Rajmohan and Qingwei Lin and Dongmei Zhang}, title = {TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1762--1773}, doi = {10.1145/3611643.3613864}, year = {2023}, } Publisher's Version |
|
Gao, Xinyu |
ESEC/FSE '23: "Benchmarking Robustness of ..."
Benchmarking Robustness of AI-Enabled Multi-sensor Fusion Systems: Challenges and Opportunities
Xinyu Gao, Zhijie Wang, Yang Feng, Lei Ma, Zhenyu Chen, and Baowen Xu (Nanjing University, China; University of Alberta, Canada; University of Tokyo, Japan) Multi-Sensor Fusion (MSF) based perception systems have been the foundation in supporting many industrial applications and domains, such as self-driving cars, robotic arms, and unmanned aerial vehicles. Over the past few years, the fast progress in datadriven artificial intelligence (AI) has brought a fast-increasing trend to empower MSF systems by deep learning techniques to further improve performance, especially on intelligent systems and their perception systems. Although quite a few AI-enabled MSF perception systems and techniques have been proposed, up to the present, limited benchmarks that focus on MSF perception are publicly available. Given that many intelligent systems such as self-driving cars are operated in safety-critical contexts where perception systems play an important role, there comes an urgent need for a more in-depth understanding of the performance and reliability of these MSF systems. To bridge this gap, we initiate an early step in this direction and construct a public benchmark of AI-enabled MSF-based perception systems including three commonly adopted tasks (i.e., object detection, object tracking, and depth completion). Based on this, to comprehensively understand MSF systems’ robustness and reliability, we design 14 common and realistic corruption patterns to synthesize large-scale corrupted datasets. We further perform a systematic evaluation of these systems through our large-scale evaluation and identify the following key findings: (1) existing AI-enabled MSF systems are not robust enough against corrupted sensor signals; (2) small synchronization and calibration errors can lead to a crash of AI-enabled MSF systems; (3) existing AI-enabled MSF systems are usually tightly-coupled in which bugs/errors from an individual sensor could result in a system crash; (4) the robustness of MSF systems can be enhanced by improving fusion mechanisms. Our results reveal the vulnerability of the current AI-enabled MSF perception systems, calling for researchers and practitioners to take robustness and reliability into account when designing AI-enabled MSF. Our benchmark, code, and detailed evaluation results are publicly available at https://sites.google.com/view/ai-msf-benchmark. @InProceedings{ESEC/FSE23p871, author = {Xinyu Gao and Zhijie Wang and Yang Feng and Lei Ma and Zhenyu Chen and Baowen Xu}, title = {Benchmarking Robustness of AI-Enabled Multi-sensor Fusion Systems: Challenges and Opportunities}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {871--882}, doi = {10.1145/3611643.3616278}, year = {2023}, } Publisher's Version Published Artifact Info Artifacts Available |
|
Gao, Xuedong |
ESEC/FSE '23-INDUSTRY: "TraceDiag: Adaptive, Interpretable, ..."
TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems
Ruomeng Ding, Chaoyun Zhang, Lu Wang, Yong Xu, Minghua Ma, Xiaomin Wu, Meng Zhang, Qingjun Chen, Xin Gao, Xuedong Gao, Hao Fan, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang (Microsoft, China; Microsoft 365, China; Microsoft 365, USA) Root Cause Analysis (RCA) is becoming increasingly crucial for ensuring the reliability of microservice systems. However, performing RCA on modern microservice systems can be challenging due to their large scale, as they usually comprise hundreds of components, leading significant human effort. This paper proposes TraceDiag, an end-to-end RCA framework that addresses the challenges for large-scale microservice systems. It leverages reinforcement learning to learn a pruning policy for the service dependency graph to automatically eliminates redundant components, thereby significantly improving the RCA efficiency. The learned pruning policy is interpretable and fully adaptive to new RCA instances. With the pruned graph, a causal-based method can be executed with high accuracy and efficiency. The proposed TraceDiag framework is evaluated on real data traces collected from the Microsoft Exchange system, and demonstrates superior performance compared to state-of-the-art RCA approaches. Notably, TraceDiag has been integrated as a critical component in the Microsoft M365 Exchange, resulting in a significant improvement in the system's reliability and a considerable reduction in the human effort required for RCA. @InProceedings{ESEC/FSE23p1762, author = {Ruomeng Ding and Chaoyun Zhang and Lu Wang and Yong Xu and Minghua Ma and Xiaomin Wu and Meng Zhang and Qingjun Chen and Xin Gao and Xuedong Gao and Hao Fan and Saravan Rajmohan and Qingwei Lin and Dongmei Zhang}, title = {TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1762--1773}, doi = {10.1145/3611643.3613864}, year = {2023}, } Publisher's Version |
|
Gao, Yongqiang |
ESEC/FSE '23-INDUSTRY: "AG3: Automated Game GUI Text ..."
AG3: Automated Game GUI Text Glitch Detection Based on Computer Vision
Xiaoyun Liang, Jiayi Qi, Yongqiang Gao, Chao Peng, and Ping Yang (ByteDance, China) With the advancement of device software and hardware performance, and the evolution of game engines, an increasing number of emerging high-quality games are captivating game players from all around the world who speak different languages. However, due to the vast fragmentation of the device and platform market, a well-tested game may still experience text glitches when installed on a new device with an unseen screen resolution and system version, which can significantly impact the user experience. In our testing pipeline, current testing techniques for identifying multilingual text glitches are laborious and inefficient. In this paper, we present AG3, which offers intelligent game traversal, precise visual text glitch detection, and integrated quality report generation capabilities. Our empirical evaluation and internal industrial deployment demonstrate that AG3 can detect various real-world multilingual text glitches with minimal human involvement. @InProceedings{ESEC/FSE23p1879, author = {Xiaoyun Liang and Jiayi Qi and Yongqiang Gao and Chao Peng and Ping Yang}, title = {AG3: Automated Game GUI Text Glitch Detection Based on Computer Vision}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1879--1890}, doi = {10.1145/3611643.3613867}, year = {2023}, } Publisher's Version |
|
Garg, Shaddy |
ESEC/FSE '23: "Outage-Watch: Early Prediction ..."
Outage-Watch: Early Prediction of Outages using Extreme Event Regularizer
Shubham Agarwal, Sarthak Chakraborty, Shaddy Garg, Sumit Bisht, Chahat Jain, Ashritha Gonuguntla, and Shiv Saini (Adobe Research, India; University of Illinois at Urbana-Champaign, USA; Adobe, India; Amazon, India; Traceable.ai, India; Cisco, India) Cloud services are omnipresent and critical cloud service failure is a fact of life. In order to retain customers and prevent revenue loss, it is important to provide high reliability guarantees for these services. One way to do this is by predicting outages in advance, which can help in reducing the severity as well as time to recovery. It is difficult to forecast critical failures due to the rarity of these events. Moreover, critical failures are ill-defined in terms of observable data. Our proposed method, Outage-Watch, defines critical service outages as deteriorations in the Quality of Service (QoS) captured by a set of metrics. Outage-Watch detects such outages in advance by using current system state to predict whether the QoS metrics will cross a threshold and initiate an extreme event. A mixture of Gaussian is used to model the distribution of the QoS metrics for flexibility and an extreme event regularizer helps in improving learning in tail of the distribution. An outage is predicted if the probability of any one of the QoS metrics crossing threshold changes significantly. Our evaluation on a real-world SaaS company dataset shows that Outage-Watch significantly outperforms traditional methods with an average AUC of 0.98. Additionally, Outage-Watch detects all the outages exhibiting a change in service metrics and reduces the Mean Time To Detection (MTTD) of outages by up to 88% when deployed in an enterprise cloud-service system, demonstrating efficacy of our proposed method. @InProceedings{ESEC/FSE23p682, author = {Shubham Agarwal and Sarthak Chakraborty and Shaddy Garg and Sumit Bisht and Chahat Jain and Ashritha Gonuguntla and Shiv Saini}, title = {Outage-Watch: Early Prediction of Outages using Extreme Event Regularizer}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {682--694}, doi = {10.1145/3611643.3616316}, year = {2023}, } Publisher's Version |
|
Gauthier, François |
ESEC/FSE '23: "Crystallizer: A Hybrid Path ..."
Crystallizer: A Hybrid Path Analysis Framework to Aid in Uncovering Deserialization Vulnerabilities
Prashast Srivastava, Flavio Toffalini, Kostyantyn Vorobyov, François Gauthier, Antonio Bianchi, and Mathias Payer (Purdue University, USA; EPFL, Switzerland; Oracle Labs, Australia) Applications use serialization and deserialization to exchange data. Serialization allows developers to exchange messages or perform remote method invocation in distributed applications. However, the application logic itself is responsible for security. Adversaries may abuse bugs in the deserialization logic to forcibly invoke attacker-controlled methods by crafting malicious bytestreams (payloads). Crystallizer presents a novel hybrid framework to automatically uncover deserialization vulnerabilities by combining static and dynamic analyses. Our intuition is to first over-approximate possible payloads through static analysis (to constrain the search space). Then, we use dynamic analysis to instantiate concrete payloads as a proof-of-concept of a vulnerability (giving the analyst concrete examples of possible attacks). Our proof-of-concept focuses on Java deserialization as the imminent domain of such attacks. We evaluate our prototype on seven popular Java libraries against state-of-the-art frameworks for uncovering gadget chains. In contrast to existing tools, we uncovered 41 previously unknown exploitable chains. Furthermore, we show the real-world security impact of Crystallizer by using it to synthesize gadget chains to mount RCE and DoS attacks on three popular Java applications. We have responsibly disclosed all newly discovered vulnerabilities. @InProceedings{ESEC/FSE23p1586, author = {Prashast Srivastava and Flavio Toffalini and Kostyantyn Vorobyov and François Gauthier and Antonio Bianchi and Mathias Payer}, title = {Crystallizer: A Hybrid Path Analysis Framework to Aid in Uncovering Deserialization Vulnerabilities}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1586--1597}, doi = {10.1145/3611643.3616313}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable |
|
Geng, Mingyang |
ESEC/FSE '23: "Natural Language to Code: ..."
Natural Language to Code: How Far Are We?
Shangwen Wang, Mingyang Geng, Bo Lin, Zhensu Sun, Ming Wen, Yepang Liu, Li Li, Tegawendé F. Bissyandé, and Xiaoguang Mao (National University of Defense Technology, China; Singapore Management University, Singapore; Huazhong University of Science and Technology, China; Southern University of Science and Technology, China; Beihang University, China; University of Luxembourg, Luxembourg) A longstanding dream in software engineering research is to devise effective approaches for automating development tasks based on developers' informally-specified intentions. Such intentions are generally in the form of natural language descriptions. In recent literature, a number of approaches have been proposed to automate tasks such as code search and even code generation based on natural language inputs. While these approaches vary in terms of technical designs, their objective is the same: transforming a developer's intention into source code. The literature, however, lacks a comprehensive understanding towards the effectiveness of existing techniques as well as their complementarity to each other. We propose to fill this gap through a large-scale empirical study where we systematically evaluate natural language to code techniques. Specifically, we consider six state-of-the-art techniques targeting code search, and four targeting code generation. Through extensive evaluations on a dataset of 22K+ natural language queries, our study reveals the following major findings: (1) code search techniques based on model pre-training are so far the most effective while code generation techniques can also provide promising results; (2) complementarity widely exists among the existing techniques; and (3) combining the ten techniques together can enhance the performance for 35% compared with the most effective standalone technique. Finally, we propose a post-processing strategy to automatically integrate different techniques based on their generated code. Experimental results show that our devised strategy is both effective and extensible. @InProceedings{ESEC/FSE23p375, author = {Shangwen Wang and Mingyang Geng and Bo Lin and Zhensu Sun and Ming Wen and Yepang Liu and Li Li and Tegawendé F. Bissyandé and Xiaoguang Mao}, title = {Natural Language to Code: How Far Are We?}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {375--387}, doi = {10.1145/3611643.3616323}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Gens, David |
ESEC/FSE '23: "A Highly Scalable, Hybrid, ..."
A Highly Scalable, Hybrid, Cross-Platform Timing Analysis Framework Providing Accurate Differential Throughput Estimation via Instruction-Level Tracing
Min-Yih Hsu, Felicitas Hetzelt, David Gens, Michael Maitland, and Michael Franz (University of California at Irvine, USA; SiFive, USA) Differential throughput estimation, i.e., predicting the performance impact of software changes, is critical when developing applications that rely on accurate timing bounds, such as automotive, avionic, or industrial control systems. However, developers often lack access to the target hardware to perform on-device measurements, and hence rely on instruction throughput estimation tools to evaluate performance impacts. State-of-the-art techniques broadly fall into two categories: dynamic and static. Dynamic approaches emulate program execution using cycle-accurate microarchitectural simulators resulting in high precision at the cost of long turnaround times and convoluted setups. Static approaches reduce overhead by predicting cycle counts outside of a concrete runtime environment. However, they are limited by the lack of dynamic runtime information and mostly focus on predictions over single basic blocks which requires developers to manually construct critical instruction sequences. We present MCAD, a hybrid timing analysis framework that combines the advantages of dynamic and static approaches. Instead of relying on heavyweight cycle-accurate emulation, MCAD collects instruction traces along with dynamic runtime information from QEMU and streams them to a static throughput estimator. This allows developers to accurately estimate the performance impact of software changes for complete programs within minutes, reducing turnaround times by orders of magnitude compared to existing approaches with similar accuracy. Our evaluation shows that MCAD scales to real-world applications such as FFmpeg and Clang with millions of instructions, achieving < 3% geo. mean error compared to ground truth timings from hardware-performance counters on x86 and ARM machines. @InProceedings{ESEC/FSE23p821, author = {Min-Yih Hsu and Felicitas Hetzelt and David Gens and Michael Maitland and Michael Franz}, title = {A Highly Scalable, Hybrid, Cross-Platform Timing Analysis Framework Providing Accurate Differential Throughput Estimation via Instruction-Level Tracing}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {821--831}, doi = {10.1145/3611643.3616246}, year = {2023}, } Publisher's Version |
|
Gerosa, Marco Aurélio |
ESEC/FSE '23: "Do CONTRIBUTING Files Provide ..."
Do CONTRIBUTING Files Provide Information about OSS Newcomers’ Onboarding Barriers?
Felipe Fronchetti, David C. Shepherd, Igor Wiese, Christoph Treude, Marco Aurélio Gerosa, and Igor Steinmacher (Virginia Commonwealth University, USA; Lousiana State University, USA; Federal University of Technology Paraná, Brazil; University of Melbourne, Australia; Northern Arizona University, USA) Effectively onboarding newcomers is essential for the success of open source projects. These projects often provide onboarding guidelines in their ’CONTRIBUTING’ files (e.g., CONTRIBUTING.md on GitHub). These files explain, for example, how to find open tasks, implement solutions, and submit code for review. However, these files often do not follow a standard structure, can be too large, and miss barriers commonly found by newcomers. In this paper, we propose an automated approach to parse these CONTRIBUTING files and assess how they address onboarding barriers. We manually classified a sample of files according to a model of onboarding barriers from the literature, trained a machine learning classifier that automatically predicts the categories of each paragraph (precision: 0.655, recall: 0.662), and surveyed developers to investigate their perspective of the predictions’ adequacy (75% of the predictions were considered adequate). We found that CONTRIBUTING files typically do not cover the barriers newcomers face (52% of the analyzed projects missed at least 3 out of the 6 barriers faced by newcomers; 84% missed at least 2). Our analysis also revealed that information about choosing a task and talking with the community, two of the most recurrent barriers newcomers face, are neglected in more than 75% of the projects. We made available our classifier as an online service that analyzes the content of a given CONTRIBUTING file. Our approach may help community builders identify missing information in the project ecosystem they maintain and newcomers can understand what to expect in CONTRIBUTING files. @InProceedings{ESEC/FSE23p16, author = {Felipe Fronchetti and David C. Shepherd and Igor Wiese and Christoph Treude and Marco Aurélio Gerosa and Igor Steinmacher}, title = {Do CONTRIBUTING Files Provide Information about OSS Newcomers’ Onboarding Barriers?}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {16--28}, doi = {10.1145/3611643.3616288}, year = {2023}, } Publisher's Version Info |
|
Ghanavati, Sepideh |
ESEC/FSE '23-DEMO: "A Language Model of Java Methods ..."
A Language Model of Java Methods with Train/Test Deduplication
Chia-Yi Su, Aakash Bansal, Vijayanta Jain, Sepideh Ghanavati, and Collin McMillan (University of Notre Dame, USA; University of Maine, USA) This tool demonstration presents a research toolkit for a language model of Java source code. The target audience includes researchers studying problems at the granularity level of subroutines, statements, or variables in Java. In contrast to many existing language models, we prioritize features for researchers including an open and easily-searchable training set, a held out test set with different levels of deduplication from the training set, infrastructure for deduplicating new examples, and an implementation platform suitable for execution on equipment accessible to a relatively modest budget. Our model is a GPT2-like architecture with 350m parameters. Our training set includes 52m Java methods (9b tokens) and 13m StackOverflow threads (10.5b tokens). To improve accessibility of research to more members of the community, we limit local resource requirements to GPUs with 16GB video memory. We provide a test set of held out Java methods that include descriptive comments, including the entire Java projects for those methods. We also provide deduplication tools using precomputed hash tables at various similarity thresholds to help researchers ensure that their own test examples are not in the training set. We make all our tools and data open source and available via Huggingface and Github. @InProceedings{ESEC/FSE23p2152, author = {Chia-Yi Su and Aakash Bansal and Vijayanta Jain and Sepideh Ghanavati and Collin McMillan}, title = {A Language Model of Java Methods with Train/Test Deduplication}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2152--2156}, doi = {10.1145/3611643.3613090}, year = {2023}, } Publisher's Version Info |
|
Ghosh, Supriyo |
ESEC/FSE '23-INDUSTRY: "Detection Is Better Than Cure: ..."
Detection Is Better Than Cure: A Cloud Incidents Perspective
Vaibhav Ganatra, Anjaly Parayil, Supriyo Ghosh, Yu Kang, Minghua Ma, Chetan Bansal, Suman Nath, and Jonathan Mace (Microsoft, India; Microsoft, China; Microsoft, USA) Cloud providers use automated watchdogs or monitors to continuously observe service availability and to proactively report incidents when system performance degrades. Improper monitoring can lead to delays in the detection and mitigation of production incidents, which can be extremely expensive in terms of customer impacts and manual toil from engineering resources. Therefore, a systematic understanding of the pitfalls in current monitoring practices and how they can lead to production incidents is crucial for ensuring {continuous} reliability of cloud services. In this work, we carefully study the production incidents from the past year at Microsoft to understand the monitoring gaps in a hyperscale cloud platform. We conduct an extensive empirical study to answer: (1) What are the major causes of failures in early detection of production incidents and what are the steps taken for mitigation, (2) What is the impact of failures in early detection, (3) How do we recommend best monitoring practices for different services, and (4) How can we leverage the insights from this study to enhance the reliability of the cloud services. This study provides a deeper understanding of existing monitoring gaps in cloud platforms, uncover interesting insights and provide guidance for best monitoring practices for ensuring continuous reliability. @InProceedings{ESEC/FSE23p1891, author = {Vaibhav Ganatra and Anjaly Parayil and Supriyo Ghosh and Yu Kang and Minghua Ma and Chetan Bansal and Suman Nath and Jonathan Mace}, title = {Detection Is Better Than Cure: A Cloud Incidents Perspective}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1891--1902}, doi = {10.1145/3611643.3613898}, year = {2023}, } Publisher's Version |
|
Gill, James |
ESEC/FSE '23-INDUSTRY: "Dead Code Removal at Meta: ..."
Dead Code Removal at Meta: Automatically Deleting Millions of Lines of Code and Petabytes of Deprecated Data
Will Shackleton, Katriel Cohn-Gordon, Peter C. Rigby, Rui Abreu, James Gill, Nachiappan Nagappan, Karim Nakad, Ioannis Papagiannis, Luke Petre, Giorgi Megreli, Patrick Riggs, and James Saindon (Meta, USA; Concordia University, Canada) Software constantly evolves in response to user needs: new features are built, deployed, mature and grow old, and eventually their usage drops enough to merit switching them off. In any large codebase, this feature lifecycle can naturally lead to retaining unnecessary code and data. Removing these respects users’ privacy expectations, as well as helping engineers to work efficiently. In prior software engineering research, we have found little evidence of code deprecation or dead-code removal at industrial scale. We describe Systematic Code and Asset Removal Framework (SCARF), a product deprecation system to assist engineers working in large codebases. SCARF identifies unused code and data assets and safely removes them. It operates fully automatically, including committing code and dropping database tables. It also gathers developer input where it cannot take automated actions, leading to further removals. Dead code removal increases the quality and consistency of large codebases, aids with knowledge management and improves reliability. SCARF has had an important impact at Meta. In the last year alone, it has removed petabytes of data across 12.8 million distinct assets, and deleted over 104 million lines of code. @InProceedings{ESEC/FSE23p1705, author = {Will Shackleton and Katriel Cohn-Gordon and Peter C. Rigby and Rui Abreu and James Gill and Nachiappan Nagappan and Karim Nakad and Ioannis Papagiannis and Luke Petre and Giorgi Megreli and Patrick Riggs and James Saindon}, title = {Dead Code Removal at Meta: Automatically Deleting Millions of Lines of Code and Petabytes of Deprecated Data}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1705--1715}, doi = {10.1145/3611643.3613871}, year = {2023}, } Publisher's Version |
|
Gligoric, Milos |
ESEC/FSE '23: "Multilingual Code Co-evolution ..."
Multilingual Code Co-evolution using Large Language Models
Jiyang Zhang, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric (University of Texas at Austin, USA) Many software projects implement APIs and algorithms in multiple programming languages. Maintaining such projects is tiresome, as developers have to ensure that any change (e.g., a bug fix or a new feature) is being propagated, timely and without errors, to implementations in other programming languages. In the world of ever-changing software, using rule-based translation tools (i.e., transpilers) or machine learning models for translating code from one language to another provides limited value. Translating each time the entire codebase from one language to another is not the way developers work. In this paper, we target a novel task: translating code changes from one programming language to another using large language models (LLMs). We design and implement the first LLM, dubbed Codeditor, to tackle this task. Codeditor explicitly models code changes as edit sequences and learns to correlate changes across programming languages. To evaluate Codeditor, we collect a corpus of 6,613 aligned code changes from 8 pairs of open-source software projects implementing similar functionalities in two programming languages (Java and C#). Results show that Codeditor outperforms the state-of-the-art approaches by a large margin on all commonly used automatic metrics. Our work also reveals that Codeditor is complementary to the existing generation-based models, and their combination ensures even greater performance. @InProceedings{ESEC/FSE23p695, author = {Jiyang Zhang and Pengyu Nie and Junyi Jessy Li and Milos Gligoric}, title = {Multilingual Code Co-evolution using Large Language Models}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {695--707}, doi = {10.1145/3611643.3616350}, year = {2023}, } Publisher's Version |
|
Gong, Jingzhi |
ESEC/FSE '23: "Predicting Software Performance ..."
Predicting Software Performance with Divide-and-Learn
Jingzhi Gong and Tao Chen (University of Electronic Science and Technology of China, China; Loughborough University, UK; University of Birmingham, UK) Predicting the performance of highly configurable software systems is the foundation for performance testing and quality assurance. To that end, recent work has been relying on machine/deep learning to model software performance. However, a crucial yet unaddressed challenge is how to cater for the sparsity inherited from the configuration landscape: the influence of configuration options (features) and the distribution of data samples are highly sparse. In this paper, we propose an approach based on the concept of “divide-and-learn”, dubbed DaL. The basic idea is that, to handle sample sparsity, we divide the samples from the configuration landscape into distant divisions, for each of which we build a regularized Deep Neural Network as the local model to deal with the feature sparsity. A newly given configuration would then be assigned to the right model of division for the final prediction. Experiment results from eight real-world systems and five sets of training data reveal that, compared with the state-of-the-art approaches, DaL performs no worse than the best counterpart on 33 out of 40 cases (within which 26 cases are significantly better) with up to 1.94× improvement on accuracy; requires fewer samples to reach the same/better accuracy; and producing acceptable training overhead. Practically, DaL also considerably improves different global models when using them as the underlying local models, which further strengthens its flexibility. To promote open science, all the data, code, and supplementary figures of this work can be accessed at our repository: https://github.com/ideas-labo/DaL. @InProceedings{ESEC/FSE23p858, author = {Jingzhi Gong and Tao Chen}, title = {Predicting Software Performance with Divide-and-Learn}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {858--870}, doi = {10.1145/3611643.3616334}, year = {2023}, } Publisher's Version Info |
|
Gonugondla, Sujan Kumar |
ESEC/FSE '23: "Towards Greener Yet Powerful ..."
Towards Greener Yet Powerful Code Generation via Quantization: An Empirical Study
Xiaokai Wei, Sujan Kumar Gonugondla, Shiqi Wang, Wasi Ahmad, Baishakhi Ray, Haifeng Qian, Xiaopeng Li, Varun Kumar, Zijian Wang, Yuchen Tian, Qing Sun, Ben Athiwaratkun, Mingyue Shang, Murali Krishna Ramanathan, Parminder Bhatia, and Bing Xiang (AWS AI Labs, USA) ML-powered code generation aims to assist developers to write code in a more productive manner by intelligently generating code blocks based on natural language prompts. Recently, large pretrained deep learning models have pushed the boundary of code generation and achieved impressive performance. However, the huge number of model parameters poses a significant challenge to their adoption in a typical software development environment, where a developer might use a standard laptop or mid-size server to develop code. Such large models cost significant resources in terms of memory, latency, dollars, as well as carbon footprint. Model compression is a promising approach to address these challenges. We have identified quantization as one of the most promising compression techniques for code-generation as it avoids expensive retraining costs. As quantization represents model parameters with lower-bit integer (e.g., int8), the model size and runtime latency would both benefit. We empirically evaluate quantized models on code generation tasks across different dimensions: (i) resource usage and carbon footprint, (ii) accuracy, and (iii) robustness. Through systematic experiments we find a code-aware quantization recipe that could run even a 6-billion-parameter model in a regular laptop without significant accuracy or robustness degradation. We find that the recipe is readily applicable to code summarization task as well. @InProceedings{ESEC/FSE23p224, author = {Xiaokai Wei and Sujan Kumar Gonugondla and Shiqi Wang and Wasi Ahmad and Baishakhi Ray and Haifeng Qian and Xiaopeng Li and Varun Kumar and Zijian Wang and Yuchen Tian and Qing Sun and Ben Athiwaratkun and Mingyue Shang and Murali Krishna Ramanathan and Parminder Bhatia and Bing Xiang}, title = {Towards Greener Yet Powerful Code Generation via Quantization: An Empirical Study}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {224--236}, doi = {10.1145/3611643.3616302}, year = {2023}, } Publisher's Version |
|
Gonuguntla, Ashritha |
ESEC/FSE '23: "Outage-Watch: Early Prediction ..."
Outage-Watch: Early Prediction of Outages using Extreme Event Regularizer
Shubham Agarwal, Sarthak Chakraborty, Shaddy Garg, Sumit Bisht, Chahat Jain, Ashritha Gonuguntla, and Shiv Saini (Adobe Research, India; University of Illinois at Urbana-Champaign, USA; Adobe, India; Amazon, India; Traceable.ai, India; Cisco, India) Cloud services are omnipresent and critical cloud service failure is a fact of life. In order to retain customers and prevent revenue loss, it is important to provide high reliability guarantees for these services. One way to do this is by predicting outages in advance, which can help in reducing the severity as well as time to recovery. It is difficult to forecast critical failures due to the rarity of these events. Moreover, critical failures are ill-defined in terms of observable data. Our proposed method, Outage-Watch, defines critical service outages as deteriorations in the Quality of Service (QoS) captured by a set of metrics. Outage-Watch detects such outages in advance by using current system state to predict whether the QoS metrics will cross a threshold and initiate an extreme event. A mixture of Gaussian is used to model the distribution of the QoS metrics for flexibility and an extreme event regularizer helps in improving learning in tail of the distribution. An outage is predicted if the probability of any one of the QoS metrics crossing threshold changes significantly. Our evaluation on a real-world SaaS company dataset shows that Outage-Watch significantly outperforms traditional methods with an average AUC of 0.98. Additionally, Outage-Watch detects all the outages exhibiting a change in service metrics and reduces the Mean Time To Detection (MTTD) of outages by up to 88% when deployed in an enterprise cloud-service system, demonstrating efficacy of our proposed method. @InProceedings{ESEC/FSE23p682, author = {Shubham Agarwal and Sarthak Chakraborty and Shaddy Garg and Sumit Bisht and Chahat Jain and Ashritha Gonuguntla and Shiv Saini}, title = {Outage-Watch: Early Prediction of Outages using Extreme Event Regularizer}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {682--694}, doi = {10.1145/3611643.3616316}, year = {2023}, } Publisher's Version |
|
Gorla, Alessandra |
ESEC/FSE '23: "LibKit: Detecting Third-Party ..."
LibKit: Detecting Third-Party Libraries in iOS Apps
Daniel Domínguez-Álvarez, Alejandro de la Cruz, Alessandra Gorla, and Juan Caballero (IMDEA Software Institute, Spain; University of Verona, Italy) We present LibKit, the first approach and tool for detecting the name and version of third-party libraries (TPLs) present in iOS apps. LibKit automatically builds fingerprints for 86K library versions available through the CocoaPods dependency manager and matches them on the decrypted app executables to identify the TPLs (name and version) an iOS app uses. LibKit supports apps written in Swift and Objective-C, detects statically and dynamically linked libraries, and addresses challenges such as partially included libraries and different compiler versions and configurations producing variants of the same library version. On a ground truth of 95 open-source apps, LibKit identifies libraries with a precision of 0.911 and a recall of 0.839. LibKit also significantly outperforms the state-of-the-art CRiOS tool for identifying TPL boundaries. When applied to 1,500 apps from the iTunes Store, LibKit detects 47,015 library versions, identifying popular apps that contain old library versions. @InProceedings{ESEC/FSE23p1407, author = {Daniel Domínguez-Álvarez and Alejandro de la Cruz and Alessandra Gorla and Juan Caballero}, title = {LibKit: Detecting Third-Party Libraries in iOS Apps}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1407--1418}, doi = {10.1145/3611643.3616344}, year = {2023}, } Publisher's Version Published Artifact Info Artifacts Available |
|
Gotmare, Akhilesh Deepak |
ESEC/FSE '23: "Efficient Text-to-Code Retrieval ..."
Efficient Text-to-Code Retrieval with Cascaded Fast and Slow Transformer Models
Akhilesh Deepak Gotmare, Junnan Li, Shafiq Joty, and Steven C.H. Hoi (Salesforce AI Research, Singapore; Salesforce AI Research, USA) The goal of semantic code search or text-to-code search is to retrieve a semantically relevant code snippet from an existing code database using a natural language query. When constructing a practical semantic code search system, existing approaches fail to provide an optimal balance between retrieval speed and the relevance of the retrieved results. We propose an efficient and effective text-to-code search framework with cascaded fast and slow models, in which a fast transformer encoder model is learned to optimize a scalable index for fast retrieval followed by learning a slow classification-based re-ranking model to improve the accuracy of the top K results from the fast retrieval. To further reduce the high memory cost of deploying two separate models in practice, we propose to jointly train the fast and slow model based on a single transformer encoder with shared parameters. Empirically our cascaded method is not only efficient and scalable, but also achieves state-of-the-art results with an average mean reciprocal ranking (MRR) score of 0.7795 (across 6 programming languages) on the CodeSearchNet benchmark as opposed to the prior state-of-the-art result of 0.744 MRR. Our codebase can be found at this link. @InProceedings{ESEC/FSE23p388, author = {Akhilesh Deepak Gotmare and Junnan Li and Shafiq Joty and Steven C.H. Hoi}, title = {Efficient Text-to-Code Retrieval with Cascaded Fast and Slow Transformer Models}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {388--400}, doi = {10.1145/3611643.3616369}, year = {2023}, } Publisher's Version |
|
Gousios, Georgios |
ESEC/FSE '23: "Dynamic Prediction of Delays ..."
Dynamic Prediction of Delays in Software Projects using Delay Patterns and Bayesian Modeling
Elvan Kula, Eric Greuter, Arie van Deursen, and Georgios Gousios (Delft University of Technology, Netherlands; ING, Netherlands) Modern agile software projects are subject to constant change, making it essential to re-asses overall delay risk throughout the project life cycle. Existing effort estimation models are static and not able to incorporate changes occurring during project execution. In this paper, we propose a dynamic model for continuously predicting overall delay using delay patterns and Bayesian modeling. The model incorporates the context of the project phase and learns from changes in team performance over time. We apply the approach to real-world data from 4,040 epics and 270 teams at ING. An empirical evaluation of our approach and comparison to the state-of-the-art demonstrate significant improvements in predictive accuracy. The dynamic model consistently outperforms static approaches and the state-of-the-art, even during early project phases. @InProceedings{ESEC/FSE23p1012, author = {Elvan Kula and Eric Greuter and Arie van Deursen and Georgios Gousios}, title = {Dynamic Prediction of Delays in Software Projects using Delay Patterns and Bayesian Modeling}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1012--1023}, doi = {10.1145/3611643.3616328}, year = {2023}, } Publisher's Version |
|
Greuter, Eric |
ESEC/FSE '23: "Dynamic Prediction of Delays ..."
Dynamic Prediction of Delays in Software Projects using Delay Patterns and Bayesian Modeling
Elvan Kula, Eric Greuter, Arie van Deursen, and Georgios Gousios (Delft University of Technology, Netherlands; ING, Netherlands) Modern agile software projects are subject to constant change, making it essential to re-asses overall delay risk throughout the project life cycle. Existing effort estimation models are static and not able to incorporate changes occurring during project execution. In this paper, we propose a dynamic model for continuously predicting overall delay using delay patterns and Bayesian modeling. The model incorporates the context of the project phase and learns from changes in team performance over time. We apply the approach to real-world data from 4,040 epics and 270 teams at ING. An empirical evaluation of our approach and comparison to the state-of-the-art demonstrate significant improvements in predictive accuracy. The dynamic model consistently outperforms static approaches and the state-of-the-art, even during early project phases. @InProceedings{ESEC/FSE23p1012, author = {Elvan Kula and Eric Greuter and Arie van Deursen and Georgios Gousios}, title = {Dynamic Prediction of Delays in Software Projects using Delay Patterns and Bayesian Modeling}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1012--1023}, doi = {10.1145/3611643.3616328}, year = {2023}, } Publisher's Version |
|
Grishina, Anastasiia |
ESEC/FSE '23: "The EarlyBIRD Catches the ..."
The EarlyBIRD Catches the Bug: On Exploiting Early Layers of Encoder Models for More Efficient Code Classification
Anastasiia Grishina, Max Hort, and Leon Moonen (Simula Research Laboratory, Norway; BI Norwegian Business School, Norway) The use of modern Natural Language Processing (NLP) techniques has shown to be beneficial for software engineering tasks, such as vulnerability detection and type inference. However, training deep NLP models requires significant computational resources. This paper explores techniques that aim at achieving the best usage of resources and available information in these models. We propose a generic approach, EarlyBIRD, to build composite representations of code from the early layers of a pre-trained transformer model. We empirically investigate the viability of this approach on the CodeBERT model by comparing the performance of 12 strategies for creating composite representations with the standard practice of only using the last encoder layer. Our evaluation on four datasets shows that several early layer combinations yield better performance on defect detection, and some combinations improve multi-class classification. More specifically, we obtain a +2 average improvement of detection accuracy on Devign with only 3 out of 12 layers of CodeBERT and a 3.3x speed-up of fine-tuning. These findings show that early layers can be used to obtain better results using the same resources, as well as to reduce resource usage during fine-tuning and inference. @InProceedings{ESEC/FSE23p895, author = {Anastasiia Grishina and Max Hort and Leon Moonen}, title = {The EarlyBIRD Catches the Bug: On Exploiting Early Layers of Encoder Models for More Efficient Code Classification}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {895--907}, doi = {10.1145/3611643.3616304}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Groce, Alex |
ESEC/FSE '23: "Contextual Predictive Mutation ..."
Contextual Predictive Mutation Testing
Kush Jain, Uri Alon, Alex Groce, and Claire Le Goues (Carnegie Mellon University, USA; Northern Arizona University, USA) Mutation testing is a powerful technique for assessing and improving test suite quality that artificially introduces bugs and checks whether the test suites catch them. However, it is also computationally expensive and thus does not scale to large systems and projects. One promising recent approach to tackling this scalability problem uses machine learning to predict whether the tests will detect the synthetic bugs, without actually running those tests. However, existing predictive mutation testing approaches still misclassify 33% of detection outcomes on a randomly sampled set of mutant-test suite pairs. We introduce MutationBERT, an approach for predictive mutation testing that simultaneously encodes the source method mutation and test method, capturing key context in the input representation. Thanks to its higher precision, MutationBERT saves 33% of the time spent by a prior approach on checking/verifying live mutants. MutationBERT, also outperforms the state-of-the-art in both same project and cross project settings, with meaningful improvements in precision, recall, and F1 score. We validate our input representation, and aggregation approaches for lifting predictions from the test matrix level to the test suite level, finding similar improvements in performance. MutationBERT not only enhances the state-of-the-art in predictive mutation testing, but also presents practical benefits for real-world applications, both in saving developer time and finding hard to detect mutants that prior approaches do not. @InProceedings{ESEC/FSE23p250, author = {Kush Jain and Uri Alon and Alex Groce and Claire Le Goues}, title = {Contextual Predictive Mutation Testing}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {250--261}, doi = {10.1145/3611643.3616289}, year = {2023}, } Publisher's Version |
|
Grundy, John |
ESEC/FSE '23-DEMO: "LazyCow: A Lightweight Crowdsourced ..."
LazyCow: A Lightweight Crowdsourced Testing Tool for Taming Android Fragmentation
Xiaoyu Sun, Xiao Chen, Yonghui Liu, John Grundy, and Li Li (Australian National University, Australia; Monash University, Australia; Beihang University, China) Android fragmentation refers to the increasing variety of Android devices and operating system versions. Their number make it impossible to test an app on every supported device, resulting in many device compatibility issues and leading to poor user experiences. To mitigate this, a number of works that automatically detect compatibility issues have been proposed. However, current state-of-the-art techniques can only be used to detect specific types of compatibility issues (i.e., compatibility issues caused by API signature evolution), i.e., many other essential categories of compatibility issues are still unknown. For instance, customised OS versions on real devices and semantic OS modifications could result in severe compatibility issues that are difficult to detect statically. In order to address this research gap and facilitate the prospect of taming Android frag- mentation through crowdsourced efforts, we propose LazyCow, a novel, lightweight, crowdsourced testing tool. Our experimental results involving thousands of test cases on real Android devices demonstrate that LazyCow is effective at autonomously identifying and validating API-induced compatibility issues. The source code of both client side and server side are all made publicly available in our artifact package. A demo video of our tool is available at https://www.youtube.com/watch?v=_xzWv_mo5xQ. @InProceedings{ESEC/FSE23p2127, author = {Xiaoyu Sun and Xiao Chen and Yonghui Liu and John Grundy and Li Li}, title = {LazyCow: A Lightweight Crowdsourced Testing Tool for Taming Android Fragmentation}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2127--2131}, doi = {10.1145/3611643.3613098}, year = {2023}, } Publisher's Version ESEC/FSE '23-INDUSTRY: "C³: Code Clone-Based Identification ..." C³: Code Clone-Based Identification of Duplicated Components Yanming Yang, Ying Zou, Xing Hu, David Lo, Chao Ni, John Grundy, and Xin Xia (Zhejiang University, China; Queen’s University, Canada; Singapore Management University, Singapore; Monash University, Australia; Huawei, China) Reinventing the wheel is a detrimental programming practice in software development that frequently results in the introduction of duplicated components. This practice not only leads to increased maintenance and labor costs but also poses a higher risk of propagating bugs throughout the system. Despite numerous issues introduced by duplicated components in software, the identification of component-level clones remains a significant challenge that existing studies struggle to effectively tackle. Specifically, existing methods face two primary limitations that are challenging to overcome: 1) Measuring the similarity between different components presents a challenge due to the significant size differences among them; 2) Identifying functional clones is a complex task as determining the primary functionality of components proves to be difficult. To overcome the aforementioned challenges, we present a novel approach named C3 (Component-level Code Clone detector) to effectively identify both textual and functional cloned components. In addition, to enhance the efficiency of eliminating cloned components, we develop an assessment method based on six component-level clone features, which assists developers in prioritizing the cloned components based on the refactoring necessity. To validate the effectiveness of C3, we employ a large-scale industrial product developed by Huawei, a prominent global ICT company, as our dataset and apply C3 to this dataset to identify the cloned components. Our experimental results demonstrate that C3 is capable of accurately detecting cloned components, achieving impressive performance in terms of precision (0.93), recall (0.91), and F1-score (0.9). Besides, we conduct a comprehensive user study to further validate the effectiveness and practicality of our assessment method and the proposed clone features in assessing the refactoring necessity of different cloned components. Our study establishes solid alignment between assessment outcomes and participant responses, indicating the accurate prioritization of clone components with a high refactoring necessity through our method. This finding further confirms the usefulness of the six “golden features” in our assessment. @InProceedings{ESEC/FSE23p1832, author = {Yanming Yang and Ying Zou and Xing Hu and David Lo and Chao Ni and John Grundy and Xin Xia}, title = {C³: Code Clone-Based Identification of Duplicated Components}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1832--1843}, doi = {10.1145/3611643.3613883}, year = {2023}, } Publisher's Version |
|
Grunske, Lars |
ESEC/FSE '23: "Semantic Debugging ..."
Semantic Debugging
Martin Eberlein, Marius Smytzek, Dominic Steinhöfel, Lars Grunske, and Andreas Zeller (Humboldt University of Berlin, Germany; CISPA Helmholtz Center for Information Security, Germany) Why does my program fail? We present a novel and general technique to automatically determine failure causes and conditions, using logical properties over input elements: “The program fails if and only if int(<length>) > len(<payload>) holds—that is, the given <length> is larger than the <payload> length.” Our AVICENNA prototype uses modern techniques for inferring properties of passing and failing inputs and validating and refining hypotheses by having a constraint solver generate supporting test cases to obtain such diagnoses. As a result, AVICENNA produces crisp and expressive diagnoses even for complex failure conditions, considerably improving over the state of the art with diagnoses close to those of human experts. @InProceedings{ESEC/FSE23p438, author = {Martin Eberlein and Marius Smytzek and Dominic Steinhöfel and Lars Grunske and Andreas Zeller}, title = {Semantic Debugging}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {438--449}, doi = {10.1145/3611643.3616296}, year = {2023}, } Publisher's Version Published Artifact Info Artifacts Available Artifacts Reusable |
|
Gu, Jian |
ESEC/FSE '23-IVR: "Towards Top-Down Automated ..."
Towards Top-Down Automated Development in Limited Scopes: A Neuro-Symbolic Framework from Expressibles to Executables
Jian Gu and Harald C. Gall (Monash University, Australia; University of Zurich, Switzerland) Deep code generation is a topic of deep learning for software engineering (DL4SE), which adopts neural models to generate code for the intended functions. Since end-to-end neural methods lack domain knowledge and software hierarchy awareness, they tend to perform poorly w.r.t project-level tasks. To systematically explore the potential improvements of code generation, we let it participate in the whole top-down development from expressibles to executables, which is possible in limited scopes. In the process, it benefits from massive samples, features, and knowledge. As the foundation, we suggest building a taxonomy on code data, namely code taxonomy, leveraging the categorization of code information. Moreover, we introduce a three-layer semantic pyramid (SP) to associate text data and code data. It identifies the information of different abstraction levels, and thus introduces the domain knowledge on development and reveals the hierarchy of software. Furthermore, we propose a semantic pyramid framework (SPF) as the approach, focusing on software of high modularity and low complexity. SPF divides the code generation process into stages and reserves spots for potential interactions. In addition, we conceived preliminary applications in software development to confirm the neuro-symbolic framework. @InProceedings{ESEC/FSE23p2072, author = {Jian Gu and Harald C. Gall}, title = {Towards Top-Down Automated Development in Limited Scopes: A Neuro-Symbolic Framework from Expressibles to Executables}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2072--2076}, doi = {10.1145/3611643.3613076}, year = {2023}, } Publisher's Version |
|
Gu, Jiazhen |
ESEC/FSE '23-INDUSTRY: "Appaction: Automatic GUI Interaction ..."
Appaction: Automatic GUI Interaction for Mobile Apps via Holistic Widget Perception
Yongxiang Hu, Jiazhen Gu, Shuqing Hu, Yu Zhang, Wenjie Tian, Shiyu Guo, Chaoyi Chen, and Yangfan Zhou (Fudan University, China; Meituan, China; Shanghai Key Laboratory of Intelligent Information Processing, China) In industrial practice, GUI (Graphic User Interface) testing of mobile apps still inevitably relies on huge manual efforts. The major efforts are those on understanding the GUIs, so that testing scripts can be written accordingly. Quality assurance could therefore be very labor-intensive, especially for modern commercial mobile apps, where one may include tremendous, diverse, and complex GUIs, e.g., those for placing orders of different commercial items. To reduce such human efforts, we propose Appaction, a learning-based automatic GUI interaction approach we developed for Meituan, one of the largest E-commerce providers with over 600 million users. Appaction can automatically analyze the target GUI and understand what each input of the GUI is about, so that corresponding valid inputs can be entered accordingly. To this end, Appaction adopts a multi-modal model to learn from human experiences in perceiving a GUI. This allows it to infer corresponding valid input events that can properly interact with the GUI. In this way, the target app can be effectively exercised. We present our experiences in Meituan on applying Appaction to popular commercial apps. We demonstrate the effectiveness of Appaction in GUI analysis, and it can perform correct interactions for numerous form pages. @InProceedings{ESEC/FSE23p1786, author = {Yongxiang Hu and Jiazhen Gu and Shuqing Hu and Yu Zhang and Wenjie Tian and Shiyu Guo and Chaoyi Chen and Yangfan Zhou}, title = {Appaction: Automatic GUI Interaction for Mobile Apps via Holistic Widget Perception}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1786--1797}, doi = {10.1145/3611643.3613885}, year = {2023}, } Publisher's Version ESEC/FSE '23: "BiasAsker: Measuring the Bias ..." BiasAsker: Measuring the Bias in Conversational AI System Yuxuan Wan, Wenxuan Wang, Pinjia He, Jiazhen Gu, Haonan Bai, and Michael R. Lyu (Chinese University of Hong Kong, China) Powered by advanced Artificial Intelligence (AI) techniques, conversational AI systems, such as ChatGPT, and digital assistants like Siri, have been widely deployed in daily life. However, such systems may still produce content containing biases and stereotypes, causing potential social problems. Due to modern AI techniques’ data-driven, black-box nature, comprehensively identifying and measuring biases in conversational systems remains challenging. Particularly, it is hard to generate inputs that can comprehensively trigger potential bias due to the lack of data containing both social groups and biased properties. In addition, modern conversational systems can produce diverse responses (e.g., chatting and explanation), which makes existing bias detection methods based solely on sentiment and toxicity hardly being adopted. In this paper, we propose BiasAsker, an automated framework to identify and measure social bias in conversational AI systems. To obtain social groups and biased properties, we construct a comprehensive social bias dataset containing a total of 841 groups and 5,021 biased properties. Given the dataset, BiasAsker automatically generates questions and adopts a novel method based on existence measurement to identify two types of biases (i.e., absolute bias and related bias) in conversational systems. Extensive experiments on eight commercial systems and two famous research models, such as ChatGPT and GPT-3, show that 32.83% of the questions generated by BiasAsker can trigger biased behaviors in these widely deployed conversational systems. All the code, data, and experimental results have been released to facilitate future research. @InProceedings{ESEC/FSE23p515, author = {Yuxuan Wan and Wenxuan Wang and Pinjia He and Jiazhen Gu and Haonan Bai and Michael R. Lyu}, title = {BiasAsker: Measuring the Bias in Conversational AI System}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {515--527}, doi = {10.1145/3611643.3616310}, year = {2023}, } Publisher's Version |
|
Gu, Qiuhan |
ESEC/FSE '23-SRC: "LLM-Based Code Generation ..."
LLM-Based Code Generation Method for Golang Compiler Testing
Qiuhan Gu (Nanjing University, China) Modern optimizing compilers are among the most complex software systems humans build. One way to identify subtle compiler bugs is fuzzing. Both the quantity and the quality of testcases are crucial to the performance of fuzzing. Traditional testcase-generation methods, such as Csmith and YARPGen, have been proven successful at discovering compiler bugs. However, such generated testcases have limited coverage and quantity. In this paper, we present a code generation method for compiler testing based on LLM to maximize the quality and quantity of the generated code. In particular, to avoid undefined behavior and syntax errors in generated testcases, we design a filter strategy to clean the source code, preparing a high-quality dataset for the model training. Besides, we present a seed schedule strategy to improve code generation. We apply the method to test the Golang compiler and the result shows that our pipeline outperforms previous methods both qualitatively and quantitatively. It produces testcases with an average coverage of 3.38%, in contrast to the testcases generated by GoFuzz, which have an average coverage of 0.44%. Moreover, among all the generated testcases, only 2.79% exhibited syntax errors, and none displayed undefined behavior. @InProceedings{ESEC/FSE23p2201, author = {Qiuhan Gu}, title = {LLM-Based Code Generation Method for Golang Compiler Testing}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2201--2203}, doi = {10.1145/3611643.3617850}, year = {2023}, } Publisher's Version |
|
Gu, Xiaodong |
ESEC/FSE '23: "Self-Supervised Query Reformulation ..."
Self-Supervised Query Reformulation for Code Search
Yuetian Mao, Chengcheng Wan, Yuze Jiang, and Xiaodong Gu (Shanghai Jiao Tong University, China; East China Normal University, China) Automatic query reformulation is a widely utilized technology for enriching user requirements and enhancing the outcomes of code search. It can be conceptualized as a machine translation task, wherein the objective is to rephrase a given query into a more comprehensive alternative. While showing promising results, training such a model typically requires a large parallel corpus of query pairs (i.e., the original query and a reformulated query) that are confidential and unpublished by online code search engines. This restricts its practicality in software development processes. In this paper, we propose SSQR, a self-supervised query reformulation method that does not rely on any parallel query corpus. Inspired by pre-trained models, SSQR treats query reformulation as a masked language modeling task conducted on an extensive unannotated corpus of queries. SSQR extends T5 (a sequence-to-sequence model based on Transformer) with a new pre-training objective named corrupted query completion (CQC), which randomly masks words within a complete query and trains T5 to predict the masked content. Subsequently, for a given query to be reformulated, SSQR identifies potential locations for expansion and leverages the pre-trained T5 model to generate appropriate content to fill these gaps. The selection of expansions is then based on the information gain associated with each candidate. Evaluation results demonstrate that SSQR outperforms unsupervised baselines significantly and achieves competitive performance compared to supervised methods. @InProceedings{ESEC/FSE23p363, author = {Yuetian Mao and Chengcheng Wan and Yuze Jiang and Xiaodong Gu}, title = {Self-Supervised Query Reformulation for Code Search}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {363--374}, doi = {10.1145/3611643.3616306}, year = {2023}, } Publisher's Version |
|
Guha, Arjun |
ESEC/FSE '23-DEMO: "npm-follower: A Complete Dataset ..."
npm-follower: A Complete Dataset Tracking the NPM Ecosystem
Donald Pinckney, Federico Cassano, Arjun Guha, and Jonathan Bell (Northeastern University, USA) Software developers typically rely upon a large network of dependencies to build their applications. For instance, the NPM package repository contains over 3 million packages and serves tens of billions of downloads weekly. Understanding the structure and nature of packages, dependencies, and published code requires datasets that provide researchers with easy access to metadata and code of packages. However, prior work on NPM dataset construction typically has two limitations: 1) only metadata is scraped, and 2) packages or versions that are deleted from NPM can not be scraped. Over 330,000 versions of packages were deleted from NPM between July 2022 and May 2023. This data is critical for researchers as it often pertains to important questions of security and malware. We present npm-follower, a dataset and crawling architecture which archives metadata and code of all packages and versions as they are published, and is thus able to retain data which is later deleted. The dataset currently includes over 35 million versions of packages, and grows at a rate of about 1 million versions per month. The dataset is designed to be easily used by researchers answering questions involving either metadata or program analysis. Both the code and dataset are available at: https://dependencies.science @InProceedings{ESEC/FSE23p2132, author = {Donald Pinckney and Federico Cassano and Arjun Guha and Jonathan Bell}, title = {npm-follower: A Complete Dataset Tracking the NPM Ecosystem}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2132--2136}, doi = {10.1145/3611643.3613094}, year = {2023}, } Publisher's Version |
|
Gulwani, Sumit |
ESEC/FSE '23: "Grace: Language Models Meet ..."
Grace: Language Models Meet Code Edits
Priyanshu Gupta, Avishree Khare, Yasharth Bajpai, Saikat Chakraborty, Sumit Gulwani, Aditya Kanade, Arjun Radhakrishna, Gustavo Soares, and Ashish Tiwari (Microsoft, India; University of Pennsylvania, USA; Microsoft Research, USA; Microsoft, USA; Microsoft Research, India) Developers spend a significant amount of time in editing code for a variety of reasons such as bug fixing or adding new features. Designing effective methods to predict code edits has been an active yet challenging area of research due to the diversity of code edits and the difficulty of capturing the developer intent. In this work, we address these challenges by endowing pre-trained large language models (LLMs) with the knowledge of relevant prior associated edits, which we call the Grace (Generation conditioned on Associated Code Edits) method. The generative capability of the LLMs helps address the diversity in code changes and conditioning code generation on prior edits helps capture the latent developer intent. We evaluate two well-known LLMs, codex and CodeT5, in zero-shot and fine-tuning settings respectively. In our experiments with two datasets, Grace boosts the performance of the LLMs significantly, enabling them to generate 29% and 54% more correctly edited code in top-1 suggestions relative to the current state-of-the-art symbolic and neural approaches, respectively. @InProceedings{ESEC/FSE23p1483, author = {Priyanshu Gupta and Avishree Khare and Yasharth Bajpai and Saikat Chakraborty and Sumit Gulwani and Aditya Kanade and Arjun Radhakrishna and Gustavo Soares and Ashish Tiwari}, title = {Grace: Language Models Meet Code Edits}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1483--1495}, doi = {10.1145/3611643.3616253}, year = {2023}, } Publisher's Version Published Artifact Archive submitted (770 kB) Artifacts Available |
|
Gulzar, Muhammad Ali |
ESEC/FSE '23: "Co-dependence Aware Fuzzing ..."
Co-dependence Aware Fuzzing for Dataflow-Based Big Data Analytics
Ahmad Humayun, Miryung Kim, and Muhammad Ali Gulzar (Virginia Tech, USA; University of California at Los Angeles, USA) Data-intensive scalable computing has become popular due to the increasing demands of analyzing big data. For example, Apache Spark and Hadoop allow developers to write dataflow-based applications with user-defined functions to process data with custom logic. Testing such applications is difficult. (1) These applications often take multiple datasets as input. (2) Unlike in SQL, there is no explicit schema for these datasets and each unstructured (or semi-structured) dataset is segmented and parsed at runtime. (3) Dataflow operators (e.g., join) create implicit co-dependence constraints between the fields of multiple datasets. An efficient and effective testing technique must analyze co-dependence among different regions of multiple datasets at the level of rows and columns and orchestrate input mutations jointly on co-dependent regions. We propose DepFuzz to increase the effectiveness and efficiency of fuzz testing dataflow-based big data applications. The key insight behind DepFuzz is twofold. It keeps track of which code segments operate on which datasets, which rows, and which columns. By analyzing the use of dataflow operators (e.g., join and groupByKey) in tandem with the semantics of UDFs, DepFuzz generates test data that subsequently reach hard-to-reach regions of the application code. In real-world big data applications, DepFuzz finds 3.4× more faults, achieving 29% more statement coverage in half the time as Jazzer’s, a state-of-the-art commercial fuzzer for Java bytecode. It outperforms prior DISC testing by exposing deeper semantic faults beyond simpler input formatting errors, especially when multiple datasets have complex interactions through dataflow operators. @InProceedings{ESEC/FSE23p1050, author = {Ahmad Humayun and Miryung Kim and Muhammad Ali Gulzar}, title = {Co-dependence Aware Fuzzing for Dataflow-Based Big Data Analytics}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1050--1061}, doi = {10.1145/3611643.3616298}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available |
|
Guo, Hanwen |
ESEC/FSE '23: "Statistical Type Inference ..."
Statistical Type Inference for Incomplete Programs
Yaohui Peng, Jing Xie, Qiongling Yang, Hanwen Guo, Qingan Li, Jingling Xue, and Mengting Yuan (Wuhan University, China; UNSW, Australia) We propose a novel two-stage approach, Stir, for inferring types in incomplete programs that may be ill-formed, where whole-program syntactic analysis often fails. In the first stage, Stir predicts a type tag for each token by using neural networks, and consequently, infers all the simple types in the program. In the second stage, Stir refines the complex types for the tokens with predicted complex type tags. Unlike existing machine-learning-based approaches, which solve type inference as a classification problem, Stir reduces it to a sequence-to-graph parsing problem. According to our experimental results, Stir achieves an accuracy of 97.37 % for simple types. By representing complex types as directed graphs (type graphs), Stir achieves a type similarity score of 77.36 % and 59.61 % for complex types and zero-shot complex types, respectively. @InProceedings{ESEC/FSE23p720, author = {Yaohui Peng and Jing Xie and Qiongling Yang and Hanwen Guo and Qingan Li and Jingling Xue and Mengting Yuan}, title = {Statistical Type Inference for Incomplete Programs}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {720--732}, doi = {10.1145/3611643.3616283}, year = {2023}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable |
|
Guo, Qing |
ESEC/FSE '23: "DistXplore: Distribution-Guided ..."
DistXplore: Distribution-Guided Testing for Evaluating and Enhancing Deep Learning Systems
Longtian Wang, Xiaofei Xie, Xiaoning Du, Meng Tian, Qing Guo, Zheng Yang, and Chao Shen (Xi’an Jiaotong University, China; Singapore Management University, Singapore; Monash University, Australia; A*STAR, Singapore; Huawei, China) Deep learning (DL) models are trained on sampled data, where the distribution of training data differs from that of real-world data (i.e., the distribution shift), which reduces the model's robustness. Various testing techniques have been proposed, including distribution-unaware and distribution-aware methods. However, distribution-unaware testing lacks effectiveness by not explicitly considering the distribution of test cases and may generate redundant errors (within same distribution). Distribution-aware testing techniques primarily focus on generating test cases that follow the training distribution, missing out-of-distribution data that may also be valid and should be considered in the testing process. In this paper, we propose a novel distribution-guided approach for generating valid test cases with diverse distributions, which can better evaluate the model's robustness (i.e., generating hard-to-detect errors) and enhance the model's robustness (i.e., enriching training data). Unlike existing testing techniques that optimize individual test cases, DistXplore optimizes test suites that represent specific distributions. To evaluate and enhance the model's robustness, we design two metrics: distribution difference, which maximizes the similarity in distribution between two different classes of data to generate hard-to-detect errors, and distribution diversity, which increase the distribution diversity of generated test cases for enhancing the model's robustness. To evaluate the effectiveness of DistXplore in model evaluation and enhancement, we compare DistXplore with 14 state-of-the-art baselines on 10 models across 4 datasets. The evaluation results show that DisXplore not only detects a larger number of errors (e.g., 2×+ on average). Furthermore, DistXplore achieves a higher improvement in empirical robustness (e.g., 5.2% more accuracy improvement than the baselines on average). |