ESEC/FSE 2022 – Author Index |
Contents -
Abstracts -
Authors
|
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Abdalkareem, Rabe |
ESEC/FSE '22: "23 Shades of Self-Admitted ..."
23 Shades of Self-Admitted Technical Debt: An Empirical Study on Machine Learning Software
David OBrien, Sumon Biswas, Sayem Imtiaz, Rabe Abdalkareem, Emad Shihab, and Hridesh Rajan (Iowa State University, USA; Carleton University, Canada; Concordia University, Canada) In software development, the term “technical debt” (TD) is used to characterize short-term solutions and workarounds implemented in source code which may incur a long-term cost. Technical debt has a variety of forms and can thus affect multiple qualities of software including but not limited to its legibility, performance, and structure. In this paper, we have conducted a comprehensive study on the technical debts in machine learning (ML) based software. TD can appear differently in ML software by infecting the data that ML models are trained on, thus affecting the functional behavior of ML systems. The growing inclusion of ML components in modern software systems have introduced a new set of TDs. Does ML software have similar TDs to traditional software? If not, what are the new types of ML specific TDs? Which ML pipeline stages do these debts appear? Do these debts differ in ML tools and applications and when they get removed? Currently, we do not know the state of the ML TDs in the wild. To address these questions, we mined 68,820 self-admitted technical debts (SATD) from all the revisions of a curated dataset consisting of 2,641 popular ML repositories from GitHub, along with their introduction and removal. By applying an open-coding scheme and following upon prior works, we provide a comprehensive taxonomy of ML SATDs. Our study analyzes ML SATD type organizations, their frequencies within stages of ML software, the differences between ML SATDs in applications and tools, and quantifies the removal of ML SATDs. The findings discovered suggest implications for ML developers and researchers to create maintainable ML systems. @InProceedings{ESEC/FSE22p734, author = {David OBrien and Sumon Biswas and Sayem Imtiaz and Rabe Abdalkareem and Emad Shihab and Hridesh Rajan}, title = {23 Shades of Self-Admitted Technical Debt: An Empirical Study on Machine Learning Software}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {734--746}, doi = {10.1145/3540250.3549088}, year = {2022}, } Publisher's Version Artifacts Functional |
|
Abdul Basit, Hamid |
ESEC/FSE '22-DEMO: "Context-Aware Code Recommendation ..."
Context-Aware Code Recommendation in Intellij IDEA
Shamsa Abid, Hamid Abdul Basit, and Shafay Shamail (Singapore Management University, Singapore; Prince Sultan University, Saudi Arabia; Lahore University of Management Sciences, Pakistan) Developers spend a lot of time online, searching for code to help them implement their desired features. While code recommenders help improve developers’ productivity, there is currently no support for context-aware code recommendation for opportunistic code reuse on-the-go. Typical code recommendation systems provide recommendations against a search query, whereas a code recommender that supports opportunistic reuse can recommend related code snippets that represent features that the developer may want to implement next. In this paper, we present a novel Context-aware Feature-driven API usage-based Code Recommender (CA-FACER) tool, which is an Intellij IDEA plugin that leverages a developer’s development context to recommend related code snippets. We consider the methods having API usages in a developer’s active project as part of the development context. Our approach uses contextual data from a developer’s active project to find similar projects and recommends code from popular features of those projects. The popular features are identified as frequently occurring API usage based Method Clone Classes. From our experimental evaluation on 120 Android Java projects from GitHub, we observe a 46% improvement of precision using our proposed context-aware approach over a baseline system. Our technique recommends related code examples with an average precision (P@5) of 94% and 83% and a success rate of 90% and 95% for initial and evolved development stages respectively. A video demonstration of our tool is available at https://youtu.be/UjuM8WRc318. @InProceedings{ESEC/FSE22p1647, author = {Shamsa Abid and Hamid Abdul Basit and Shafay Shamail}, title = {Context-Aware Code Recommendation in Intellij IDEA}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1647--1651}, doi = {10.1145/3540250.3558937}, year = {2022}, } Publisher's Version Video |
|
Abid, Shamsa |
ESEC/FSE '22-DEMO: "Context-Aware Code Recommendation ..."
Context-Aware Code Recommendation in Intellij IDEA
Shamsa Abid, Hamid Abdul Basit, and Shafay Shamail (Singapore Management University, Singapore; Prince Sultan University, Saudi Arabia; Lahore University of Management Sciences, Pakistan) Developers spend a lot of time online, searching for code to help them implement their desired features. While code recommenders help improve developers’ productivity, there is currently no support for context-aware code recommendation for opportunistic code reuse on-the-go. Typical code recommendation systems provide recommendations against a search query, whereas a code recommender that supports opportunistic reuse can recommend related code snippets that represent features that the developer may want to implement next. In this paper, we present a novel Context-aware Feature-driven API usage-based Code Recommender (CA-FACER) tool, which is an Intellij IDEA plugin that leverages a developer’s development context to recommend related code snippets. We consider the methods having API usages in a developer’s active project as part of the development context. Our approach uses contextual data from a developer’s active project to find similar projects and recommends code from popular features of those projects. The popular features are identified as frequently occurring API usage based Method Clone Classes. From our experimental evaluation on 120 Android Java projects from GitHub, we observe a 46% improvement of precision using our proposed context-aware approach over a baseline system. Our technique recommends related code examples with an average precision (P@5) of 94% and 83% and a success rate of 90% and 95% for initial and evolved development stages respectively. A video demonstration of our tool is available at https://youtu.be/UjuM8WRc318. @InProceedings{ESEC/FSE22p1647, author = {Shamsa Abid and Hamid Abdul Basit and Shafay Shamail}, title = {Context-Aware Code Recommendation in Intellij IDEA}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1647--1651}, doi = {10.1145/3540250.3558937}, year = {2022}, } Publisher's Version Video |
|
Abreu, Rui |
ESEC/FSE '22-IND: "Leveraging Test Plan Quality ..."
Leveraging Test Plan Quality to Improve Code Review Efficacy
Lawrence Chen, Rui Abreu, Tobi Akomolede, Peter C. Rigby, Satish Chandra, and Nachiappan Nagappan (Meta Platforms, USA; Concordia University, Canada) In modern code reviews, many artifacts play roles in knowledge- sharing and documentation: summaries, test plans, and comments, etc. Improving developer tools and facilitating better code reviews require an understanding of the quality of pull requests and their artifacts. This is difficult to measure, however, because they are often free-form natural language and unstructured text data. In this paper, we focus on measuring the quality of test plans at Meta. Test plans are used as a communication mechanism between the author of a pull request and its reviewers, serving as walkthroughs to help confirm that the changed code is behaving as expected. We collected developer opinions on over 650 test plans from more than 500 Meta developers, then introduced a transformer-based model to leverage the success of natural language processing (NLP) tech- niques in the code review domain. In our study, we show that the learned model is able to capture the sentiment of developers and reflect a correlation of test plan quality with review engagement and reversions: compared to a decision tree model, our proposed transformer-based model achieves a 7% higher F1-score. Finally, we present a case study of how such a metric may be useful in experiments to inform improvements in developer tools and experiences. @InProceedings{ESEC/FSE22p1320, author = {Lawrence Chen and Rui Abreu and Tobi Akomolede and Peter C. Rigby and Satish Chandra and Nachiappan Nagappan}, title = {Leveraging Test Plan Quality to Improve Code Review Efficacy}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1320--1330}, doi = {10.1145/3540250.3558952}, year = {2022}, } Publisher's Version |
|
Abualhaija, Sallam |
ESEC/FSE '22-DEMO: "WikiDoMiner: Wikipedia Domain-Specific ..."
WikiDoMiner: Wikipedia Domain-Specific Miner
Saad Ezzini, Sallam Abualhaija, and Mehrdad Sabetzadeh (University of Luxembourg, Luxembourg; University of Ottawa, Canada) We introduce WikiDoMiner – a tool for automatically generating domain-specific corpora by crawling Wikipedia. WikiDoMiner helps requirements engineers create an external knowledge resource that is specific to the underlying domain of a given requirements specification (RS). Being able to build such a resource is important since domain-specific datasets are scarce. WikiDoMiner generates a corpus by first extracting a set of domain-specific keywords from a given RS, and then querying Wikipedia for these keywords. The output of WikiDoMiner is a set of Wikipedia articles relevant to the domain of the input RS. Mining Wikipedia for domain-specific knowledge can be beneficial for multiple requirements engineering tasks, e.g., ambiguity handling, requirements classification, and question answering. WikiDoMiner is publicly available on Zenodo under an open-source license (https: //doi.org/10.5281/zenodo.6672682) @InProceedings{ESEC/FSE22p1706, author = {Saad Ezzini and Sallam Abualhaija and Mehrdad Sabetzadeh}, title = {WikiDoMiner: Wikipedia Domain-Specific Miner}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1706--1710}, doi = {10.1145/3540250.3558916}, year = {2022}, } Publisher's Version ESEC/FSE '22-DEMO: "COREQQA: A COmpliance REQuirements ..." COREQQA: A COmpliance REQuirements Understanding using Question Answering Tool Sallam Abualhaija, Chetan Arora, and Lionel C. Briand (University of Luxembourg, Luxembourg; Deakin University, Australia; University of Ottawa, Canada) We introduce COREQQA, a tool for assisting requirements engineers in acquiring a better understanding of compliance requirements by means of automated Question Answering. Extracting compliance-related requirements by manually navigating through a legal document is both time-consuming and error-prone. COREQQA enables requirements engineers to pose questions in natural language about a compliance-related topic given some legal document, e.g., asking about data breach. The tool then automatically navigates through the legal document and returns to the requirements engineer a list of text passages containing the possible answers to the input question. For better readability, the tool also highlights the likely answers in these passages. The engineer can then use this output for specifying compliance requirements. COREQQA is developed using advanced large-scale language models from BERT’s family. COREQQA has been evaluated on four legal documents. The results of this evaluation are briefly presented in the paper. The tool is publicly available on Zenodo (https://doi.org/10.5281/zenodo.6653514). @InProceedings{ESEC/FSE22p1682, author = {Sallam Abualhaija and Chetan Arora and Lionel C. Briand}, title = {COREQQA: A COmpliance REQuirements Understanding using Question Answering Tool}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1682--1686}, doi = {10.1145/3540250.3558926}, year = {2022}, } Publisher's Version ESEC/FSE '22-DEMO: "TAPHSIR: Towards AnaPHoric ..." TAPHSIR: Towards AnaPHoric Ambiguity Detection and ReSolution In Requirements Saad Ezzini, Sallam Abualhaija, Chetan Arora, and Mehrdad Sabetzadeh (University of Luxembourg, Luxembourg; Deakin University, Australia; University of Ottawa, Canada) We introduce TAPHSIR – a tool for anaphoric ambiguity detection and anaphora resolution in requirements. TAPHSIR facilities reviewing the use of pronouns in a requirements specification and revising those pronouns that can lead to misunderstandings during the development process. To this end, TAPHSIR detects the requirements which have potential anaphoric ambiguity and further attempts interpreting anaphora occurrences automatically. TAPHSIR employs a hybrid solution composed of an ambiguity detection solution based on machine learning and an anaphora resolution solution based on a variant of the BERT language model. Given a requirements specification, TAPHSIR decides for each pronoun occurrence in the specification whether the pronoun is ambiguous or unambiguous, and further provides an automatic interpretation for the pronoun. The output generated by TAPHSIR can be easily reviewed and validated by requirements engineers. TAPHSIR is publicly available on Zenodo (https://doi.org/10.5281/zenodo.5902117). @InProceedings{ESEC/FSE22p1677, author = {Saad Ezzini and Sallam Abualhaija and Chetan Arora and Mehrdad Sabetzadeh}, title = {TAPHSIR: Towards AnaPHoric Ambiguity Detection and ReSolution In Requirements}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1677--1681}, doi = {10.1145/3540250.3558928}, year = {2022}, } Publisher's Version |
|
Agrawal, Apoorva |
ESEC/FSE '22-IND: "Nalanda: A Socio-technical ..."
Nalanda: A Socio-technical Graph Platform for Building Software Analytics Tools at Enterprise Scale
Chandra Maddila, Suhas Shanbhogue, Apoorva Agrawal, Thomas Zimmermann, Chetan Bansal, Nicole Forsgren, Divyanshu Agrawal, Kim Herzig, and Arie van Deursen (Microsoft Research, USA; Microsoft Research, India; Microsoft, USA; Delft University of Technology, Netherlands) Software development is information-dense knowledge work that requires collaboration with other developers and awareness of artifacts such as work items, pull requests, and file changes. With the speed of development increasing, information overload and information discovery are challenges for people developing and maintaining these systems. Finding information about similar code changes and experts is difficult for software engineers, especially when they work in large software systems or have just recently joined a project. In this paper, we build a large scale data platform named Nalanda platform to address the challenges of information overload and discovery. Nalanda contains two subsystems: (1) a large scale socio-technical graph system, named Nalanda graph system, and (2) a large scale index system, named Nalanda index system that aims at satisfying the information needs of software developers. To show the versatility of the Nalanda platform, we built two applications: (1) a software analytics application with a news feed named MyNalanda that has Daily Active Users (DAU) of 290 and Monthly Active Users (MAU) of 590, and (2) a recommendation system for related work items and pull requests that accomplished similar tasks (artifact recommendation) and a recommendation system for subject matter experts (expert recommendation), augmented by the Nalanda socio-technical graph. Initial studies of the two applications found that developers and engineering managers are favorable toward continued use of the news feed application for information discovery. The studies also found that developers agreed that a system like Nalanda artifact and expert recommendation application could reduce the time spent and the number of places needed to visit to find information. @InProceedings{ESEC/FSE22p1246, author = {Chandra Maddila and Suhas Shanbhogue and Apoorva Agrawal and Thomas Zimmermann and Chetan Bansal and Nicole Forsgren and Divyanshu Agrawal and Kim Herzig and Arie van Deursen}, title = {Nalanda: A Socio-technical Graph Platform for Building Software Analytics Tools at Enterprise Scale}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1246--1256}, doi = {10.1145/3540250.3558949}, year = {2022}, } Publisher's Version Info |
|
Agrawal, Divyanshu |
ESEC/FSE '22-IND: "Nalanda: A Socio-technical ..."
Nalanda: A Socio-technical Graph Platform for Building Software Analytics Tools at Enterprise Scale
Chandra Maddila, Suhas Shanbhogue, Apoorva Agrawal, Thomas Zimmermann, Chetan Bansal, Nicole Forsgren, Divyanshu Agrawal, Kim Herzig, and Arie van Deursen (Microsoft Research, USA; Microsoft Research, India; Microsoft, USA; Delft University of Technology, Netherlands) Software development is information-dense knowledge work that requires collaboration with other developers and awareness of artifacts such as work items, pull requests, and file changes. With the speed of development increasing, information overload and information discovery are challenges for people developing and maintaining these systems. Finding information about similar code changes and experts is difficult for software engineers, especially when they work in large software systems or have just recently joined a project. In this paper, we build a large scale data platform named Nalanda platform to address the challenges of information overload and discovery. Nalanda contains two subsystems: (1) a large scale socio-technical graph system, named Nalanda graph system, and (2) a large scale index system, named Nalanda index system that aims at satisfying the information needs of software developers. To show the versatility of the Nalanda platform, we built two applications: (1) a software analytics application with a news feed named MyNalanda that has Daily Active Users (DAU) of 290 and Monthly Active Users (MAU) of 590, and (2) a recommendation system for related work items and pull requests that accomplished similar tasks (artifact recommendation) and a recommendation system for subject matter experts (expert recommendation), augmented by the Nalanda socio-technical graph. Initial studies of the two applications found that developers and engineering managers are favorable toward continued use of the news feed application for information discovery. The studies also found that developers agreed that a system like Nalanda artifact and expert recommendation application could reduce the time spent and the number of places needed to visit to find information. @InProceedings{ESEC/FSE22p1246, author = {Chandra Maddila and Suhas Shanbhogue and Apoorva Agrawal and Thomas Zimmermann and Chetan Bansal and Nicole Forsgren and Divyanshu Agrawal and Kim Herzig and Arie van Deursen}, title = {Nalanda: A Socio-technical Graph Platform for Building Software Analytics Tools at Enterprise Scale}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1246--1256}, doi = {10.1145/3540250.3558949}, year = {2022}, } Publisher's Version Info |
|
Ahmadi, Zahra |
ESEC/FSE '22-DEMO: "MANDO-GURU: Vulnerability ..."
MANDO-GURU: Vulnerability Detection for Smart Contract Source Code by Heterogeneous Graph Embeddings
Hoang H. Nguyen, Nhat-Minh Nguyen, Hong-Phuc Doan, Zahra Ahmadi, Thanh-Nam Doan, and Lingxiao Jiang (Leibniz Universität Hannover, Germany; Singapore Management University, Singapore; Hanoi University of Science and Technology, Vietnam) Smart contracts are increasingly used with blockchain systems for high-value applications. It is highly desired to ensure the quality of smart contract source code before they are deployed. This paper proposes a new deep learning-based tool, MANDO-GURU, that aims to accurately detect vulnerabilities in smart contracts at both coarse-grained contract-level and fine-grained line-level. Using a combination of control-flow graphs and call graphs of Solidity code, we design new heterogeneous graph attention neural networks to encode more structural and potentially semantic relations among different types of nodes and edges of such graphs and use the encoded embeddings of the graphs and nodes to detect vulnerabilities. Our validation of real-world smart contract datasets shows that MANDO-GURU can significantly improve many other vulnerability detection techniques by up to 24% in terms of the F1-score at the contract level, depending on vulnerability types. It is the first learning-based tool for Ethereum smart contracts that identify vulnerabilities at the line level and significantly improves the traditional code analysis-based techniques by up to 63.4%. Our tool is publicly available at https://github.com/MANDO-Project/ge-sc-machine. A test version is currently deployed at http://mandoguru.com, and a demo video of our tool is available at http://mandoguru.com/demo-video. @InProceedings{ESEC/FSE22p1736, author = {Hoang H. Nguyen and Nhat-Minh Nguyen and Hong-Phuc Doan and Zahra Ahmadi and Thanh-Nam Doan and Lingxiao Jiang}, title = {MANDO-GURU: Vulnerability Detection for Smart Contract Source Code by Heterogeneous Graph Embeddings}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1736--1740}, doi = {10.1145/3540250.3558927}, year = {2022}, } Publisher's Version Video Info |
|
Ahmed, Bestoun S. |
ESEC/FSE '22-IND: "Testing of Machine Learning ..."
Testing of Machine Learning Models with Limited Samples: An Industrial Vacuum Pumping Application
Ayan Chatterjee, Bestoun S. Ahmed, Erik Hallin, and Anton Engman (Karlstad University, Sweden; Uddeholms, Sweden) There is often a scarcity of training data for machine learning (ML) classification and regression models in industrial production, especially for time-consuming or sparsely run manufacturing processes. Traditionally, a majority of the limited ground-truth data is used for training, while a handful of samples are left for testing. In that case, the number of test samples is inadequate to properly evaluate the robustness of the ML models under test (i.e., the system under test) for classification and regression. Furthermore, the output of these ML models may be inaccurate or even fail if the input data differ from the expected. This is the case for ML models used in the Electroslag Remelting (ESR) process in the refined steel industry to predict the pressure in a vacuum chamber. A vacuum pumping event that occurs once a workday generates a few hundred samples in a year of pumping for training and testing. In the absence of adequate training and test samples, this paper first presents a method to generate a fresh set of augmented samples based on vacuum pumping principles. Based on the generated augmented samples, three test scenarios and one test oracle are presented to assess the robustness of an ML model used for production on an industrial scale. Experiments are conducted with real industrial production data obtained from Uddeholms AB steel company. The evaluations indicate that Ensemble and Neural Network are the most robust when trained on augmented data using the proposed testing strategy. The evaluation also demonstrates the proposed method's effectiveness in checking and improving ML algorithms' robustness in such situations. The work improves software testing's state-of-the-art robustness testing in similar settings. Finally, the paper presents an MLOps implementation of the proposed approach for real-time ML model prediction and action on the edge node and automated continuous delivery of ML software from the cloud. @InProceedings{ESEC/FSE22p1280, author = {Ayan Chatterjee and Bestoun S. Ahmed and Erik Hallin and Anton Engman}, title = {Testing of Machine Learning Models with Limited Samples: An Industrial Vacuum Pumping Application}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1280--1290}, doi = {10.1145/3540250.3558943}, year = {2022}, } Publisher's Version |
|
Ahmed, Iftekhar |
ESEC/FSE '22: "A Case Study of Implicit Mentoring, ..."
A Case Study of Implicit Mentoring, Its Prevalence, and Impact in Apache
Zixuan Feng, Amreeta Chatterjee, Anita Sarma, and Iftekhar Ahmed (Oregon State University, USA; University of California at Irvine, USA) Mentoring is traditionally viewed as a dyadic, top-down apprenticeship. This perspective, however, overlooks other forms of informal mentoring taking place in everyday activities in which developers invest time and effort. Here, we investigate informal mentoring taking place in Open Source Software (OSS). We define a specific type of informal mentoring—implicit mentoring—situations where contributors guide others through instructions and suggestions embedded in everyday (OSS) activities. We defined implicit mentoring by first performing a review of related work on mentoring, and then through formative interviews with OSS contributors and member-checking. Next, through an empirical investigation of Pull Requests (PRs) in 37 Apache Projects, we built a classifier to extract implicit mentoring. Our analysis of 107,895 PRs shows that implicit mentoring does occur through code reviews (27.41% of all PRs included implicit mentoring) and is beneficial for both mentors and mentees. We analyzed the impact of implicit mentoring on OSS contributors by investigating their contributions and learning trajectories in their projects. Through an online survey (N=231), we then triangulated these results and identified the potential benefits of implicit mentoring from OSS contributors’ perspectives. @InProceedings{ESEC/FSE22p797, author = {Zixuan Feng and Amreeta Chatterjee and Anita Sarma and Iftekhar Ahmed}, title = {A Case Study of Implicit Mentoring, Its Prevalence, and Impact in Apache}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {797--809}, doi = {10.1145/3540250.3549167}, year = {2022}, } Publisher's Version |
|
Ahmed, Muhammad Ejaz |
ESEC/FSE '22: "Cross-Language Android Permission ..."
Cross-Language Android Permission Specification
Chaoran Li, Xiao Chen, Ruoxi Sun, Minhui Xue, Sheng Wen, Muhammad Ejaz Ahmed, Seyit Camtepe, and Yang Xiang (Swinburne University of Technology, Australia; Monash University, Australia; University of Adelaide, Australia; CSIRO’s Data61, Australia) The Android system manages access to sensitive APIs by permission enforcement. An application (app) must declare proper permissions before invoking specific Android APIs. However, there is no official documentation providing the complete list of permission-protected APIs and the corresponding permissions to date. Researchers have spent significant efforts extracting such API protection mapping from the Android API framework, which leverages static code analysis to determine if specific permissions are required before accessing an API. Nevertheless, none of them has attempted to analyze the protection mapping in the native library (i.e., code written in C and C++), an essential component of the Android framework that handles communication with the lower-level hardware, such as cameras and sensors. While the protection mapping can be utilized to detect various security vulnerabilities in Android apps, such as permission over-privilege, imprecise mapping will lead to false results in detecting such security vulnerabilities. To fill this gap, we thereby propose to construct the protection mapping involved in the native libraries of the Android framework to present a complete and accurate specification of Android API protection. We develop a prototype system, named NatiDroid, to facilitate the cross-language static analysis and compare its performance with two state-of-the-practice tools, termed Axplorer and Arcade. We evaluate NatiDroid on more than 11,000 Android apps, including system apps from custom Android ROMs and third-party apps from the Google Play. Our NatiDroid can identify up to 464 new API-permission mappings, in contrast to the worst-case results derived from both Axplorer and Arcade, where approximately 71% apps have at least one false positive in permission over-privilege. We have disclosed all the potential vulnerabilities detected to the stakeholders. @InProceedings{ESEC/FSE22p772, author = {Chaoran Li and Xiao Chen and Ruoxi Sun and Minhui Xue and Sheng Wen and Muhammad Ejaz Ahmed and Seyit Camtepe and Yang Xiang}, title = {Cross-Language Android Permission Specification}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {772--783}, doi = {10.1145/3540250.3549142}, year = {2022}, } Publisher's Version |
|
Ahmed, Toufique |
ESEC/FSE '22: "NatGen: Generative Pre-training ..."
NatGen: Generative Pre-training by “Naturalizing” Source Code
Saikat Chakraborty, Toufique Ahmed, Yangruibo Ding, Premkumar T. Devanbu, and Baishakhi Ray (Columbia University, USA; University of California at Davis, USA) Pre-trained Generative Language models (e.g., PLBART, CodeT5, SPT-Code) for source code yielded strong results on several tasks in the past few years, including code generation and translation. These models have adopted varying pre-training objectives to learn statistics of code construction from very large-scale corpora in a self-supervised fashion; the success of pre-trained models largely hinges on these pre-training objectives. This paper proposes a new pre-training objective, “Naturalizing” of source code, exploiting code’s bimodal, dual-channel (formal & natural channels) nature. Unlike natural language, code’s bimodal, dual-channel nature allows us to generate semantically equivalent code at scale. We introduce six classes of semantic preserving transformations to introduce unnatural forms of code, and then force our model to produce more natural original programs written by developers. Learning to generate equivalent, but more natural code, at scale, over large corpora of open-source code, without explicit manual supervision, helps the model learn to both ingest & generate code. We fine-tune our model in three generative Software Engineering tasks: code generation, code translation, and code refinement with limited human-curated labeled data and achieve state-of-the-art performance rivaling CodeT5. We show that our pre-trained model is especially competitive at zero-shot and few-shot learning, and better at learning code properties (e.g., syntax, data flow) @InProceedings{ESEC/FSE22p18, author = {Saikat Chakraborty and Toufique Ahmed and Yangruibo Ding and Premkumar T. Devanbu and Baishakhi Ray}, title = {NatGen: Generative Pre-training by “Naturalizing” Source Code}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {18--30}, doi = {10.1145/3540250.3549162}, year = {2022}, } Publisher's Version |
|
Akomolede, Tobi |
ESEC/FSE '22-IND: "Leveraging Test Plan Quality ..."
Leveraging Test Plan Quality to Improve Code Review Efficacy
Lawrence Chen, Rui Abreu, Tobi Akomolede, Peter C. Rigby, Satish Chandra, and Nachiappan Nagappan (Meta Platforms, USA; Concordia University, Canada) In modern code reviews, many artifacts play roles in knowledge- sharing and documentation: summaries, test plans, and comments, etc. Improving developer tools and facilitating better code reviews require an understanding of the quality of pull requests and their artifacts. This is difficult to measure, however, because they are often free-form natural language and unstructured text data. In this paper, we focus on measuring the quality of test plans at Meta. Test plans are used as a communication mechanism between the author of a pull request and its reviewers, serving as walkthroughs to help confirm that the changed code is behaving as expected. We collected developer opinions on over 650 test plans from more than 500 Meta developers, then introduced a transformer-based model to leverage the success of natural language processing (NLP) tech- niques in the code review domain. In our study, we show that the learned model is able to capture the sentiment of developers and reflect a correlation of test plan quality with review engagement and reversions: compared to a decision tree model, our proposed transformer-based model achieves a 7% higher F1-score. Finally, we present a case study of how such a metric may be useful in experiments to inform improvements in developer tools and experiences. @InProceedings{ESEC/FSE22p1320, author = {Lawrence Chen and Rui Abreu and Tobi Akomolede and Peter C. Rigby and Satish Chandra and Nachiappan Nagappan}, title = {Leveraging Test Plan Quality to Improve Code Review Efficacy}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1320--1330}, doi = {10.1145/3540250.3558952}, year = {2022}, } Publisher's Version |
|
Aleti, Aldeida |
ESEC/FSE '22: "CommentFinder: A Simpler, ..."
CommentFinder: A Simpler, Faster, More Accurate Code Review Comments Recommendation
Yang Hong, Chakkrit Tantithamthavorn, Patanamon Thongtanunam, and Aldeida Aleti (Monash University, Australia; University of Melbourne, Australia) Code review is an effective quality assurance practice, but can be labor-intensive since developers have to manually review the code and provide written feedback. Recently, a Deep Learning (DL)-based approach was introduced to automatically recommend code review comments based on changed methods. While the approach showed promising results, it requires expensive computational resource and time which limits its use in practice. To address this limitation, we propose CommentFinder – a retrieval-based approach to recommend code review comments. Through an empirical evaluation of 151,019 changed methods, we evaluate the effectiveness and efficiency of CommentFinder against the state-of-the-art approach. We find that when recommending the best-1 review comment candidate, our CommentFinder is 32% better than prior work in recommending the correct code review comment. In addition, CommentFinder is 49 times faster than the prior work. These findings highlight that our CommentFinder could help reviewers to reduce the manual efforts by recommending code review comments, while requiring less computational time. @InProceedings{ESEC/FSE22p507, author = {Yang Hong and Chakkrit Tantithamthavorn and Patanamon Thongtanunam and Aldeida Aleti}, title = {CommentFinder: A Simpler, Faster, More Accurate Code Review Comments Recommendation}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {507--519}, doi = {10.1145/3540250.3549119}, year = {2022}, } Publisher's Version |
|
Ali, Shaukat |
ESEC/FSE '22-IND: "Are Elevator Software Robust ..."
Are Elevator Software Robust against Uncertainties? Results and Experiences from an Industrial Case Study
Liping Han, Tao Yue, Shaukat Ali, Aitor Arrieta, and Maite Arratibel (Nanjing University of Aeronautics and Astronautics, China; Simula Research Laboratory, Norway; Mondragon University, Spain; Orona, Spain) Industrial elevator systems are complex Cyber-Physical Systems operating in uncertain environments and experiencing uncertain passenger behaviors, hardware delays, and software errors. Identifying, understanding, and classifying such uncertainties are essential to enable system designers to reason about uncertainties and subsequently develop solutions for empowering elevator systems to deal with uncertainties systematically. To this end, we present a method, called RuCynefin, based on the Cynefin framework to classify uncertainties in industrial elevator systems from our industrial partner (Orona, Spain), results of which can then be used for assessing their robustness. RuCynefin is equipped with a novel classification algorithm to identify the Cynefin contexts for a variety of uncertainties in industrial elevator systems, and a novel metric for measuring the robustness using the uncertainty classification. We evaluated RuCynefin with an industrial case study of 90 dispatchers from Orona to assess their robustness against uncertainties. Results show that RuCynefin could effectively identify several situations for which certain dispatchers were not robust. Specifically, 93% of such versions showed some degree of low robustness against uncertainties. We also provide insights on the potential practical usages of RuCynefin, which are useful for practitioners in this field. @InProceedings{ESEC/FSE22p1331, author = {Liping Han and Tao Yue and Shaukat Ali and Aitor Arrieta and Maite Arratibel}, title = {Are Elevator Software Robust against Uncertainties? Results and Experiences from an Industrial Case Study}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1331--1342}, doi = {10.1145/3540250.3558955}, year = {2022}, } Publisher's Version ESEC/FSE '22-IND: "Uncertainty-Aware Transfer ..." Uncertainty-Aware Transfer Learning to Evolve Digital Twins for Industrial Elevators Qinghua Xu, Shaukat Ali, Tao Yue, and Maite Arratibel (Simula Research Laboratory, Norway; University of Oslo, Norway; Orona, Spain) Digital twins are increasingly developed to support the development, operation, and maintenance of cyber-physical systems such as industrial elevators. However, industrial elevators continuously evolve due to changes in physical installations, introducing new software features, updating existing ones, and making changes due to regulations (e.g., enforcing restricted elevator capacity due to COVID-19), etc. Thus, digital twin functionalities (often built on neural network-based models) need to evolve themselves constantly to be synchronized with the industrial elevators. Such an evolution is preferred to be automated, as manual evolution is time-consuming and error-prone. Moreover, collecting sufficient data to re-train neural network models of digital twins could be expensive or even infeasible. To this end, we propose unceRtaInty-aware tranSfer lEarning enriched Digital Twins LATTICE, a transfer learning based approach capable of transferring knowledge about the waiting time prediction capability of a digital twin of an industrial elevator across different scenarios. LATTICE also leverages uncertainty quantification to further improve its effectiveness. To evaluate LATTICE, we conducted experiments with 10 versions of an elevator dispatching software from Orona, Spain, which are deployed in a Software in the Loop (SiL) environment. Experiment results show that LATTICE, on average, improves the Mean Squared Error by 13.131% and the utilization of uncertainty quantification further improves it by 2.71%. @InProceedings{ESEC/FSE22p1257, author = {Qinghua Xu and Shaukat Ali and Tao Yue and Maite Arratibel}, title = {Uncertainty-Aware Transfer Learning to Evolve Digital Twins for Industrial Elevators}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1257--1268}, doi = {10.1145/3540250.3558957}, year = {2022}, } Publisher's Version |
|
Alon, Yoav |
ESEC/FSE '22: "Using Graph Neural Networks ..."
Using Graph Neural Networks for Program Termination
Yoav Alon and Cristina David (University of Bristol, UK) Termination analyses investigate the termination behavior of programs, intending to detect nontermination, which is known to cause a variety of program bugs (e.g. hanging programs, denial-of-service vulnerabilities). Beyond formal approaches, various attempts have been made to estimate the termination behavior of programs using neural networks. However, the majority of these approaches continue to rely on formal methods to provide strong soundness guarantees and consequently suffer from similar limitations. In this paper, we move away from formal methods and embrace the stochastic nature of machine learning models. Instead of aiming for rigorous guarantees that can be interpreted by solvers, our objective is to provide an estimation of a program's termination behavior and of the likely reason for nontermination (when applicable) that a programmer can use for debugging purposes. Compared to previous approaches using neural networks for program termination, we also take advantage of the graph representation of programs by employing Graph Neural Networks. To further assist programmers in understanding and debugging nontermination bugs, we adapt the notions of attention and semantic segmentation, previously used for other application domains, to programs. Overall, we designed and implemented classifiers for program termination based on Graph Convolutional Networks and Graph Attention Networks, as well as a semantic segmentation Graph Neural Network that localizes AST nodes likely to cause nontermination. We also illustrated how the information provided by semantic segmentation can be combined with program slicing to further aid debugging. @InProceedings{ESEC/FSE22p910, author = {Yoav Alon and Cristina David}, title = {Using Graph Neural Networks for Program Termination}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {910--921}, doi = {10.1145/3540250.3549095}, year = {2022}, } Publisher's Version Artifacts Functional |
|
Alonso, Juan C. |
ESEC/FSE '22-SRC: "Automated Generation of Test ..."
Automated Generation of Test Oracles for RESTful APIs
Juan C. Alonso (University of Seville, Spain) Test case generation tools for RESTful APIs have proliferated in recent years. However, despite their promising results, they all share the same limitation: they can only detect crashes (i.e., server errors) and disconformities with the API specification. In this paper, we present a technique for the automated generation of test oracles for RESTful APIs through the detection of invariants. In practice, our approach aims to learn the expected properties of the output by analysing previous API requests and their corresponding responses. For this, we extended the popular tool Daikon for dynamic detection of likely invariants. A preliminary evaluation conducted on a set of 8 operations from 6 industrial APIs reveals a total precision of 66.5% (reaching 100% in 2 operations). Moreover, our approach revealed 6 reproducible bugs in APIs with millions of users: Amadeus, GitHub and OMDb. @InProceedings{ESEC/FSE22p1808, author = {Juan C. Alonso}, title = {Automated Generation of Test Oracles for RESTful APIs}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1808--1810}, doi = {10.1145/3540250.3559080}, year = {2022}, } Publisher's Version |
|
Alshangiti, Moayad |
ESEC/FSE '22: "Hierarchical Bayesian Multi-kernel ..."
Hierarchical Bayesian Multi-kernel Learning for Integrated Classification and Summarization of App Reviews
Moayad Alshangiti, Weishi Shi, Eduardo Lima, Xumin Liu, and Qi Yu (University of Jeddah, Saudi Arabia; Rochester Institute of Technology, USA) App stores enable users to share their experiences directly with the developers in the form of app reviews. Recent studies have shown that the feedback received from users is a valuable source of information for requirements extraction, which encourages app developers to leverage the reviews for app update and maintenance purposes. Follow-up studies proposed automated techniques to help developers filter the large volume of daily and noisy reviews and/or summarize their content. However, all previous studies approached the app reviews classification and summarization as separate tasks, which complicated the process and introduced unnecessary overhead. Moreover, none of those approaches explored the potential of utilizing the hierarchical relationships that exist between the labels of app reviews for the purpose of building a more accurate model. In this work, we propose Hierarchical Multi-Kernel Relevance Vector Machines (HMK-RVM), a Bayesian multi-kernel technique that integrates app review classification and summarization using a unified model. Moreover, it can provide insights into the learned patterns and underlying data for easier model interpretation. We evaluated our proposed approach on two real-world datasets and showed that in addition to the gained insights, the model produces equal or better results than the state of the art. @InProceedings{ESEC/FSE22p558, author = {Moayad Alshangiti and Weishi Shi and Eduardo Lima and Xumin Liu and Qi Yu}, title = {Hierarchical Bayesian Multi-kernel Learning for Integrated Classification and Summarization of App Reviews}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {558--569}, doi = {10.1145/3540250.3549174}, year = {2022}, } Publisher's Version |
|
Alshayban, Abdulaziz |
ESEC/FSE '22: "AccessiText: Automated Detection ..."
AccessiText: Automated Detection of Text Accessibility Issues in Android Apps
Abdulaziz Alshayban and Sam Malek (University of California at Irvine, USA) For 15% of the world population with disabilities, accessibility is arguably the most critical software quality attribute. The growing reliance of users with disability on mobile apps to complete their day-to-day tasks further stresses the need for accessible software. Mobile operating systems, such as iOS and Android, provide various integrated assistive services to help individuals with disabilities perform tasks that could otherwise be difficult or not possible. However, for these assistive services to work correctly, developers have to support them in their app by following a set of best practices and accessibility guidelines. Text Scaling Assistive Service (TSAS) is utilized by people with low vision, to increase the text size and make apps accessible to them. However, the use of TSAS with incompatible apps can result in unexpected behavior introducing accessibility barriers to users. This paper presents approach, an automated testing technique for text accessibility issues arising from incompatibility between apps and TSAS. As a first step, we identify five different types of text accessibility by analyzing more than 600 candidate issues reported by users in (i) app reviews for Android and iOS, and (ii) Twitter data collected from public Twitter accounts. To automatically detect such issues, approach utilizes the UI screenshots and various metadata information extracted using dynamic analysis, and then applies various heuristics informed by the different types of text accessibility issues identified earlier. Evaluation of approach on 30 real-world Android apps corroborates its effectiveness by achieving 88.27% precision and 95.76% recall on average in detecting text accessibility issues. @InProceedings{ESEC/FSE22p984, author = {Abdulaziz Alshayban and Sam Malek}, title = {AccessiText: Automated Detection of Text Accessibility Issues in Android Apps}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {984--995}, doi = {10.1145/3540250.3549118}, year = {2022}, } Publisher's Version |
|
Amiri, Mohammad Javad |
ESEC/FSE '22: "Declarative Smart Contracts ..."
Declarative Smart Contracts
Haoxian Chen, Gerald Whitters, Mohammad Javad Amiri, Yuepeng Wang, and Boon Thau Loo (University of Pennsylvania, USA; Simon Fraser University, Canada) This paper presents DeCon, a declarative programming language for implementing smart contracts and specifying contract-level properties. Driven by the observation that smart contract operations and contract-level properties can be naturally expressed as relational constraints, DeCon models each smart contract as a set of relational tables that store transaction records. This relational representation of smart contracts enables convenient specification of contract properties, facilitates run-time monitoring of potential property violations, and brings clarity to contract debugging via data provenance. Specifically, a DeCon program consists of a set of declarative rules and violation query rules over the relational representation, describing the smart contract implementation and contract-level properties, respectively. We have developed a tool that can compile DeCon programs into executable Solidity programs, with instrumentation for run-time property monitoring. Our case studies demonstrate that DeCon can implement realistic smart contracts such as ERC20 and ERC721 digital tokens. Our evaluation results reveal the marginal overhead of DeCon compared to the open-source reference implementation, incurring 14% median gas overhead for execution, and another 16% median gas overhead for run-time verification. @InProceedings{ESEC/FSE22p281, author = {Haoxian Chen and Gerald Whitters and Mohammad Javad Amiri and Yuepeng Wang and Boon Thau Loo}, title = {Declarative Smart Contracts}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {281--293}, doi = {10.1145/3540250.3549121}, year = {2022}, } Publisher's Version Artifacts Functional |
|
Amusuo, Paschal C. |
ESEC/FSE '22-IVR: "Reflections on Software Failure ..."
Reflections on Software Failure Analysis
Paschal C. Amusuo, Aishwarya Sharma, Siddharth R. Rao, Abbey Vincent, and James C. Davis (Purdue University, USA) Failure studies are important in revealing the root causes, behaviors, and life cycle of defects in software systems. These studies either focus on understanding the characteristics of defects in specific classes of systems, or the characteristics of a specific type of defect in the systems it manifests in. Failure studies have influenced various software engineering research directions, especially in the area of software evolution, defect detection, and program repair. In this paper, we reflect on the conduct of failure studies in software engineering. We reviewed a sample of 52 failure study papers. We identified several recurring problems in these studies, some of which hinder the ability of software engineering community to trust or replicate the results. Based on our findings, we suggest future research directions, including identifying and analyzing failure causal chains, standardizing the conduct of failure studies, and tool support for faster defect analysis. @InProceedings{ESEC/FSE22p1615, author = {Paschal C. Amusuo and Aishwarya Sharma and Siddharth R. Rao and Abbey Vincent and James C. Davis}, title = {Reflections on Software Failure Analysis}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1615--1620}, doi = {10.1145/3540250.3560879}, year = {2022}, } Publisher's Version |
|
An, Seungmin |
ESEC/FSE '22: "Automatically Deriving JavaScript ..."
Automatically Deriving JavaScript Static Analyzers from Specifications using Meta-level Static Analysis
Jihyeok Park, Seungmin An, and Sukyoung Ryu (Oracle, Australia; KAIST, South Korea) JavaScript is one of the most dominant programming languages. However, despite its popularity, it is a challenging task to correctly understand the behaviors of JavaScript programs because of their highly dynamic nature. Researchers have developed various static analyzers that strive to conform to ECMA-262, the standard specification of JavaScript. Unfortunately, all the existing JavaScript static analyzers require manual updates for new language features. This problem has become more critical since 2015 because the JavaScript language itself rapidly evolves with a yearly release cadence and open development process. In this paper, we present JSAVER, the first tool that automatically derives JavaScript static analyzers from language specifications. The main idea of our approach is to extract a definitional interpreter from ECMA-262 and perform a meta-level static analysis with the extracted interpreter. A meta-level static analysis is a novel technique that indirectly analyzes programs by analyzing a definitional interpreter with the programs. We also describe how to indirectly configure abstract domains and analysis sensitivities in a meta-level static analysis. For evaluation, we derived a static analyzer from the latest ECMA-262 (ES12, 2021) using JSAVER. The derived analyzer soundly analyzed all applicable 18,556 official conformance tests with 99.0% of precision in 590 ms on average. In addition, we demonstrate the configurability and adaptability of JSAVER with several case studies. @InProceedings{ESEC/FSE22p1022, author = {Jihyeok Park and Seungmin An and Sukyoung Ryu}, title = {Automatically Deriving JavaScript Static Analyzers from Specifications using Meta-level Static Analysis}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1022--1034}, doi = {10.1145/3540250.3549097}, year = {2022}, } Publisher's Version Artifacts Reusable |
|
Andrews, Clayton |
ESEC/FSE '22-IND: "Workgraph: Personal Focus ..."
Workgraph: Personal Focus vs. Interruption for Engineers at Meta
Yifen Chen, Peter C. Rigby, Yulin Chen, Kun Jiang, Nader Dehghani, Qianying Huang, Peter Cottle, Clayton Andrews, Noah Lee, and Nachiappan Nagappan (Meta, USA; Concordia University, Canada) All engineers dislike interruptions because it takes away from the deep focus time needed to write complex code. Our goal is to reduce unnecessary interruptions at . We first describe our Workgraph platform that logs how engineers use our internal work tools at . Using these anonymized logs, we create sessions. sessions are defined in opposition to interruption and are the amount of time until the engineer is interrupted by, for example, a work chat message. We describe descriptive statistics related to how long engineers are able to focus. We find that at Meta, Engineers have a total of 14.25 hours of personal-focus time per week. These numbers are comparable with those reported by other software firms. We then create a Random Forest model to understand which factors influence the median daily personal-focus time. We find that the more time an engineer spends in the IDE the longer their focus. We also find that the more central an engineer is in the social work network, the shorter their personal-focus time. Other factors such as role and domain/pillar have little impact on personal-focus at Meta. To help engineers achieve longer blocks of personal-focus and help them stay in flow, Meta developed the AutoFocus tool that blocks work chat notifications when an engineer is working on code for 12 minutes or longer. AutoFocus allows the sender to still force a work chat message using “@notify” ensuring that urgent messages still get through, but allowing the sender to reflect on the importance of the message. In a large experiment, we find that AutoFocus increases the amount of personal-focus time by 20.27%, and it has now been rolled out widely at Meta. @InProceedings{ESEC/FSE22p1390, author = {Yifen Chen and Peter C. Rigby and Yulin Chen and Kun Jiang and Nader Dehghani and Qianying Huang and Peter Cottle and Clayton Andrews and Noah Lee and Nachiappan Nagappan}, title = {Workgraph: Personal Focus vs. Interruption for Engineers at Meta}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1390--1397}, doi = {10.1145/3540250.3558961}, year = {2022}, } Publisher's Version |
|
Apel, Sven |
ESEC/FSE '22-IND: "Sometimes You Have to Treat ..."
Sometimes You Have to Treat the Symptoms: Tackling Model Drift in an Industrial Clone-and-Own Software Product Line
Christof Tinnes, Wolfgang Rössler, Uwe Hohenstein, Torsten Kühn, Andreas Biesdorf, and Sven Apel (Siemens, Germany; Siemens Mobility, Germany; Saarland University, Germany) Many industrial software product lines use a clone-and-own approach for reuse among software products. As a result, the different products in the product line may drift apart, which implies increased efforts for tasks such as change propagation, domain analysis, and quality assurance. While many solutions have been proposed in the literature, these are often difficult to apply in a real-world setting. We study this drift of products in a concrete large-scale industrial model-driven clone-and-own software product line in the railway domain at our industry partner. For this purpose, we conducted interviews and a survey, and we investigated the models in the model history of this project. We found that increased efforts are mainly caused by large model differences and increased communication efforts. We argue that, in the short-term, treating the symptoms (i.e., handling large model differences) can help to keep efforts for software product-line engineering acceptable — instead of employing sophisticated variability management. To treat the symptoms, we employ a solution based on semantic-lifting to simplify model differences. Using the interviews and the survey, we evaluate the feasibility of variability management approaches and the semantic-lifting approach in the context of this project. @InProceedings{ESEC/FSE22p1355, author = {Christof Tinnes and Wolfgang Rössler and Uwe Hohenstein and Torsten Kühn and Andreas Biesdorf and Sven Apel}, title = {Sometimes You Have to Treat the Symptoms: Tackling Model Drift in an Industrial Clone-and-Own Software Product Line}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1355--1366}, doi = {10.1145/3540250.3558960}, year = {2022}, } Publisher's Version ESEC/FSE '22: "Correlates of Programmer Efficacy ..." Correlates of Programmer Efficacy and Their Link to Experience: A Combined EEG and Eye-Tracking Study Norman Peitek, Annabelle Bergum, Maurice Rekrut, Jonas Mucke, Matthias Nadig, Chris Parnin, Janet Siegmund, and Sven Apel (Saarland University, Germany; German Research Center for Artificial Intelligence, Germany; Chemnitz University of Technology, Germany; North Carolina State University, USA) Background: Despite similar education and background, programmers can exhibit vast differences in efficacy. While research has identified some potential factors, such as programming experience and domain knowledge, the effect of these factors on programmers' efficacy is not well understood. Aims: We aim at unraveling the relationship between efficacy (speed and correctness) and measures of programming experience. We further investigate the correlates of programmer efficacy in terms of reading behavior and cognitive load. Method: For this purpose, we conducted a controlled experiment with 37 participants using electroencephalography (EEG) and eye tracking. We asked participants to comprehend up to 32 Java source-code snippets and observed their eye gaze and neural correlates of cognitive load. We analyzed the correlation of participants' efficacy with popular programming experience measures. Results: We found that programmers with high efficacy read source code more targeted and with lower cognitive load. Commonly used experience levels do not predict programmer efficacy well, but self-estimation and indicators of learning eagerness are fairly accurate. Implications: The identified correlates of programmer efficacy can be used for future research and practice (e.g., hiring). Future research should also consider efficacy as a group sampling method, rather than using simple experience measures. @InProceedings{ESEC/FSE22p120, author = {Norman Peitek and Annabelle Bergum and Maurice Rekrut and Jonas Mucke and Matthias Nadig and Chris Parnin and Janet Siegmund and Sven Apel}, title = {Correlates of Programmer Efficacy and Their Link to Experience: A Combined EEG and Eye-Tracking Study}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {120--131}, doi = {10.1145/3540250.3549084}, year = {2022}, } Publisher's Version Info |
|
Arcaini, Paolo |
ESEC/FSE '22-DEMO: "JSIMutate: Understanding Performance ..."
JSIMutate: Understanding Performance Results through Mutations
Thomas Laurent, Paolo Arcaini, Catia Trubiani, and Anthony Ventresque (Lero, Ireland; University College Dublin, Ireland; National Institute of Informatics, Japan; Gran Sasso Science Institute, Italy) Understanding the performance characteristics of software systems is particular relevant when looking at design alternatives. However, it is a very challenging problem, due to the complexity of interpreting the role and incidence of the different system elements on performance metrics of interest, such as system response time or resources utilisation. This work introduces JSIMutate, a tool that makes use of queueing network performance models and enables the analysis of mutations of a model reflecting possible design changes to support designers in identifying the model elements that contribute to improving or worsening the system's performance. @InProceedings{ESEC/FSE22p1721, author = {Thomas Laurent and Paolo Arcaini and Catia Trubiani and Anthony Ventresque}, title = {JSIMutate: Understanding Performance Results through Mutations}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1721--1725}, doi = {10.1145/3540250.3558930}, year = {2022}, } Publisher's Version |
|
Arora, Chetan |
ESEC/FSE '22-DEMO: "COREQQA: A COmpliance REQuirements ..."
COREQQA: A COmpliance REQuirements Understanding using Question Answering Tool
Sallam Abualhaija, Chetan Arora, and Lionel C. Briand (University of Luxembourg, Luxembourg; Deakin University, Australia; University of Ottawa, Canada) We introduce COREQQA, a tool for assisting requirements engineers in acquiring a better understanding of compliance requirements by means of automated Question Answering. Extracting compliance-related requirements by manually navigating through a legal document is both time-consuming and error-prone. COREQQA enables requirements engineers to pose questions in natural language about a compliance-related topic given some legal document, e.g., asking about data breach. The tool then automatically navigates through the legal document and returns to the requirements engineer a list of text passages containing the possible answers to the input question. For better readability, the tool also highlights the likely answers in these passages. The engineer can then use this output for specifying compliance requirements. COREQQA is developed using advanced large-scale language models from BERT’s family. COREQQA has been evaluated on four legal documents. The results of this evaluation are briefly presented in the paper. The tool is publicly available on Zenodo (https://doi.org/10.5281/zenodo.6653514). @InProceedings{ESEC/FSE22p1682, author = {Sallam Abualhaija and Chetan Arora and Lionel C. Briand}, title = {COREQQA: A COmpliance REQuirements Understanding using Question Answering Tool}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1682--1686}, doi = {10.1145/3540250.3558926}, year = {2022}, } Publisher's Version ESEC/FSE '22-DEMO: "TAPHSIR: Towards AnaPHoric ..." TAPHSIR: Towards AnaPHoric Ambiguity Detection and ReSolution In Requirements Saad Ezzini, Sallam Abualhaija, Chetan Arora, and Mehrdad Sabetzadeh (University of Luxembourg, Luxembourg; Deakin University, Australia; University of Ottawa, Canada) We introduce TAPHSIR – a tool for anaphoric ambiguity detection and anaphora resolution in requirements. TAPHSIR facilities reviewing the use of pronouns in a requirements specification and revising those pronouns that can lead to misunderstandings during the development process. To this end, TAPHSIR detects the requirements which have potential anaphoric ambiguity and further attempts interpreting anaphora occurrences automatically. TAPHSIR employs a hybrid solution composed of an ambiguity detection solution based on machine learning and an anaphora resolution solution based on a variant of the BERT language model. Given a requirements specification, TAPHSIR decides for each pronoun occurrence in the specification whether the pronoun is ambiguous or unambiguous, and further provides an automatic interpretation for the pronoun. The output generated by TAPHSIR can be easily reviewed and validated by requirements engineers. TAPHSIR is publicly available on Zenodo (https://doi.org/10.5281/zenodo.5902117). @InProceedings{ESEC/FSE22p1677, author = {Saad Ezzini and Sallam Abualhaija and Chetan Arora and Mehrdad Sabetzadeh}, title = {TAPHSIR: Towards AnaPHoric Ambiguity Detection and ReSolution In Requirements}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1677--1681}, doi = {10.1145/3540250.3558928}, year = {2022}, } Publisher's Version |
|
Arratibel, Maite |
ESEC/FSE '22-IND: "Are Elevator Software Robust ..."
Are Elevator Software Robust against Uncertainties? Results and Experiences from an Industrial Case Study
Liping Han, Tao Yue, Shaukat Ali, Aitor Arrieta, and Maite Arratibel (Nanjing University of Aeronautics and Astronautics, China; Simula Research Laboratory, Norway; Mondragon University, Spain; Orona, Spain) Industrial elevator systems are complex Cyber-Physical Systems operating in uncertain environments and experiencing uncertain passenger behaviors, hardware delays, and software errors. Identifying, understanding, and classifying such uncertainties are essential to enable system designers to reason about uncertainties and subsequently develop solutions for empowering elevator systems to deal with uncertainties systematically. To this end, we present a method, called RuCynefin, based on the Cynefin framework to classify uncertainties in industrial elevator systems from our industrial partner (Orona, Spain), results of which can then be used for assessing their robustness. RuCynefin is equipped with a novel classification algorithm to identify the Cynefin contexts for a variety of uncertainties in industrial elevator systems, and a novel metric for measuring the robustness using the uncertainty classification. We evaluated RuCynefin with an industrial case study of 90 dispatchers from Orona to assess their robustness against uncertainties. Results show that RuCynefin could effectively identify several situations for which certain dispatchers were not robust. Specifically, 93% of such versions showed some degree of low robustness against uncertainties. We also provide insights on the potential practical usages of RuCynefin, which are useful for practitioners in this field. @InProceedings{ESEC/FSE22p1331, author = {Liping Han and Tao Yue and Shaukat Ali and Aitor Arrieta and Maite Arratibel}, title = {Are Elevator Software Robust against Uncertainties? Results and Experiences from an Industrial Case Study}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1331--1342}, doi = {10.1145/3540250.3558955}, year = {2022}, } Publisher's Version ESEC/FSE '22-IND: "Uncertainty-Aware Transfer ..." Uncertainty-Aware Transfer Learning to Evolve Digital Twins for Industrial Elevators Qinghua Xu, Shaukat Ali, Tao Yue, and Maite Arratibel (Simula Research Laboratory, Norway; University of Oslo, Norway; Orona, Spain) Digital twins are increasingly developed to support the development, operation, and maintenance of cyber-physical systems such as industrial elevators. However, industrial elevators continuously evolve due to changes in physical installations, introducing new software features, updating existing ones, and making changes due to regulations (e.g., enforcing restricted elevator capacity due to COVID-19), etc. Thus, digital twin functionalities (often built on neural network-based models) need to evolve themselves constantly to be synchronized with the industrial elevators. Such an evolution is preferred to be automated, as manual evolution is time-consuming and error-prone. Moreover, collecting sufficient data to re-train neural network models of digital twins could be expensive or even infeasible. To this end, we propose unceRtaInty-aware tranSfer lEarning enriched Digital Twins LATTICE, a transfer learning based approach capable of transferring knowledge about the waiting time prediction capability of a digital twin of an industrial elevator across different scenarios. LATTICE also leverages uncertainty quantification to further improve its effectiveness. To evaluate LATTICE, we conducted experiments with 10 versions of an elevator dispatching software from Orona, Spain, which are deployed in a Software in the Loop (SiL) environment. Experiment results show that LATTICE, on average, improves the Mean Squared Error by 13.131% and the utilization of uncertainty quantification further improves it by 2.71%. @InProceedings{ESEC/FSE22p1257, author = {Qinghua Xu and Shaukat Ali and Tao Yue and Maite Arratibel}, title = {Uncertainty-Aware Transfer Learning to Evolve Digital Twins for Industrial Elevators}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1257--1268}, doi = {10.1145/3540250.3558957}, year = {2022}, } Publisher's Version |
|
Arrieta, Aitor |
ESEC/FSE '22-IND: "Are Elevator Software Robust ..."
Are Elevator Software Robust against Uncertainties? Results and Experiences from an Industrial Case Study
Liping Han, Tao Yue, Shaukat Ali, Aitor Arrieta, and Maite Arratibel (Nanjing University of Aeronautics and Astronautics, China; Simula Research Laboratory, Norway; Mondragon University, Spain; Orona, Spain) Industrial elevator systems are complex Cyber-Physical Systems operating in uncertain environments and experiencing uncertain passenger behaviors, hardware delays, and software errors. Identifying, understanding, and classifying such uncertainties are essential to enable system designers to reason about uncertainties and subsequently develop solutions for empowering elevator systems to deal with uncertainties systematically. To this end, we present a method, called RuCynefin, based on the Cynefin framework to classify uncertainties in industrial elevator systems from our industrial partner (Orona, Spain), results of which can then be used for assessing their robustness. RuCynefin is equipped with a novel classification algorithm to identify the Cynefin contexts for a variety of uncertainties in industrial elevator systems, and a novel metric for measuring the robustness using the uncertainty classification. We evaluated RuCynefin with an industrial case study of 90 dispatchers from Orona to assess their robustness against uncertainties. Results show that RuCynefin could effectively identify several situations for which certain dispatchers were not robust. Specifically, 93% of such versions showed some degree of low robustness against uncertainties. We also provide insights on the potential practical usages of RuCynefin, which are useful for practitioners in this field. @InProceedings{ESEC/FSE22p1331, author = {Liping Han and Tao Yue and Shaukat Ali and Aitor Arrieta and Maite Arratibel}, title = {Are Elevator Software Robust against Uncertainties? Results and Experiences from an Industrial Case Study}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1331--1342}, doi = {10.1145/3540250.3558955}, year = {2022}, } Publisher's Version |
|
Arya, Deeksha M. |
ESEC/FSE '22-DOC: "This Is Your Cue! Assisting ..."
This Is Your Cue! Assisting Search Behaviour with Resource Style Properties
Deeksha M. Arya (McGill University, Canada) When learning a software technology, programmers face a large variety of resources in different styles and catering to different requirements. Although search engines are helpful to filter relevant resources, programmers are still required to manually go through a number of resources before they find one pertinent to their needs. Prior work has largely concentrated on helping programmers find the precise location of relevant information within a resource. Our work focuses on helping programmers assess the pertinence of resources to differentiate between resources. We investigated how programmers find learning resources online via a diary and interview study, and observed that programmers use certain cues to determine whether to access a resource. Based on our findings, we investigate the extent to which we can support the cue-following process via a prototype tool. Our research supports programmers’ search behaviour for software technology learning resources to inform resource creators on important factors that programmers look for during their search. @InProceedings{ESEC/FSE22p1770, author = {Deeksha M. Arya}, title = {This Is Your Cue! Assisting Search Behaviour with Resource Style Properties}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1770--1774}, doi = {10.1145/3540250.3558909}, year = {2022}, } Publisher's Version |
|
Arzt, Steven |
ESEC/FSE '22: "Security Code Smells in Apps: ..."
Security Code Smells in Apps: Are We Getting Better?
Steven Arzt (Fraunhofer SIT, Germany; ATHENE, Germany) Users increasingly rely on mobile apps for everyday tasks, including security- and privacy-sensitive tasks such as online banking, e-health, and e-government. Additionally, a wealth of sensors captures the movements and habits of the users for fitness tracking and convenience. Despite legal regulations imposing requirements and limits on the processing of privacy-sensitive data, users must still trust the app developers to apply suffcient protections. In this paper, we investigate the state of security in Android apps and how security-related code smells have evolved since the introduction of the Android operating system. With an analysis of 300 apps per year over 12 years between 2010 and 2021 from the Google Play Store, we find that the number of code scanner findings per thousand lines of code decreases over time. Still, this development is offset by the increase in code size. Apps have more and more findings, suggesting that the overall security level decreases. This trend is driven by flaws in the use of cryptography, insecure compiler flags, insecure uses of WebView components, and insecure uses of language features such as reflection. Based on our data, we argue for stricter controls on apps before admission to the store. @InProceedings{ESEC/FSE22p245, author = {Steven Arzt}, title = {Security Code Smells in Apps: Are We Getting Better?}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {245--255}, doi = {10.1145/3540250.3549091}, year = {2022}, } Publisher's Version |
|
Atakishiyev, Abdurahman |
ESEC/FSE '22-IND: "Towards Developer-Centered ..."
Towards Developer-Centered Automatic Program Repair: Findings from Bloomberg
Emily Rowan Winter, Vesna Nowack, David Bowes, Steve Counsell, Tracy Hall, Sæmundur Haraldsson, John Woodward, Serkan Kirbas, Etienne Windels, Olayori McBello, Abdurahman Atakishiyev, Kevin Kells, and Matthew Pagano (Lancaster University, UK; Brunel University London, UK; University of Stirling, UK; Queen Mary University of London, UK; Bloomberg, UK) This paper reports on qualitative research into automatic program repair (APR) at Bloomberg. Six focus groups were conducted with a total of seventeen participants (including both developers of the APR tool and developers using the tool) to consider: the development at Bloomberg of a prototype APR tool (Fixie); developers’ early experiences using the tool; and developers’ perspectives on how they would like to interact with the tool in future. APR is developing rapidly and it is important to understand in greater detail developers' experiences using this emerging technology. In this paper, we provide in-depth, qualitative data from an industrial setting. We found that the development of APR at Bloomberg had become increasingly user-centered, emphasising how fixes were presented to developers, as well as particular features, such as customisability. From the focus groups with developers who had used Fixie, we found particular concern with the pragmatic aspects of APR, such as how and when fixes were presented to them. Based on our findings, we make a series of recommendations to inform future APR development, highlighting how APR tools should 'start small', be customisable, and fit with developers' workflows. We also suggest that APR tools should capitalise on the promise of repair bots and draw on advances in explainable AI. @InProceedings{ESEC/FSE22p1578, author = {Emily Rowan Winter and Vesna Nowack and David Bowes and Steve Counsell and Tracy Hall and Sæmundur Haraldsson and John Woodward and Serkan Kirbas and Etienne Windels and Olayori McBello and Abdurahman Atakishiyev and Kevin Kells and Matthew Pagano}, title = {Towards Developer-Centered Automatic Program Repair: Findings from Bloomberg}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1578--1588}, doi = {10.1145/3540250.3558953}, year = {2022}, } Publisher's Version |
|
AuBuchon, Jake |
ESEC/FSE '22: "Pair Programming Conversations ..."
Pair Programming Conversations with Agents vs. Developers: Challenges and Opportunities for SE Community
Peter Robe, Sandeep K. Kuttal, Jake AuBuchon, and Jacob Hart (University of Tulsa, USA) Recent research has shown feasibility of an interactive pair-programming conversational agent, but implementing such an agent poses three challenges: a lack of benchmark datasets, absence of software engineering specific labels, and the need to understand developer conversations. To address these challenges, we conducted a Wizard of Oz study with 14 participants pair programming with a simulated agent and collected 4,443 developer-agent utterances. Based on this dataset, we created 26 software engineering labels using an open coding process to develop a hierarchical classification scheme. To understand labeled developer-agent conversations, we compared the accuracy of three state-of-the-art transformer-based language models, BERT, GPT-2, and XLNet, which performed interchangeably. In order to begin creating a developer-agent dataset, researchers and practitioners need to conduct resource intensive Wizard of Oz studies. Presently, there exists vast amounts of developer-developer conversations on video hosting websites. To investigate the feasibility of using developer-developer conversations, we labeled a publicly available developer-developer dataset (3,436 utterances) with our hierarchical classification scheme and found that a BERT model trained on developer-developer data performed ~10% worse than the BERT trained on developer-agent data, but when using transfer-learning, accuracy improved. Finally, our qualitative analysis revealed that developer-developer conversations are more implicit, neutral, and opinionated than developer-agent conversations. Our results have implications for software engineering researchers and practitioners developing conversational agents. @InProceedings{ESEC/FSE22p319, author = {Peter Robe and Sandeep K. Kuttal and Jake AuBuchon and Jacob Hart}, title = {Pair Programming Conversations with Agents vs. Developers: Challenges and Opportunities for SE Community}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {319--331}, doi = {10.1145/3540250.3549127}, year = {2022}, } Publisher's Version |
|
Ayoup, Patrick |
ESEC/FSE '22-IND: "Achievement Unlocked: A Case ..."
Achievement Unlocked: A Case Study on Gamifying DevOps Practices in Industry
Patrick Ayoup, Diego Elias Costa, and Emad Shihab (Concordia University, Canada; Université du Québec à Montréal, Canada) Gamification is the use of game elements such as points, leaderboards, and badges in a non-game context to encourage a desired behavior from individuals interacting with an environment. Recently, gamification has found its way into software engineering contexts as a means to promote certain activities to practitioners. Previous studies investigated the use of gamification to promote the adoption of a variety of tools and practices, however, these studies were either performed in an educational environment or in small to medium-sized teams of developers in the industry. We performed a large-scale mixed-methods study on the effects of badge-based gamification in promoting the adoption of DevOps practices in a very large company and evaluated how practice adoption is associated with changes in key delivery, quality, and throughput metrics of 333 software projects. We observed an accelerated adoption of some gamified DevOps practices by at least 60%, with increased adoption rates up to 6x. We found mixed results when associating badge adoption and metric changes: teams that earned testing badges showed an increase in bug fixing commits but output fewer commits and pull requests; teams that earned code review and quality tooling badges exhibited faster delivery metrics. Finally, our empirical study was supplemented by a survey with 45 developers where 73% of respondents found badges to be helpful for learning about and adopting new standardized practices. Our results contribute to the rich knowledge on gamification with a unique and important perspective from real industry practitioners. @InProceedings{ESEC/FSE22p1343, author = {Patrick Ayoup and Diego Elias Costa and Emad Shihab}, title = {Achievement Unlocked: A Case Study on Gamifying DevOps Practices in Industry}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1343--1354}, doi = {10.1145/3540250.3558948}, year = {2022}, } Publisher's Version |
|
Bacchelli, Alberto |
ESEC/FSE '22: "Software Security during Modern ..."
Software Security during Modern Code Review: The Developer’s Perspective
Larissa Braz and Alberto Bacchelli (University of Zurich, Switzerland) To avoid software vulnerabilities, organizations are shifting security to earlier stages of the software development, such as at code review time. In this paper, we aim to understand the developers’ perspective on assessing software security during code review, the challenges they encounter, and the support that companies and projects provide. To this end, we conduct a two-step investigation: we interview 10 professional developers and survey 182 practitioners about software security assessment during code review. The outcome is an overview of how developers perceive software security during code review and a set of identified challenges. Our study revealed that most developers do not immediately report to focus on security issues during code review. Only after being asked about software security, developers state to always consider it during review and acknowledge its importance. Most companies do not provide security training, yet expect developers to still ensure security during reviews. Accordingly, developers report the lack of training and security knowledge as the main challenges they face when checking for security issues. In addition, they have challenges with third-party libraries and to identify interactions between parts of code that could have security implications. Moreover, security may be disregarded during reviews due to developers’ assumptions about the security dynamic of the application they develop. Preprint: https://arxiv.org/abs/2208.04261 Data and materials: https://doi.org/10.5281/zenodo.6969369 @InProceedings{ESEC/FSE22p810, author = {Larissa Braz and Alberto Bacchelli}, title = {Software Security during Modern Code Review: The Developer’s Perspective}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {810--821}, doi = {10.1145/3540250.3549135}, year = {2022}, } Publisher's Version Info ESEC/FSE '22: "First Come First Served: The ..." First Come First Served: The Impact of File Position on Code Review Enrico Fregnan, Larissa Braz, Marco D'Ambros, Gül Çalıklı, and Alberto Bacchelli (University of Zurich, Switzerland; USI Lugano, Switzerland; University of Glasgow, UK) The most popular code review tools (e.g., Gerrit and GitHub) present the files to review sorted in alphabetical order. Could this choice or, more generally, the relative position in which a file is presented bias the outcome of code reviews? We investigate this hypothesis by triangulating complementary evidence in a two-step study. First, we observe developers’ code review activity. We analyze the review comments pertaining to 219,476 Pull Requests (PRs) from 138 popular Java projects on GitHub. We found files shown earlier in a PR to receive more comments than files shown later, also when controlling for possible confounding factors: e.g., the presence of discussion threads or the lines added in a file. Second, we measure the impact of file position on defect finding in code review. Recruit- ing 106 participants, we conduct an online controlled experiment in which we measure participants’ performance in detecting two unrelated defects seeded into two different files. Participants are assigned to one of two treatments in which the position of the defective files is switched. For one type of defect, participants are not affected by its file’s position; for the other, they have 64% lower odds to identify it when its file is last as opposed to first. Overall, our findings provide evidence that the relative position in which files are presented has an impact on code reviews’ outcome; we discuss these results and implications for tool design and code review. @InProceedings{ESEC/FSE22p483, author = {Enrico Fregnan and Larissa Braz and Marco D'Ambros and Gül Çalıklı and Alberto Bacchelli}, title = {First Come First Served: The Impact of File Position on Code Review}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {483--494}, doi = {10.1145/3540250.3549177}, year = {2022}, } Publisher's Version Artifacts Reusable |
|
Bagheri, Hamid |
ESEC/FSE '22: "Parasol: Efficient Parallel ..."
Parasol: Efficient Parallel Synthesis of Large Model Spaces
Clay Stevens and Hamid Bagheri (University of Nebraska-Lincoln, USA) Formal analysis is an invaluable tool for software engineers, yet state-of-the-art formal analysis techniques suffer from well-known limitations in terms of scalability. In particular, some software design domains—such as tradeoff analysis and security analysis—require systematic exploration of potentially huge model spaces, which further exacerbates the problem. Despite this present and urgent challenge, few techniques exist to support the systematic exploration of large model spaces. This paper introduces Parasol, an approach and accompanying tool suite, to improve the scalability of large-scale formal model space exploration. Parasol presents a novel parallel model space synthesis approach, backed with unsupervised learning to automatically derive domain knowledge, guiding a balanced partitioning of the model space. This allows Parasol to synthesize the models in each partition in parallel, significantly reducing synthesis time and making large-scale systematic model space exploration for real-world systems more tractable. Our empirical results corroborate that Parasol substantially reduces (by 460% on average) the time required for model space synthesis, compared to state-of-the-art model space synthesis techniques relying on both incremental and parallel constraint solving technologies as well as competing, non-learning-based partitioning methods. @InProceedings{ESEC/FSE22p620, author = {Clay Stevens and Hamid Bagheri}, title = {Parasol: Efficient Parallel Synthesis of Large Model Spaces}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {620--632}, doi = {10.1145/3540250.3549157}, year = {2022}, } Publisher's Version |
|
Baltes, Sebastian |
ESEC/FSE '22-IVR: "Paving the Way for Mature ..."
Paving the Way for Mature Secondary Research: The Seven Types of Literature Review
Paul Ralph and Sebastian Baltes (Dalhousie University, Canada; University of Adelaide, Australia) Confusion over different kinds of secondary research, and their divergent purposes, is undermining the effectiveness and usefulness of secondary studies in software engineering. This short paper therefore explains the differences between ad hoc review, case survey, critical review, meta-analysis (aka systematic literature review), meta-synthesis (aka thematic analysis), rapid review and scoping review (aka systematic mapping study). These definitions and associated guidelines help researchers better select and describe their literature reviews, while helping reviewers select more appropriate evaluation criteria. @InProceedings{ESEC/FSE22p1632, author = {Paul Ralph and Sebastian Baltes}, title = {Paving the Way for Mature Secondary Research: The Seven Types of Literature Review}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1632--1636}, doi = {10.1145/3540250.3560877}, year = {2022}, } Publisher's Version |
|
Bansal, Chetan |
ESEC/FSE '22-IND: "AutoTSG: Learning and Synthesis ..."
AutoTSG: Learning and Synthesis for Incident Troubleshooting
Manish Shetty, Chetan Bansal, Sai Pramod Upadhyayula, Arjun Radhakrishna, and Anurag Gupta (Microsoft Research, India; Microsoft, USA) Incident management is a key aspect of operating large-scale cloud services. To aid with faster and efficient resolution of incidents, engineering teams document frequent troubleshooting steps in the form of Troubleshooting Guides (TSGs), to be used by on-call engineers (OCEs). However, TSGs are siloed, unstructured, and often incomplete, requiring developers to manually understand and execute necessary steps. This results in a plethora of issues such as on-call fatigue, reduced productivity, and human errors. In this work, we conduct a large-scale empirical study of over 4K+ TSGs mapped to incidents and find that TSGs are widely used and help significantly reduce mitigation efforts. We then analyze feedback on TSGs provided by 400+ OCEs and propose a taxonomy of issues that highlights significant gaps in TSG quality. To alleviate these gaps, we investigate the automation of TSGs and propose AutoTSG -- a novel framework for automation of TSGs to executable workflows by combining machine learning and program synthesis. Our evaluation of AutoTSG on 50 TSGs shows the effectiveness in both identifying TSG statements (accuracy 0.89) and parsing them for execution (precision 0.94 and recall 0.91). Lastly, we survey ten Microsoft engineers and show the importance of TSG automation and the usefulness of AutoTSG. @InProceedings{ESEC/FSE22p1477, author = {Manish Shetty and Chetan Bansal and Sai Pramod Upadhyayula and Arjun Radhakrishna and Anurag Gupta}, title = {AutoTSG: Learning and Synthesis for Incident Troubleshooting}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1477--1488}, doi = {10.1145/3540250.3558958}, year = {2022}, } Publisher's Version ESEC/FSE '22-IND: "Nalanda: A Socio-technical ..." Nalanda: A Socio-technical Graph Platform for Building Software Analytics Tools at Enterprise Scale Chandra Maddila, Suhas Shanbhogue, Apoorva Agrawal, Thomas Zimmermann, Chetan Bansal, Nicole Forsgren, Divyanshu Agrawal, Kim Herzig, and Arie van Deursen (Microsoft Research, USA; Microsoft Research, India; Microsoft, USA; Delft University of Technology, Netherlands) Software development is information-dense knowledge work that requires collaboration with other developers and awareness of artifacts such as work items, pull requests, and file changes. With the speed of development increasing, information overload and information discovery are challenges for people developing and maintaining these systems. Finding information about similar code changes and experts is difficult for software engineers, especially when they work in large software systems or have just recently joined a project. In this paper, we build a large scale data platform named Nalanda platform to address the challenges of information overload and discovery. Nalanda contains two subsystems: (1) a large scale socio-technical graph system, named Nalanda graph system, and (2) a large scale index system, named Nalanda index system that aims at satisfying the information needs of software developers. To show the versatility of the Nalanda platform, we built two applications: (1) a software analytics application with a news feed named MyNalanda that has Daily Active Users (DAU) of 290 and Monthly Active Users (MAU) of 590, and (2) a recommendation system for related work items and pull requests that accomplished similar tasks (artifact recommendation) and a recommendation system for subject matter experts (expert recommendation), augmented by the Nalanda socio-technical graph. Initial studies of the two applications found that developers and engineering managers are favorable toward continued use of the news feed application for information discovery. The studies also found that developers agreed that a system like Nalanda artifact and expert recommendation application could reduce the time spent and the number of places needed to visit to find information. @InProceedings{ESEC/FSE22p1246, author = {Chandra Maddila and Suhas Shanbhogue and Apoorva Agrawal and Thomas Zimmermann and Chetan Bansal and Nicole Forsgren and Divyanshu Agrawal and Kim Herzig and Arie van Deursen}, title = {Nalanda: A Socio-technical Graph Platform for Building Software Analytics Tools at Enterprise Scale}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1246--1256}, doi = {10.1145/3540250.3558949}, year = {2022}, } Publisher's Version Info |
|
Bao, Lingfeng |
ESEC/FSE '22: "Automated Unearthing of Dangerous ..."
Automated Unearthing of Dangerous Issue Reports
Shengyi Pan, Jiayuan Zhou, Filipe Roseiro Cogo, Xin Xia, Lingfeng Bao, Xing Hu, Shanping Li, and Ahmed E. Hassan (Zhejiang University, China; Huawei, Canada; Huawei, China; Queen’s University, Canada) The coordinated vulnerability disclosure (CVD) process is commonly adopted for open source software (OSS) vulnerability management, which suggests to privately report the discovered vulnerabilities and keep relevant information secret until the official disclosure. However, in practice, due to various reasons (e.g., lacking security domain expertise or the sense of security management), many vulnerabilities are first reported via public issue reports (IRs) before its official disclosure. Such IRs are dangerous IRs, since attackers can take advantages of the leaked vulnerability information to launch zero-day attacks. It is crucial to identify such dangerous IRs at an early stage, such that OSS users can start the vulnerability remediation process earlier and OSS maintainers can timely manage the dangerous IRs. In this paper, we propose and evaluate a deep learning based approach, namely MemVul, to automatically identify dangerous IRs at the time they are reported. MemVul augments the neural networks with a memory component, which stores the external vulnerability knowledge from Common Weakness Enumeration (CWE). We rely on publicly accessible CVE-referred IRs (CIRs) to operationalize the concept of dangerous IR. We mine 3,937 CIRs distributed across 1,390 OSS projects hosted on GitHub. Evaluated under a practical scenario of high data imbalance, MemVul achieves the best trade-off between precision and recall among all baselines. In particular, the F1-score of MemVul (i.e., 0.49) improves the best performing baseline by 44%. For IRs that are predicted as CIRs but not reported to CVE, we conduct a user study to investigate their usefulness to OSS stakeholders. We observe that 82% (41 out of 50) of these IRs are security-related and 28 of them are suggested by security experts to be publicly disclosed, indicating MemVul is capable of identifying undisclosed dangerous IRs. @InProceedings{ESEC/FSE22p834, author = {Shengyi Pan and Jiayuan Zhou and Filipe Roseiro Cogo and Xin Xia and Lingfeng Bao and Xing Hu and Shanping Li and Ahmed E. Hassan}, title = {Automated Unearthing of Dangerous Issue Reports}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {834--846}, doi = {10.1145/3540250.3549156}, year = {2022}, } Publisher's Version |
|
Bao, Xuanlin |
ESEC/FSE '22-DEMO: "CodeMatcher: A Tool for Large-Scale ..."
CodeMatcher: A Tool for Large-Scale Code Search Based on Query Semantics Matching
Chao Liu, Xuanlin Bao, Xin Xia, Meng Yan, David Lo, and Ting Zhang (Chongqing University, China; Huawei, China; Singapore Management University, Singapore) Due to the emergence of large-scale codebases, such as GitHub and Gitee, searching and reusing existing code can help developers substantially improve software development productivity. Over the years, many code search tools have been developed. Early tools leveraged the information retrieval (IR) technique to perform an efficient code search for a frequently changed large-scale codebase. However, the search accuracy was low due to the semantic mismatch between query and code. In the recent years, many tools leveraged Deep Learning (DL) technique to address this issue. But the DL-based tools are slow and the search accuracy is unstable. In this paper, we presented an IR-based tool CodeMatcher, which inherits the advantages of the DL-based tool in query semantics matching. Generally, CodeMatcher builds indexing for a large-scale codebase at first to accelerate the search response time. For a given search query, it addresses irrelevant and noisy words in the query, then retrieves candidate code from the indexed codebase via iterative fuzzy search, and finally reranks the candidates based on two designed measures of semantic matching between query and candidates. We implemented CodeMatcher as a search engine website. To verify the effectiveness of our tool, we evaluated CodeMatcher on 41k+ open-source Java repositories. Experimental results showed that CodeMatcher can achieve an industrial-level response time (0.3s) with a common server with an Intel-i7 CPU. On the search accuracy, CodeMatcher significantly outperforms three state-of-the-art tools (DeepCS, UNIF, and CodeHow) and two online search engines (GitHub search and Google search). @InProceedings{ESEC/FSE22p1642, author = {Chao Liu and Xuanlin Bao and Xin Xia and Meng Yan and David Lo and Ting Zhang}, title = {CodeMatcher: A Tool for Large-Scale Code Search Based on Query Semantics Matching}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1642--1646}, doi = {10.1145/3540250.3558935}, year = {2022}, } Publisher's Version Video Info |
|
Baral, Kesina |
ESEC/FSE '22: "Avgust: Automating Usage-Based ..."
Avgust: Automating Usage-Based Test Generation from Videos of App Executions
Yixue Zhao, Saghar Talebipour, Kesina Baral, Hyojae Park, Leon Yee, Safwat Ali Khan, Yuriy Brun, Nenad Medvidović, and Kevin Moran (University of Massachusetts at Amherst, USA; University of Southern California, USA; George Mason University, USA; Sharon High School, USA; Valley Christian High School, USA) Writing and maintaining UI tests for mobile apps is a time-consuming and tedious task. While decades of research have produced auto- mated approaches for UI test generation, these approaches typically focus on testing for crashes or maximizing code coverage. By contrast, recent research has shown that developers prefer usage-based tests, which center around specific uses of app features, to help support activities such as regression testing. Very few existing techniques support the generation of such tests, as doing so requires automating the difficult task of understanding the semantics of UI screens and user inputs. In this paper, we introduce Avgust, which automates key steps of generating usage-based tests. Avgust uses neural models for image understanding to process video recordings of app uses to synthesize an app-agnostic state-machine encoding of those uses. Then, Avgust uses this encoding to synthesize test cases for a new target app. We evaluate Avgust on 374 videos of common uses of 18 popular apps and show that 69% of the tests Avgust generates successfully execute the desired usage, and that Avgust’s classifiers outperform the state of the art. @InProceedings{ESEC/FSE22p421, author = {Yixue Zhao and Saghar Talebipour and Kesina Baral and Hyojae Park and Leon Yee and Safwat Ali Khan and Yuriy Brun and Nenad Medvidović and Kevin Moran}, title = {Avgust: Automating Usage-Based Test Generation from Videos of App Executions}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {421--433}, doi = {10.1145/3540250.3549134}, year = {2022}, } Publisher's Version Info Artifacts Functional |
|
Barr, Earl T. |
ESEC/FSE '22: "Modus: A Datalog Dialect for ..."
Modus: A Datalog Dialect for Building Container Images
Chris Tomy, Tingmao Wang, Earl T. Barr, and Sergey Mechtaev (University College London, UK) Containers help share and deploy software by packaging it with all its dependencies. Tools, like Docker or Kubernetes, spawn containers from images as specified by a build system’s language, such as Dockerfile. A build system takes many parameters to build an image, including OS and application versions. These build parameters can interact: setting one can restrict another. Dockerfile lacks support for reifying and constraining these interactions, thus forcing developers to write a build script per workflow. As a result, developers have resorted to creating ad-hoc solutions such as templates or domain-specific frameworks that harm performance and complicate maintenance because they are verbose and mix languages. To address this problem, we introduce Modus, a Datalog dialect for building container images. Modus' key insight is that container definitions naturally map to proof trees of Horn clauses. In these trees, container configurations correspond to logical facts, build instructions correspond to logic rules, and the build tree is computed as the minimal proof of the Datalog query specifying the target image. Modus relies on Datalog’s expressivity to specify complex workflows with concision and facilitate automatic parallelisation. We evaluated Modus by porting build systems of six popular Docker Hub images to Modus. Modus reduced the code size by 20.1% compared to the used ad-hoc solutions, while imposing a negligible performance overhead, preserving the original image size and image efficiency. We also provide a detailed analysis of porting OpenJDK image build system to Modus. @InProceedings{ESEC/FSE22p595, author = {Chris Tomy and Tingmao Wang and Earl T. Barr and Sergey Mechtaev}, title = {Modus: A Datalog Dialect for Building Container Images}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {595--606}, doi = {10.1145/3540250.3549133}, year = {2022}, } Publisher's Version Info Artifacts Reusable |
|
Bartocci, Ezio |
ESEC/FSE '22-DEMO: "FIM: Fault Injection and Mutation ..."
FIM: Fault Injection and Mutation for Simulink
Ezio Bartocci, Leonardo Mariani, Dejan Ničković, and Drishti Yadav (TU Wien, Austria; University of Milano-Bicocca, Italy; AIT, Austria) We introduce FIM, an open-source toolkit for automated fault injection and mutant generation in Simulink models. FIM allows the injection of faults into specific parts, supporting common types of faults and mutation operators whose parameters can be customized to control the time of fault actuation and persistence. Additional flags allow the user to activate the individual fault blocks during testing to observe their effects on the overall system reliability. We provide insights into the design and architecture of FIM, and evaluate its performance on a case study from the avionics domain. @InProceedings{ESEC/FSE22p1716, author = {Ezio Bartocci and Leonardo Mariani and Dejan Ničković and Drishti Yadav}, title = {FIM: Fault Injection and Mutation for Simulink}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1716--1720}, doi = {10.1145/3540250.3558932}, year = {2022}, } Publisher's Version |
|
Beckmann, Jennifer |
ESEC/FSE '22-IND: "Discovering Feature Flag Interdependencies ..."
Discovering Feature Flag Interdependencies in Microsoft Office
Michael Schröder, Katja Kevic, Dan Gopstein, Brendan Murphy, and Jennifer Beckmann (TU Wien, Austria; Microsoft, UK; Microsoft, USA) Feature flags are a popular method to control functionality in released code. They enable rapid development and deployment, but can also quickly accumulate technical debt. Complex interactions between feature flags can go unnoticed, especially if interdependent flags are located far apart in the code, and these unknown dependencies could become a source of serious bugs. Testing all possible combinations of feature flags is infeasible in large systems like Microsoft Office, which has about 12000 active flags. The goal of our research is to aid product teams in improving system reliability by providing an approach to automatically discover feature flag interdependencies. We use probabilistic reasoning to infer causal relationships from feature flag query logs. Our approach is language-agnostic, scales easily to large heterogeneous codebases, and is robust against noise such as code drift or imperfect log data. We evaluated our approach on real-world query logs from Microsoft Office and are able to achieve over 90% precision while recalling non-trivial indirect feature flag relationships across different source files. We also investigated re-occurring patterns of relationships and describe applications for targeted testing, determining deployment velocity, error mitigation, and diagnostics. @InProceedings{ESEC/FSE22p1419, author = {Michael Schröder and Katja Kevic and Dan Gopstein and Brendan Murphy and Jennifer Beckmann}, title = {Discovering Feature Flag Interdependencies in Microsoft Office}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1419--1429}, doi = {10.1145/3540250.3558942}, year = {2022}, } Publisher's Version |
|
Behroozi, Mahnaz |
ESEC/FSE '22: "Asynchronous Technical Interviews: ..."
Asynchronous Technical Interviews: Reducing the Effect of Supervised Think-Aloud on Communication Ability
Mahnaz Behroozi, Chris Parnin, and Chris Brown (IBM, USA; North Carolina State University, USA; Virginia Tech, USA) Software engineers often face a critical test before landing a job—passing a technical interview. During these sessions, candidates must write code while thinking aloud as they work toward a solution to a problem under the watchful eye of an interviewer. While thinking aloud during technical interviews gives interviewers a picture of candidates’ problem-solving ability, surprisingly, these types of interviews often prevent candidates from communicating their thought process effectively. To understand if poor performance related to interviewer presence can be reduced while preserving communication and technical skills, we introduce asynchronous technical interviews—where candidates submit recordings of think-aloud and coding. We compare this approach to traditional whiteboard interviews and find that, by eliminating interviewer supervision, asynchronicity significantly improved the clarity of think-aloud via increased informativeness and reduced stress. Moreover, we discovered asynchronous technical interviews preserved, and in some cases even enhanced, technical problem-solving strategies and code quality. This work offers insight into asynchronous technical interviews as a design for supporting communication during interviews, and discusses trade-offs and guidelines for implementing this approach in software engineering hiring practices. @InProceedings{ESEC/FSE22p294, author = {Mahnaz Behroozi and Chris Parnin and Chris Brown}, title = {Asynchronous Technical Interviews: Reducing the Effect of Supervised Think-Aloud on Communication Ability}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {294--305}, doi = {10.1145/3540250.3549168}, year = {2022}, } Publisher's Version |
|
Bell, Jonathan |
ESEC/FSE '22: "A Retrospective Study of One ..."
A Retrospective Study of One Decade of Artifact Evaluations
Stefan Winter, Christopher S. Timperley, Ben Hermann, Jürgen Cito, Jonathan Bell, Michael Hilton, and Dirk Beyer (LMU Munich, Germany; Carnegie Mellon University, USA; TU Dortmund, Germany; TU Wien, Austria; Northeastern University, USA) Most software engineering research involves the development of a prototype, a proof of concept, or a measurement apparatus. Together with the data collected in the research process, they are collectively referred to as research artifacts and are subject to artifact evaluation (AE) at scientific conferences. Since its initiation in the SE community at ESEC/FSE 2011, both the goals and the process of AE have evolved and today expectations towards AE are strongly linked with reproducible research results and reusable tools that other researchers can build their work on. However, to date little evidence has been provided that artifacts which have passed AE actually live up to these high expectations, i.e., to which degree AE processes contribute to AE's goals and whether the overhead they impose is justified. We aim to fill this gap by providing an in-depth analysis of research artifacts from a decade of software engineering (SE) and programming languages (PL) conferences, based on which we reflect on the goals and mechanisms of AE in our community. In summary, our analyses (1) suggest that articles with artifacts do not generally have better visibility in the community, (2) provide evidence how evaluated and not evaluated artifacts differ with respect to different quality criteria, and (3) highlight opportunities for further improving AE processes. @InProceedings{ESEC/FSE22p145, author = {Stefan Winter and Christopher S. Timperley and Ben Hermann and Jürgen Cito and Jonathan Bell and Michael Hilton and Dirk Beyer}, title = {A Retrospective Study of One Decade of Artifact Evaluations}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {145--156}, doi = {10.1145/3540250.3549172}, year = {2022}, } Publisher's Version Artifacts Reusable |
|
Bergum, Annabelle |
ESEC/FSE '22: "Correlates of Programmer Efficacy ..."
Correlates of Programmer Efficacy and Their Link to Experience: A Combined EEG and Eye-Tracking Study
Norman Peitek, Annabelle Bergum, Maurice Rekrut, Jonas Mucke, Matthias Nadig, Chris Parnin, Janet Siegmund, and Sven Apel (Saarland University, Germany; German Research Center for Artificial Intelligence, Germany; Chemnitz University of Technology, Germany; North Carolina State University, USA) Background: Despite similar education and background, programmers can exhibit vast differences in efficacy. While research has identified some potential factors, such as programming experience and domain knowledge, the effect of these factors on programmers' efficacy is not well understood. Aims: We aim at unraveling the relationship between efficacy (speed and correctness) and measures of programming experience. We further investigate the correlates of programmer efficacy in terms of reading behavior and cognitive load. Method: For this purpose, we conducted a controlled experiment with 37 participants using electroencephalography (EEG) and eye tracking. We asked participants to comprehend up to 32 Java source-code snippets and observed their eye gaze and neural correlates of cognitive load. We analyzed the correlation of participants' efficacy with popular programming experience measures. Results: We found that programmers with high efficacy read source code more targeted and with lower cognitive load. Commonly used experience levels do not predict programmer efficacy well, but self-estimation and indicators of learning eagerness are fairly accurate. Implications: The identified correlates of programmer efficacy can be used for future research and practice (e.g., hiring). Future research should also consider efficacy as a group sampling method, rather than using simple experience measures. @InProceedings{ESEC/FSE22p120, author = {Norman Peitek and Annabelle Bergum and Maurice Rekrut and Jonas Mucke and Matthias Nadig and Chris Parnin and Janet Siegmund and Sven Apel}, title = {Correlates of Programmer Efficacy and Their Link to Experience: A Combined EEG and Eye-Tracking Study}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {120--131}, doi = {10.1145/3540250.3549084}, year = {2022}, } Publisher's Version Info |
|
Beyer, Dirk |
ESEC/FSE '22: "A Retrospective Study of One ..."
A Retrospective Study of One Decade of Artifact Evaluations
Stefan Winter, Christopher S. Timperley, Ben Hermann, Jürgen Cito, Jonathan Bell, Michael Hilton, and Dirk Beyer (LMU Munich, Germany; Carnegie Mellon University, USA; TU Dortmund, Germany; TU Wien, Austria; Northeastern University, USA) Most software engineering research involves the development of a prototype, a proof of concept, or a measurement apparatus. Together with the data collected in the research process, they are collectively referred to as research artifacts and are subject to artifact evaluation (AE) at scientific conferences. Since its initiation in the SE community at ESEC/FSE 2011, both the goals and the process of AE have evolved and today expectations towards AE are strongly linked with reproducible research results and reusable tools that other researchers can build their work on. However, to date little evidence has been provided that artifacts which have passed AE actually live up to these high expectations, i.e., to which degree AE processes contribute to AE's goals and whether the overhead they impose is justified. We aim to fill this gap by providing an in-depth analysis of research artifacts from a decade of software engineering (SE) and programming languages (PL) conferences, based on which we reflect on the goals and mechanisms of AE in our community. In summary, our analyses (1) suggest that articles with artifacts do not generally have better visibility in the community, (2) provide evidence how evaluated and not evaluated artifacts differ with respect to different quality criteria, and (3) highlight opportunities for further improving AE processes. @InProceedings{ESEC/FSE22p145, author = {Stefan Winter and Christopher S. Timperley and Ben Hermann and Jürgen Cito and Jonathan Bell and Michael Hilton and Dirk Beyer}, title = {A Retrospective Study of One Decade of Artifact Evaluations}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {145--156}, doi = {10.1145/3540250.3549172}, year = {2022}, } Publisher's Version Artifacts Reusable |
|
Bezzubov, Alexander |
ESEC/FSE '22-IND: "All You Need Is Logs: Improving ..."
All You Need Is Logs: Improving Code Completion by Learning from Anonymous IDE Usage Logs
Vitaliy Bibaev, Alexey Kalina, Vadim Lomshakov, Yaroslav Golubev, Alexander Bezzubov, Nikita Povarov, and Timofey Bryksin (JetBrains, Serbia; JetBrains, Germany; JetBrains, Russia; JetBrains Research, Serbia; JetBrains, Netherlands; JetBrains Research, Cyprus) In this work, we propose an approach for collecting completion usage logs from the users in an IDE and using them to train a machine learning based model for ranking completion candidates. We developed a set of features that describe completion candidates and their context, and deployed their anonymized collection in the Early Access Program of IntelliJ-based IDEs. We used the logs to collect a dataset of code completions from users, and employed it to train a ranking CatBoost model. Then, we evaluated it in two settings: on a held-out set of the collected completions and in a separate A/B test on two different groups of users in the IDE. Our evaluation shows that using a simple ranking model trained on the past user behavior logs significantly improved code completion experience. Compared to the default heuristics-based ranking, our model demonstrated a decrease in the number of typing actions necessary to perform the completion in the IDE from 2.073 to 1.832. The approach adheres to privacy requirements and legal constraints, since it does not require collecting personal information, performing all the necessary anonymization on the client's side. Importantly, it can be improved continuously: implementing new features, collecting new data, and evaluating new models - this way, we have been using it in production since the end of 2020. @InProceedings{ESEC/FSE22p1269, author = {Vitaliy Bibaev and Alexey Kalina and Vadim Lomshakov and Yaroslav Golubev and Alexander Bezzubov and Nikita Povarov and Timofey Bryksin}, title = {All You Need Is Logs: Improving Code Completion by Learning from Anonymous IDE Usage Logs}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1269--1279}, doi = {10.1145/3540250.3558968}, year = {2022}, } Publisher's Version |
|
Bibaev, Vitaliy |
ESEC/FSE '22-IND: "All You Need Is Logs: Improving ..."
All You Need Is Logs: Improving Code Completion by Learning from Anonymous IDE Usage Logs
Vitaliy Bibaev, Alexey Kalina, Vadim Lomshakov, Yaroslav Golubev, Alexander Bezzubov, Nikita Povarov, and Timofey Bryksin (JetBrains, Serbia; JetBrains, Germany; JetBrains, Russia; JetBrains Research, Serbia; JetBrains, Netherlands; JetBrains Research, Cyprus) In this work, we propose an approach for collecting completion usage logs from the users in an IDE and using them to train a machine learning based model for ranking completion candidates. We developed a set of features that describe completion candidates and their context, and deployed their anonymized collection in the Early Access Program of IntelliJ-based IDEs. We used the logs to collect a dataset of code completions from users, and employed it to train a ranking CatBoost model. Then, we evaluated it in two settings: on a held-out set of the collected completions and in a separate A/B test on two different groups of users in the IDE. Our evaluation shows that using a simple ranking model trained on the past user behavior logs significantly improved code completion experience. Compared to the default heuristics-based ranking, our model demonstrated a decrease in the number of typing actions necessary to perform the completion in the IDE from 2.073 to 1.832. The approach adheres to privacy requirements and legal constraints, since it does not require collecting personal information, performing all the necessary anonymization on the client's side. Importantly, it can be improved continuously: implementing new features, collecting new data, and evaluating new models - this way, we have been using it in production since the end of 2020. @InProceedings{ESEC/FSE22p1269, author = {Vitaliy Bibaev and Alexey Kalina and Vadim Lomshakov and Yaroslav Golubev and Alexander Bezzubov and Nikita Povarov and Timofey Bryksin}, title = {All You Need Is Logs: Improving Code Completion by Learning from Anonymous IDE Usage Logs}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1269--1279}, doi = {10.1145/3540250.3558968}, year = {2022}, } Publisher's Version |
|
Biesdorf, Andreas |
ESEC/FSE '22-IND: "Sometimes You Have to Treat ..."
Sometimes You Have to Treat the Symptoms: Tackling Model Drift in an Industrial Clone-and-Own Software Product Line
Christof Tinnes, Wolfgang Rössler, Uwe Hohenstein, Torsten Kühn, Andreas Biesdorf, and Sven Apel (Siemens, Germany; Siemens Mobility, Germany; Saarland University, Germany) Many industrial software product lines use a clone-and-own approach for reuse among software products. As a result, the different products in the product line may drift apart, which implies increased efforts for tasks such as change propagation, domain analysis, and quality assurance. While many solutions have been proposed in the literature, these are often difficult to apply in a real-world setting. We study this drift of products in a concrete large-scale industrial model-driven clone-and-own software product line in the railway domain at our industry partner. For this purpose, we conducted interviews and a survey, and we investigated the models in the model history of this project. We found that increased efforts are mainly caused by large model differences and increased communication efforts. We argue that, in the short-term, treating the symptoms (i.e., handling large model differences) can help to keep efforts for software product-line engineering acceptable — instead of employing sophisticated variability management. To treat the symptoms, we employ a solution based on semantic-lifting to simplify model differences. Using the interviews and the survey, we evaluate the feasibility of variability management approaches and the semantic-lifting approach in the context of this project. @InProceedings{ESEC/FSE22p1355, author = {Christof Tinnes and Wolfgang Rössler and Uwe Hohenstein and Torsten Kühn and Andreas Biesdorf and Sven Apel}, title = {Sometimes You Have to Treat the Symptoms: Tackling Model Drift in an Industrial Clone-and-Own Software Product Line}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1355--1366}, doi = {10.1145/3540250.3558960}, year = {2022}, } Publisher's Version |
|
Bird, Christian |
ESEC/FSE '22: "Program Merge Conflict Resolution ..."
Program Merge Conflict Resolution via Neural Transformers
Alexey Svyatkovskiy, Sarah Fakhoury, Negar Ghorbani, Todd Mytkowicz, Elizabeth Dinella, Christian Bird, Jinu Jang, Neel Sundaresan, and Shuvendu K. Lahiri (Microsoft, USA; Washington State University, USA; University of California at Irvine, USA; Microsoft Research, USA; University of Pennsylvania, USA) Collaborative software development is an integral part of the modern software development life cycle, essential to the success of large-scale software projects. When multiple developers make concurrent changes around the same lines of code, a merge conflict may occur. Such conflicts stall pull requests and continuous integration pipelines for hours to several days, seriously hurting developer productivity. To address this problem, we introduce MergeBERT, a novel neural program merge framework based on token-level three-way differencing and a transformer encoder model. By exploiting the restricted nature of merge conflict resolutions, we reformulate the task of generating the resolution sequence as a classification task over a set of primitive merge patterns extracted from real-world merge commit data. Our model achieves 63–68% accuracy for merge resolution synthesis, yielding nearly a 3× performance improvement over existing semi-structured, and 2× improvement over neural program merge tools. Finally, we demonstrate that MergeBERT is sufficiently flexible to work with source code files in Java, JavaScript, TypeScript, and C# programming languages. To measure the practical use of MergeBERT, we conduct a user study to evaluate MergeBERT suggestions with 25 developers from large OSS projects on 122 real-world conflicts they encountered. Results suggest that in practice, MergeBERT resolutions would be accepted at a higher rate than estimated by automatic metrics for precision and accuracy. Additionally, we use participant feedback to identify future avenues for improvement of MergeBERT. @InProceedings{ESEC/FSE22p822, author = {Alexey Svyatkovskiy and Sarah Fakhoury and Negar Ghorbani and Todd Mytkowicz and Elizabeth Dinella and Christian Bird and Jinu Jang and Neel Sundaresan and Shuvendu K. Lahiri}, title = {Program Merge Conflict Resolution via Neural Transformers}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {822--833}, doi = {10.1145/3540250.3549163}, year = {2022}, } Publisher's Version |
|
Biswas, Sumon |
ESEC/FSE '22: "23 Shades of Self-Admitted ..."
23 Shades of Self-Admitted Technical Debt: An Empirical Study on Machine Learning Software
David OBrien, Sumon Biswas, Sayem Imtiaz, Rabe Abdalkareem, Emad Shihab, and Hridesh Rajan (Iowa State University, USA; Carleton University, Canada; Concordia University, Canada) In software development, the term “technical debt” (TD) is used to characterize short-term solutions and workarounds implemented in source code which may incur a long-term cost. Technical debt has a variety of forms and can thus affect multiple qualities of software including but not limited to its legibility, performance, and structure. In this paper, we have conducted a comprehensive study on the technical debts in machine learning (ML) based software. TD can appear differently in ML software by infecting the data that ML models are trained on, thus affecting the functional behavior of ML systems. The growing inclusion of ML components in modern software systems have introduced a new set of TDs. Does ML software have similar TDs to traditional software? If not, what are the new types of ML specific TDs? Which ML pipeline stages do these debts appear? Do these debts differ in ML tools and applications and when they get removed? Currently, we do not know the state of the ML TDs in the wild. To address these questions, we mined 68,820 self-admitted technical debts (SATD) from all the revisions of a curated dataset consisting of 2,641 popular ML repositories from GitHub, along with their introduction and removal. By applying an open-coding scheme and following upon prior works, we provide a comprehensive taxonomy of ML SATDs. Our study analyzes ML SATD type organizations, their frequencies within stages of ML software, the differences between ML SATDs in applications and tools, and quantifies the removal of ML SATDs. The findings discovered suggest implications for ML developers and researchers to create maintainable ML systems. @InProceedings{ESEC/FSE22p734, author = {David OBrien and Sumon Biswas and Sayem Imtiaz and Rabe Abdalkareem and Emad Shihab and Hridesh Rajan}, title = {23 Shades of Self-Admitted Technical Debt: An Empirical Study on Machine Learning Software}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {734--746}, doi = {10.1145/3540250.3549088}, year = {2022}, } Publisher's Version Artifacts Functional |
|
Bittner, Paul Maximilian |
ESEC/FSE '22: "Classifying Edits to Variability ..."
Classifying Edits to Variability in Source Code
Paul Maximilian Bittner, Christof Tinnes, Alexander Schultheiß, Sören Viegener, Timo Kehrer, and Thomas Thüm (University of Ulm, Germany; Siemens, Germany; Humboldt University of Berlin, Germany; University of Bern, Switzerland) For highly configurable software systems, such as the Linux kernel, maintaining and evolving variability information along changes to source code poses a major challenge. While source code itself may be edited, also feature-to-code mappings may be introduced, removed, or changed. In practice, such edits are often conducted ad-hoc and without proper documentation. To support the maintenance and evolution of variability, it is desirable to understand the impact of each edit on the variability. We propose the first complete and unambiguous classification of edits to variability in source code by means of a catalog of edit classes. This catalog is based on a scheme that can be used to build classifications that are complete and unambiguous by construction. To this end, we introduce a complete and sound model for edits to variability. In about 21.5ms per commit, we validate the correctness and suitability of our classification by classifying each edit in 1.7 million commits in the change histories of 44 open-source software systems automatically. We are able to classify all edits with syntactically correct feature-to-code mappings and find that all our edit classes occur in practice. @InProceedings{ESEC/FSE22p196, author = {Paul Maximilian Bittner and Christof Tinnes and Alexander Schultheiß and Sören Viegener and Timo Kehrer and Thomas Thüm}, title = {Classifying Edits to Variability in Source Code}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {196--208}, doi = {10.1145/3540250.3549108}, year = {2022}, } Publisher's Version Artifacts Reusable |
|
Bo, Lili |
ESEC/FSE '22-DEMO: "KVS: A Tool for Knowledge-Driven ..."
KVS: A Tool for Knowledge-Driven Vulnerability Searching
Xingqi Cheng, Xiaobing Sun, Lili Bo, and Ying Wei (Yangzhou University, China; Heidelberg University, Germany; Nanjing University, China) It is difficult to quickly locate and search for specific vulnerabilities and their solutions because vulnerability information is scattered in the existing vulnerability management library. To alleviate this problem, we extract knowledge from vulnerability reports and organize the vulnerability information into the form of a knowledge graph. Then, we implement a tool for knowledge-driven vulnerability searching, KVS. This tool mainly uses the BERT model to realize the vulnerability named entity recognition and construct the vulnerability knowledge graph (VulKG). Finally, we can search vulnerabilities of interest-based on VulKG. The URL of this tool is https://cinnqi.github.io/Neo4j-D3-VKG/. Video of our demo is available at https://youtu.be/FT1BaLUGPk0. @InProceedings{ESEC/FSE22p1731, author = {Xingqi Cheng and Xiaobing Sun and Lili Bo and Ying Wei}, title = {KVS: A Tool for Knowledge-Driven Vulnerability Searching}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1731--1735}, doi = {10.1145/3540250.3558920}, year = {2022}, } Publisher's Version Video Info |
|
Bowes, David |
ESEC/FSE '22-IND: "Towards Developer-Centered ..."
Towards Developer-Centered Automatic Program Repair: Findings from Bloomberg
Emily Rowan Winter, Vesna Nowack, David Bowes, Steve Counsell, Tracy Hall, Sæmundur Haraldsson, John Woodward, Serkan Kirbas, Etienne Windels, Olayori McBello, Abdurahman Atakishiyev, Kevin Kells, and Matthew Pagano (Lancaster University, UK; Brunel University London, UK; University of Stirling, UK; Queen Mary University of London, UK; Bloomberg, UK) This paper reports on qualitative research into automatic program repair (APR) at Bloomberg. Six focus groups were conducted with a total of seventeen participants (including both developers of the APR tool and developers using the tool) to consider: the development at Bloomberg of a prototype APR tool (Fixie); developers’ early experiences using the tool; and developers’ perspectives on how they would like to interact with the tool in future. APR is developing rapidly and it is important to understand in greater detail developers' experiences using this emerging technology. In this paper, we provide in-depth, qualitative data from an industrial setting. We found that the development of APR at Bloomberg had become increasingly user-centered, emphasising how fixes were presented to developers, as well as particular features, such as customisability. From the focus groups with developers who had used Fixie, we found particular concern with the pragmatic aspects of APR, such as how and when fixes were presented to them. Based on our findings, we make a series of recommendations to inform future APR development, highlighting how APR tools should 'start small', be customisable, and fit with developers' workflows. We also suggest that APR tools should capitalise on the promise of repair bots and draw on advances in explainable AI. @InProceedings{ESEC/FSE22p1578, author = {Emily Rowan Winter and Vesna Nowack and David Bowes and Steve Counsell and Tracy Hall and Sæmundur Haraldsson and John Woodward and Serkan Kirbas and Etienne Windels and Olayori McBello and Abdurahman Atakishiyev and Kevin Kells and Matthew Pagano}, title = {Towards Developer-Centered Automatic Program Repair: Findings from Bloomberg}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1578--1588}, doi = {10.1145/3540250.3558953}, year = {2022}, } Publisher's Version |
|
Boyalakuntla, Kowndinya |
ESEC/FSE '22-DEMO: "eGEN: An Energy-Saving Modeling ..."
eGEN: An Energy-Saving Modeling Language and Code Generator for Location-Sensing of Mobile Apps
Kowndinya Boyalakuntla, Marimuthu Chinnakali, Sridhar Chimalakonda, and Chandrasekaran K (IIT Tirupati, India; National Institute of Technology Karnataka, India) Given the limited tool support for energy-saving strategies during the design phase of android applications, developing battery-aware, location-based android applications is a non-trivial task for developers. To this end, we propose eGEN, consisting of (1) a Domain-Specific Modeling Language (DSML) and (2) a code generator to specify and create native battery-aware, location-based mobile applications. We evaluated eGEN by instrumenting the generated battery-aware code in five location-based, open-source android applications and compared the energy consumption with non-eGEN versions. The experimental results show 188 mA (8.34% of battery per hour) of average reduction in battery consumption while showing only 97 meters of degradation in location accuracy over three kilometers of a cycling path. Hence, we see this tool as a first step in helping developers write battery-aware code in location-based android applications. The GitHub repository with source code and all artifacts is available at https://github.com/Kowndinya2000/egen, and the tool demo video at https://youtu.be/Iadfh4cCw8I. @InProceedings{ESEC/FSE22p1697, author = {Kowndinya Boyalakuntla and Marimuthu Chinnakali and Sridhar Chimalakonda and Chandrasekaran K}, title = {eGEN: An Energy-Saving Modeling Language and Code Generator for Location-Sensing of Mobile Apps}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1697--1700}, doi = {10.1145/3540250.3558914}, year = {2022}, } Publisher's Version Video |
|
Braz, Larissa |
ESEC/FSE '22: "Software Security during Modern ..."
Software Security during Modern Code Review: The Developer’s Perspective
Larissa Braz and Alberto Bacchelli (University of Zurich, Switzerland) To avoid software vulnerabilities, organizations are shifting security to earlier stages of the software development, such as at code review time. In this paper, we aim to understand the developers’ perspective on assessing software security during code review, the challenges they encounter, and the support that companies and projects provide. To this end, we conduct a two-step investigation: we interview 10 professional developers and survey 182 practitioners about software security assessment during code review. The outcome is an overview of how developers perceive software security during code review and a set of identified challenges. Our study revealed that most developers do not immediately report to focus on security issues during code review. Only after being asked about software security, developers state to always consider it during review and acknowledge its importance. Most companies do not provide security training, yet expect developers to still ensure security during reviews. Accordingly, developers report the lack of training and security knowledge as the main challenges they face when checking for security issues. In addition, they have challenges with third-party libraries and to identify interactions between parts of code that could have security implications. Moreover, security may be disregarded during reviews due to developers’ assumptions about the security dynamic of the application they develop. Preprint: https://arxiv.org/abs/2208.04261 Data and materials: https://doi.org/10.5281/zenodo.6969369 @InProceedings{ESEC/FSE22p810, author = {Larissa Braz and Alberto Bacchelli}, title = {Software Security during Modern Code Review: The Developer’s Perspective}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {810--821}, doi = {10.1145/3540250.3549135}, year = {2022}, } Publisher's Version Info ESEC/FSE '22: "First Come First Served: The ..." First Come First Served: The Impact of File Position on Code Review Enrico Fregnan, Larissa Braz, Marco D'Ambros, Gül Çalıklı, and Alberto Bacchelli (University of Zurich, Switzerland; USI Lugano, Switzerland; University of Glasgow, UK) The most popular code review tools (e.g., Gerrit and GitHub) present the files to review sorted in alphabetical order. Could this choice or, more generally, the relative position in which a file is presented bias the outcome of code reviews? We investigate this hypothesis by triangulating complementary evidence in a two-step study. First, we observe developers’ code review activity. We analyze the review comments pertaining to 219,476 Pull Requests (PRs) from 138 popular Java projects on GitHub. We found files shown earlier in a PR to receive more comments than files shown later, also when controlling for possible confounding factors: e.g., the presence of discussion threads or the lines added in a file. Second, we measure the impact of file position on defect finding in code review. Recruit- ing 106 participants, we conduct an online controlled experiment in which we measure participants’ performance in detecting two unrelated defects seeded into two different files. Participants are assigned to one of two treatments in which the position of the defective files is switched. For one type of defect, participants are not affected by its file’s position; for the other, they have 64% lower odds to identify it when its file is last as opposed to first. Overall, our findings provide evidence that the relative position in which files are presented has an impact on code reviews’ outcome; we discuss these results and implications for tool design and code review. @InProceedings{ESEC/FSE22p483, author = {Enrico Fregnan and Larissa Braz and Marco D'Ambros and Gül Çalıklı and Alberto Bacchelli}, title = {First Come First Served: The Impact of File Position on Code Review}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {483--494}, doi = {10.1145/3540250.3549177}, year = {2022}, } Publisher's Version Artifacts Reusable |
|
Briand, Lionel C. |
ESEC/FSE '22-DEMO: "COREQQA: A COmpliance REQuirements ..."
COREQQA: A COmpliance REQuirements Understanding using Question Answering Tool
Sallam Abualhaija, Chetan Arora, and Lionel C. Briand (University of Luxembourg, Luxembourg; Deakin University, Australia; University of Ottawa, Canada) We introduce COREQQA, a tool for assisting requirements engineers in acquiring a better understanding of compliance requirements by means of automated Question Answering. Extracting compliance-related requirements by manually navigating through a legal document is both time-consuming and error-prone. COREQQA enables requirements engineers to pose questions in natural language about a compliance-related topic given some legal document, e.g., asking about data breach. The tool then automatically navigates through the legal document and returns to the requirements engineer a list of text passages containing the possible answers to the input question. For better readability, the tool also highlights the likely answers in these passages. The engineer can then use this output for specifying compliance requirements. COREQQA is developed using advanced large-scale language models from BERT’s family. COREQQA has been evaluated on four legal documents. The results of this evaluation are briefly presented in the paper. The tool is publicly available on Zenodo (https://doi.org/10.5281/zenodo.6653514). @InProceedings{ESEC/FSE22p1682, author = {Sallam Abualhaija and Chetan Arora and Lionel C. Briand}, title = {COREQQA: A COmpliance REQuirements Understanding using Question Answering Tool}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1682--1686}, doi = {10.1145/3540250.3558926}, year = {2022}, } Publisher's Version |
|
Brown, Chris |
ESEC/FSE '22: "Asynchronous Technical Interviews: ..."
Asynchronous Technical Interviews: Reducing the Effect of Supervised Think-Aloud on Communication Ability
Mahnaz Behroozi, Chris Parnin, and Chris Brown (IBM, USA; North Carolina State University, USA; Virginia Tech, USA) Software engineers often face a critical test before landing a job—passing a technical interview. During these sessions, candidates must write code while thinking aloud as they work toward a solution to a problem under the watchful eye of an interviewer. While thinking aloud during technical interviews gives interviewers a picture of candidates’ problem-solving ability, surprisingly, these types of interviews often prevent candidates from communicating their thought process effectively. To understand if poor performance related to interviewer presence can be reduced while preserving communication and technical skills, we introduce asynchronous technical interviews—where candidates submit recordings of think-aloud and coding. We compare this approach to traditional whiteboard interviews and find that, by eliminating interviewer supervision, asynchronicity significantly improved the clarity of think-aloud via increased informativeness and reduced stress. Moreover, we discovered asynchronous technical interviews preserved, and in some cases even enhanced, technical problem-solving strategies and code quality. This work offers insight into asynchronous technical interviews as a design for supporting communication during interviews, and discusses trade-offs and guidelines for implementing this approach in software engineering hiring practices. @InProceedings{ESEC/FSE22p294, author = {Mahnaz Behroozi and Chris Parnin and Chris Brown}, title = {Asynchronous Technical Interviews: Reducing the Effect of Supervised Think-Aloud on Communication Ability}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {294--305}, doi = {10.1145/3540250.3549168}, year = {2022}, } Publisher's Version |
|
Brun, Yuriy |
ESEC/FSE '22: "Avgust: Automating Usage-Based ..."
Avgust: Automating Usage-Based Test Generation from Videos of App Executions
Yixue Zhao, Saghar Talebipour, Kesina Baral, Hyojae Park, Leon Yee, Safwat Ali Khan, Yuriy Brun, Nenad Medvidović, and Kevin Moran (University of Massachusetts at Amherst, USA; University of Southern California, USA; George Mason University, USA; Sharon High School, USA; Valley Christian High School, USA) Writing and maintaining UI tests for mobile apps is a time-consuming and tedious task. While decades of research have produced auto- mated approaches for UI test generation, these approaches typically focus on testing for crashes or maximizing code coverage. By contrast, recent research has shown that developers prefer usage-based tests, which center around specific uses of app features, to help support activities such as regression testing. Very few existing techniques support the generation of such tests, as doing so requires automating the difficult task of understanding the semantics of UI screens and user inputs. In this paper, we introduce Avgust, which automates key steps of generating usage-based tests. Avgust uses neural models for image understanding to process video recordings of app uses to synthesize an app-agnostic state-machine encoding of those uses. Then, Avgust uses this encoding to synthesize test cases for a new target app. We evaluate Avgust on 374 videos of common uses of 18 popular apps and show that 69% of the tests Avgust generates successfully execute the desired usage, and that Avgust’s classifiers outperform the state of the art. @InProceedings{ESEC/FSE22p421, author = {Yixue Zhao and Saghar Talebipour and Kesina Baral and Hyojae Park and Leon Yee and Safwat Ali Khan and Yuriy Brun and Nenad Medvidović and Kevin Moran}, title = {Avgust: Automating Usage-Based Test Generation from Videos of App Executions}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {421--433}, doi = {10.1145/3540250.3549134}, year = {2022}, } Publisher's Version Info Artifacts Functional |
|
Bryksin, Timofey |
ESEC/FSE '22-IND: "All You Need Is Logs: Improving ..."
All You Need Is Logs: Improving Code Completion by Learning from Anonymous IDE Usage Logs
Vitaliy Bibaev, Alexey Kalina, Vadim Lomshakov, Yaroslav Golubev, Alexander Bezzubov, Nikita Povarov, and Timofey Bryksin (JetBrains, Serbia; JetBrains, Germany; JetBrains, Russia; JetBrains Research, Serbia; JetBrains, Netherlands; JetBrains Research, Cyprus) In this work, we propose an approach for collecting completion usage logs from the users in an IDE and using them to train a machine learning based model for ranking completion candidates. We developed a set of features that describe completion candidates and their context, and deployed their anonymized collection in the Early Access Program of IntelliJ-based IDEs. We used the logs to collect a dataset of code completions from users, and employed it to train a ranking CatBoost model. Then, we evaluated it in two settings: on a held-out set of the collected completions and in a separate A/B test on two different groups of users in the IDE. Our evaluation shows that using a simple ranking model trained on the past user behavior logs significantly improved code completion experience. Compared to the default heuristics-based ranking, our model demonstrated a decrease in the number of typing actions necessary to perform the completion in the IDE from 2.073 to 1.832. The approach adheres to privacy requirements and legal constraints, since it does not require collecting personal information, performing all the necessary anonymization on the client's side. Importantly, it can be improved continuously: implementing new features, collecting new data, and evaluating new models - this way, we have been using it in production since the end of 2020. @InProceedings{ESEC/FSE22p1269, author = {Vitaliy Bibaev and Alexey Kalina and Vadim Lomshakov and Yaroslav Golubev and Alexander Bezzubov and Nikita Povarov and Timofey Bryksin}, title = {All You Need Is Logs: Improving Code Completion by Learning from Anonymous IDE Usage Logs}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1269--1279}, doi = {10.1145/3540250.3558968}, year = {2022}, } Publisher's Version |
|
Bultan, Tevfik |
ESEC/FSE '22-DEMO: "TSA: A Tool to Detect and ..."
TSA: A Tool to Detect and Quantify Network Side-Channels
İsmet Burak Kadron and Tevfik Bultan (University of California at Santa Barbara, USA) Mobile applications, Internet of Things devices and web services are pervasive and they all encrypt the communications between servers and clients to not have information leakages. While the network traffic is encrypted, packet sizes and timings are still visible to an eavesdropper and these properties can leak information and sacrifice user privacy. We present TSA, a black box network side-channel analysis tool which detects and quantifies side-channel information leakages. TSA provides the users with the means to automate trace gathering by providing a framework in which the users can write mutators for the inputs to the system under analysis. TSA can also take as input traces directly for analysis if the user prefers to gather them separately. TSA is open-source and available as a Python package and a command-line tool. TSA demo, tool and benchmarks are available at https://github.com/kadron/tsa-tool. @InProceedings{ESEC/FSE22p1760, author = {İsmet Burak Kadron and Tevfik Bultan}, title = {TSA: A Tool to Detect and Quantify Network Side-Channels}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1760--1764}, doi = {10.1145/3540250.3558938}, year = {2022}, } Publisher's Version |
|
Businge, John |
ESEC/FSE '22: "PaReco: Patched Clones and ..."
PaReco: Patched Clones and Missed Patches among the Divergent Variants of a Software Family
Poedjadevie Kadjel Ramkisoen, John Businge, Brent van Bladel, Alexandre Decan, Serge Demeyer, Coen De Roover, and Foutse Khomh (University of Antwerp, Belgium; Flanders Make, Belgium; University of Nevada at Las Vegas, USA; University of Mons, Belgium; F.R.S.-FNRS, Belgium; Vrije Universiteit Brussel, Belgium; Polytechnique Montréal, Canada) Re-using whole repositories as a starting point for new projects is often done by maintaining a variant fork parallel to the original. However, the common artifacts between both are not always kept up to date. As a result, patches are not optimally integrated across the two repositories, which may lead to sub-optimal maintenance between the variant and the original project. A bug existing in both repositories can be patched in one but not the other (we see this as a missed opportunity) or it can be manually patched in both probably by different developers (we see this as effort duplication). In this paper we present a tool (named PaReCo) which relies on clone detection to mine cases of missed opportunity and effort duplication from a pool of patches. We analyzed 364 (source to target) variant pairs with 8,323 patches resulting in a curated dataset containing 1,116 cases of effort duplication and 1,008 cases of missed opportunities. We achieve a precision of 91%, recall of 80%, accuracy of 88%, and F1-score of 85%. Furthermore, we investigated the time interval between patches and found out that, on average, missed patches in the target variants have been introduced in the source variants 52 weeks earlier. Consequently, PaReCo can be used to manage variability in “time” by automatically identifying interesting patches in later project releases to be backported to supported earlier releases. @InProceedings{ESEC/FSE22p646, author = {Poedjadevie Kadjel Ramkisoen and John Businge and Brent van Bladel and Alexandre Decan and Serge Demeyer and Coen De Roover and Foutse Khomh}, title = {PaReco: Patched Clones and Missed Patches among the Divergent Variants of a Software Family}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {646--658}, doi = {10.1145/3540250.3549112}, year = {2022}, } Publisher's Version |
|
Cai, Haipeng |
ESEC/FSE '22-DEMO: "PolyFax: A Toolkit for Characterizing ..."
PolyFax: A Toolkit for Characterizing Multi-language Software
Wen Li, Li Li, and Haipeng Cai (Washington State University, USA; Monash University, Australia) Today’s software systems are mostly developed in multiple languages (i.e., multi-language software), yet tool support for understanding and assuring these systems is rare. To facilitate future research on multi-language software engineering, this paper presents PolyFax, a toolkit that offers automated means for dataset collection from GitHub and two analysis utilities--a vulnerability-fixing commit categorization tool (VCC) and a language interfacing mechanism identification/categorization tool (LIC). The VCC tool immediately assists with assessing the vulnerability proneness of a given multi-language project based on its version histories, while the LIC tool enables dissection of the most important aspect of the construction of multi-language systems. Application of PolyFax to 7,113 multi-language projects with 12.6 million commits showed its practical usefulness in terms of promising efficiency and accuracy for studying multi-language software. @InProceedings{ESEC/FSE22p1662, author = {Wen Li and Li Li and Haipeng Cai}, title = {PolyFax: A Toolkit for Characterizing Multi-language Software}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1662--1666}, doi = {10.1145/3540250.3558925}, year = {2022}, } Publisher's Version ESEC/FSE '22-IVR: "Language-Agnostic Dynamic ..." Language-Agnostic Dynamic Analysis of Multilingual Code: Promises, Pitfalls, and Prospects Haoran Yang, Wen Li, and Haipeng Cai (Washington State University, USA) Analyzing multilingual code holistically is key to systematic quality assurance of real-world software which is mostly developed in multiple computer languages. Toward such analyses, state-of-the-art approaches propose an almost-fully language-agnostic methodology and apply it to dynamic dependence analysis/slicing of multilingual code, showing great promises. We investigated this methodology through a technical analysis followed by a replication study applying it to 10 real-world multilingual projects of diverse language combinations. Our results revealed critical practicality (i.e., having the levels of efficiency/scalability, precision, and extensibility to various language combinations for practical use) challenges to the methodology. Based on the results, we reflect on the underlying pitfalls of the language-agnostic design that leads to such challenges. Finally, looking forward to the prospects of dynamic analysis for multilingual code, we identify a new research direction towards better practicality and precision while not sacrificing extensibility much, as supported by preliminary results. The key takeaway is that pursuing fully language-agnostic analysis may be both impractical and unnecessary, and striving for a better balance between language independence and practicality may be more fruitful. @InProceedings{ESEC/FSE22p1621, author = {Haoran Yang and Wen Li and Haipeng Cai}, title = {Language-Agnostic Dynamic Analysis of Multilingual Code: Promises, Pitfalls, and Prospects}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1621--1626}, doi = {10.1145/3540250.3560880}, year = {2022}, } Publisher's Version ESEC/FSE '22: "On the Vulnerability Proneness ..." On the Vulnerability Proneness of Multilingual Code Wen Li, Li Li, and Haipeng Cai (Washington State University, USA; Monash University, Australia) Software construction using multiple languages has long been a norm, yet it is still unclear if multilingual code construction has significant security implications and real security consequences. This paper aims to address this question with a large-scale study of popular multi-language projects on GitHub and their evolution histories, enabled by our novel techniques for multilingual code characterization. We found statistically significant associations between the proneness of multilingual code to vulnerabilities (in general and of specific categories) and its language selection. We also found this association is correlated with that of the language interfacing mechanism, not that of individual languages. We validated our statistical findings with in-depth case studies on actual vulnerabilities, explained via the mechanism and language selection. Our results call for immediate actions to assess and defend against multilingual vulnerabilities, for which we provide practical recommendations. @InProceedings{ESEC/FSE22p847, author = {Wen Li and Li Li and Haipeng Cai}, title = {On the Vulnerability Proneness of Multilingual Code}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {847--859}, doi = {10.1145/3540250.3549173}, year = {2022}, } Publisher's Version Artifacts Reusable ESEC/FSE '22: "Generating Realistic Vulnerabilities ..." Generating Realistic Vulnerabilities via Neural Code Editing: An Empirical Study Yu Nong, Yuzhe Ou, Michael Pradel, Feng Chen, and Haipeng Cai (Washington State University, USA; University of Texas at Dallas, USA; University of Stuttgart, Germany) The availability of large-scale, realistic vulnerability datasets is essential both for benchmarking existing techniques and for developing effective new data-driven approaches for software security. Yet such datasets are critically lacking. A promising solution is to generate such datasets by injecting vulnerabilities into real-world programs, which are richly available. Thus, in this paper, we explore the feasibility of vulnerability injection through neural code editing. With a synthetic dataset and a real-world one, we investigate the potential and gaps of three state-of-the-art neural code editors for vulnerability injection. We find that the studied editors have critical limitations on the real-world dataset, where the best accuracy is only 10.03%, versus 79.40% on the synthetic dataset. While the graph-based editors are more effective (successfully injecting vulnerabilities in up to 34.93% of real-world testing samples) than the sequence-based one (0 success), they still suffer from complex code structures and fall short for long edits due to their insufficient designs of the preprocessing and deep learning (DL) models. We reveal the promise of neural code editing for generating realistic vulnerable samples, as they help boost the effectiveness of DL-based vulnerability detectors by up to 49.51% in terms of F1 score. We also provide insights into the gaps in current editors (e.g., they are good at deleting but not at replacing code) and actionable suggestions for addressing them (e.g., designing effective editing primitives). @InProceedings{ESEC/FSE22p1097, author = {Yu Nong and Yuzhe Ou and Michael Pradel and Feng Chen and Haipeng Cai}, title = {Generating Realistic Vulnerabilities via Neural Code Editing: An Empirical Study}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1097--1109}, doi = {10.1145/3540250.3549128}, year = {2022}, } Publisher's Version Artifacts Functional |
|
Cai, Shaowei |
ESEC/FSE '22: "SamplingCA: Effective and ..."
SamplingCA: Effective and Efficient Sampling-Based Pairwise Testing for Highly Configurable Software Systems
Chuan Luo, Qiyuan Zhao, Shaowei Cai, Hongyu Zhang, and Chunming Hu (Beihang University, China; Shanghai Jiao Tong University, China; Institute of Software at Chinese Academy of Sciences, China; University of Newcastle, Australia) Combinatorial interaction testing (CIT) is an effective paradigm for testing highly configurable systems, and its goal is to generate a t-wise covering array (CA) as a test suite, where t is the strength of testing. It is recognized that pairwise testing (i.e., CIT with t=2) is the most common CIT technique, and has high fault detection capability in practice. The problem of pairwise CA generation (PCAG), which is a core problem in pairwise testing, aims at generating a pairwise CA (i.e., 2-wise CA) of minimum size, subject to hard constraints. The PCAG problem is a hard combinatorial optimization problem, which urgently requires practical methods for generating pairwise CAs (PCAs) of small sizes. However, existing PCAG algorithms suffer from the severe scalability issue; that is, when solving large-scale PCAG instances, existing state-of-the-art PCAG algorithms usually cost a fairly long time to generate large PCAs, which would make the testing of highly configurable systems both ineffective and inefficient. In this paper, we propose a novel and effective sampling-based approach dubbed SamplingCA for solving the PCAG problem. SamplingCA first utilizes sampling techniques to obtain a small test suite that covers valid pairwise tuples as many as possible, and then adds a few more test cases into the test suite to ensure that all valid pairwise tuples are covered. Extensive experiments on 125 public PCAG instances show that our approach can generate much smaller PCAs than its state-of-the-art competitors, indicating the effectiveness of SamplingCA. Also, our experiments show that SamplingCA runs one to two orders of magnitude faster than its competitors, demonstrating the efficiency of SamplingCA. Our results confirm that SamplingCA is able to address the scalability issue and considerably pushes forward the state of the art in PCAG solving. @InProceedings{ESEC/FSE22p1185, author = {Chuan Luo and Qiyuan Zhao and Shaowei Cai and Hongyu Zhang and Chunming Hu}, title = {SamplingCA: Effective and Efficient Sampling-Based Pairwise Testing for Highly Configurable Software Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1185--1197}, doi = {10.1145/3540250.3549155}, year = {2022}, } Publisher's Version Artifacts Reusable |
|
Cai, Yuandao |
ESEC/FSE '22: "Peahen: Fast and Precise Static ..."
Peahen: Fast and Precise Static Deadlock Detection via Context Reduction
Yuandao Cai, Chengfeng Ye, Qingkai Shi, and Charles Zhang (Hong Kong University of Science and Technology, China; Ant Group, China) Deadlocks still severely inflict reliability and security issues upon software systems of the modern age. Worse still, as we note, in prior static deadlock detectors, good precision does not go hand-in-hand with high scalability --- their approaches are either context-insensitive, thereby engendering many false positives, or suffer from the calling context explosion to reach context-sensitive, thus compromising good efficiency. In this paper, we advocate Peahen, geared towards precise yet also scalable static deadlock detection. At its crux, Peahen decomposes the computational effort for embracing high precision into two cooperative analysis stages: (i) context-insensitive lock-graph construction, which selectively encodes the essential lock-acquisition information on each edge, and (ii) three precise yet lazy refinements, which incorporate such edge information into progressively refining the deadlock cycles in the lock graph only for a few interesting calling contexts. Our extensive experiments yield promising results: Peahen dramatically out-performs the state-of-the-art tools on accuracy without losing scalability; it can efficiently check million-line systems at a low false positive rate; and it has uncovered many confirmed deadlocks in dozens of mature open-source systems. @InProceedings{ESEC/FSE22p784, author = {Yuandao Cai and Chengfeng Ye and Qingkai Shi and Charles Zhang}, title = {Peahen: Fast and Precise Static Deadlock Detection via Context Reduction}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {784--796}, doi = {10.1145/3540250.3549110}, year = {2022}, } Publisher's Version |
|
Çalıklı, Gül |
ESEC/FSE '22: "First Come First Served: The ..."
First Come First Served: The Impact of File Position on Code Review
Enrico Fregnan, Larissa Braz, Marco D'Ambros, Gül Çalıklı, and Alberto Bacchelli (University of Zurich, Switzerland; USI Lugano, Switzerland; University of Glasgow, UK) The most popular code review tools (e.g., Gerrit and GitHub) present the files to review sorted in alphabetical order. Could this choice or, more generally, the relative position in which a file is presented bias the outcome of code reviews? We investigate this hypothesis by triangulating complementary evidence in a two-step study. First, we observe developers’ code review activity. We analyze the review comments pertaining to 219,476 Pull Requests (PRs) from 138 popular Java projects on GitHub. We found files shown earlier in a PR to receive more comments than files shown later, also when controlling for possible confounding factors: e.g., the presence of discussion threads or the lines added in a file. Second, we measure the impact of file position on defect finding in code review. Recruit- ing 106 participants, we conduct an online controlled experiment in which we measure participants’ performance in detecting two unrelated defects seeded into two different files. Participants are assigned to one of two treatments in which the position of the defective files is switched. For one type of defect, participants are not affected by its file’s position; for the other, they have 64% lower odds to identify it when its file is last as opposed to first. Overall, our findings provide evidence that the relative position in which files are presented has an impact on code reviews’ outcome; we discuss these results and implications for tool design and code review. @InProceedings{ESEC/FSE22p483, author = {Enrico Fregnan and Larissa Braz and Marco D'Ambros and Gül Çalıklı and Alberto Bacchelli}, title = {First Come First Served: The Impact of File Position on Code Review}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {483--494}, doi = {10.1145/3540250.3549177}, year = {2022}, } Publisher's Version Artifacts Reusable |
|
Camtepe, Seyit |
ESEC/FSE '22: "Cross-Language Android Permission ..."
Cross-Language Android Permission Specification
Chaoran Li, Xiao Chen, Ruoxi Sun, Minhui Xue, Sheng Wen, Muhammad Ejaz Ahmed, Seyit Camtepe, and Yang Xiang (Swinburne University of Technology, Australia; Monash University, Australia; University of Adelaide, Australia; CSIRO’s Data61, Australia) The Android system manages access to sensitive APIs by permission enforcement. An application (app) must declare proper permissions before invoking specific Android APIs. However, there is no official documentation providing the complete list of permission-protected APIs and the corresponding permissions to date. Researchers have spent significant efforts extracting such API protection mapping from the Android API framework, which leverages static code analysis to determine if specific permissions are required before accessing an API. Nevertheless, none of them has attempted to analyze the protection mapping in the native library (i.e., code written in C and C++), an essential component of the Android framework that handles communication with the lower-level hardware, such as cameras and sensors. While the protection mapping can be utilized to detect various security vulnerabilities in Android apps, such as permission over-privilege, imprecise mapping will lead to false results in detecting such security vulnerabilities. To fill this gap, we thereby propose to construct the protection mapping involved in the native libraries of the Android framework to present a complete and accurate specification of Android API protection. We develop a prototype system, named NatiDroid, to facilitate the cross-language static analysis and compare its performance with two state-of-the-practice tools, termed Axplorer and Arcade. We evaluate NatiDroid on more than 11,000 Android apps, including system apps from custom Android ROMs and third-party apps from the Google Play. Our NatiDroid can identify up to 464 new API-permission mappings, in contrast to the worst-case results derived from both Axplorer and Arcade, where approximately 71% apps have at least one false positive in permission over-privilege. We have disclosed all the potential vulnerabilities detected to the stakeholders. @InProceedings{ESEC/FSE22p772, author = {Chaoran Li and Xiao Chen and Ruoxi Sun and Minhui Xue and Sheng Wen and Muhammad Ejaz Ahmed and Seyit Camtepe and Yang Xiang}, title = {Cross-Language Android Permission Specification}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {772--783}, doi = {10.1145/3540250.3549142}, year = {2022}, } Publisher's Version |
|
Canning, Mark |
ESEC/FSE '22-IND: "What Improves Developer Productivity ..."
What Improves Developer Productivity at Google? Code Quality
Lan Cheng, Emerson Murphy-Hill, Mark Canning, Ciera Jaspan, Collin Green, Andrea Knight, Nan Zhang, and Elizabeth Kammer (Google, USA) Understanding what affects software developer productivity can help organizations choose wise investments in their technical and social environment. But the research literature either focuses on what correlates with developer productivity in ecologically valid settings or focuses on what causes developer productivity in highly constrained settings. In this paper, we bridge the gap by studying software developers at Google through two analyses. In the first analysis, we use panel data with 39 productivity factors, finding that code quality, technical debt, infrastructure tools and support, team communication, goals and priorities, and organizational change and process are all causally linked to self-reported developer productivity. In the second analysis, we use a lagged panel analysis to strengthen our causal claims. We find that increases in perceived code quality tend to be followed by increased perceived developer productivity, but not vice versa, providing the strongest evidence to date that code quality affects individual developer productivity. @InProceedings{ESEC/FSE22p1302, author = {Lan Cheng and Emerson Murphy-Hill and Mark Canning and Ciera Jaspan and Collin Green and Andrea Knight and Nan Zhang and Elizabeth Kammer}, title = {What Improves Developer Productivity at Google? Code Quality}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1302--1313}, doi = {10.1145/3540250.3558940}, year = {2022}, } Publisher's Version Archive submitted (330 kB) |
|
Cao, Junming |
ESEC/FSE '22: "Understanding Performance ..."
Understanding Performance Problems in Deep Learning Systems
Junming Cao, Bihuan Chen, Chao Sun, Longjie Hu, Shuaihong Wu, and Xin Peng (Fudan University, China) Deep learning (DL) has been widely applied to many domains. Unique challenges in engineering DL systems are posed by the programming paradigm shift from traditional systems to DL systems, and performance is one of the challenges. Performance problems (PPs) in DL systems can cause severe consequences such as excessive resource consumption and financial loss. While bugs in DL systems have been extensively investigated, PPs in DL systems have hardly been explored. To bridge this gap, we present the first comprehensive study to i) characterize symptoms, root causes, and introducing and exposing stages of PPs in DL systems developed in TensorFLow and Keras, with 224 PPs collected from 210 StackOverflow posts, and to ii) assess the capability of existing performance analysis approaches in tackling PPs, with a constructed benchmark of 58 PPs in DL systems. Our findings shed light on the implications on developing high-performance DL systems, and detecting and localizing PPs in DL systems. To demonstrate the usefulness of our findings, we develop a static checker DeepPerf to detect three types of PPs. It has detected 488 new PPs in 130 GitHub projects. 105 and 27 PPs have been confirmed and fixed. @InProceedings{ESEC/FSE22p357, author = {Junming Cao and Bihuan Chen and Chao Sun and Longjie Hu and Shuaihong Wu and Xin Peng}, title = {Understanding Performance Problems in Deep Learning Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {357--369}, doi = {10.1145/3540250.3549123}, year = {2022}, } Publisher's Version Info Artifacts Reusable |
|
Cao, Li |
ESEC/FSE '22: "Actionable and Interpretable ..."
Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems
Zeyan Li, Nengwen Zhao, Mingjie Li, Xianglin Lu, Lixin Wang, Dongdong Chang, Xiaohui Nie, Li Cao, Wenchi Zhang, Kaixin Sui, Yanhua Wang, Xu Du, Guoqiang Duan, and Dan Pei (Tsinghua University, China; China Construction Bank, China; BizSeer, China) Fault localization is challenging in an online service system due to its monitoring data's large volume and variety and complex dependencies across/within its components (e.g., services or databases). Furthermore, engineers require fault localization solutions to be actionable and interpretable, which existing research approaches cannot satisfy. Therefore, the common industry practice is that, for a specific online service system, its experienced engineers focus on localization for recurring failures based on the knowledge accumulated about the system and historical failures. More specifically, 1) they can identify the underlying root causes and take mitigation actions when pinpointing a group of indicative metrics on the faulty component; 2) their diagnosis knowledge is roughly based on how one failure might affect the components in the whole system. Although the above common practice is actionable and interpretable, it is largely manual, thus slow and sometimes inaccurate. In this paper, we aim to automate this practice through machine learning. That is, we propose an actionable and interpretable fault localization approach, DejaVu, for recurring failures in online service systems. For a specific online service system, DejaVu takes historical failures and dependencies in the system as input and trains a localization model offline; for an incoming failure, the trained model online recommends where the failure occurs (i.e., the faulty components) and which kind of failure occurs (i.e., the indicative group of metrics) (thus actionable), which are further interpreted both globally and locally (thus interpretable). Based on the evaluation on 601 failures from three production systems and one open-source benchmark, in less than one second, DejaVu can rank the ground truths at 1.66∼5.03-th among a long candidate list on average, outperforming baselines by 54.52%. @InProceedings{ESEC/FSE22p996, author = {Zeyan Li and Nengwen Zhao and Mingjie Li and Xianglin Lu and Lixin Wang and Dongdong Chang and Xiaohui Nie and Li Cao and Wenchi Zhang and Kaixin Sui and Yanhua Wang and Xu Du and Guoqiang Duan and Dan Pei}, title = {Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {996--1008}, doi = {10.1145/3540250.3549092}, year = {2022}, } Publisher's Version |
|
Cao, Liqing |
ESEC/FSE '22: "Generic Sensitivity: Customizing ..."
Generic Sensitivity: Customizing Context-Sensitive Pointer Analysis for Generics
Haofeng Li, Jie Lu, Haining Meng, Liqing Cao, Yongheng Huang, Lian Li, and Lin Gao (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; TianqiSoft, China) Generic programming has been extensively used in object-oriented programs such as Java. However, existing context-sensitive pointer analyses perform poorly in analyzing generics. This paper introduces generic sensitivity, a new context customization scheme targeting generics. We design our context customization scheme in such a way that generic instantiation sites, i.e., locations instantiating generic classes/methods with concrete types, are always preserved as key context elements. This is realized by augmenting contexts with a type variable lookup map, which is efficiently updated during the analysis in a context-sensitive manner. We have implemented different variants of generic-sensitive analysis in Wala and experimental results show that the generic customization scheme can significantly improve performance and precision of context-sensitive pointer analyses. For instance, generic context customization significantly improves precision of 1-object-sensitive analysis, with an average speedup of 1.8×. In addition, generic context customization enables a 1-object-sensitive analysis to achieve overall better precision than a 2-object-sensitive analysis, with an averagely speed up of 12.6 × (62 × for chart). @InProceedings{ESEC/FSE22p1110, author = {Haofeng Li and Jie Lu and Haining Meng and Liqing Cao and Yongheng Huang and Lian Li and Lin Gao}, title = {Generic Sensitivity: Customizing Context-Sensitive Pointer Analysis for Generics}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1110--1121}, doi = {10.1145/3540250.3549122}, year = {2022}, } Publisher's Version |
|
Cassee, Nathan |
ESEC/FSE '22-DOC: "Sentiment in Software Engineering: ..."
Sentiment in Software Engineering: Detection and Application
Nathan Cassee (Eindhoven University of Technology, Netherlands) In software engineering the role of human aspects is an important one, especially as developers indicate that they experience a wide range of emotions while developing software. Within software engineering researchers have sought to understand the role emotions and sentiment play in the development of software by studying issues, pull-requests and commit messages. To detect sentiment, automated tools are used, and in this doctoral thesis we plan to study the use of these sentiment analysis tools, their applications, best practices for their usage and the effect of non-natural language on their performance. In addition to studying the application of sentiment analysis tools, we also aim to study self-admitted technical debt and bots in software engineering, to understand why developers express sentiment and what they signal when they express sentiment. Through studying both the application of sentiment analysis tools and the role of sentiment in software engineering, we hope to provide practical recommendations for both researchers and developers. @InProceedings{ESEC/FSE22p1800, author = {Nathan Cassee}, title = {Sentiment in Software Engineering: Detection and Application}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1800--1804}, doi = {10.1145/3540250.3558908}, year = {2022}, } Publisher's Version |
|
Chakraborty, Saikat |
ESEC/FSE '22: "NatGen: Generative Pre-training ..."
NatGen: Generative Pre-training by “Naturalizing” Source Code
Saikat Chakraborty, Toufique Ahmed, Yangruibo Ding, Premkumar T. Devanbu, and Baishakhi Ray (Columbia University, USA; University of California at Davis, USA) Pre-trained Generative Language models (e.g., PLBART, CodeT5, SPT-Code) for source code yielded strong results on several tasks in the past few years, including code generation and translation. These models have adopted varying pre-training objectives to learn statistics of code construction from very large-scale corpora in a self-supervised fashion; the success of pre-trained models largely hinges on these pre-training objectives. This paper proposes a new pre-training objective, “Naturalizing” of source code, exploiting code’s bimodal, dual-channel (formal & natural channels) nature. Unlike natural language, code’s bimodal, dual-channel nature allows us to generate semantically equivalent code at scale. We introduce six classes of semantic preserving transformations to introduce unnatural forms of code, and then force our model to produce more natural original programs written by developers. Learning to generate equivalent, but more natural code, at scale, over large corpora of open-source code, without explicit manual supervision, helps the model learn to both ingest & generate code. We fine-tune our model in three generative Software Engineering tasks: code generation, code translation, and code refinement with limited human-curated labeled data and achieve state-of-the-art performance rivaling CodeT5. We show that our pre-trained model is especially competitive at zero-shot and few-shot learning, and better at learning code properties (e.g., syntax, data flow) @InProceedings{ESEC/FSE22p18, author = {Saikat Chakraborty and Toufique Ahmed and Yangruibo Ding and Premkumar T. Devanbu and Baishakhi Ray}, title = {NatGen: Generative Pre-training by “Naturalizing” Source Code}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {18--30}, doi = {10.1145/3540250.3549162}, year = {2022}, } Publisher's Version |
|
Chandra, Satish |
ESEC/FSE '22-IND: "Leveraging Test Plan Quality ..."
Leveraging Test Plan Quality to Improve Code Review Efficacy
Lawrence Chen, Rui Abreu, Tobi Akomolede, Peter C. Rigby, Satish Chandra, and Nachiappan Nagappan (Meta Platforms, USA; Concordia University, Canada) In modern code reviews, many artifacts play roles in knowledge- sharing and documentation: summaries, test plans, and comments, etc. Improving developer tools and facilitating better code reviews require an understanding of the quality of pull requests and their artifacts. This is difficult to measure, however, because they are often free-form natural language and unstructured text data. In this paper, we focus on measuring the quality of test plans at Meta. Test plans are used as a communication mechanism between the author of a pull request and its reviewers, serving as walkthroughs to help confirm that the changed code is behaving as expected. We collected developer opinions on over 650 test plans from more than 500 Meta developers, then introduced a transformer-based model to leverage the success of natural language processing (NLP) tech- niques in the code review domain. In our study, we show that the learned model is able to capture the sentiment of developers and reflect a correlation of test plan quality with review engagement and reversions: compared to a decision tree model, our proposed transformer-based model achieves a 7% higher F1-score. Finally, we present a case study of how such a metric may be useful in experiments to inform improvements in developer tools and experiences. @InProceedings{ESEC/FSE22p1320, author = {Lawrence Chen and Rui Abreu and Tobi Akomolede and Peter C. Rigby and Satish Chandra and Nachiappan Nagappan}, title = {Leveraging Test Plan Quality to Improve Code Review Efficacy}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1320--1330}, doi = {10.1145/3540250.3558952}, year = {2022}, } Publisher's Version |
|
Chang, Dongdong |
ESEC/FSE '22: "Actionable and Interpretable ..."
Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems
Zeyan Li, Nengwen Zhao, Mingjie Li, Xianglin Lu, Lixin Wang, Dongdong Chang, Xiaohui Nie, Li Cao, Wenchi Zhang, Kaixin Sui, Yanhua Wang, Xu Du, Guoqiang Duan, and Dan Pei (Tsinghua University, China; China Construction Bank, China; BizSeer, China) Fault localization is challenging in an online service system due to its monitoring data's large volume and variety and complex dependencies across/within its components (e.g., services or databases). Furthermore, engineers require fault localization solutions to be actionable and interpretable, which existing research approaches cannot satisfy. Therefore, the common industry practice is that, for a specific online service system, its experienced engineers focus on localization for recurring failures based on the knowledge accumulated about the system and historical failures. More specifically, 1) they can identify the underlying root causes and take mitigation actions when pinpointing a group of indicative metrics on the faulty component; 2) their diagnosis knowledge is roughly based on how one failure might affect the components in the whole system. Although the above common practice is actionable and interpretable, it is largely manual, thus slow and sometimes inaccurate. In this paper, we aim to automate this practice through machine learning. That is, we propose an actionable and interpretable fault localization approach, DejaVu, for recurring failures in online service systems. For a specific online service system, DejaVu takes historical failures and dependencies in the system as input and trains a localization model offline; for an incoming failure, the trained model online recommends where the failure occurs (i.e., the faulty components) and which kind of failure occurs (i.e., the indicative group of metrics) (thus actionable), which are further interpreted both globally and locally (thus interpretable). Based on the evaluation on 601 failures from three production systems and one open-source benchmark, in less than one second, DejaVu can rank the ground truths at 1.66∼5.03-th among a long candidate list on average, outperforming baselines by 54.52%. @InProceedings{ESEC/FSE22p996, author = {Zeyan Li and Nengwen Zhao and Mingjie Li and Xianglin Lu and Lixin Wang and Dongdong Chang and Xiaohui Nie and Li Cao and Wenchi Zhang and Kaixin Sui and Yanhua Wang and Xu Du and Guoqiang Duan and Dan Pei}, title = {Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {996--1008}, doi = {10.1145/3540250.3549092}, year = {2022}, } Publisher's Version |
|
Chang, Zhiyuan |
ESEC/FSE '22: "Putting Them under Microscope: ..."
Putting Them under Microscope: A Fine-Grained Approach for Detecting Redundant Test Cases in Natural Language
Zhiyuan Chang, Mingyang Li, Junjie Wang, Qing Wang, and Shoubin Li (Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China) Natural language (NL) documentation is the bridge between software managers and testers, and NL test cases are prevalent in system-level testing and other quality assurance activities. Due to reasons such as requirements redundancy, parallel testing, tester turn-over within long evolving history, there are inevitably lots of redundant test cases, which significantly increase the cost. Previous redundancy detection approaches typically treat the textual descriptions as a whole to compare their similarity and suffer from low precision. Our observation reveals that a test case can have explicit test-oriented entities, such as tested function Components, Constraints, etc; and there are also specific relations between these entities. This inspires us with a potential opportunity for accurate redundancy detection. In this paper, we first define five test-oriented entity categories and four associated relation categories, and re-formulate the NL test case redundancy detection problem as the comparison of detailed testing content guided by the test-oriented entities and relations. Following that, we propose Tscope, a fine-grained approach for redundant NL test case detection by dissecting test cases into atomic test tuple(s) with the entities restricted by associated relations. To serve as the test case dissection, Tscope designs a context-aware model for the automatic entity and relation extraction. Evaluation on 3,467 test cases from ten projects shows Tscope could achieve 91.8% precision, 74.8% recall and 82.4% F1, significantly outperforming state-of-the-art approaches and commonly-used classifiers. This new formulation of the NL test case redundant detection problem can motivate the follow-up studies in further improving this task and other related tasks involving NL descriptions. @InProceedings{ESEC/FSE22p1161, author = {Zhiyuan Chang and Mingyang Li and Junjie Wang and Qing Wang and Shoubin Li}, title = {Putting Them under Microscope: A Fine-Grained Approach for Detecting Redundant Test Cases in Natural Language}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1161--1172}, doi = {10.1145/3540250.3549089}, year = {2022}, } Publisher's Version |
|
Chaparro, Oscar |
ESEC/FSE '22: "Toward Interactive Bug Reporting ..."
Toward Interactive Bug Reporting for (Android App) End-Users
Yang Song, Junayed Mahmud, Ying Zhou, Oscar Chaparro, Kevin Moran, Andrian Marcus, and Denys Poshyvanyk (College of William and Mary, USA; George Mason University, USA; University of Texas at Dallas, USA) Many software bugs are reported manually, particularly bugs that manifest themselves visually in the user interface. End-users typically report these bugs via app reviewing websites, issue trackers, or in-app built-in bug reporting tools, if available. While these systems have various features that facilitate bug reporting (e.g., textual templates or forms), they often provide limited guidance, concrete feedback, or quality verification to end-users, who are often inexperienced at reporting bugs and submit low-quality bug reports that lead to excessive developer effort in bug report management tasks. We propose an interactive bug reporting system for end-users (Burt), implemented as a task-oriented chatbot. Unlike existing bug reporting systems, Burt provides guided reporting of essential bug report elements (i.e., the observed behavior, expected behavior, and steps to reproduce the bug), instant quality verification, and graphical suggestions for these elements. We implemented a version of Burt for Android and conducted an empirical evaluation study with end-users, who reported 12 bugs from six Android apps studied in prior work. The reporters found that Burt’s guidance and automated suggestions/clarifications are useful and Burt is easy to use. We found that Burt reports contain higher-quality information than reports collected via a template-based bug reporting system. Improvements to Burt, informed by the reporters, include support for various wordings to describe bug report elements and improved quality verification. Our work marks an important paradigm shift from static to interactive bug reporting for end-users. @InProceedings{ESEC/FSE22p344, author = {Yang Song and Junayed Mahmud and Ying Zhou and Oscar Chaparro and Kevin Moran and Andrian Marcus and Denys Poshyvanyk}, title = {Toward Interactive Bug Reporting for (Android App) End-Users}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {344--356}, doi = {10.1145/3540250.3549131}, year = {2022}, } Publisher's Version Artifacts Reusable |
|
Chatterjee, Amreeta |
ESEC/FSE '22: "A Case Study of Implicit Mentoring, ..."
A Case Study of Implicit Mentoring, Its Prevalence, and Impact in Apache
Zixuan Feng, Amreeta Chatterjee, Anita Sarma, and Iftekhar Ahmed (Oregon State University, USA; University of California at Irvine, USA) Mentoring is traditionally viewed as a dyadic, top-down apprenticeship. This perspective, however, overlooks other forms of informal mentoring taking place in everyday activities in which developers invest time and effort. Here, we investigate informal mentoring taking place in Open Source Software (OSS). We define a specific type of informal mentoring—implicit mentoring—situations where contributors guide others through instructions and suggestions embedded in everyday (OSS) activities. We defined implicit mentoring by first performing a review of related work on mentoring, and then through formative interviews with OSS contributors and member-checking. Next, through an empirical investigation of Pull Requests (PRs) in 37 Apache Projects, we built a classifier to extract implicit mentoring. Our analysis of 107,895 PRs shows that implicit mentoring does occur through code reviews (27.41% of all PRs included implicit mentoring) and is beneficial for both mentors and mentees. We analyzed the impact of implicit mentoring on OSS contributors by investigating their contributions and learning trajectories in their projects. Through an online survey (N=231), we then triangulated these results and identified the potential benefits of implicit mentoring from OSS contributors’ perspectives. @InProceedings{ESEC/FSE22p797, author = {Zixuan Feng and Amreeta Chatterjee and Anita Sarma and Iftekhar Ahmed}, title = {A Case Study of Implicit Mentoring, Its Prevalence, and Impact in Apache}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {797--809}, doi = {10.1145/3540250.3549167}, year = {2022}, } Publisher's Version |
|
Chatterjee, Ayan |
ESEC/FSE '22-IND: "Testing of Machine Learning ..."
Testing of Machine Learning Models with Limited Samples: An Industrial Vacuum Pumping Application
Ayan Chatterjee, Bestoun S. Ahmed, Erik Hallin, and Anton Engman (Karlstad University, Sweden; Uddeholms, Sweden) There is often a scarcity of training data for machine learning (ML) classification and regression models in industrial production, especially for time-consuming or sparsely run manufacturing processes. Traditionally, a majority of the limited ground-truth data is used for training, while a handful of samples are left for testing. In that case, the number of test samples is inadequate to properly evaluate the robustness of the ML models under test (i.e., the system under test) for classification and regression. Furthermore, the output of these ML models may be inaccurate or even fail if the input data differ from the expected. This is the case for ML models used in the Electroslag Remelting (ESR) process in the refined steel industry to predict the pressure in a vacuum chamber. A vacuum pumping event that occurs once a workday generates a few hundred samples in a year of pumping for training and testing. In the absence of adequate training and test samples, this paper first presents a method to generate a fresh set of augmented samples based on vacuum pumping principles. Based on the generated augmented samples, three test scenarios and one test oracle are presented to assess the robustness of an ML model used for production on an industrial scale. Experiments are conducted with real industrial production data obtained from Uddeholms AB steel company. The evaluations indicate that Ensemble and Neural Network are the most robust when trained on augmented data using the proposed testing strategy. The evaluation also demonstrates the proposed method's effectiveness in checking and improving ML algorithms' robustness in such situations. The work improves software testing's state-of-the-art robustness testing in similar settings. Finally, the paper presents an MLOps implementation of the proposed approach for real-time ML model prediction and action on the edge node and automated continuous delivery of ML software from the cloud. @InProceedings{ESEC/FSE22p1280, author = {Ayan Chatterjee and Bestoun S. Ahmed and Erik Hallin and Anton Engman}, title = {Testing of Machine Learning Models with Limited Samples: An Industrial Vacuum Pumping Application}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1280--1290}, doi = {10.1145/3540250.3558943}, year = {2022}, } Publisher's Version |
|
Chauhan, Jigyasa |
ESEC/FSE '22: "An Exploratory Study on the ..."
An Exploratory Study on the Predominant Programming Paradigms in Python Code
Robert Dyer and Jigyasa Chauhan (University of Nebraska-Lincoln, USA) Python is a multi-paradigm programming language that fully supports object-oriented (OO) programming. The language allows writing code in a non-procedural imperative manner, using procedures, using classes, or in a functional style. To date, no one has studied what paradigm(s), if any, are predominant in Python code and projects. In this work, we first define a technique to classify Python files into predominant paradigm(s). We then automate our approach and evaluate it against human judgements, showing over 80% agreement. We then analyze over 100k open-source Python projects, automatically classifying each source file and investigating the paradigm distributions. The results indicate Python developers tend to heavily favor OO features. We also observed a positive correlation between OO and procedural paradigms and the size of the project. And despite few files or projects being predominantly functional, we still found many functional feature uses. @InProceedings{ESEC/FSE22p684, author = {Robert Dyer and Jigyasa Chauhan}, title = {An Exploratory Study on the Predominant Programming Paradigms in Python Code}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {684--695}, doi = {10.1145/3540250.3549158}, year = {2022}, } Publisher's Version Info Artifacts Reusable |
|
Chechik, Marsha |
ESEC/FSE '22-INV: "On Safety, Assurance, and ..."
On Safety, Assurance, and Reliability: A Software Engineering Perspective (Keynote)
Marsha Chechik (University of Toronto, Canada) From financial services platforms to social networks to vehicle control, software has come to mediate many activities of daily life. Governing bodies and standards organizations have responded to this trend by creating regulations and standards to address issues such as safety, security and privacy. In this environment, the compliance of software development to standards and regulations has emerged as a key requirement. Compliance claims and arguments are often captured in assurance cases, with linked evidence of compliance. Evidence can come from test cases, verification proofs, human judgement, or a combination of these. That is, we try to build (safety-critical) systems carefully according to well justified methods and articulate these justifications in an assurance case that is ultimately judged by a human. Building safety arguments for traditional software systems is difficult — they are lengthy and expensive to maintain, especially as software undergoes change. Safety is also notoriously noncompositional — each subsystem might be safe but together they may create unsafe behaviors. It is also easy to miss cases, which in the simplest case would mean developing an argument for when a condition is true but missing arguing for a false condition. Furthermore, many ML-based systems are becoming safety-critical. For example, recent Tesla self-driving cars misclassified emergency vehicles and caused multiple crashes. ML-based systems typically do not have precisely specified and machine-verifiable requirements. While some safety requirements can be stated clearly: “the system should detect all pedestrians at a crossing”, these requirements are for the entire system, making them too high-level for safety analysis of individual components. Thus, systems with ML components (MLCs) add a significant layer of complexity for safety assurance. I argue that safety assurance should be an integral part of building safe and reliable software systems, but this process needs support from advanced software engineering and software analysis. In this talk, I outline a few approaches for development of principled, tool-supported methodologies for creating and managing assurance arguments. I then describe some of the recent work on specifying and verifying reliability requirements for machine-learned components in safety-critical domains. @InProceedings{ESEC/FSE22p2, author = {Marsha Chechik}, title = {On Safety, Assurance, and Reliability: A Software Engineering Perspective (Keynote)}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {2--2}, doi = {10.1145/3540250.3569443}, year = {2022}, } Publisher's Version |
|
Chen, Bihuan |
ESEC/FSE '22: "Understanding Performance ..."
Understanding Performance Problems in Deep Learning Systems
Junming Cao, Bihuan Chen, Chao Sun, Longjie Hu, Shuaihong Wu, and Xin Peng (Fudan University, China) Deep learning (DL) has been widely applied to many domains. Unique challenges in engineering DL systems are posed by the programming paradigm shift from traditional systems to DL systems, and performance is one of the challenges. Performance problems (PPs) in DL systems can cause severe consequences such as excessive resource consumption and financial loss. While bugs in DL systems have been extensively investigated, PPs in DL systems have hardly been explored. To bridge this gap, we present the first comprehensive study to i) characterize symptoms, root causes, and introducing and exposing stages of PPs in DL systems developed in TensorFLow and Keras, with 224 PPs collected from 210 StackOverflow posts, and to ii) assess the capability of existing performance analysis approaches in tackling PPs, with a constructed benchmark of 58 PPs in DL systems. Our findings shed light on the implications on developing high-performance DL systems, and detecting and localizing PPs in DL systems. To demonstrate the usefulness of our findings, we develop a static checker DeepPerf to detect three types of PPs. It has detected 488 new PPs in 130 GitHub projects. 105 and 27 PPs have been confirmed and fixed. @InProceedings{ESEC/FSE22p357, author = {Junming Cao and Bihuan Chen and Chao Sun and Longjie Hu and Shuaihong Wu and Xin Peng}, title = {Understanding Performance Problems in Deep Learning Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {357--369}, doi = {10.1145/3540250.3549123}, year = {2022}, } Publisher's Version Info Artifacts Reusable ESEC/FSE '22: "Tracking Patches for Open ..." Tracking Patches for Open Source Software Vulnerabilities Congying Xu, Bihuan Chen, Chenhao Lu, Kaifeng Huang, Xin Peng, and Yang Liu (Fudan University, China; Nanyang Technological University, Singapore) Open source software (OSS) vulnerabilities threaten the security of software systems that use OSS. Vulnerability databases provide valuable information (e.g., vulnerable version and patch) to mitigate OSS vulnerabilities. There arises a growing concern about the information quality of vulnerability databases. However, it is unclear what the quality of patches in existing vulnerability databases is; and existing manual or heuristic-based approaches for patch tracking are either too expensive or too specific to apply to all OSS vulnerabilities. To address these problems, we first conduct an empirical study to understand the quality and characteristics of patches for OSS vulnerabilities in two industrial vulnerability databases. Inspired by our study, we then propose the first automated approach, Tracer, to track patches for OSS vulnerabilities from multiple knowledge sources. Our evaluation has demonstrated that i) Tracer can track patches for up to 273.8% more vulnerabilities than heuristic-based approaches while achieving a higher F1-score by up to 116.8%; and ii) Tracer can complement industrial vulnerability databases. Our evaluation has also indicated the generality and practical usefulness of Tracer. @InProceedings{ESEC/FSE22p860, author = {Congying Xu and Bihuan Chen and Chenhao Lu and Kaifeng Huang and Xin Peng and Yang Liu}, title = {Tracking Patches for Open Source Software Vulnerabilities}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {860--871}, doi = {10.1145/3540250.3549125}, year = {2022}, } Publisher's Version |
|
Chen, Chunyang |
ESEC/FSE '22: "Psychologically-Inspired, ..."
Psychologically-Inspired, Unsupervised Inference of Perceptual Groups of GUI Widgets from GUI Images
Mulong Xie, Zhenchang Xing, Sidong Feng, Xiwei Xu, Liming Zhu, and Chunyang Chen (Australian National University, Australia; CSIRO’s Data61, Australia; Monash University, Australia; UNSW, Australia) Graphical User Interface (GUI) is not merely a collection of individual and unrelated widgets, but rather partitions discrete widgets into groups by various visual cues, thus forming higher-order perceptual units such as tab, menu, card or list. The ability to automatically segment a GUI into perceptual groups of widgets constitutes a fundamental component of visual intelligence to automate GUI design, implementation and automation tasks. Although humans can partition a GUI into meaningful perceptual groups of widgets in a highly reliable way, perceptual grouping is still an open challenge for computational approaches. Existing methods rely on ad-hoc heuristics or supervised machine learning that is dependent on specific GUI implementations and runtime information. Research in psychology and biological vision has formulated a set of principles (i.e., Gestalt theory of perception) that describe how humans group elements in visual scenes based on visual cues like connectivity, similarity, proximity and continuity. These principles are domain-independent and have been widely adopted by practitioners to structure content on GUIs to improve aesthetic pleasantness and usability. Inspired by these principles, we present a novel unsupervised image-based method for inferring perceptual groups of GUI widgets. Our method requires only GUI pixel images, is independent of GUI implementation, and does not require any training data. The evaluation on a dataset of 1,091 GUIs collected from 772 mobile apps and 20 UI design mockups shows that our method significantly outperforms the state-of-the-art ad-hoc heuristics-based baseline. Our perceptual grouping method creates opportunities for improving UI-related software engineering tasks. @InProceedings{ESEC/FSE22p332, author = {Mulong Xie and Zhenchang Xing and Sidong Feng and Xiwei Xu and Liming Zhu and Chunyang Chen}, title = {Psychologically-Inspired, Unsupervised Inference of Perceptual Groups of GUI Widgets from GUI Images}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {332--343}, doi = {10.1145/3540250.3549138}, year = {2022}, } Publisher's Version Info |
|
Chen, Feng |
ESEC/FSE '22: "Generating Realistic Vulnerabilities ..."
Generating Realistic Vulnerabilities via Neural Code Editing: An Empirical Study
Yu Nong, Yuzhe Ou, Michael Pradel, Feng Chen, and Haipeng Cai (Washington State University, USA; University of Texas at Dallas, USA; University of Stuttgart, Germany) The availability of large-scale, realistic vulnerability datasets is essential both for benchmarking existing techniques and for developing effective new data-driven approaches for software security. Yet such datasets are critically lacking. A promising solution is to generate such datasets by injecting vulnerabilities into real-world programs, which are richly available. Thus, in this paper, we explore the feasibility of vulnerability injection through neural code editing. With a synthetic dataset and a real-world one, we investigate the potential and gaps of three state-of-the-art neural code editors for vulnerability injection. We find that the studied editors have critical limitations on the real-world dataset, where the best accuracy is only 10.03%, versus 79.40% on the synthetic dataset. While the graph-based editors are more effective (successfully injecting vulnerabilities in up to 34.93% of real-world testing samples) than the sequence-based one (0 success), they still suffer from complex code structures and fall short for long edits due to their insufficient designs of the preprocessing and deep learning (DL) models. We reveal the promise of neural code editing for generating realistic vulnerable samples, as they help boost the effectiveness of DL-based vulnerability detectors by up to 49.51% in terms of F1 score. We also provide insights into the gaps in current editors (e.g., they are good at deleting but not at replacing code) and actionable suggestions for addressing them (e.g., designing effective editing primitives). @InProceedings{ESEC/FSE22p1097, author = {Yu Nong and Yuzhe Ou and Michael Pradel and Feng Chen and Haipeng Cai}, title = {Generating Realistic Vulnerabilities via Neural Code Editing: An Empirical Study}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1097--1109}, doi = {10.1145/3540250.3549128}, year = {2022}, } Publisher's Version Artifacts Functional |
|
Chen, Haoxian |
ESEC/FSE '22: "Declarative Smart Contracts ..."
Declarative Smart Contracts
Haoxian Chen, Gerald Whitters, Mohammad Javad Amiri, Yuepeng Wang, and Boon Thau Loo (University of Pennsylvania, USA; Simon Fraser University, Canada) This paper presents DeCon, a declarative programming language for implementing smart contracts and specifying contract-level properties. Driven by the observation that smart contract operations and contract-level properties can be naturally expressed as relational constraints, DeCon models each smart contract as a set of relational tables that store transaction records. This relational representation of smart contracts enables convenient specification of contract properties, facilitates run-time monitoring of potential property violations, and brings clarity to contract debugging via data provenance. Specifically, a DeCon program consists of a set of declarative rules and violation query rules over the relational representation, describing the smart contract implementation and contract-level properties, respectively. We have developed a tool that can compile DeCon programs into executable Solidity programs, with instrumentation for run-time property monitoring. Our case studies demonstrate that DeCon can implement realistic smart contracts such as ERC20 and ERC721 digital tokens. Our evaluation results reveal the marginal overhead of DeCon compared to the open-source reference implementation, incurring 14% median gas overhead for execution, and another 16% median gas overhead for run-time verification. @InProceedings{ESEC/FSE22p281, author = {Haoxian Chen and Gerald Whitters and Mohammad Javad Amiri and Yuepeng Wang and Boon Thau Loo}, title = {Declarative Smart Contracts}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {281--293}, doi = {10.1145/3540250.3549121}, year = {2022}, } Publisher's Version Artifacts Functional |
|
Chen, Lawrence |
ESEC/FSE '22-IND: "Leveraging Test Plan Quality ..."
Leveraging Test Plan Quality to Improve Code Review Efficacy
Lawrence Chen, Rui Abreu, Tobi Akomolede, Peter C. Rigby, Satish Chandra, and Nachiappan Nagappan (Meta Platforms, USA; Concordia University, Canada) In modern code reviews, many artifacts play roles in knowledge- sharing and documentation: summaries, test plans, and comments, etc. Improving developer tools and facilitating better code reviews require an understanding of the quality of pull requests and their artifacts. This is difficult to measure, however, because they are often free-form natural language and unstructured text data. In this paper, we focus on measuring the quality of test plans at Meta. Test plans are used as a communication mechanism between the author of a pull request and its reviewers, serving as walkthroughs to help confirm that the changed code is behaving as expected. We collected developer opinions on over 650 test plans from more than 500 Meta developers, then introduced a transformer-based model to leverage the success of natural language processing (NLP) tech- niques in the code review domain. In our study, we show that the learned model is able to capture the sentiment of developers and reflect a correlation of test plan quality with review engagement and reversions: compared to a decision tree model, our proposed transformer-based model achieves a 7% higher F1-score. Finally, we present a case study of how such a metric may be useful in experiments to inform improvements in developer tools and experiences. @InProceedings{ESEC/FSE22p1320, author = {Lawrence Chen and Rui Abreu and Tobi Akomolede and Peter C. Rigby and Satish Chandra and Nachiappan Nagappan}, title = {Leveraging Test Plan Quality to Improve Code Review Efficacy}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1320--1330}, doi = {10.1145/3540250.3558952}, year = {2022}, } Publisher's Version ESEC/FSE '22-IND: "Understanding Why We Cannot ..." Understanding Why We Cannot Model How Long a Code Review Will Take: An Industrial Case Study Lawrence Chen, Peter C. Rigby, and Nachiappan Nagappan (Meta, USA; Concordia University, Canada) Code review is an effective practice for finding defects, but because it is manually intensive it can slow down the continuous integration of changes. Our goal was to understand the factors that influenced the time a change, ie a diff at Meta, would spend in review. A developer survey showed that diff reviews start to feel slow after they have been waiting for around 24 hour review. We built a review time predictor model to identify potential factors that may be causing reviews to take longer, which we could use to predict when would be the best time to nudge reviewers or to identify diff-related factors that we may need to address. The strongest feature of the time spent in review model we built was the day of the week because diffs submitted near the weekend may have to wait for Monday for review. After removing time on weekends, the remaining features, including size of diff and the number of meetings the reviewers have did not provide substantial predictive power, thereby not being able to predict how long a code review would take. We contributed to the effort to reduce stale diffs by suggesting that diffs be nudged near the start of the workday and that diffs published near the weekend be nudged sooner on Friday to avoid waiting the entire weekend. We use a nudging threshold rather than a model because we showed that TimeInReview cannot be accurately modelled. The NudgeBot has been rolled to over 30k developers at Meta. @InProceedings{ESEC/FSE22p1314, author = {Lawrence Chen and Peter C. Rigby and Nachiappan Nagappan}, title = {Understanding Why We Cannot Model How Long a Code Review Will Take: An Industrial Case Study}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1314--1319}, doi = {10.1145/3540250.3558945}, year = {2022}, } Publisher's Version ESEC/FSE '22: "Using Nudges to Accelerate ..." Using Nudges to Accelerate Code Reviews at Scale Qianhua Shan, David Sukhdeo, Qianying Huang, Seth Rogers, Lawrence Chen, Elise Paradis, Peter C. Rigby, and Nachiappan Nagappan (Meta, USA; Concordia University, Canada) We describe a large-scale study to reduce the amount of time code review takes. Each quarter at Meta we survey developers. Combining sentiment data from a developer experience survey and telemetry data from our diff review tool, we address, “When does a diff review feel too slow?” From the sentiment data alone, we learn that 84.7% of developers are satisfied with the time their diffs spend in review. By enriching the survey results with telemetry for each respondent, we determined that sentiment is closely associated with the 75th percentile time in review for that respondent’s diffs, ie those that take more than 24 hours. To encourage developers to act on stale diffs that have had no action for 24 or more hours, we designed a NudgeBot to notify, ie nudge, reviewers. To determine who to nudge when a diff is stale, we created a model to rank the reviewers based on the probability that they will make a comment or perform some other action on a diff. This model outperformed models that looked at files the reviewer had modified in the past. Combining this information with prior author-review relationships, we achieved an MRR and AUC of .81 and .88, respectively. To evaluate NudgeBot in production, we conducted an A/B cluster-randomized experiment on over 30k engineers. We observed substantial statistically significant decrease in both time in review (-6.8%, p=0.049) and time to first reviewer action (-9.9%, p=0.010). We also used guard metrics to ensure that most reviews were still done in fewer than 24 hours and that reviewers still spend the same amount of time looking at diffs, and saw no statistically significant change in these metrics. NudgeBot is now rolled out company wide and is used daily by thousands of engineers at Meta. @InProceedings{ESEC/FSE22p472, author = {Qianhua Shan and David Sukhdeo and Qianying Huang and Seth Rogers and Lawrence Chen and Elise Paradis and Peter C. Rigby and Nachiappan Nagappan}, title = {Using Nudges to Accelerate Code Reviews at Scale}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {472--482}, doi = {10.1145/3540250.3549104}, year = {2022}, } Publisher's Version |
|
Chen, Qiuyuan |
ESEC/FSE '22: "What Motivates Software Practitioners ..."
What Motivates Software Practitioners to Contribute to Inner Source?
Zhiyuan Wan, Xin Xia, Yun Zhang, David Lo, Daibing Zhou, Qiuyuan Chen, and Ahmed E. Hassan (Zhejiang University, China; Huawei, China; Zhejiang University City College, China; Singapore Management University, Singapore; Queen’s University, Canada) Software development organizations have adopted open source development practices to support or augment their software development processes, a phenomenon referred to as inner source. Given the rapid adoption of inner source, we wonder what motivates software practitioners to contribute to inner source projects. We followed a mixed-methods approach--a qualitative phase of interviews with 20 interviewees, followed by a quantitative phase of an exploratory survey with 124 respondents from 13 countries across four continents. Our study uncovers practitioners' motivation to contribute to inner source projects, as well as how the motivation differs from what motivates practitioners to participate in open source projects. We also investigate how software practitioners' motivation impacts their contribution level and continuance intention in inner source projects. Based on our findings, we outline directions for future research and provide recommendations for organizations and software practitioners. @InProceedings{ESEC/FSE22p132, author = {Zhiyuan Wan and Xin Xia and Yun Zhang and David Lo and Daibing Zhou and Qiuyuan Chen and Ahmed E. Hassan}, title = {What Motivates Software Practitioners to Contribute to Inner Source?}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {132--144}, doi = {10.1145/3540250.3549148}, year = {2022}, } Publisher's Version |
|
Chen, Rong |
ESEC/FSE '22: "Detecting Simulink Compiler ..."
Detecting Simulink Compiler Bugs via Controllable Zombie Blocks Mutation
Shikai Guo, He Jiang, Zhihao Xu, Xiaochen Li, Zhilei Ren, Zhide Zhou, and Rong Chen (Dalian Maritime University, China; Dalian University of Technology, China) As a popular Cyber-Physical System (CPS) development tool chain, MathWorks Simulink is widely used to prototype CPS models in safety-critical applications, e.g., aerospace and healthcare. It is crucial to ensure the correctness and reliability of Simulink compiler (i.e., the compiler module of Simulink) in practice since all CPS models depend on compilation. However, Simulink compiler testing is challenging due to millions of lines of source code and the lack of the complete formal language specification. Although several methods have been proposed to automatically test Simulink compiler, there still remains two challenges to be tackled, namely the limited variant space and the insufficient mutation diversity. To address these challenges, we propose COMBAT, a new differential testing method for Simulink compiler testing. COMBAT includes an EMI (Equivalence Modulo Input) mutation component and a diverse variant generation component. The EMI mutation component inserts assertion statements (e.g., If /While blocks) at arbitrary points of the seed CPS model. These statements break each insertion point into true and false branches. Then, COMBAT feeds all the data passed through the insertion point into the true branch to preserve the equivalence of CPS variants. In such a way, the body of the false branch could be viewed as a new variant space, thus addressing the first challenge. The diverse variant generation component uses Markov chain Monte Carlo optimization to sample the seed CPS model and generate complex mutations of long sequences of blocks in the variant space, thus addressing the second challenge. Experiments demonstrate that COMBAT significantly outperforms the state-of-the-art approaches in Simulink compiler testing. Within five months, COMBAT has reported 16 valid bugs for Simulink R2021b, of which 11 bugs have been confirmed as new bugs by MathWorks Support. @InProceedings{ESEC/FSE22p1061, author = {Shikai Guo and He Jiang and Zhihao Xu and Xiaochen Li and Zhilei Ren and Zhide Zhou and Rong Chen}, title = {Detecting Simulink Compiler Bugs via Controllable Zombie Blocks Mutation}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1061--1072}, doi = {10.1145/3540250.3549159}, year = {2022}, } Publisher's Version |
|
Chen, Sen |
ESEC/FSE '22: "Large-Scale Analysis of Non-Termination ..."
Large-Scale Analysis of Non-Termination Bugs in Real-World OSS Projects
Xiuhan Shi, Xiaofei Xie, Yi Li, Yao Zhang, Sen Chen, and Xiaohong Li (Tianjin University, China; Singapore Management University, Singapore; Nanyang Technological University, Singapore) Termination is a crucial program property. Non-termination bugs can be subtle to detect and may remain hidden for long before they take effect. Many real-world programs still suffer from vast consequences (e.g., no response) caused by non-termination bugs. As a classic problem, termination proving has been studied for many years. Many termination checking tools and techniques have been developed and demonstrated effectiveness on existing well-established benchmarks. However, the capability of these tools in finding practical non-termination bugs has yet to be tested on real-world projects. To fill in this gap, in this paper, we conducted the first large-scale empirical study of non-termination bugs in real-world OSS projects. Specifically, we first devoted substantial manual efforts in collecting and analyzing 445 non-termination bugs from 3,142 GitHub commits and provided a systematic classifi-cation of the bugs based on their root causes. We constructed a new benchmark set characterizing the real-world bugs with simplified programs, including a non-termination dataset with 56 real and reproducible non-termination bugs and a termination dataset with 58 fixed programs. With the constructed benchmark, we evaluated five state-of-the-art termination analysis tools. The results show that the capabilities of the tested tools to make correct verdicts have obviously dropped compared with the existing benchmarks. Meanwhile, we identified the challenges and limitations that these tools face by analyzing the root causes of their unhandled bugs. Fi-nally, we summarized the challenges and future research directions for detecting non-termination bugs in real-world projects. @InProceedings{ESEC/FSE22p256, author = {Xiuhan Shi and Xiaofei Xie and Yi Li and Yao Zhang and Sen Chen and Xiaohong Li}, title = {Large-Scale Analysis of Non-Termination Bugs in Real-World OSS Projects}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {256--268}, doi = {10.1145/3540250.3549129}, year = {2022}, } Publisher's Version Info |
|
Chen, Simin |
ESEC/FSE '22: "NMTSloth: Understanding and ..."
NMTSloth: Understanding and Testing Efficiency Degradation of Neural Machine Translation Systems
Simin Chen, Cong Liu, Mirazul Haque, Zihe Song, and Wei Yang (University of Texas at Dallas, USA) Neural Machine Translation (NMT) systems have received much recent attention due to their human-level accuracy. While existing works mostly focus on either improving accuracy or testing accuracy robustness, the computation efficiency of NMT systems, which is of paramount importance due to often vast translation demands and real-time requirements, has surprisingly received little attention. In this paper, we make the first attempt to understand and test potential computation efficiency robustness in state-of-the-art NMT systems. By analyzing the working mechanism and implementation of 1455 public-accessible NMT systems, we observe a fundamental property in NMT systems that could be manipulated in an adversarial manner to reduce computation efficiency significantly. Our interesting observation is that the output length determines the computation efficiency of NMT systems instead of the input, where the output length depends on two factors: an often sufficiently large yet pessimistic pre-configured threshold controlling the max number of iterations and a runtime generated end of sentence (EOS) token. Our key motivation is to generate test inputs that could sufficiently delay the generation of EOS such that NMT systems would have to go through enough iterations to satisfy the pre-configured threshold. We present NMTSloth, which develops a gradient-guided technique that searches for a minimal and unnoticeable perturbation at character-level, token-level, and structure-level, which sufficiently delays the appearance of EOS and forces these inputs to reach the naturally-unreachable threshold. To demonstrate the effectiveness of NMTSloth, we conduct a systematic evaluation on three public-available NMT systems: Google T5, AllenAI WMT14, and Helsinki-NLP translators. Experimental results show that NMTSloth can increase NMT systems' response latency and energy consumption by 85% to 3153% and 86% to 3052%, respectively, by perturbing just one character or token in the input sentence. Our case study shows that inputs generated by NMTSloth significantly affect the battery power in real-world mobile devices (i.e., drain more than 30 times battery power than normal inputs). @InProceedings{ESEC/FSE22p1148, author = {Simin Chen and Cong Liu and Mirazul Haque and Zihe Song and Wei Yang}, title = {NMTSloth: Understanding and Testing Efficiency Degradation of Neural Machine Translation Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1148--1160}, doi = {10.1145/3540250.3549102}, year = {2022}, } Publisher's Version |
|
Chen, Taolue |
ESEC/FSE '22: "DeJITLeak: Eliminating JIT-Induced ..."
DeJITLeak: Eliminating JIT-Induced Timing Side-Channel Leaks
Qi Qin, JulianAndres JiYang, Fu Song, Taolue Chen, and Xinyu Xing (ShanghaiTech University, China; Birkbeck University of London, UK; Northwestern University, USA) Timing side-channels can be exploited to infer secret information when the execution time of a program is correlated with secrets. Recent work has shown that Just-In-Time (JIT) compilation can introduce new timing side-channels in programs even if they are time-balanced at the source code level. In this paper, we propose a novel approach to eliminate JIT-induced leaks. We first formalise timing side-channel security under JIT compilation via the notion of time-balancing, laying the foundation for reasoning about programs with JIT compilation. We then propose to eliminate JIT-induced leaks via a fine-grained JIT compilation. To this end, we provide an automated approach to generate compilation policies and a novel type system to guarantee its soundness. We develop a tool DeJITLeak for real-world Java and implement the fine-grained JIT compilation in HotSpot JVM. Experimental results show that DeJITLeak can effectively and efficiently eliminate JIT-induced leaks on three widely adopted benchmarks in the setting of side-channel detection. @InProceedings{ESEC/FSE22p872, author = {Qi Qin and JulianAndres JiYang and Fu Song and Taolue Chen and Xinyu Xing}, title = {DeJITLeak: Eliminating JIT-Induced Timing Side-Channel Leaks}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {872--884}, doi = {10.1145/3540250.3549150}, year = {2022}, } Publisher's Version Info Artifacts Reusable |
|
Chen, Wei |
ESEC/FSE '22: "MOSAT: Finding Safety Violations ..."
MOSAT: Finding Safety Violations of Autonomous Driving Systems using Multi-objective Genetic Algorithm
Haoxiang Tian, Yan Jiang, Guoquan Wu, Jiren Yan, Jun Wei, Wei Chen, Shuo Li, and Dan Ye (Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; University of Chinese Academy of Sciences Nanjing College, China; China Southern Power Grid, China; University of Chinese Academy of Sciences Chongqing School, China; Nanjing Institute of Software Technology, China) Autonomous Driving Systems (ADSs) are safety-critical systems, and safety violations of Autonomous Vehicles (AVs) in real traffic will cause huge losses. Therefore, ADSs must be fully tested before deployed on real world roads. Simulation testing is essential to find safety violations of ADS. This paper proposes MOSAT, a multi-objective search-based testing framework, which constructs diverse and adversarial driving environment to expose safety violations of ADSs. Specifically, based on atomic driving maneuvers, MOSAT introduces motif pattern, which describes a sequence of maneuvers that can challenge ADS effectively. MOSAT constructs test scenarios by atomic maneuvers and motif patterns, and uses multi-objective genetic algorithm to search for adversarial and diverse test scenarios. Moreover, in order to test the performance of ADS comprehensively during long-mile driving, we design a novel continuous simulation testing technique, which runs the scenarios generated by multiple parallel search processes alternately in the simulator and can continuously create different perturbations to ADS. We demonstrate MOSAT on an industrial-grade platform, Baidu Apollo, and the experimental results show that MOSAT can effectively generate safety-critical scenarios to crash ADSs and it exposes 11 distinct types of safety violations in a short period of time. It also outperforms state-of-the-art techniques by finding more 6 distinct safety violations on the same road. @InProceedings{ESEC/FSE22p94, author = {Haoxiang Tian and Yan Jiang and Guoquan Wu and Jiren Yan and Jun Wei and Wei Chen and Shuo Li and Dan Ye}, title = {MOSAT: Finding Safety Violations of Autonomous Driving Systems using Multi-objective Genetic Algorithm}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {94--106}, doi = {10.1145/3540250.3549100}, year = {2022}, } Publisher's Version |
|
Chen, Xiao |
ESEC/FSE '22: "Are We Building on the Rock? ..."
Are We Building on the Rock? On the Importance of Data Preprocessing for Code Summarization
Lin Shi, Fangwen Mu, Xiao Chen, Song Wang, Junjie Wang, Ye Yang, Ge Li, Xin Xia, and Qing Wang (Institute of Software at Chinese Academy of Sciences, China; York University, Canada; Stevens Institute of Technology, USA; Peking University, China; Huawei, China) Code summarization, the task of generating useful comments given the code, has long been of interest. Most of the existing code summarization models are trained and validated on widely-used code comment benchmark datasets. However, little is known about the quality of the benchmark datasets built from real-world projects. Are the benchmark datasets as good as expected? To bridge the gap, we conduct a systematic research to assess and improve the quality of four benchmark datasets widely used for code summarization tasks. First, we propose an automated code-comment cleaning tool that can accurately detect noisy data caused by inappropriate data preprocessing operations from existing benchmark datasets. Then, we apply the tool to further assess the data quality of the four benchmark datasets, based on the detected noises. Finally, we conduct comparative experiments to investigate the impact of noisy data on the performance of code summarization models. The results show that these data preprocessing noises widely exist in all four benchmark datasets, and removing these noisy data leads to a significant improvement on the performance of code summarization. We believe that the findings and insights will enable a better understanding of data quality in code summarization tasks, and pave the way for relevant research and practice. @InProceedings{ESEC/FSE22p107, author = {Lin Shi and Fangwen Mu and Xiao Chen and Song Wang and Junjie Wang and Ye Yang and Ge Li and Xin Xia and Qing Wang}, title = {Are We Building on the Rock? On the Importance of Data Preprocessing for Code Summarization}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {107--119}, doi = {10.1145/3540250.3549145}, year = {2022}, } Publisher's Version ESEC/FSE '22: "Cross-Language Android Permission ..." Cross-Language Android Permission Specification Chaoran Li, Xiao Chen, Ruoxi Sun, Minhui Xue, Sheng Wen, Muhammad Ejaz Ahmed, Seyit Camtepe, and Yang Xiang (Swinburne University of Technology, Australia; Monash University, Australia; University of Adelaide, Australia; CSIRO’s Data61, Australia) The Android system manages access to sensitive APIs by permission enforcement. An application (app) must declare proper permissions before invoking specific Android APIs. However, there is no official documentation providing the complete list of permission-protected APIs and the corresponding permissions to date. Researchers have spent significant efforts extracting such API protection mapping from the Android API framework, which leverages static code analysis to determine if specific permissions are required before accessing an API. Nevertheless, none of them has attempted to analyze the protection mapping in the native library (i.e., code written in C and C++), an essential component of the Android framework that handles communication with the lower-level hardware, such as cameras and sensors. While the protection mapping can be utilized to detect various security vulnerabilities in Android apps, such as permission over-privilege, imprecise mapping will lead to false results in detecting such security vulnerabilities. To fill this gap, we thereby propose to construct the protection mapping involved in the native libraries of the Android framework to present a complete and accurate specification of Android API protection. We develop a prototype system, named NatiDroid, to facilitate the cross-language static analysis and compare its performance with two state-of-the-practice tools, termed Axplorer and Arcade. We evaluate NatiDroid on more than 11,000 Android apps, including system apps from custom Android ROMs and third-party apps from the Google Play. Our NatiDroid can identify up to 464 new API-permission mappings, in contrast to the worst-case results derived from both Axplorer and Arcade, where approximately 71% apps have at least one false positive in permission over-privilege. We have disclosed all the potential vulnerabilities detected to the stakeholders. @InProceedings{ESEC/FSE22p772, author = {Chaoran Li and Xiao Chen and Ruoxi Sun and Minhui Xue and Sheng Wen and Muhammad Ejaz Ahmed and Seyit Camtepe and Yang Xiang}, title = {Cross-Language Android Permission Specification}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {772--783}, doi = {10.1145/3540250.3549142}, year = {2022}, } Publisher's Version |
|
Chen, Yifen |
ESEC/FSE '22-IND: "Workgraph: Personal Focus ..."
Workgraph: Personal Focus vs. Interruption for Engineers at Meta
Yifen Chen, Peter C. Rigby, Yulin Chen, Kun Jiang, Nader Dehghani, Qianying Huang, Peter Cottle, Clayton Andrews, Noah Lee, and Nachiappan Nagappan (Meta, USA; Concordia University, Canada) All engineers dislike interruptions because it takes away from the deep focus time needed to write complex code. Our goal is to reduce unnecessary interruptions at . We first describe our Workgraph platform that logs how engineers use our internal work tools at . Using these anonymized logs, we create sessions. sessions are defined in opposition to interruption and are the amount of time until the engineer is interrupted by, for example, a work chat message. We describe descriptive statistics related to how long engineers are able to focus. We find that at Meta, Engineers have a total of 14.25 hours of personal-focus time per week. These numbers are comparable with those reported by other software firms. We then create a Random Forest model to understand which factors influence the median daily personal-focus time. We find that the more time an engineer spends in the IDE the longer their focus. We also find that the more central an engineer is in the social work network, the shorter their personal-focus time. Other factors such as role and domain/pillar have little impact on personal-focus at Meta. To help engineers achieve longer blocks of personal-focus and help them stay in flow, Meta developed the AutoFocus tool that blocks work chat notifications when an engineer is working on code for 12 minutes or longer. AutoFocus allows the sender to still force a work chat message using “@notify” ensuring that urgent messages still get through, but allowing the sender to reflect on the importance of the message. In a large experiment, we find that AutoFocus increases the amount of personal-focus time by 20.27%, and it has now been rolled out widely at Meta. @InProceedings{ESEC/FSE22p1390, author = {Yifen Chen and Peter C. Rigby and Yulin Chen and Kun Jiang and Nader Dehghani and Qianying Huang and Peter Cottle and Clayton Andrews and Noah Lee and Nachiappan Nagappan}, title = {Workgraph: Personal Focus vs. Interruption for Engineers at Meta}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1390--1397}, doi = {10.1145/3540250.3558961}, year = {2022}, } Publisher's Version |
|
Chen, Yimin |
ESEC/FSE '22-DEMO: "Clang __usercall: Towards ..."
Clang __usercall: Towards Native Support for User Defined Calling Conventions
Jared Widberg, Sashank Narain, and Yimin Chen (University of Massachusetts at Lowell, USA) In reverse engineering interfacing with C/C++ functions is of great interest because it provides much more flexibility for product development and security purpose. However, it has been a great challenge when interfacing functions with user defined calling conventions due to the lack of sufficient and user-friendly tooling. In this work, we design and implement Clang __usercall, which aims to provide programmers with an elegant and familiar syntax to specify user defined calling conventions on functions in C/C++ source code. Our key novelties lie in mimicing the most popular syntax and adapting Clang for interfacing purpose. Our preliminary user study shows that our solution outperforms the existing ones in multiple key aspects including user experience and required lines of code. Clang __usercall is already added to the Compiler Explorer website as well. @InProceedings{ESEC/FSE22p1746, author = {Jared Widberg and Sashank Narain and Yimin Chen}, title = {Clang __usercall: Towards Native Support for User Defined Calling Conventions}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1746--1750}, doi = {10.1145/3540250.3558921}, year = {2022}, } Publisher's Version |
|
Chen, Yiru |
ESEC/FSE '22: "TraceCRL: Contrastive Representation ..."
TraceCRL: Contrastive Representation Learning for Microservice Trace Analysis
Chenxi Zhang, Xin Peng, Tong Zhou, Chaofeng Sha, Zhenghui Yan, Yiru Chen, and Hong Yang (Fudan University, China) Due to the large amount and high complexity of trace data, microservice trace analysis tasks such as anomaly detection, fault diagnosis, and tail-based sampling widely adopt machine learning technology. These trace analysis approaches usually use a preprocessing step to map structured features of traces to vector representations in an ad-hoc way. Therefore, they may lose important information such as topological dependencies between service operations. In this paper, we propose TraceCRL, a trace representation learning approach based on contrastive learning and graph neural network, which can incorporate graph structured information in the downstream trace analysis tasks. Given a trace, TraceCRL constructs an operation invocation graph where nodes represent service operations and edges represent operation invocations together with predefined features for invocation status and related metrics. Based on the operation invocation graphs of traces TraceCRL uses a contrastive learning method to train a graph neural network-based model for trace representation. In particular, TraceCRL employs six trace data augmentation strategies to alleviate the problems of class collision and uniformity of representation in contrastive learning. Our experimental studies show that TraceCRL can significantly improve the performance of trace anomaly detection and offline trace sampling. It also confirms the effectiveness of the trace augmentation strategies and the efficiency of TraceCRL. @InProceedings{ESEC/FSE22p1221, author = {Chenxi Zhang and Xin Peng and Tong Zhou and Chaofeng Sha and Zhenghui Yan and Yiru Chen and Hong Yang}, title = {TraceCRL: Contrastive Representation Learning for Microservice Trace Analysis}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1221--1232}, doi = {10.1145/3540250.3549146}, year = {2022}, } Publisher's Version |
|
Chen, Yulin |
ESEC/FSE '22-IND: "Workgraph: Personal Focus ..."
Workgraph: Personal Focus vs. Interruption for Engineers at Meta
Yifen Chen, Peter C. Rigby, Yulin Chen, Kun Jiang, Nader Dehghani, Qianying Huang, Peter Cottle, Clayton Andrews, Noah Lee, and Nachiappan Nagappan (Meta, USA; Concordia University, Canada) All engineers dislike interruptions because it takes away from the deep focus time needed to write complex code. Our goal is to reduce unnecessary interruptions at . We first describe our Workgraph platform that logs how engineers use our internal work tools at . Using these anonymized logs, we create sessions. sessions are defined in opposition to interruption and are the amount of time until the engineer is interrupted by, for example, a work chat message. We describe descriptive statistics related to how long engineers are able to focus. We find that at Meta, Engineers have a total of 14.25 hours of personal-focus time per week. These numbers are comparable with those reported by other software firms. We then create a Random Forest model to understand which factors influence the median daily personal-focus time. We find that the more time an engineer spends in the IDE the longer their focus. We also find that the more central an engineer is in the social work network, the shorter their personal-focus time. Other factors such as role and domain/pillar have little impact on personal-focus at Meta. To help engineers achieve longer blocks of personal-focus and help them stay in flow, Meta developed the AutoFocus tool that blocks work chat notifications when an engineer is working on code for 12 minutes or longer. AutoFocus allows the sender to still force a work chat message using “@notify” ensuring that urgent messages still get through, but allowing the sender to reflect on the importance of the message. In a large experiment, we find that AutoFocus increases the amount of personal-focus time by 20.27%, and it has now been rolled out widely at Meta. @InProceedings{ESEC/FSE22p1390, author = {Yifen Chen and Peter C. Rigby and Yulin Chen and Kun Jiang and Nader Dehghani and Qianying Huang and Peter Cottle and Clayton Andrews and Noah Lee and Nachiappan Nagappan}, title = {Workgraph: Personal Focus vs. Interruption for Engineers at Meta}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1390--1397}, doi = {10.1145/3540250.3558961}, year = {2022}, } Publisher's Version |
|
Chen, Zhenpeng |
ESEC/FSE '22: "MAAT: A Novel Ensemble Approach ..."
MAAT: A Novel Ensemble Approach to Addressing Fairness and Performance Bugs for Machine Learning Software
Zhenpeng Chen, Jie M. Zhang, Federica Sarro, and Mark Harman (University College London, UK; King’s College London, UK) Machine Learning (ML) software can lead to unfair and unethical decisions, making software fairness bugs an increasingly significant concern for software engineers. However, addressing fairness bugs often comes at the cost of introducing more ML performance (e.g., accuracy) bugs. In this paper, we propose MAAT, a novel ensemble approach to improving fairness-performance trade-off for ML software. Conventional ensemble methods combine different models with identical learning objectives. MAAT, instead, combines models optimized for different objectives: fairness and ML performance. We conduct an extensive evaluation of MAAT with 5 state-of-the-art methods, 9 software decision tasks, and 15 fairness-performance measurements. The results show that MAAT significantly outperforms the state-of-the-art. In particular, MAAT beats the trade-off baseline constructed by a recent benchmarking tool in 92.2% of the overall cases evaluated, 12.2 percentage points more than the best technique currently available. Moreover, the superiority of MAAT over the state-of-the-art holds on all the tasks and measurements that we study. We have made publicly available the code and data of this work to allow for future replication and extension. @InProceedings{ESEC/FSE22p1122, author = {Zhenpeng Chen and Jie M. Zhang and Federica Sarro and Mark Harman}, title = {MAAT: A Novel Ensemble Approach to Addressing Fairness and Performance Bugs for Machine Learning Software}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1122--1134}, doi = {10.1145/3540250.3549093}, year = {2022}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable |
|
Chen, Zhenyu |
ESEC/FSE '22-DEMO: "SemCluster: A Semi-supervised ..."
SemCluster: A Semi-supervised Clustering Tool for Crowdsourced Test Reports with Deep Image Understanding
Mingzhe Du, Shengcheng Yu, Chunrong Fang, Tongyu Li, Heyuan Zhang, and Zhenyu Chen (Nanjing University, China) Due to the openness of crowdsourced testing, mobile app crowdsourced testing has been subject to duplicate reports. The previous research methods extract the textual features of the crowdsourced test reports, combine with shallow image analysis, and perform unsupervised clustering on the crowdsourced test reports to clarify the duplication of crowdsourced test reports and solve the problem. However, these methods ignore the semantic connection between textual descriptions and screenshots, making the clustering results unsatisfactory and the deduplication effect less accurate. This paper proposes a semi-supervised clustering tool for crowdsourced test reports with deep image understanding, namely SemCluster, which makes the most of the semantic connection between textual descriptions and screenshots by constructing semantic binding rules and performing semi-supervised clustering. SemCluster improves six metrics of clustering results in the experiment compared to the state-of-the-art method, which verifies that SemCluster has achieved a good deduplication effect. The demo can be found at: https://sites.google.com/view/semcluster-demo. @InProceedings{ESEC/FSE22p1756, author = {Mingzhe Du and Shengcheng Yu and Chunrong Fang and Tongyu Li and Heyuan Zhang and Zhenyu Chen}, title = {SemCluster: A Semi-supervised Clustering Tool for Crowdsourced Test Reports with Deep Image Understanding}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1756--1759}, doi = {10.1145/3540250.3558933}, year = {2022}, } Publisher's Version |
|
Cheng, Lan |
ESEC/FSE '22-IND: "What Improves Developer Productivity ..."
What Improves Developer Productivity at Google? Code Quality
Lan Cheng, Emerson Murphy-Hill, Mark Canning, Ciera Jaspan, Collin Green, Andrea Knight, Nan Zhang, and Elizabeth Kammer (Google, USA) Understanding what affects software developer productivity can help organizations choose wise investments in their technical and social environment. But the research literature either focuses on what correlates with developer productivity in ecologically valid settings or focuses on what causes developer productivity in highly constrained settings. In this paper, we bridge the gap by studying software developers at Google through two analyses. In the first analysis, we use panel data with 39 productivity factors, finding that code quality, technical debt, infrastructure tools and support, team communication, goals and priorities, and organizational change and process are all causally linked to self-reported developer productivity. In the second analysis, we use a lagged panel analysis to strengthen our causal claims. We find that increases in perceived code quality tend to be followed by increased perceived developer productivity, but not vice versa, providing the strongest evidence to date that code quality affects individual developer productivity. @InProceedings{ESEC/FSE22p1302, author = {Lan Cheng and Emerson Murphy-Hill and Mark Canning and Ciera Jaspan and Collin Green and Andrea Knight and Nan Zhang and Elizabeth Kammer}, title = {What Improves Developer Productivity at Google? Code Quality}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1302--1313}, doi = {10.1145/3540250.3558940}, year = {2022}, } Publisher's Version Archive submitted (330 kB) |
|
Cheng, Xingqi |
ESEC/FSE '22-DEMO: "KVS: A Tool for Knowledge-Driven ..."
KVS: A Tool for Knowledge-Driven Vulnerability Searching
Xingqi Cheng, Xiaobing Sun, Lili Bo, and Ying Wei (Yangzhou University, China; Heidelberg University, Germany; Nanjing University, China) It is difficult to quickly locate and search for specific vulnerabilities and their solutions because vulnerability information is scattered in the existing vulnerability management library. To alleviate this problem, we extract knowledge from vulnerability reports and organize the vulnerability information into the form of a knowledge graph. Then, we implement a tool for knowledge-driven vulnerability searching, KVS. This tool mainly uses the BERT model to realize the vulnerability named entity recognition and construct the vulnerability knowledge graph (VulKG). Finally, we can search vulnerabilities of interest-based on VulKG. The URL of this tool is https://cinnqi.github.io/Neo4j-D3-VKG/. Video of our demo is available at https://youtu.be/FT1BaLUGPk0. @InProceedings{ESEC/FSE22p1731, author = {Xingqi Cheng and Xiaobing Sun and Lili Bo and Ying Wei}, title = {KVS: A Tool for Knowledge-Driven Vulnerability Searching}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1731--1735}, doi = {10.1145/3540250.3558920}, year = {2022}, } Publisher's Version Video Info |
|
Chimalakonda, Sridhar |
ESEC/FSE '22-DEMO: "eGEN: An Energy-Saving Modeling ..."
eGEN: An Energy-Saving Modeling Language and Code Generator for Location-Sensing of Mobile Apps
Kowndinya Boyalakuntla, Marimuthu Chinnakali, Sridhar Chimalakonda, and Chandrasekaran K (IIT Tirupati, India; National Institute of Technology Karnataka, India) Given the limited tool support for energy-saving strategies during the design phase of android applications, developing battery-aware, location-based android applications is a non-trivial task for developers. To this end, we propose eGEN, consisting of (1) a Domain-Specific Modeling Language (DSML) and (2) a code generator to specify and create native battery-aware, location-based mobile applications. We evaluated eGEN by instrumenting the generated battery-aware code in five location-based, open-source android applications and compared the energy consumption with non-eGEN versions. The experimental results show 188 mA (8.34% of battery per hour) of average reduction in battery consumption while showing only 97 meters of degradation in location accuracy over three kilometers of a cycling path. Hence, we see this tool as a first step in helping developers write battery-aware code in location-based android applications. The GitHub repository with source code and all artifacts is available at https://github.com/Kowndinya2000/egen, and the tool demo video at https://youtu.be/Iadfh4cCw8I. @InProceedings{ESEC/FSE22p1697, author = {Kowndinya Boyalakuntla and Marimuthu Chinnakali and Sridhar Chimalakonda and Chandrasekaran K}, title = {eGEN: An Energy-Saving Modeling Language and Code Generator for Location-Sensing of Mobile Apps}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1697--1700}, doi = {10.1145/3540250.3558914}, year = {2022}, } Publisher's Version Video ESEC/FSE '22-IVR: "Exploring the Under-Explored ..." Exploring the Under-Explored Terrain of Non-open Source Data for Software Engineering through the Lens of Federated Learning Shriram Shanbhag and Sridhar Chimalakonda (IIT Tirupati, India) The availability of open source projects on platforms like GitHub has led to the wide use of the artifacts from these projects in software engineering research. These publicly available artifacts have been used to train artificial intelligence models used in various empirical studies and the development of tools. However, these advancements have missed out on the artifacts from non-open source projects due to the unavailability of the data. A major cause for the unavailability of the data from non-open source repositories is the issue concerning data privacy. In this paper, we propose using federated learning to address the issue of data privacy to enable the use of data from non-open source to train AI models used in software engineering research. We believe that this can potentially enable industries to collaborate with software engineering researchers without concerns about privacy. We present the preliminary evaluation of the use of federated learning to train a classifier to label bug-fix commits from an existing study to demonstrate its feasibility. The federated approach achieved an F1 score of 0.83 compared to a score of 0.84 using the centralized approach. We also present our vision of the potential implications of the use of federated learning in software engineering research. @InProceedings{ESEC/FSE22p1610, author = {Shriram Shanbhag and Sridhar Chimalakonda}, title = {Exploring the Under-Explored Terrain of Non-open Source Data for Software Engineering through the Lens of Federated Learning}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1610--1614}, doi = {10.1145/3540250.3560883}, year = {2022}, } Publisher's Version |
|
Chinnakali, Marimuthu |
ESEC/FSE '22-DEMO: "eGEN: An Energy-Saving Modeling ..."
eGEN: An Energy-Saving Modeling Language and Code Generator for Location-Sensing of Mobile Apps
Kowndinya Boyalakuntla, Marimuthu Chinnakali, Sridhar Chimalakonda, and Chandrasekaran K (IIT Tirupati, India; National Institute of Technology Karnataka, India) Given the limited tool support for energy-saving strategies during the design phase of android applications, developing battery-aware, location-based android applications is a non-trivial task for developers. To this end, we propose eGEN, consisting of (1) a Domain-Specific Modeling Language (DSML) and (2) a code generator to specify and create native battery-aware, location-based mobile applications. We evaluated eGEN by instrumenting the generated battery-aware code in five location-based, open-source android applications and compared the energy consumption with non-eGEN versions. The experimental results show 188 mA (8.34% of battery per hour) of average reduction in battery consumption while showing only 97 meters of degradation in location accuracy over three kilometers of a cycling path. Hence, we see this tool as a first step in helping developers write battery-aware code in location-based android applications. The GitHub repository with source code and all artifacts is available at https://github.com/Kowndinya2000/egen, and the tool demo video at https://youtu.be/Iadfh4cCw8I. @InProceedings{ESEC/FSE22p1697, author = {Kowndinya Boyalakuntla and Marimuthu Chinnakali and Sridhar Chimalakonda and Chandrasekaran K}, title = {eGEN: An Energy-Saving Modeling Language and Code Generator for Location-Sensing of Mobile Apps}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1697--1700}, doi = {10.1145/3540250.3558914}, year = {2022}, } Publisher's Version Video |
|
Christakis, Maria |
ESEC/FSE '22-IND: "Input Splitting for Cloud-Based ..."
Input Splitting for Cloud-Based Static Application Security Testing Platforms
Maria Christakis, Thomas Cottenier, Antonio Filieri, Linghui Luo, Muhammad Numair Mansur, Lee Pike, Nicolás Rosner, Martin Schäf, Aritra Sengupta, and Willem Visser (MPI-SWS, Germany; Amazon Web Services, USA; Amazon Web Services, Germany) As software development teams adopt DevSecOps practices, application security is increasingly the responsibility of development teams, who are required to set up their own Static Application Security Testing (SAST) infrastructure. Since development teams often do not have the necessary infrastructure and expertise to set up a custom SAST solution, there is an increased need for cloud-based SAST platforms that operate as a service and run a variety of static analyzers. Adding a new static analyzer to a cloud-based SAST platform can be challenging because static analyzers greatly vary in complexity, from linters that scale efficiently to interprocedural dataflow engines that use cubic or even more complex algorithms. Careful manual evaluation is needed to decide whether a new analyzer would slow down the overall response time of the platform or may timeout too often. We explore the question of whether this can be simplified by splitting the input to the analyzer into partitions and analyzing the partitions independently. Depending on the complexity of the static analyzer, the partition size can be adjusted to curtail the overall response time. We report on an experiment where we run different analysis tools with and without splitting the inputs. The experimental results show that simple splitting strategies can effectively reduce the running time and memory usage per partition without significantly affecting the findings produced by the tool. @InProceedings{ESEC/FSE22p1367, author = {Maria Christakis and Thomas Cottenier and Antonio Filieri and Linghui Luo and Muhammad Numair Mansur and Lee Pike and Nicolás Rosner and Martin Schäf and Aritra Sengupta and Willem Visser}, title = {Input Splitting for Cloud-Based Static Application Security Testing Platforms}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1367--1378}, doi = {10.1145/3540250.3558944}, year = {2022}, } Publisher's Version |
|
Chung, Hyunhee |
ESEC/FSE '22-IND: "An Empirical Study of Deep ..."
An Empirical Study of Deep Transfer Learning-Based Program Repair for Kotlin Projects
Misoo Kim, Youngkyoung Kim, Hohyeon Jeong, Jinseok Heo, Sungoh Kim, Hyunhee Chung, and Eunseok Lee (Sungkyunkwan University, South Korea; Samsung Electronics, South Korea) Deep learning-based automated program repair (DL-APR) can automatically fix software bugs and has received significant attention in the industry because of its potential to significantly reduce software development and maintenance costs. The Samsung mobile experience (MX) team is currently switching from Java to Kotlin projects. This study reviews the application of DL-APR, which automatically fixes defects that arise during this switching process; however, the shortage of Kotlin defect-fixing datasets in Samsung MX team precludes us from fully utilizing the power of deep learning. Therefore, strategies are needed to effectively reuse the pretrained DL-APR model. This demand can be met using the Kotlin defect-fixing datasets constructed from industrial and open-source repositories, and transfer learning. This study aims to validate the performance of the pretrained DL-APR model in fixing defects in the Samsung Kotlin projects, then improve its performance by applying transfer learning. We show that transfer learning with open source and industrial Kotlin defect-fixing datasets can improve the defect-fixing performance of the existing DL-APR by 307%. Furthermore, we confirmed that the performance was improved by 532% compared with the baseline DL-APR model as a result of transferring the knowledge of an industrial (non-defect) bug-fixing dataset. We also discovered that the embedded vectors and overlapping code tokens of the code-change pairs are valuable features for selecting useful knowledge transfer instances by improving the performance of APR models by up to 696%. Our study demonstrates the possibility of applying transfer learning to practitioners who review the application of DL-APR to industrial software. @InProceedings{ESEC/FSE22p1441, author = {Misoo Kim and Youngkyoung Kim and Hohyeon Jeong and Jinseok Heo and Sungoh Kim and Hyunhee Chung and Eunseok Lee}, title = {An Empirical Study of Deep Transfer Learning-Based Program Repair for Kotlin Projects}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1441--1452}, doi = {10.1145/3540250.3558967}, year = {2022}, } Publisher's Version |
|
Cito, Jürgen |
ESEC/FSE '22: "A Retrospective Study of One ..."
A Retrospective Study of One Decade of Artifact Evaluations
Stefan Winter, Christopher S. Timperley, Ben Hermann, Jürgen Cito, Jonathan Bell, Michael Hilton, and Dirk Beyer (LMU Munich, Germany; Carnegie Mellon University, USA; TU Dortmund, Germany; TU Wien, Austria; Northeastern University, USA) Most software engineering research involves the development of a prototype, a proof of concept, or a measurement apparatus. Together with the data collected in the research process, they are collectively referred to as research artifacts and are subject to artifact evaluation (AE) at scientific conferences. Since its initiation in the SE community at ESEC/FSE 2011, both the goals and the process of AE have evolved and today expectations towards AE are strongly linked with reproducible research results and reusable tools that other researchers can build their work on. However, to date little evidence has been provided that artifacts which have passed AE actually live up to these high expectations, i.e., to which degree AE processes contribute to AE's goals and whether the overhead they impose is justified. We aim to fill this gap by providing an in-depth analysis of research artifacts from a decade of software engineering (SE) and programming languages (PL) conferences, based on which we reflect on the goals and mechanisms of AE in our community. In summary, our analyses (1) suggest that articles with artifacts do not generally have better visibility in the community, (2) provide evidence how evaluated and not evaluated artifacts differ with respect to different quality criteria, and (3) highlight opportunities for further improving AE processes. @InProceedings{ESEC/FSE22p145, author = {Stefan Winter and Christopher S. Timperley and Ben Hermann and Jürgen Cito and Jonathan Bell and Michael Hilton and Dirk Beyer}, title = {A Retrospective Study of One Decade of Artifact Evaluations}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {145--156}, doi = {10.1145/3540250.3549172}, year = {2022}, } Publisher's Version Artifacts Reusable |
|
Clement, Colin B. |
ESEC/FSE '22-IND: "Exploring and Evaluating Personalized ..."
Exploring and Evaluating Personalized Models for Code Generation
Andrei Zlotchevski, Dawn Drain, Alexey Svyatkovskiy, Colin B. Clement, Neel Sundaresan, and Michele Tufano (McGill University, Canada; Anthropic, USA; Microsoft, USA) Large Transformer models achieved the state-of-the-art status for Natural Language Understanding tasks and are increasingly becoming the baseline model architecture for modeling source code. Transformers are usually pre-trained on large unsupervised corpora, learning token representations and transformations relevant to modeling generally available text, and are then fine-tuned on a particular downstream task of interest. While fine-tuning is a tried-and-true method for adapting a model to a new domain -- for example, question-answering on a given topic -- generalization remains an on-going challenge. In this paper, we explore and evaluate transformer model fine-tuning for personalization. In the context of generating unit tests for Java methods, we evaluate learning to personalize to a specific software project using several personalization techniques. We consider three key approaches: (i) custom fine-tuning, which allows all the model parameters to be tuned; (ii) lightweight fine-tuning, which freezes most of the model's parameters, allowing tuning of the token embeddings and softmax layer only or the final layer alone; (iii) prefix tuning, which keeps model parameters frozen, but optimizes a small project-specific prefix vector. Each of these techniques offers a trade-off in total compute cost and predictive performance, which we evaluate by code and task-specific metrics, training time, and total computational operations. We compare these fine-tuning strategies for code generation and discuss the potential generalization and cost benefits of each in various deployment scenarios. @InProceedings{ESEC/FSE22p1500, author = {Andrei Zlotchevski and Dawn Drain and Alexey Svyatkovskiy and Colin B. Clement and Neel Sundaresan and Michele Tufano}, title = {Exploring and Evaluating Personalized Models for Code Generation}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1500--1508}, doi = {10.1145/3540250.3558959}, year = {2022}, } Publisher's Version ESEC/FSE '22: "DeepDev-PERF: A Deep Learning-Based ..." DeepDev-PERF: A Deep Learning-Based Approach for Improving Software Performance Spandan Garg, Roshanak Zilouchian Moghaddam, Colin B. Clement, Neel Sundaresan, and Chen Wu (Microsoft, USA; Microsoft, China) Improving software performance is an important yet challenging part of the software development cycle. Today, the majority of performance inefficiencies are identified and patched by performance experts. Recent advancements in deep learning approaches and the wide-spread availability of open-source data creates a great opportunity to automate the identification and patching of performance problems. In this paper, we present DeepDev-PERF, a transformer-based approach to suggest performance improvements for C# applications. We pretrain DeepDev-PERF on English and Source code corpora, followed by finetuning for the task of generating performance improvement patches for C# applications. Our evaluation shows that our model can generate the same performance improvement suggestion as the developer fix in 53 @InProceedings{ESEC/FSE22p948, author = {Spandan Garg and Roshanak Zilouchian Moghaddam and Colin B. Clement and Neel Sundaresan and Chen Wu}, title = {DeepDev-PERF: A Deep Learning-Based Approach for Improving Software Performance}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {948--958}, doi = {10.1145/3540250.3549096}, year = {2022}, } Publisher's Version |
|
Cogo, Filipe Roseiro |
ESEC/FSE '22: "Automated Unearthing of Dangerous ..."
Automated Unearthing of Dangerous Issue Reports
Shengyi Pan, Jiayuan Zhou, Filipe Roseiro Cogo, Xin Xia, Lingfeng Bao, Xing Hu, Shanping Li, and Ahmed E. Hassan (Zhejiang University, China; Huawei, Canada; Huawei, China; Queen’s University, Canada) The coordinated vulnerability disclosure (CVD) process is commonly adopted for open source software (OSS) vulnerability management, which suggests to privately report the discovered vulnerabilities and keep relevant information secret until the official disclosure. However, in practice, due to various reasons (e.g., lacking security domain expertise or the sense of security management), many vulnerabilities are first reported via public issue reports (IRs) before its official disclosure. Such IRs are dangerous IRs, since attackers can take advantages of the leaked vulnerability information to launch zero-day attacks. It is crucial to identify such dangerous IRs at an early stage, such that OSS users can start the vulnerability remediation process earlier and OSS maintainers can timely manage the dangerous IRs. In this paper, we propose and evaluate a deep learning based approach, namely MemVul, to automatically identify dangerous IRs at the time they are reported. MemVul augments the neural networks with a memory component, which stores the external vulnerability knowledge from Common Weakness Enumeration (CWE). We rely on publicly accessible CVE-referred IRs (CIRs) to operationalize the concept of dangerous IR. We mine 3,937 CIRs distributed across 1,390 OSS projects hosted on GitHub. Evaluated under a practical scenario of high data imbalance, MemVul achieves the best trade-off between precision and recall among all baselines. In particular, the F1-score of MemVul (i.e., 0.49) improves the best performing baseline by 44%. For IRs that are predicted as CIRs but not reported to CVE, we conduct a user study to investigate their usefulness to OSS stakeholders. We observe that 82% (41 out of 50) of these IRs are security-related and 28 of them are suggested by security experts to be publicly disclosed, indicating MemVul is capable of identifying undisclosed dangerous IRs. @InProceedings{ESEC/FSE22p834, author = {Shengyi Pan and Jiayuan Zhou and Filipe Roseiro Cogo and Xin Xia and Lingfeng Bao and Xing Hu and Shanping Li and Ahmed E. Hassan}, title = {Automated Unearthing of Dangerous Issue Reports}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {834--846}, doi = {10.1145/3540250.3549156}, year = {2022}, } Publisher's Version |
|
Costa, Diego Elias |
ESEC/FSE '22-IND: "Achievement Unlocked: A Case ..."
Achievement Unlocked: A Case Study on Gamifying DevOps Practices in Industry
Patrick Ayoup, Diego Elias Costa, and Emad Shihab (Concordia University, Canada; Université du Québec à Montréal, Canada) Gamification is the use of game elements such as points, leaderboards, and badges in a non-game context to encourage a desired behavior from individuals interacting with an environment. Recently, gamification has found its way into software engineering contexts as a means to promote certain activities to practitioners. Previous studies investigated the use of gamification to promote the adoption of a variety of tools and practices, however, these studies were either performed in an educational environment or in small to medium-sized teams of developers in the industry. We performed a large-scale mixed-methods study on the effects of badge-based gamification in promoting the adoption of DevOps practices in a very large company and evaluated how practice adoption is associated with changes in key delivery, quality, and throughput metrics of 333 software projects. We observed an accelerated adoption of some gamified DevOps practices by at least 60%, with increased adoption rates up to 6x. We found mixed results when associating badge adoption and metric changes: teams that earned testing badges showed an increase in bug fixing commits but output fewer commits and pull requests; teams that earned code review and quality tooling badges exhibited faster delivery metrics. Finally, our empirical study was supplemented by a survey with 45 developers where 73% of respondents found badges to be helpful for learning about and adopting new standardized practices. Our results contribute to the rich knowledge on gamification with a unique and important perspective from real industry practitioners. @InProceedings{ESEC/FSE22p1343, author = {Patrick Ayoup and Diego Elias Costa and Emad Shihab}, title = {Achievement Unlocked: A Case Study on Gamifying DevOps Practices in Industry}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1343--1354}, doi = {10.1145/3540250.3558948}, year = {2022}, } Publisher's Version |
|
Cottenier, Thomas |
ESEC/FSE '22-IND: "Input Splitting for Cloud-Based ..."
Input Splitting for Cloud-Based Static Application Security Testing Platforms
Maria Christakis, Thomas Cottenier, Antonio Filieri, Linghui Luo, Muhammad Numair Mansur, Lee Pike, Nicolás Rosner, Martin Schäf, Aritra Sengupta, and Willem Visser (MPI-SWS, Germany; Amazon Web Services, USA; Amazon Web Services, Germany) As software development teams adopt DevSecOps practices, application security is increasingly the responsibility of development teams, who are required to set up their own Static Application Security Testing (SAST) infrastructure. Since development teams often do not have the necessary infrastructure and expertise to set up a custom SAST solution, there is an increased need for cloud-based SAST platforms that operate as a service and run a variety of static analyzers. Adding a new static analyzer to a cloud-based SAST platform can be challenging because static analyzers greatly vary in complexity, from linters that scale efficiently to interprocedural dataflow engines that use cubic or even more complex algorithms. Careful manual evaluation is needed to decide whether a new analyzer would slow down the overall response time of the platform or may timeout too often. We explore the question of whether this can be simplified by splitting the input to the analyzer into partitions and analyzing the partitions independently. Depending on the complexity of the static analyzer, the partition size can be adjusted to curtail the overall response time. We report on an experiment where we run different analysis tools with and without splitting the inputs. The experimental results show that simple splitting strategies can effectively reduce the running time and memory usage per partition without significantly affecting the findings produced by the tool. @InProceedings{ESEC/FSE22p1367, author = {Maria Christakis and Thomas Cottenier and Antonio Filieri and Linghui Luo and Muhammad Numair Mansur and Lee Pike and Nicolás Rosner and Martin Schäf and Aritra Sengupta and Willem Visser}, title = {Input Splitting for Cloud-Based Static Application Security Testing Platforms}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1367--1378}, doi = {10.1145/3540250.3558944}, year = {2022}, } Publisher's Version |
|
Cottle, Peter |
ESEC/FSE '22-IND: "Workgraph: Personal Focus ..."
Workgraph: Personal Focus vs. Interruption for Engineers at Meta
Yifen Chen, Peter C. Rigby, Yulin Chen, Kun Jiang, Nader Dehghani, Qianying Huang, Peter Cottle, Clayton Andrews, Noah Lee, and Nachiappan Nagappan (Meta, USA; Concordia University, Canada) All engineers dislike interruptions because it takes away from the deep focus time needed to write complex code. Our goal is to reduce unnecessary interruptions at . We first describe our Workgraph platform that logs how engineers use our internal work tools at . Using these anonymized logs, we create sessions. sessions are defined in opposition to interruption and are the amount of time until the engineer is interrupted by, for example, a work chat message. We describe descriptive statistics related to how long engineers are able to focus. We find that at Meta, Engineers have a total of 14.25 hours of personal-focus time per week. These numbers are comparable with those reported by other software firms. We then create a Random Forest model to understand which factors influence the median daily personal-focus time. We find that the more time an engineer spends in the IDE the longer their focus. We also find that the more central an engineer is in the social work network, the shorter their personal-focus time. Other factors such as role and domain/pillar have little impact on personal-focus at Meta. To help engineers achieve longer blocks of personal-focus and help them stay in flow, Meta developed the AutoFocus tool that blocks work chat notifications when an engineer is working on code for 12 minutes or longer. AutoFocus allows the sender to still force a work chat message using “@notify” ensuring that urgent messages still get through, but allowing the sender to reflect on the importance of the message. In a large experiment, we find that AutoFocus increases the amount of personal-focus time by 20.27%, and it has now been rolled out widely at Meta. @InProceedings{ESEC/FSE22p1390, author = {Yifen Chen and Peter C. Rigby and Yulin Chen and Kun Jiang and Nader Dehghani and Qianying Huang and Peter Cottle and Clayton Andrews and Noah Lee and Nachiappan Nagappan}, title = {Workgraph: Personal Focus vs. Interruption for Engineers at Meta}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1390--1397}, doi = {10.1145/3540250.3558961}, year = {2022}, } Publisher's Version |
|
Counsell, Steve |
ESEC/FSE '22-IND: "Towards Developer-Centered ..."
Towards Developer-Centered Automatic Program Repair: Findings from Bloomberg
Emily Rowan Winter, Vesna Nowack, David Bowes, Steve Counsell, Tracy Hall, Sæmundur Haraldsson, John Woodward, Serkan Kirbas, Etienne Windels, Olayori McBello, Abdurahman Atakishiyev, Kevin Kells, and Matthew Pagano (Lancaster University, UK; Brunel University London, UK; University of Stirling, UK; Queen Mary University of London, UK; Bloomberg, UK) This paper reports on qualitative research into automatic program repair (APR) at Bloomberg. Six focus groups were conducted with a total of seventeen participants (including both developers of the APR tool and developers using the tool) to consider: the development at Bloomberg of a prototype APR tool (Fixie); developers’ early experiences using the tool; and developers’ perspectives on how they would like to interact with the tool in future. APR is developing rapidly and it is important to understand in greater detail developers' experiences using this emerging technology. In this paper, we provide in-depth, qualitative data from an industrial setting. We found that the development of APR at Bloomberg had become increasingly user-centered, emphasising how fixes were presented to developers, as well as particular features, such as customisability. From the focus groups with developers who had used Fixie, we found particular concern with the pragmatic aspects of APR, such as how and when fixes were presented to them. Based on our findings, we make a series of recommendations to inform future APR development, highlighting how APR tools should 'start small', be customisable, and fit with developers' workflows. We also suggest that APR tools should capitalise on the promise of repair bots and draw on advances in explainable AI. @InProceedings{ESEC/FSE22p1578, author = {Emily Rowan Winter and Vesna Nowack and David Bowes and Steve Counsell and Tracy Hall and Sæmundur Haraldsson and John Woodward and Serkan Kirbas and Etienne Windels and Olayori McBello and Abdurahman Atakishiyev and Kevin Kells and Matthew Pagano}, title = {Towards Developer-Centered Automatic Program Repair: Findings from Bloomberg}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1578--1588}, doi = {10.1145/3540250.3558953}, year = {2022}, } Publisher's Version |
|
Cui, Yunna |
ESEC/FSE '22-IND: "Trace Analysis Based Microservice ..."
Trace Analysis Based Microservice Architecture Measurement
Xin Peng, Chenxi Zhang, Zhongyuan Zhao, Akasaka Isami, Xiaofeng Guo, and Yunna Cui (Fudan University, China; Alibaba Group, China) Microservice architecture design highly relies on expert experience and may often result in improper service decomposition. Moreover, a microservice architecture is likely to degrade with the continuous evolution of services. Architecture measurement is thus important for the long-term evolution of microservice architectures. Due to the independent and dynamic nature of services, source code analysis based approaches cannot well capture the interactions between services. In this paper, we propose a trace analysis based microservice architecture measurement approach. We define a trace data model for microservice architecture measurement, which enables fine-grained analysis of the execution processes of requests and the interactions between interfaces and services. Based on the data model, we define 14 architectural metrics to measure the service independence and invocation chain complexity of a microservice system. We implement the approach and conduct three case studies with a student course project, an open-source microservice benchmark system, and three industrial microservice systems. The results show that our approach can well characterize the independence and invocation chain complexity of microservice architectures and help developers to identify microservice architecture issues caused by improper service decomposition and architecture degradation. @InProceedings{ESEC/FSE22p1589, author = {Xin Peng and Chenxi Zhang and Zhongyuan Zhao and Akasaka Isami and Xiaofeng Guo and Yunna Cui}, title = {Trace Analysis Based Microservice Architecture Measurement}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1589--1599}, doi = {10.1145/3540250.3558951}, year = {2022}, } Publisher's Version |
|
Cunha, Alcino |
ESEC/FSE '22: "Quantitative Relational Modelling ..."
Quantitative Relational Modelling with QAlloy
Pedro Silva, José N. Oliveira, Nuno Macedo, and Alcino Cunha (University of Minho, Portugal; INESC TEC, Portugal; University of Porto, Portugal) Alloy is a popular language and tool for formal software design. A key factor to this popularity is its relational logic, an elegant specification language with a minimal syntax and semantics. However, many software problems nowadays involve both structural and quantitative requirements, and Alloy's relational logic is not well suited to reason about the latter. This paper introduces QAlloy, an extension of Alloy with quantitative relations that add integer quantities to associations between domain elements. Having integers internalised in relations, instead of being explicit domain elements like in standard Alloy, allows quantitative requirements to be specified in QAlloy with a similar elegance to structural requirements, with the side-effect of providing basic dimensional analysis support via the type system. The QAlloy Analyzer also implements an SMT-based engine that enables quantities to be unbounded, thus avoiding many problems that may arise with the current bounded integer semantics of Alloy. @InProceedings{ESEC/FSE22p885, author = {Pedro Silva and José N. Oliveira and Nuno Macedo and Alcino Cunha}, title = {Quantitative Relational Modelling with QAlloy}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {885--896}, doi = {10.1145/3540250.3549154}, year = {2022}, } Publisher's Version Artifacts Functional |
|
D'Ambros, Marco |
ESEC/FSE '22: "First Come First Served: The ..."
First Come First Served: The Impact of File Position on Code Review
Enrico Fregnan, Larissa Braz, Marco D'Ambros, Gül Çalıklı, and Alberto Bacchelli (University of Zurich, Switzerland; USI Lugano, Switzerland; University of Glasgow, UK) The most popular code review tools (e.g., Gerrit and GitHub) present the files to review sorted in alphabetical order. Could this choice or, more generally, the relative position in which a file is presented bias the outcome of code reviews? We investigate this hypothesis by triangulating complementary evidence in a two-step study. First, we observe developers’ code review activity. We analyze the review comments pertaining to 219,476 Pull Requests (PRs) from 138 popular Java projects on GitHub. We found files shown earlier in a PR to receive more comments than files shown later, also when controlling for possible confounding factors: e.g., the presence of discussion threads or the lines added in a file. Second, we measure the impact of file position on defect finding in code review. Recruit- ing 106 participants, we conduct an online controlled experiment in which we measure participants’ performance in detecting two unrelated defects seeded into two different files. Participants are assigned to one of two treatments in which the position of the defective files is switched. For one type of defect, participants are not affected by its file’s position; for the other, they have 64% lower odds to identify it when its file is last as opposed to first. Overall, our findings provide evidence that the relative position in which files are presented has an impact on code reviews’ outcome; we discuss these results and implications for tool design and code review. @InProceedings{ESEC/FSE22p483, author = {Enrico Fregnan and Larissa Braz and Marco D'Ambros and Gül Çalıklı and Alberto Bacchelli}, title = {First Come First Served: The Impact of File Position on Code Review}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {483--494}, doi = {10.1145/3540250.3549177}, year = {2022}, } Publisher's Version Artifacts Reusable |
|
Dang, Yingnong |
ESEC/FSE '22-IND: "An Empirical Investigation ..."
An Empirical Investigation of Missing Data Handling in Cloud Node Failure Prediction
Minghua Ma, Yudong Liu, Yuang Tong, Haozhe Li, Pu Zhao, Yong Xu, Hongyu Zhang, Shilin He, Lu Wang, Yingnong Dang, Saravanakumar Rajmohan, and Qingwei Lin (Microsoft Research, China; University of Newcastle, Australia; Microsoft, USA) Cloud computing systems have become increasingly popular in recent years. A typical cloud system utilizes millions of computing nodes as the basic infrastructure. Node failure has been identified as one of the most prevalent causes of cloud system downtime. To improve the reliability of cloud systems, many previous studies collected monitoring metrics from nodes and built models to predict node failures before the failures happen. However, based on our experience with large-scale real-world cloud systems in Microsoft, we find that the task of predicting node failure is severely hampered by missing data. There is a large amount of missing data, and the online latest data utilized for prediction is even worse. As a result, the real-time performance of the node prediction model is limited. In this paper, we first characterize the missing data problem for node failure prediction. Then, we evaluate several existing data interpolation approaches, and find that node dimension interpolation approaches outperform time dimension ones and deep learning based interpolation is the best for early prediction. Our findings can help academics and engineers address the missing data problem in cloud node failure prediction and other data-driven software engineering scenarios. @InProceedings{ESEC/FSE22p1453, author = {Minghua Ma and Yudong Liu and Yuang Tong and Haozhe Li and Pu Zhao and Yong Xu and Hongyu Zhang and Shilin He and Lu Wang and Yingnong Dang and Saravanakumar Rajmohan and Qingwei Lin}, title = {An Empirical Investigation of Missing Data Handling in Cloud Node Failure Prediction}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1453--1464}, doi = {10.1145/3540250.3558946}, year = {2022}, } Publisher's Version ESEC/FSE '22-IND: "An Empirical Study of Log ..." An Empirical Study of Log Analysis at Microsoft Shilin He, Xu Zhang, Pinjia He, Yong Xu, Liqun Li, Yu Kang, Minghua Ma, Yining Wei, Yingnong Dang, Saravanakumar Rajmohan, and Qingwei Lin (Microsoft Research, China; Chinese University of Hong Kong at Shenzhen, China; Microsoft Azure, China; Microsoft Azure, USA; Microsoft 365, USA) Logs are crucial to the management and maintenance of software systems. In recent years, log analysis research has achieved notable progress on various topics such as log parsing and log-based anomaly detection. However, the real voices from front-line practitioners are seldom heard. For example, what are the pain points of log analysis in practice? In this work, we conduct a comprehensive survey study on log analysis at Microsoft. We collected feedback from 105 employees through a questionnaire of 13 questions and individual interviews with 12 employees. We summarize the format, scenario, method, tool, and pain points of log analysis. Additionally, by comparing the industrial practices with academic research, we discuss the gaps between academia and industry, and future opportunities on log analysis with four inspiring findings. Particularly, we observe a huge gap exists between log anomaly detection research and failure alerting practices regarding the goal, technique, efficiency, etc. Moreover, data-driven log parsing, which has been widely studied in recent research, can be alternatively achieved by simply logging template IDs during software development. We hope this paper could uncover the real needs of industrial practitioners and the unnoticed yet significant gap between industry and academia, and inspire interesting future directions that converge efforts from both sides. @InProceedings{ESEC/FSE22p1465, author = {Shilin He and Xu Zhang and Pinjia He and Yong Xu and Liqun Li and Yu Kang and Minghua Ma and Yining Wei and Yingnong Dang and Saravanakumar Rajmohan and Qingwei Lin}, title = {An Empirical Study of Log Analysis at Microsoft}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1465--1476}, doi = {10.1145/3540250.3558963}, year = {2022}, } Publisher's Version ESEC/FSE '22: "SPINE: A Scalable Log Parser ..." SPINE: A Scalable Log Parser with Feedback Guidance Xuheng Wang, Xu Zhang, Liqun Li, Shilin He, Hongyu Zhang, Yudong Liu, Lingling Zheng, Yu Kang, Qingwei Lin, Yingnong Dang, Saravanakumar Rajmohan, and Dongmei Zhang (Tsinghua University, China; Microsoft Research, China; University of Newcastle, Australia; Microsoft Azure, USA; Microsoft 365, USA) Log parsing, which extracts log templates and parameters, is a critical prerequisite step for automated log analysis techniques. Though existing log parsers have achieved promising accuracy on public log datasets, they still face many challenges when applied in the industry. Through studying the characteristics of real-world log data and analyzing the limitations of existing log parsers, we identify two problems. Firstly, it is non-trivial to scale a log parser to a vast number of logs, especially in real-world scenarios where the log data is extremely imbalanced. Secondly, existing log parsers overlook the importance of user feedback, which is imperative for parser fine-tuning under the continuous evolution of log data. To overcome the challenges, we propose SPINE, which is a highly scalable log parser with user feedback guidance. Based on our log parser equipped with initial grouping and progressive clustering,we propose a novel log data scheduling algorithm to improve the efficiency of parallelization under the large-scale imbalanced log data. Besides, we introduce user feedback to make the parser fast adapt to the evolving logs. We evaluated SPINE on 16 public log datasets. SPINE achieves more than 0.90 parsing accuracy on average with the highest parsing efficiency, which outperforms the state-of-the-art log parsers. We also evaluated SPINE in the production environment of Microsoft, in which SPINE can parse 30million logs in less than 8 minutes under 16 executors, achieving near real-time performance. In addition, our evaluations show that SPINE can consistently achieve good accuracy under log evolution with a moderate number of user feedback. @InProceedings{ESEC/FSE22p1198, author = {Xuheng Wang and Xu Zhang and Liqun Li and Shilin He and Hongyu Zhang and Yudong Liu and Lingling Zheng and Yu Kang and Qingwei Lin and Yingnong Dang and Saravanakumar Rajmohan and Dongmei Zhang}, title = {SPINE: A Scalable Log Parser with Feedback Guidance}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1198--1208}, doi = {10.1145/3540250.3549176}, year = {2022}, } Publisher's Version |
|
David, Cristina |
ESEC/FSE '22: "Using Graph Neural Networks ..."
Using Graph Neural Networks for Program Termination
Yoav Alon and Cristina David (University of Bristol, UK) Termination analyses investigate the termination behavior of programs, intending to detect nontermination, which is known to cause a variety of program bugs (e.g. hanging programs, denial-of-service vulnerabilities). Beyond formal approaches, various attempts have been made to estimate the termination behavior of programs using neural networks. However, the majority of these approaches continue to rely on formal methods to provide strong soundness guarantees and consequently suffer from similar limitations. In this paper, we move away from formal methods and embrace the stochastic nature of machine learning models. Instead of aiming for rigorous guarantees that can be interpreted by solvers, our objective is to provide an estimation of a program's termination behavior and of the likely reason for nontermination (when applicable) that a programmer can use for debugging purposes. Compared to previous approaches using neural networks for program termination, we also take advantage of the graph representation of programs by employing Graph Neural Networks. To further assist programmers in understanding and debugging nontermination bugs, we adapt the notions of attention and semantic segmentation, previously used for other application domains, to programs. Overall, we designed and implemented classifiers for program termination based on Graph Convolutional Networks and Graph Attention Networks, as well as a semantic segmentation Graph Neural Network that localizes AST nodes likely to cause nontermination. We also illustrated how the information provided by semantic segmentation can be combined with program slicing to further aid debugging. @InProceedings{ESEC/FSE22p910, author = {Yoav Alon and Cristina David}, title = {Using Graph Neural Networks for Program Termination}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {910--921}, doi = {10.1145/3540250.3549095}, year = {2022}, } Publisher's Version Artifacts Functional |
|
David, Yaniv |
ESEC/FSE '22: "NeuDep: Neural Binary Memory ..."
NeuDep: Neural Binary Memory Dependence Analysis
Kexin Pei, Dongdong She, Michael Wang, Scott Geng, Zhou Xuan, Yaniv David, Junfeng Yang, Suman Jana, and Baishakhi Ray (Columbia University, USA; Massachusetts Institute of Technology, USA; Purdue University, USA) Determining whether multiple instructions can access the same memory location is a critical task in binary analysis. It is challenging as statically computing precise alias information is undecidable in theory. The problem aggravates at the binary level due to the presence of compiler optimizations and the absence of symbols and types. Existing approaches either produce significant spurious dependencies due to conservative analysis or scale poorly to complex binaries. We present a new machine-learning-based approach to predict memory dependencies by exploiting the model's learned knowledge about how binary programs execute. Our approach features (i) a self-supervised procedure that pretrains a neural net to reason over binary code and its dynamic value flows through memory addresses, followed by (ii) supervised finetuning to infer the memory dependencies statically. To facilitate efficient learning, we develop dedicated neural architectures to encode the heterogeneous inputs (i.e., code, data values, and memory addresses from traces) with specific modules and fuse them with a composition learning strategy. We implement our approach in NeuDep and evaluate it on 41 popular software projects compiled by 2 compilers, 4 optimizations, and 4 obfuscation passes. We demonstrate that NeuDep is more precise (1.5x) and faster (3.5x) than the current state-of-the-art. Extensive probing studies on security-critical reverse engineering tasks suggest that NeuDep understands memory access patterns, learns function signatures, and is able to match indirect calls. All these tasks either assist or benefit from inferring memory dependencies. Notably, NeuDep also outperforms the current state-of-the-art on these tasks. @InProceedings{ESEC/FSE22p747, author = {Kexin Pei and Dongdong She and Michael Wang and Scott Geng and Zhou Xuan and Yaniv David and Junfeng Yang and Suman Jana and Baishakhi Ray}, title = {NeuDep: Neural Binary Memory Dependence Analysis}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {747--759}, doi = {10.1145/3540250.3549147}, year = {2022}, } Publisher's Version |
|
Davis, James C. |
ESEC/FSE '22-IVR: "Discrepancies among Pre-trained ..."
Discrepancies among Pre-trained Deep Neural Networks: A New Threat to Model Zoo Reliability
Diego Montes, Pongpatapee Peerapatanapokin, Jeff Schultz, Chengjun Guo, Wenxin Jiang, and James C. Davis (Purdue University, USA) Training deep neural networks (DNNs) takes significant time and resources. A practice for expedited deployment is to use pre-trained deep neural networks (PTNNs), often from model zoos--collections of PTNNs; yet, the reliability of model zoos remains unexamined. In the absence of an industry standard for the implementation and performance of PTNNs, engineers cannot confidently incorporate them into production systems. As a first step, discovering potential discrepancies between PTNNs across model zoos would reveal a threat to model zoo reliability. Prior works indicated existing variances in deep learning systems in terms of accuracy. However, broader measures of reliability for PTNNs from model zoos are unexplored. This work measures notable discrepancies between accuracy, latency, and architecture of 36 PTNNs across four model zoos. Among the top 10 discrepancies, we find differences of 1.23%-2.62% in accuracy and 9%-131% in latency. We also find mismatches in architecture for well-known DNN architectures (e.g., ResNet and AlexNet). Our findings call for future works on empirical validation, automated tools for measurement, and best practices for implementation. @InProceedings{ESEC/FSE22p1605, author = {Diego Montes and Pongpatapee Peerapatanapokin and Jeff Schultz and Chengjun Guo and Wenxin Jiang and James C. Davis}, title = {Discrepancies among Pre-trained Deep Neural Networks: A New Threat to Model Zoo Reliability}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1605--1609}, doi = {10.1145/3540250.3560881}, year = {2022}, } Publisher's Version ESEC/FSE '22-IVR: "Reflections on Software Failure ..." Reflections on Software Failure Analysis Paschal C. Amusuo, Aishwarya Sharma, Siddharth R. Rao, Abbey Vincent, and James C. Davis (Purdue University, USA) Failure studies are important in revealing the root causes, behaviors, and life cycle of defects in software systems. These studies either focus on understanding the characteristics of defects in specific classes of systems, or the characteristics of a specific type of defect in the systems it manifests in. Failure studies have influenced various software engineering research directions, especially in the area of software evolution, defect detection, and program repair. In this paper, we reflect on the conduct of failure studies in software engineering. We reviewed a sample of 52 failure study papers. We identified several recurring problems in these studies, some of which hinder the ability of software engineering community to trust or replicate the results. Based on our findings, we suggest future research directions, including identifying and analyzing failure causal chains, standardizing the conduct of failure studies, and tool support for faster defect analysis. @InProceedings{ESEC/FSE22p1615, author = {Paschal C. Amusuo and Aishwarya Sharma and Siddharth R. Rao and Abbey Vincent and James C. Davis}, title = {Reflections on Software Failure Analysis}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1615--1620}, doi = {10.1145/3540250.3560879}, year = {2022}, } Publisher's Version |
|
Decan, Alexandre |
ESEC/FSE '22: "PaReco: Patched Clones and ..."
PaReco: Patched Clones and Missed Patches among the Divergent Variants of a Software Family
Poedjadevie Kadjel Ramkisoen, John Businge, Brent van Bladel, Alexandre Decan, Serge Demeyer, Coen De Roover, and Foutse Khomh (University of Antwerp, Belgium; Flanders Make, Belgium; University of Nevada at Las Vegas, USA; University of Mons, Belgium; F.R.S.-FNRS, Belgium; Vrije Universiteit Brussel, Belgium; Polytechnique Montréal, Canada) Re-using whole repositories as a starting point for new projects is often done by maintaining a variant fork parallel to the original. However, the common artifacts between both are not always kept up to date. As a result, patches are not optimally integrated across the two repositories, which may lead to sub-optimal maintenance between the variant and the original project. A bug existing in both repositories can be patched in one but not the other (we see this as a missed opportunity) or it can be manually patched in both probably by different developers (we see this as effort duplication). In this paper we present a tool (named PaReCo) which relies on clone detection to mine cases of missed opportunity and effort duplication from a pool of patches. We analyzed 364 (source to target) variant pairs with 8,323 patches resulting in a curated dataset containing 1,116 cases of effort duplication and 1,008 cases of missed opportunities. We achieve a precision of 91%, recall of 80%, accuracy of 88%, and F1-score of 85%. Furthermore, we investigated the time interval between patches and found out that, on average, missed patches in the target variants have been introduced in the source variants 52 weeks earlier. Consequently, PaReCo can be used to manage variability in “time” by automatically identifying interesting patches in later project releases to be backported to supported earlier releases. @InProceedings{ESEC/FSE22p646, author = {Poedjadevie Kadjel Ramkisoen and John Businge and Brent van Bladel and Alexandre Decan and Serge Demeyer and Coen De Roover and Foutse Khomh}, title = {PaReco: Patched Clones and Missed Patches among the Divergent Variants of a Software Family}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {646--658}, doi = {10.1145/3540250.3549112}, year = {2022}, } Publisher's Version |
|
Dehghani, Nader |
ESEC/FSE '22-IND: "Workgraph: Personal Focus ..."
Workgraph: Personal Focus vs. Interruption for Engineers at Meta
Yifen Chen, Peter C. Rigby, Yulin Chen, Kun Jiang, Nader Dehghani, Qianying Huang, Peter Cottle, Clayton Andrews, Noah Lee, and Nachiappan Nagappan (Meta, USA; Concordia University, Canada) All engineers dislike interruptions because it takes away from the deep focus time needed to write complex code. Our goal is to reduce unnecessary interruptions at . We first describe our Workgraph platform that logs how engineers use our internal work tools at . Using these anonymized logs, we create sessions. sessions are defined in opposition to interruption and are the amount of time until the engineer is interrupted by, for example, a work chat message. We describe descriptive statistics related to how long engineers are able to focus. We find that at Meta, Engineers have a total of 14.25 hours of personal-focus time per week. These numbers are comparable with those reported by other software firms. We then create a Random Forest model to understand which factors influence the median daily personal-focus time. We find that the more time an engineer spends in the IDE the longer their focus. We also find that the more central an engineer is in the social work network, the shorter their personal-focus time. Other factors such as role and domain/pillar have little impact on personal-focus at Meta. To help engineers achieve longer blocks of personal-focus and help them stay in flow, Meta developed the AutoFocus tool that blocks work chat notifications when an engineer is working on code for 12 minutes or longer. AutoFocus allows the sender to still force a work chat message using “@notify” ensuring that urgent messages still get through, but allowing the sender to reflect on the importance of the message. In a large experiment, we find that AutoFocus increases the amount of personal-focus time by 20.27%, and it has now been rolled out widely at Meta. @InProceedings{ESEC/FSE22p1390, author = {Yifen Chen and Peter C. Rigby and Yulin Chen and Kun Jiang and Nader Dehghani and Qianying Huang and Peter Cottle and Clayton Andrews and Noah Lee and Nachiappan Nagappan}, title = {Workgraph: Personal Focus vs. Interruption for Engineers at Meta}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1390--1397}, doi = {10.1145/3540250.3558961}, year = {2022}, } Publisher's Version |
|
Demeyer, Serge |
ESEC/FSE '22: "PaReco: Patched Clones and ..."
PaReco: Patched Clones and Missed Patches among the Divergent Variants of a Software Family
Poedjadevie Kadjel Ramkisoen, John Businge, Brent van Bladel, Alexandre Decan, Serge Demeyer, Coen De Roover, and Foutse Khomh (University of Antwerp, Belgium; Flanders Make, Belgium; University of Nevada at Las Vegas, USA; University of Mons, Belgium; F.R.S.-FNRS, Belgium; Vrije Universiteit Brussel, Belgium; Polytechnique Montréal, Canada) Re-using whole repositories as a starting point for new projects is often done by maintaining a variant fork parallel to the original. However, the common artifacts between both are not always kept up to date. As a result, patches are not optimally integrated across the two repositories, which may lead to sub-optimal maintenance between the variant and the original project. A bug existing in both repositories can be patched in one but not the other (we see this as a missed opportunity) or it can be manually patched in both probably by different developers (we see this as effort duplication). In this paper we present a tool (named PaReCo) which relies on clone detection to mine cases of missed opportunity and effort duplication from a pool of patches. We analyzed 364 (source to target) variant pairs with 8,323 patches resulting in a curated dataset containing 1,116 cases of effort duplication and 1,008 cases of missed opportunities. We achieve a precision of 91%, recall of 80%, accuracy of 88%, and F1-score of 85%. Furthermore, we investigated the time interval between patches and found out that, on average, missed patches in the target variants have been introduced in the source variants 52 weeks earlier. Consequently, PaReCo can be used to manage variability in “time” by automatically identifying interesting patches in later project releases to be backported to supported earlier releases. @InProceedings{ESEC/FSE22p646, author = {Poedjadevie Kadjel Ramkisoen and John Businge and Brent van Bladel and Alexandre Decan and Serge Demeyer and Coen De Roover and Foutse Khomh}, title = {PaReco: Patched Clones and Missed Patches among the Divergent Variants of a Software Family}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {646--658}, doi = {10.1145/3540250.3549112}, year = {2022}, } Publisher's Version |
|
Deng, Yao |
ESEC/FSE '22: "Testing of Autonomous Driving ..."
Testing of Autonomous Driving Systems: Where Are We and Where Should We Go?
Guannan Lou, Yao Deng, Xi Zheng, Mengshi Zhang, and Tianyi Zhang (Macquarie University, Australia; Meta, USA; Purdue University, USA) Autonomous driving has shown great potential to reform modern transportation. Yet its reliability and safety have drawn a lot of attention and concerns. Compared with traditional software systems, autonomous driving systems (ADSs) often use deep neural networks in tandem with logic-based modules. This new paradigm poses unique challenges for software testing. Despite the recent development of new ADS testing techniques, it is not clear to what extent those techniques have addressed the needs of ADS practitioners. To fill this gap, we present the first comprehensive study to identify the current practices and needs of ADS testing. We conducted semi-structured interviews with developers from 10 autonomous driving companies and surveyed 100 developers who have worked on autonomous driving systems. A systematic analysis of the interview and survey data revealed 7 common practices and 4 emerging needs of autonomous driving testing. Through a comprehensive literature review, we developed a taxonomy of existing ADS testing techniques and analyzed the gap between ADS research and practitioners’ needs. Finally, we proposed several future directions for SE researchers, such as developing test reduction techniques to accelerate simulation-based ADS testing. @InProceedings{ESEC/FSE22p31, author = {Guannan Lou and Yao Deng and Xi Zheng and Mengshi Zhang and Tianyi Zhang}, title = {Testing of Autonomous Driving Systems: Where Are We and Where Should We Go?}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {31--43}, doi = {10.1145/3540250.3549111}, year = {2022}, } Publisher's Version ESEC/FSE '22: "Scenario-Based Test Reduction ..." Scenario-Based Test Reduction and Prioritization for Multi-Module Autonomous Driving Systems Yao Deng, Xi Zheng, Mengshi Zhang, Guannan Lou, and Tianyi Zhang (Macquarie University, Australia; Meta, USA; Purdue University, USA) When developing autonomous driving systems (ADS), developers often need to replay previously collected driving recordings to check the correctness of newly introduced changes to the system. However, simply replaying the entire recording is not necessary given the high redundancy of driving scenes in a recording (e.g., keeping the same lane for 10 minutes on a highway). In this pa- per, we propose a novel test reduction and prioritization approach for multi-module ADS. First, our approach automatically encodes frames in a driving recording to feature vectors based on a driving scene schema. Then, the given recording is sliced into segments based on the similarity of consecutive vectors. Lengthy segments are truncated to reduce the length of a recording and redundant segments with the same vector are removed. The remaining seg- ments are prioritized based on both the coverage and the rarity of driving scenes. We implemented this approach on an industry- level, multi-module ADS called Apollo and evaluated it on three road maps in various regression settings. The results show that our approach significantly reduced the original recordings by over 34% while keeping comparable test effectiveness, identifying almost all injected faults. Furthermore, our test prioritization method achieves about 22% to 39% and 41% to 53% improvements over three baselines in terms of both the average percentage of faults detected (APFD) and TOP-K. @InProceedings{ESEC/FSE22p82, author = {Yao Deng and Xi Zheng and Mengshi Zhang and Guannan Lou and Tianyi Zhang}, title = {Scenario-Based Test Reduction and Prioritization for Multi-Module Autonomous Driving Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {82--93}, doi = {10.1145/3540250.3549152}, year = {2022}, } Publisher's Version |
|
Deng, Yinlin |
ESEC/FSE '22: "Fuzzing Deep-Learning Libraries ..."
Fuzzing Deep-Learning Libraries via Automated Relational API Inference
Yinlin Deng, Chenyuan Yang, Anjiang Wei, and Lingming Zhang (University of Illinois at Urbana-Champaign, USA; Stanford University, USA) Deep Learning (DL) has gained wide attention in recent years. Meanwhile, bugs in DL systems can lead to serious consequences, and may even threaten human lives. As a result, a growing body of research has been dedicated to DL model testing. However, there is still limited work on testing DL libraries, e.g., PyTorch and TensorFlow, which serve as the foundations for building, training, and running DL models. Prior work on fuzzing DL libraries can only generate tests for APIs which have been invoked by documentation examples, developer tests, or DL models, leaving a large number of APIs untested. In this paper, we propose DeepREL, the first approach to automatically inferring relational APIs for more effective DL library fuzzing. Our basic hypothesis is that for a DL library under test, there may exist a number of APIs sharing similar input parameters and outputs; in this way, we can easily “borrow” test inputs from invoked APIs to test other relational APIs. Furthermore, we formalize the notion of value equivalence and status equivalence for relational APIs to serve as the oracle for effective bug finding. We have implemented DeepREL as a fully automated end-to-end relational API inference and fuzzing technique for DL libraries, which 1) automatically infers potential API relations based on API syntactic/semantic information, 2) synthesizes concrete test programs for invoking relational APIs, 3) validates the inferred relational APIs via representative test inputs, and finally 4) performs fuzzing on the verified relational APIs to find potential inconsistencies. Our evaluation on two of the most popular DL libraries, PyTorch and TensorFlow, demonstrates that DeepREL can cover 157% more APIs than state-of-the-art FreeFuzz. To date, DeepREL has detected 162 bugs in total, with 106 already confirmed by the developers as previously unknown bugs. Surprisingly, DeepREL has detected 13.5% of the high-priority bugs for the entire PyTorch issue-tracking system in a three-month period. Also, besides the 162 code bugs, we have also detected 14 documentation bugs (all confirmed). @InProceedings{ESEC/FSE22p44, author = {Yinlin Deng and Chenyuan Yang and Anjiang Wei and Lingming Zhang}, title = {Fuzzing Deep-Learning Libraries via Automated Relational API Inference}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {44--56}, doi = {10.1145/3540250.3549085}, year = {2022}, } Publisher's Version |
|
De Roover, Coen |
ESEC/FSE '22: "PaReco: Patched Clones and ..."
PaReco: Patched Clones and Missed Patches among the Divergent Variants of a Software Family
Poedjadevie Kadjel Ramkisoen, John Businge, Brent van Bladel, Alexandre Decan, Serge Demeyer, Coen De Roover, and Foutse Khomh (University of Antwerp, Belgium; Flanders Make, Belgium; University of Nevada at Las Vegas, USA; University of Mons, Belgium; F.R.S.-FNRS, Belgium; Vrije Universiteit Brussel, Belgium; Polytechnique Montréal, Canada) Re-using whole repositories as a starting point for new projects is often done by maintaining a variant fork parallel to the original. However, the common artifacts between both are not always kept up to date. As a result, patches are not optimally integrated across the two repositories, which may lead to sub-optimal maintenance between the variant and the original project. A bug existing in both repositories can be patched in one but not the other (we see this as a missed opportunity) or it can be manually patched in both probably by different developers (we see this as effort duplication). In this paper we present a tool (named PaReCo) which relies on clone detection to mine cases of missed opportunity and effort duplication from a pool of patches. We analyzed 364 (source to target) variant pairs with 8,323 patches resulting in a curated dataset containing 1,116 cases of effort duplication and 1,008 cases of missed opportunities. We achieve a precision of 91%, recall of 80%, accuracy of 88%, and F1-score of 85%. Furthermore, we investigated the time interval between patches and found out that, on average, missed patches in the target variants have been introduced in the source variants 52 weeks earlier. Consequently, PaReCo can be used to manage variability in “time” by automatically identifying interesting patches in later project releases to be backported to supported earlier releases. @InProceedings{ESEC/FSE22p646, author = {Poedjadevie Kadjel Ramkisoen and John Businge and Brent van Bladel and Alexandre Decan and Serge Demeyer and Coen De Roover and Foutse Khomh}, title = {PaReco: Patched Clones and Missed Patches among the Divergent Variants of a Software Family}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {646--658}, doi = {10.1145/3540250.3549112}, year = {2022}, } Publisher's Version |
|
Devanbu, Premkumar T. |
ESEC/FSE '22: "NatGen: Generative Pre-training ..."
NatGen: Generative Pre-training by “Naturalizing” Source Code
Saikat Chakraborty, Toufique Ahmed, Yangruibo Ding, Premkumar T. Devanbu, and Baishakhi Ray (Columbia University, USA; University of California at Davis, USA) Pre-trained Generative Language models (e.g., PLBART, CodeT5, SPT-Code) for source code yielded strong results on several tasks in the past few years, including code generation and translation. These models have adopted varying pre-training objectives to learn statistics of code construction from very large-scale corpora in a self-supervised fashion; the success of pre-trained models largely hinges on these pre-training objectives. This paper proposes a new pre-training objective, “Naturalizing” of source code, exploiting code’s bimodal, dual-channel (formal & natural channels) nature. Unlike natural language, code’s bimodal, dual-channel nature allows us to generate semantically equivalent code at scale. We introduce six classes of semantic preserving transformations to introduce unnatural forms of code, and then force our model to produce more natural original programs written by developers. Learning to generate equivalent, but more natural code, at scale, over large corpora of open-source code, without explicit manual supervision, helps the model learn to both ingest & generate code. We fine-tune our model in three generative Software Engineering tasks: code generation, code translation, and code refinement with limited human-curated labeled data and achieve state-of-the-art performance rivaling CodeT5. We show that our pre-trained model is especially competitive at zero-shot and few-shot learning, and better at learning code properties (e.g., syntax, data flow) @InProceedings{ESEC/FSE22p18, author = {Saikat Chakraborty and Toufique Ahmed and Yangruibo Ding and Premkumar T. Devanbu and Baishakhi Ray}, title = {NatGen: Generative Pre-training by “Naturalizing” Source Code}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {18--30}, doi = {10.1145/3540250.3549162}, year = {2022}, } Publisher's Version |
|
Di Grazia, Luca |
ESEC/FSE '22: "The Evolution of Type Annotations ..."
The Evolution of Type Annotations in Python: An Empirical Study
Luca Di Grazia and Michael Pradel (University of Stuttgart, Germany) Type annotations and gradual type checkers attempt to reveal errors and facilitate maintenance in dynamically typed programming languages. Despite the availability of these features and tools, it is currently unclear how quickly developers are adopting them, what strategies they follow when doing so, and whether adding type annotations reveals more type errors. This paper presents the first large-scale empirical study of the evolution of type annotations and type errors in Python. The study is based on an analysis of 1,414,936 type annotation changes, which we extract from 1,123,393 commits among 9,655 projects. Our results show that (i) type annotations are getting more popular, and once added, often remain unchanged in the projects for a long time, (ii) projects follow three evolution patterns for type annotation usage -- regular annotation, type sprints, and occasional uses -- and that the used pattern correlates with the number of contributors, (iii) more type annotations help find more type errors (0.704 correlation), but nevertheless, many commits (78.3%) are committed despite having such errors. Our findings show that better developer training and automated techniques for adding type annotations are needed, as most code still remains unannotated, and they call for a better integration of gradual type checking into the development process. @InProceedings{ESEC/FSE22p209, author = {Luca Di Grazia and Michael Pradel}, title = {The Evolution of Type Annotations in Python: An Empirical Study}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {209--220}, doi = {10.1145/3540250.3549114}, year = {2022}, } Publisher's Version Info Artifacts Reusable |
|
Dinella, Elizabeth |
ESEC/FSE '22: "Program Merge Conflict Resolution ..."
Program Merge Conflict Resolution via Neural Transformers
Alexey Svyatkovskiy, Sarah Fakhoury, Negar Ghorbani, Todd Mytkowicz, Elizabeth Dinella, Christian Bird, Jinu Jang, Neel Sundaresan, and Shuvendu K. Lahiri (Microsoft, USA; Washington State University, USA; University of California at Irvine, USA; Microsoft Research, USA; University of Pennsylvania, USA) Collaborative software development is an integral part of the modern software development life cycle, essential to the success of large-scale software projects. When multiple developers make concurrent changes around the same lines of code, a merge conflict may occur. Such conflicts stall pull requests and continuous integration pipelines for hours to several days, seriously hurting developer productivity. To address this problem, we introduce MergeBERT, a novel neural program merge framework based on token-level three-way differencing and a transformer encoder model. By exploiting the restricted nature of merge conflict resolutions, we reformulate the task of generating the resolution sequence as a classification task over a set of primitive merge patterns extracted from real-world merge commit data. Our model achieves 63–68% accuracy for merge resolution synthesis, yielding nearly a 3× performance improvement over existing semi-structured, and 2× improvement over neural program merge tools. Finally, we demonstrate that MergeBERT is sufficiently flexible to work with source code files in Java, JavaScript, TypeScript, and C# programming languages. To measure the practical use of MergeBERT, we conduct a user study to evaluate MergeBERT suggestions with 25 developers from large OSS projects on 122 real-world conflicts they encountered. Results suggest that in practice, MergeBERT resolutions would be accepted at a higher rate than estimated by automatic metrics for precision and accuracy. Additionally, we use participant feedback to identify future avenues for improvement of MergeBERT. @InProceedings{ESEC/FSE22p822, author = {Alexey Svyatkovskiy and Sarah Fakhoury and Negar Ghorbani and Todd Mytkowicz and Elizabeth Dinella and Christian Bird and Jinu Jang and Neel Sundaresan and Shuvendu K. Lahiri}, title = {Program Merge Conflict Resolution via Neural Transformers}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {822--833}, doi = {10.1145/3540250.3549163}, year = {2022}, } Publisher's Version |
|
Ding, Xuhua |
ESEC/FSE '22-DEMO: "FastKLEE: Faster Symbolic ..."
FastKLEE: Faster Symbolic Execution via Reducing Redundant Bound Checking of Type-Safe Pointers
Haoxin Tu, Lingxiao Jiang, Xuhua Ding, and He Jiang (Singapore Management University, Singapore; Dalian University of Technology, China) Symbolic execution (SE) has been widely adopted for automatic program analysis and software testing. Many SE engines (e.g., KLEE or Angr) need to interpret certain Intermediate Representations (IR) of code during execution, which may be slow and costly. Although a plurality of studies proposed to accelerate SE, few of them consider optimizing the internal interpretation operations. In this paper, we propose FastKLEE, a faster SE engine that aims to speed up execution via reducing redundant bound checking of type-safe pointers during IR code interpretation. Specifically, in FastKLEE, a type inference system is first leveraged to classify pointer types (i.e., safe or unsafe) for the most frequently interpreted read/write instructions. Then, a customized memory operation is designed to perform bound checking for only the unsafe pointers and omit redundant checking on safe pointers. We implement FastKLEE on top of the well-known SE engine KLEE and combined it with the notable type inference system CCured. Evaluation results demonstrate that FastKLEE is able to reduce by up to 9.1% (5.6% on average) as the state-of-the-art approach KLEE in terms of the time to explore the same number (i.e., 10k) of execution paths. FastKLEE is opensourced at https://github.com/haoxintu/FastKLEE. A video demo of FastKLEE is available at https://youtu.be/fjV_a3kt-mo. @InProceedings{ESEC/FSE22p1741, author = {Haoxin Tu and Lingxiao Jiang and Xuhua Ding and He Jiang}, title = {FastKLEE: Faster Symbolic Execution via Reducing Redundant Bound Checking of Type-Safe Pointers}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1741--1745}, doi = {10.1145/3540250.3558919}, year = {2022}, } Publisher's Version Video Info |
|
Ding, Yangruibo |
ESEC/FSE '22: "NatGen: Generative Pre-training ..."
NatGen: Generative Pre-training by “Naturalizing” Source Code
Saikat Chakraborty, Toufique Ahmed, Yangruibo Ding, Premkumar T. Devanbu, and Baishakhi Ray (Columbia University, USA; University of California at Davis, USA) Pre-trained Generative Language models (e.g., PLBART, CodeT5, SPT-Code) for source code yielded strong results on several tasks in the past few years, including code generation and translation. These models have adopted varying pre-training objectives to learn statistics of code construction from very large-scale corpora in a self-supervised fashion; the success of pre-trained models largely hinges on these pre-training objectives. This paper proposes a new pre-training objective, “Naturalizing” of source code, exploiting code’s bimodal, dual-channel (formal & natural channels) nature. Unlike natural language, code’s bimodal, dual-channel nature allows us to generate semantically equivalent code at scale. We introduce six classes of semantic preserving transformations to introduce unnatural forms of code, and then force our model to produce more natural original programs written by developers. Learning to generate equivalent, but more natural code, at scale, over large corpora of open-source code, without explicit manual supervision, helps the model learn to both ingest & generate code. We fine-tune our model in three generative Software Engineering tasks: code generation, code translation, and code refinement with limited human-curated labeled data and achieve state-of-the-art performance rivaling CodeT5. We show that our pre-trained model is especially competitive at zero-shot and few-shot learning, and better at learning code properties (e.g., syntax, data flow) @InProceedings{ESEC/FSE22p18, author = {Saikat Chakraborty and Toufique Ahmed and Yangruibo Ding and Premkumar T. Devanbu and Baishakhi Ray}, title = {NatGen: Generative Pre-training by “Naturalizing” Source Code}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {18--30}, doi = {10.1145/3540250.3549162}, year = {2022}, } Publisher's Version |
|
Doan, Hong-Phuc |
ESEC/FSE '22-DEMO: "MANDO-GURU: Vulnerability ..."
MANDO-GURU: Vulnerability Detection for Smart Contract Source Code by Heterogeneous Graph Embeddings
Hoang H. Nguyen, Nhat-Minh Nguyen, Hong-Phuc Doan, Zahra Ahmadi, Thanh-Nam Doan, and Lingxiao Jiang (Leibniz Universität Hannover, Germany; Singapore Management University, Singapore; Hanoi University of Science and Technology, Vietnam) Smart contracts are increasingly used with blockchain systems for high-value applications. It is highly desired to ensure the quality of smart contract source code before they are deployed. This paper proposes a new deep learning-based tool, MANDO-GURU, that aims to accurately detect vulnerabilities in smart contracts at both coarse-grained contract-level and fine-grained line-level. Using a combination of control-flow graphs and call graphs of Solidity code, we design new heterogeneous graph attention neural networks to encode more structural and potentially semantic relations among different types of nodes and edges of such graphs and use the encoded embeddings of the graphs and nodes to detect vulnerabilities. Our validation of real-world smart contract datasets shows that MANDO-GURU can significantly improve many other vulnerability detection techniques by up to 24% in terms of the F1-score at the contract level, depending on vulnerability types. It is the first learning-based tool for Ethereum smart contracts that identify vulnerabilities at the line level and significantly improves the traditional code analysis-based techniques by up to 63.4%. Our tool is publicly available at https://github.com/MANDO-Project/ge-sc-machine. A test version is currently deployed at http://mandoguru.com, and a demo video of our tool is available at http://mandoguru.com/demo-video. @InProceedings{ESEC/FSE22p1736, author = {Hoang H. Nguyen and Nhat-Minh Nguyen and Hong-Phuc Doan and Zahra Ahmadi and Thanh-Nam Doan and Lingxiao Jiang}, title = {MANDO-GURU: Vulnerability Detection for Smart Contract Source Code by Heterogeneous Graph Embeddings}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1736--1740}, doi = {10.1145/3540250.3558927}, year = {2022}, } Publisher's Version Video Info |
|
Doan, Thanh-Nam |
ESEC/FSE '22-DEMO: "MANDO-GURU: Vulnerability ..."
MANDO-GURU: Vulnerability Detection for Smart Contract Source Code by Heterogeneous Graph Embeddings
Hoang H. Nguyen, Nhat-Minh Nguyen, Hong-Phuc Doan, Zahra Ahmadi, Thanh-Nam Doan, and Lingxiao Jiang (Leibniz Universität Hannover, Germany; Singapore Management University, Singapore; Hanoi University of Science and Technology, Vietnam) Smart contracts are increasingly used with blockchain systems for high-value applications. It is highly desired to ensure the quality of smart contract source code before they are deployed. This paper proposes a new deep learning-based tool, MANDO-GURU, that aims to accurately detect vulnerabilities in smart contracts at both coarse-grained contract-level and fine-grained line-level. Using a combination of control-flow graphs and call graphs of Solidity code, we design new heterogeneous graph attention neural networks to encode more structural and potentially semantic relations among different types of nodes and edges of such graphs and use the encoded embeddings of the graphs and nodes to detect vulnerabilities. Our validation of real-world smart contract datasets shows that MANDO-GURU can significantly improve many other vulnerability detection techniques by up to 24% in terms of the F1-score at the contract level, depending on vulnerability types. It is the first learning-based tool for Ethereum smart contracts that identify vulnerabilities at the line level and significantly improves the traditional code analysis-based techniques by up to 63.4%. Our tool is publicly available at https://github.com/MANDO-Project/ge-sc-machine. A test version is currently deployed at http://mandoguru.com, and a demo video of our tool is available at http://mandoguru.com/demo-video. @InProceedings{ESEC/FSE22p1736, author = {Hoang H. Nguyen and Nhat-Minh Nguyen and Hong-Phuc Doan and Zahra Ahmadi and Thanh-Nam Doan and Lingxiao Jiang}, title = {MANDO-GURU: Vulnerability Detection for Smart Contract Source Code by Heterogeneous Graph Embeddings}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1736--1740}, doi = {10.1145/3540250.3558927}, year = {2022}, } Publisher's Version Video Info |
|
Dolby, Julian |
ESEC/FSE '22-TUT: "Program Analysis using WALA ..."
Program Analysis using WALA (Tutorial)
Joanna C. S. Santos and Julian Dolby (University of Notre Dame, USA; IBM Research, USA) Static analysis is widely used in research and practice for multiple purposes such as fault localization, vulnerability detection, code clone identification, code refactoring, optimization, etc. Since implementing static analyzers is a non-trivial task, engineers often rely on existing frameworks to implement their techniques. The IBM T.J. Watson Libraries for Analysis (WALA) is one of such frameworks that allows the analysis of multiple environments, such as Java bytecode (and related languages), JavaScript, Android, Python, etc. In this tutorial, we walk through the process of using WALA for program analysis. First, the tutorial will cover all the required background knowledge that is necessary to understand the technical implementation details of the explained algorithms and techniques. Subsequently, we provide a technical overview of the WALA framework and its support for analysis of multiple programming languages and frameworks code. Then, we will do several live demonstration of using WALA to implement client analyses. We will focus on two common uses of analysis: a form of security analysis, taint analysis, and on using analysis graphs for machine learning of code. @InProceedings{ESEC/FSE22p1819, author = {Joanna C. S. Santos and Julian Dolby}, title = {Program Analysis using WALA (Tutorial)}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1819--1819}, doi = {10.1145/3540250.3569449}, year = {2022}, } Publisher's Version |
|
Dong, Jin Song |
ESEC/FSE '22-DEMO: "RegMiner: Mining Replicable ..."
RegMiner: Mining Replicable Regression Dataset from Code Repositories
Xuezhi Song, Yun Lin, Yijian Wu, Yifan Zhang, Siang Hwee Ng, Xin Peng, Jin Song Dong, and Hong Mei (Fudan University, China; Shanghai Jiao Tong University, China; National University of Singapore, Singapore; Peking University, China) In this work, we introduce a tool, RegMiner, to automate the process of collecting replicable regression bugs from a set of Git repositories. In the code commit history, RegMiner searches for regressions where a test can pass a regression-fixing commit, fail a regressioninducing commit, and pass a previous working commit again. Technically, RegMiner (1) identifies potential regression-fixing commits from the code evolution history, (2) migrates the test and its code dependencies in the commit over the history, and (3) minimizes the compilation overhead during the regression search. Our experients show that RegMiner can successfully collect 1035 regressions over 147 projects in 8 weeks, creating the largest replicable regression dataset within the shortest period, to the best of our knowledge. In addition, our experiments further show that (1) RegMiner can construct the regression dataset with very high precision and acceptable recall, and (2) the constructed regression dataset is of high authenticity and diversity. The source code of RegMiner is available at https://github.com/SongXueZhi/RegMiner, the mined regression dataset is available at https://regminer.github.io/, and the demonstration video is available at https://youtu.be/yzcM9Y4unok. @InProceedings{ESEC/FSE22p1711, author = {Xuezhi Song and Yun Lin and Yijian Wu and Yifan Zhang and Siang Hwee Ng and Xin Peng and Jin Song Dong and Hong Mei}, title = {RegMiner: Mining Replicable Regression Dataset from Code Repositories}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1711--1715}, doi = {10.1145/3540250.3558929}, year = {2022}, } Publisher's Version Video Info |
|
Dong, Liming |
ESEC/FSE '22: "Semi-supervised Pre-processing ..."
Semi-supervised Pre-processing for Learning-Based Traceability Framework on Real-World Software Projects
Liming Dong, He Zhang, Wei Liu, Zhiluo Weng, and Hongyu Kuang (Nanjing University, China) The traceability of software artifacts has been recognized as an important factor to support various activities in software development processes. However, traceability can be difficult and time-consuming to create and maintain manually, thereby automated approaches have gained much attention. Unfortunately, existing automated approaches for traceability suffer from practical issues. This paper aims to gain an understanding of the potential challenges for the underperforming of the state-of-the-art, ML-based trace link classifiers applied in real-world projects. By investigating different industrial datasets, we found that two critical (and classic) challenges, i.e. data imbalance and sparse problems, lie in real-world projects’ traceability automation. To overcome these challenges, we developed a framework called SPLINT to incorporate hybrid textual similarity measures and semi-supervised learning strategies as enhancements to the learning-based traceability approaches. We carried out experiments with six open-source platforms and ten industry datasets. The results confirm that SPLINT is able to operate at higher performance on two communities’ datasets. Specifically, the industrial datasets, which significantly suffer from data imbalance and sparsity problems, show an increase in F2-score over 14% and AUC over 8% on average. The adjusted class-balancing and self-training policies used in SPLINT (CBST-Adjust) also work effectively for the selection of pseudo-labels on minor classes from unlabeled trace sets, demonstrating SPLINT’s practicability. @InProceedings{ESEC/FSE22p570, author = {Liming Dong and He Zhang and Wei Liu and Zhiluo Weng and Hongyu Kuang}, title = {Semi-supervised Pre-processing for Learning-Based Traceability Framework on Real-World Software Projects}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {570--582}, doi = {10.1145/3540250.3549151}, year = {2022}, } Publisher's Version |
|
Dong, Yihong |
ESEC/FSE '22-IND: "Incorporating Domain Knowledge ..."
Incorporating Domain Knowledge through Task Augmentation for Front-End JavaScript Code Generation
Sijie Shen, Xiang Zhu, Yihong Dong, Qizhi Guo, Yankun Zhen, and Ge Li (Peking University, China; Alibaba Group, China) Code generation aims to generate a code snippet automatically from natural language descriptions. Generally, the mainstream code generation methods rely on a large amount of paired training data, including both the natural language description and the code. However, in some domain-specific scenarios, building such a large paired corpus for code generation is difficult because there is no directly available pairing data, and a lot of effort is required to manually write the code descriptions to construct a high-quality training dataset. Due to the limited training data, the generation model cannot be well trained and is likely to be overfitting, making the model's performance unsatisfactory for real-world use. To this end, in this paper, we propose a task augmentation method that incorporates domain knowledge into code generation models through auxiliary tasks and a Subtoken-TranX model by extending the original TranX model to support subtoken-level code generation. To verify our proposed approach, we collect a real-world code generation dataset and conduct experiments on it. Our experimental results demonstrate that the subtoken-level TranX model outperforms the original TranX model and the Transformer model on our dataset, and the exact match accuracy of Subtoken-TranX improves significantly by 12.75% with the help of our task augmentation method. The model performance on several code categories has satisfied the requirements for application in industrial systems. Our proposed approach has been adopted by Alibaba's BizCook platform. To the best of our knowledge, this is the first domain code generation system adopted in industrial development environments. @InProceedings{ESEC/FSE22p1533, author = {Sijie Shen and Xiang Zhu and Yihong Dong and Qizhi Guo and Yankun Zhen and Ge Li}, title = {Incorporating Domain Knowledge through Task Augmentation for Front-End JavaScript Code Generation}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1533--1543}, doi = {10.1145/3540250.3558965}, year = {2022}, } Publisher's Version |
|
Dong, Zikan |
ESEC/FSE '22-IND: "What Did You Pack in My App? ..."
What Did You Pack in My App? A Systematic Analysis of Commercial Android Packers
Zikan Dong, Hongxuan Liu, Liu Wang, Xiapu Luo, Yao Guo, Guoai Xu, Xusheng Xiao, and Haoyu Wang (Beijing University of Posts and Telecommunications, China; Peking University, China; Hong Kong Polytechnic University, China; Case Western Reserve University, USA; Huazhong University of Science and Technology, China) Commercial Android packers have been widely used by developers as a way to protect their apps from being tampered with. However, app packer is usually provided as an online service developed by security vendors, and the packed apps are well protected. It is thus hard to know what exactly is packed in the app, and few existing studies in the community have systematically analyzed the behaviors of commercial app packers. In this paper, we propose PackDiff, a dynamic analysis system to inspect the fine-grained behaviors of commercial packers. By instrumenting the Android system, PackDiff records the runtime behaviors of Android apps (e.g., Linux system call invocations, Java API calls, Binder interactions, etc.), which are further processed to pinpoint the additional sensitive behaviors introduced by packers. By applying PackDiff to roughly 200 apps protected by seven commercial packers, we observe the disappointing facts of existing commercial packers. Most app packers have introduced unnecessary behaviors (e.g., accessing sensitive data), serious performance and compatibility issues, and they can even be abused to create evasive malware and repackaged apps, which contradicts with their design purposes. @InProceedings{ESEC/FSE22p1430, author = {Zikan Dong and Hongxuan Liu and Liu Wang and Xiapu Luo and Yao Guo and Guoai Xu and Xusheng Xiao and Haoyu Wang}, title = {What Did You Pack in My App? A Systematic Analysis of Commercial Android Packers}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1430--1440}, doi = {10.1145/3540250.3558969}, year = {2022}, } Publisher's Version |
|
Drain, Dawn |
ESEC/FSE '22-IND: "Exploring and Evaluating Personalized ..."
Exploring and Evaluating Personalized Models for Code Generation
Andrei Zlotchevski, Dawn Drain, Alexey Svyatkovskiy, Colin B. Clement, Neel Sundaresan, and Michele Tufano (McGill University, Canada; Anthropic, USA; Microsoft, USA) Large Transformer models achieved the state-of-the-art status for Natural Language Understanding tasks and are increasingly becoming the baseline model architecture for modeling source code. Transformers are usually pre-trained on large unsupervised corpora, learning token representations and transformations relevant to modeling generally available text, and are then fine-tuned on a particular downstream task of interest. While fine-tuning is a tried-and-true method for adapting a model to a new domain -- for example, question-answering on a given topic -- generalization remains an on-going challenge. In this paper, we explore and evaluate transformer model fine-tuning for personalization. In the context of generating unit tests for Java methods, we evaluate learning to personalize to a specific software project using several personalization techniques. We consider three key approaches: (i) custom fine-tuning, which allows all the model parameters to be tuned; (ii) lightweight fine-tuning, which freezes most of the model's parameters, allowing tuning of the token embeddings and softmax layer only or the final layer alone; (iii) prefix tuning, which keeps model parameters frozen, but optimizes a small project-specific prefix vector. Each of these techniques offers a trade-off in total compute cost and predictive performance, which we evaluate by code and task-specific metrics, training time, and total computational operations. We compare these fine-tuning strategies for code generation and discuss the potential generalization and cost benefits of each in various deployment scenarios. @InProceedings{ESEC/FSE22p1500, author = {Andrei Zlotchevski and Dawn Drain and Alexey Svyatkovskiy and Colin B. Clement and Neel Sundaresan and Michele Tufano}, title = {Exploring and Evaluating Personalized Models for Code Generation}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1500--1508}, doi = {10.1145/3540250.3558959}, year = {2022}, } Publisher's Version |
|
D'Souza, Deepak |
ESEC/FSE '22: "Static Executes-Before Analysis ..."
Static Executes-Before Analysis for Event Driven Programs
Rekha Pai, Abhishek Uppar, Akshatha Shenoy, Pranshul Kushwaha, and Deepak D'Souza (IISc Bangalore, India; TCS Research, India) The executes-before relation between tasks is fundamental in the analysis of Event Driven Programs with several downstream applications like race detection and identifying redundant synchronizations. We present a sound, efficient, and effective static analysis technique to compute executes-before pairs of tasks for a general class of event driven programs. The analysis is based on a small but comprehensive set of rules evaluated on a novel structure called the task post graph of a program. We show how to use the executes-before information to identify disjoint-blocks in event driven programs and further use them to improve the precision of data race detection for these programs. We have implemented our analysis in the Flowdroid framework in a tool called AndRacer and evaluated it on several Android apps, bringing out the scalability, recall, and improved precision of the analyses @InProceedings{ESEC/FSE22p233, author = {Rekha Pai and Abhishek Uppar and Akshatha Shenoy and Pranshul Kushwaha and Deepak D'Souza}, title = {Static Executes-Before Analysis for Event Driven Programs}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {233--244}, doi = {10.1145/3540250.3549116}, year = {2022}, } Publisher's Version Artifacts Functional |
|
Du, Mingzhe |
ESEC/FSE '22-DEMO: "SemCluster: A Semi-supervised ..."
SemCluster: A Semi-supervised Clustering Tool for Crowdsourced Test Reports with Deep Image Understanding
Mingzhe Du, Shengcheng Yu, Chunrong Fang, Tongyu Li, Heyuan Zhang, and Zhenyu Chen (Nanjing University, China) Due to the openness of crowdsourced testing, mobile app crowdsourced testing has been subject to duplicate reports. The previous research methods extract the textual features of the crowdsourced test reports, combine with shallow image analysis, and perform unsupervised clustering on the crowdsourced test reports to clarify the duplication of crowdsourced test reports and solve the problem. However, these methods ignore the semantic connection between textual descriptions and screenshots, making the clustering results unsatisfactory and the deduplication effect less accurate. This paper proposes a semi-supervised clustering tool for crowdsourced test reports with deep image understanding, namely SemCluster, which makes the most of the semantic connection between textual descriptions and screenshots by constructing semantic binding rules and performing semi-supervised clustering. SemCluster improves six metrics of clustering results in the experiment compared to the state-of-the-art method, which verifies that SemCluster has achieved a good deduplication effect. The demo can be found at: https://sites.google.com/view/semcluster-demo. @InProceedings{ESEC/FSE22p1756, author = {Mingzhe Du and Shengcheng Yu and Chunrong Fang and Tongyu Li and Heyuan Zhang and Zhenyu Chen}, title = {SemCluster: A Semi-supervised Clustering Tool for Crowdsourced Test Reports with Deep Image Understanding}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1756--1759}, doi = {10.1145/3540250.3558933}, year = {2022}, } Publisher's Version |
|
Du, Xu |
ESEC/FSE '22: "Actionable and Interpretable ..."
Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems
Zeyan Li, Nengwen Zhao, Mingjie Li, Xianglin Lu, Lixin Wang, Dongdong Chang, Xiaohui Nie, Li Cao, Wenchi Zhang, Kaixin Sui, Yanhua Wang, Xu Du, Guoqiang Duan, and Dan Pei (Tsinghua University, China; China Construction Bank, China; BizSeer, China) Fault localization is challenging in an online service system due to its monitoring data's large volume and variety and complex dependencies across/within its components (e.g., services or databases). Furthermore, engineers require fault localization solutions to be actionable and interpretable, which existing research approaches cannot satisfy. Therefore, the common industry practice is that, for a specific online service system, its experienced engineers focus on localization for recurring failures based on the knowledge accumulated about the system and historical failures. More specifically, 1) they can identify the underlying root causes and take mitigation actions when pinpointing a group of indicative metrics on the faulty component; 2) their diagnosis knowledge is roughly based on how one failure might affect the components in the whole system. Although the above common practice is actionable and interpretable, it is largely manual, thus slow and sometimes inaccurate. In this paper, we aim to automate this practice through machine learning. That is, we propose an actionable and interpretable fault localization approach, DejaVu, for recurring failures in online service systems. For a specific online service system, DejaVu takes historical failures and dependencies in the system as input and trains a localization model offline; for an incoming failure, the trained model online recommends where the failure occurs (i.e., the faulty components) and which kind of failure occurs (i.e., the indicative group of metrics) (thus actionable), which are further interpreted both globally and locally (thus interpretable). Based on the evaluation on 601 failures from three production systems and one open-source benchmark, in less than one second, DejaVu can rank the ground truths at 1.66∼5.03-th among a long candidate list on average, outperforming baselines by 54.52%. @InProceedings{ESEC/FSE22p996, author = {Zeyan Li and Nengwen Zhao and Mingjie Li and Xianglin Lu and Lixin Wang and Dongdong Chang and Xiaohui Nie and Li Cao and Wenchi Zhang and Kaixin Sui and Yanhua Wang and Xu Du and Guoqiang Duan and Dan Pei}, title = {Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {996--1008}, doi = {10.1145/3540250.3549092}, year = {2022}, } Publisher's Version |
|
Duan, Guoqiang |
ESEC/FSE '22: "Actionable and Interpretable ..."
Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems
Zeyan Li, Nengwen Zhao, Mingjie Li, Xianglin Lu, Lixin Wang, Dongdong Chang, Xiaohui Nie, Li Cao, Wenchi Zhang, Kaixin Sui, Yanhua Wang, Xu Du, Guoqiang Duan, and Dan Pei (Tsinghua University, China; China Construction Bank, China; BizSeer, China) Fault localization is challenging in an online service system due to its monitoring data's large volume and variety and complex dependencies across/within its components (e.g., services or databases). Furthermore, engineers require fault localization solutions to be actionable and interpretable, which existing research approaches cannot satisfy. Therefore, the common industry practice is that, for a specific online service system, its experienced engineers focus on localization for recurring failures based on the knowledge accumulated about the system and historical failures. More specifically, 1) they can identify the underlying root causes and take mitigation actions when pinpointing a group of indicative metrics on the faulty component; 2) their diagnosis knowledge is roughly based on how one failure might affect the components in the whole system. Although the above common practice is actionable and interpretable, it is largely manual, thus slow and sometimes inaccurate. In this paper, we aim to automate this practice through machine learning. That is, we propose an actionable and interpretable fault localization approach, DejaVu, for recurring failures in online service systems. For a specific online service system, DejaVu takes historical failures and dependencies in the system as input and trains a localization model offline; for an incoming failure, the trained model online recommends where the failure occurs (i.e., the faulty components) and which kind of failure occurs (i.e., the indicative group of metrics) (thus actionable), which are further interpreted both globally and locally (thus interpretable). Based on the evaluation on 601 failures from three production systems and one open-source benchmark, in less than one second, DejaVu can rank the ground truths at 1.66∼5.03-th among a long candidate list on average, outperforming baselines by 54.52%. @InProceedings{ESEC/FSE22p996, author = {Zeyan Li and Nengwen Zhao and Mingjie Li and Xianglin Lu and Lixin Wang and Dongdong Chang and Xiaohui Nie and Li Cao and Wenchi Zhang and Kaixin Sui and Yanhua Wang and Xu Du and Guoqiang Duan and Dan Pei}, title = {Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {996--1008}, doi = {10.1145/3540250.3549092}, year = {2022}, } Publisher's Version |
|
Duan, Nan |
ESEC/FSE '22: "Automating Code Review Activities ..."
Automating Code Review Activities by Large-Scale Pre-training
Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundaresan (Peking University, China; Microsoft Research, China; Sun Yat-sen University, China; LinkedIn, USA; Microsoft, USA) Code review is an essential part to software development lifecycle since it aims at guaranteeing the quality of codes. Modern code review activities necessitate developers viewing, understanding and even running the programs to assess logic, functionality, latency, style and other factors. It turns out that developers have to spend far too much time reviewing the code of their peers. Accordingly, it is in significant demand to automate the code review process. In this research, we focus on utilizing pre-training techniques for the tasks in the code review scenario. We collect a large-scale dataset of real-world code changes and code reviews from open-source projects in nine of the most popular programming languages. To better understand code diffs and reviews, we propose CodeReviewer, a pre-trained model that utilizes four pre-training tasks tailored specifically for the code review scenario. To evaluate our model, we focus on three key tasks related to code review activities, including code change quality estimation, review comment generation and code refinement. Furthermore, we establish a high-quality benchmark dataset based on our collected data for these three tasks and conduct comprehensive experiments on it. The experimental results demonstrate that our model outperforms the previous state-of-the-art pre-training approaches in all tasks. Further analysis show that our proposed pre-training tasks and the multilingual pre-training dataset benefit the model on the understanding of code changes and reviews. @InProceedings{ESEC/FSE22p1035, author = {Zhiyu Li and Shuai Lu and Daya Guo and Nan Duan and Shailesh Jannu and Grant Jenks and Deep Majumder and Jared Green and Alexey Svyatkovskiy and Shengyu Fu and Neel Sundaresan}, title = {Automating Code Review Activities by Large-Scale Pre-training}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1035--1047}, doi = {10.1145/3540250.3549081}, year = {2022}, } Publisher's Version |
|
Dyer, Robert |
ESEC/FSE '22: "An Exploratory Study on the ..."
An Exploratory Study on the Predominant Programming Paradigms in Python Code
Robert Dyer and Jigyasa Chauhan (University of Nebraska-Lincoln, USA) Python is a multi-paradigm programming language that fully supports object-oriented (OO) programming. The language allows writing code in a non-procedural imperative manner, using procedures, using classes, or in a functional style. To date, no one has studied what paradigm(s), if any, are predominant in Python code and projects. In this work, we first define a technique to classify Python files into predominant paradigm(s). We then automate our approach and evaluate it against human judgements, showing over 80% agreement. We then analyze over 100k open-source Python projects, automatically classifying each source file and investigating the paradigm distributions. The results indicate Python developers tend to heavily favor OO features. We also observed a positive correlation between OO and procedural paradigms and the size of the project. And despite few files or projects being predominantly functional, we still found many functional feature uses. @InProceedings{ESEC/FSE22p684, author = {Robert Dyer and Jigyasa Chauhan}, title = {An Exploratory Study on the Predominant Programming Paradigms in Python Code}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {684--695}, doi = {10.1145/3540250.3549158}, year = {2022}, } Publisher's Version Info Artifacts Reusable ESEC/FSE '22-TUT: "Performing Large-Scale Mining ..." Performing Large-Scale Mining Studies: From Start to Finish (Tutorial) Robert Dyer and Samuel W. Flint (University of Nebraska-Lincoln, USA) Modern software engineering research often relies on mining open-source software repositories, to either provide motivation for their research problems and/or evaluation of the proposed approach. Mining ultra-large-scale software repositories is still a difficult task, requiring substantial expertise and access to significant hardware. Tools such as Boa can help researchers easily mine large numbers of open-source repositories. There has also recently been more of a push toward open science, with an emphasis on making replication packages available. Building such replication packages incurs additional workload for researchers. In this tutorial, we teach how to use the Boa infrastructure for mining software repository data. We leverage Boa’s VS Code IDE extension to help write and submit Boa queries, and also leverage Boa’s study template to show how researchers can more easily analyze the output from Boa and automatically produce a suitable replication package that is published on Zenodo. @InProceedings{ESEC/FSE22p1822, author = {Robert Dyer and Samuel W. Flint}, title = {Performing Large-Scale Mining Studies: From Start to Finish (Tutorial)}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1822--1822}, doi = {10.1145/3540250.3569448}, year = {2022}, } Publisher's Version |
|
Eberlein, Martin |
ESEC/FSE '22-DOC: "Explaining and Debugging Pathological ..."
Explaining and Debugging Pathological Program Behavior
Martin Eberlein (Humboldt University of Berlin, Germany) Programs fail. But which part of the input is responsible for the failure? To resolve the issue, developers must first understand how and why the program behaves as it does, notably when it deviates from the expected outcome. A program’s behavior is essentially the set of all its executions. This set is usually diverse, unpredictable, and generally unbounded. A pathological program behavior occurs once the actual outcome does not match the expected behavior. Consequently, developers must fix these issues to ensure the built system is the desired software. In our upcoming research, we want to focus on providing developers with a detailed description of the root causes that resulted in the program’s unwanted behavior. Thus, we aim to automatically produce explanations that capture the circumstances of arbitrary program behavior by correlating individual input elements (features) and their corresponding execution outcome. To this end, we use the scientific method and combine generative and predictive models, allowing us (i) to learn the statistical relations between the features of the inputs and the program behavior and (ii) to generate new inputs to refine or refute our current explanatory prediction model. @InProceedings{ESEC/FSE22p1795, author = {Martin Eberlein}, title = {Explaining and Debugging Pathological Program Behavior}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1795--1799}, doi = {10.1145/3540250.3558910}, year = {2022}, } Publisher's Version |
|
Eghbali, Aryaz |
ESEC/FSE '22: "DynaPyt: A Dynamic Analysis ..."
DynaPyt: A Dynamic Analysis Framework for Python
Aryaz Eghbali and Michael Pradel (University of Stuttgart, Germany) Python is a widely used programming language that powers important application domains such as machine learning, data analysis, and web applications. For many programs in these domains it is consequential to analyze aspects like security and performance, and with Python’s dynamic nature, it is crucial to be able to dynamically analyze Python programs. However, existing tools and frameworks do not provide the means to implement dynamic analyses easily and practitioners resort to implementing an ad-hoc dynamic analysis for their own use case. This work presents DynaPyt, the first general-purpose framework for heavy-weight dynamic analysis of Python programs. Compared to existing tools for other programming languages, our framework provides a wider range of analysis hooks arranged in a hierarchical structure, which allows developers to concisely implement analyses. DynaPyt features selective instrumentation and execution modification as well. We evaluate our framework on test suites of 9 popular open-source Python projects, 1,268,545 lines of code in total, and show that it, by and large, preserves the semantics of the original execution. The running time of DynaPyt is between 1.2x and 16x times the original execution time, which is in line with similar frameworks designed for other languages, and 5.6%–88.6% faster than analyses using a built-in tracing API offered by Python. We also implement multiple analyses, show the simplicity of implementing them and some potential use cases of DynaPyt. Among the analyses implemented are: an analysis to detect a memory blow up in Pytorch programs, a taint analysis to detect SQL injections, and an analysis to warn about a runtime performance anti-pattern. @InProceedings{ESEC/FSE22p760, author = {Aryaz Eghbali and Michael Pradel}, title = {DynaPyt: A Dynamic Analysis Framework for Python}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {760--771}, doi = {10.1145/3540250.3549126}, year = {2022}, } Publisher's Version Info Artifacts Functional |
|
Engman, Anton |
ESEC/FSE '22-IND: "Testing of Machine Learning ..."
Testing of Machine Learning Models with Limited Samples: An Industrial Vacuum Pumping Application
Ayan Chatterjee, Bestoun S. Ahmed, Erik Hallin, and Anton Engman (Karlstad University, Sweden; Uddeholms, Sweden) There is often a scarcity of training data for machine learning (ML) classification and regression models in industrial production, especially for time-consuming or sparsely run manufacturing processes. Traditionally, a majority of the limited ground-truth data is used for training, while a handful of samples are left for testing. In that case, the number of test samples is inadequate to properly evaluate the robustness of the ML models under test (i.e., the system under test) for classification and regression. Furthermore, the output of these ML models may be inaccurate or even fail if the input data differ from the expected. This is the case for ML models used in the Electroslag Remelting (ESR) process in the refined steel industry to predict the pressure in a vacuum chamber. A vacuum pumping event that occurs once a workday generates a few hundred samples in a year of pumping for training and testing. In the absence of adequate training and test samples, this paper first presents a method to generate a fresh set of augmented samples based on vacuum pumping principles. Based on the generated augmented samples, three test scenarios and one test oracle are presented to assess the robustness of an ML model used for production on an industrial scale. Experiments are conducted with real industrial production data obtained from Uddeholms AB steel company. The evaluations indicate that Ensemble and Neural Network are the most robust when trained on augmented data using the proposed testing strategy. The evaluation also demonstrates the proposed method's effectiveness in checking and improving ML algorithms' robustness in such situations. The work improves software testing's state-of-the-art robustness testing in similar settings. Finally, the paper presents an MLOps implementation of the proposed approach for real-time ML model prediction and action on the edge node and automated continuous delivery of ML software from the cloud. @InProceedings{ESEC/FSE22p1280, author = {Ayan Chatterjee and Bestoun S. Ahmed and Erik Hallin and Anton Engman}, title = {Testing of Machine Learning Models with Limited Samples: An Industrial Vacuum Pumping Application}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1280--1290}, doi = {10.1145/3540250.3558943}, year = {2022}, } Publisher's Version |
|
Ezzini, Saad |
ESEC/FSE '22-DEMO: "WikiDoMiner: Wikipedia Domain-Specific ..."
WikiDoMiner: Wikipedia Domain-Specific Miner
Saad Ezzini, Sallam Abualhaija, and Mehrdad Sabetzadeh (University of Luxembourg, Luxembourg; University of Ottawa, Canada) We introduce WikiDoMiner – a tool for automatically generating domain-specific corpora by crawling Wikipedia. WikiDoMiner helps requirements engineers create an external knowledge resource that is specific to the underlying domain of a given requirements specification (RS). Being able to build such a resource is important since domain-specific datasets are scarce. WikiDoMiner generates a corpus by first extracting a set of domain-specific keywords from a given RS, and then querying Wikipedia for these keywords. The output of WikiDoMiner is a set of Wikipedia articles relevant to the domain of the input RS. Mining Wikipedia for domain-specific knowledge can be beneficial for multiple requirements engineering tasks, e.g., ambiguity handling, requirements classification, and question answering. WikiDoMiner is publicly available on Zenodo under an open-source license (https: //doi.org/10.5281/zenodo.6672682) @InProceedings{ESEC/FSE22p1706, author = {Saad Ezzini and Sallam Abualhaija and Mehrdad Sabetzadeh}, title = {WikiDoMiner: Wikipedia Domain-Specific Miner}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1706--1710}, doi = {10.1145/3540250.3558916}, year = {2022}, } Publisher's Version ESEC/FSE '22-DEMO: "TAPHSIR: Towards AnaPHoric ..." TAPHSIR: Towards AnaPHoric Ambiguity Detection and ReSolution In Requirements Saad Ezzini, Sallam Abualhaija, Chetan Arora, and Mehrdad Sabetzadeh (University of Luxembourg, Luxembourg; Deakin University, Australia; University of Ottawa, Canada) We introduce TAPHSIR – a tool for anaphoric ambiguity detection and anaphora resolution in requirements. TAPHSIR facilities reviewing the use of pronouns in a requirements specification and revising those pronouns that can lead to misunderstandings during the development process. To this end, TAPHSIR detects the requirements which have potential anaphoric ambiguity and further attempts interpreting anaphora occurrences automatically. TAPHSIR employs a hybrid solution composed of an ambiguity detection solution based on machine learning and an anaphora resolution solution based on a variant of the BERT language model. Given a requirements specification, TAPHSIR decides for each pronoun occurrence in the specification whether the pronoun is ambiguous or unambiguous, and further provides an automatic interpretation for the pronoun. The output generated by TAPHSIR can be easily reviewed and validated by requirements engineers. TAPHSIR is publicly available on Zenodo (https://doi.org/10.5281/zenodo.5902117). @InProceedings{ESEC/FSE22p1677, author = {Saad Ezzini and Sallam Abualhaija and Chetan Arora and Mehrdad Sabetzadeh}, title = {TAPHSIR: Towards AnaPHoric Ambiguity Detection and ReSolution In Requirements}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1677--1681}, doi = {10.1145/3540250.3558928}, year = {2022}, } Publisher's Version |
|
Fakhoury, Sarah |
ESEC/FSE '22: "Program Merge Conflict Resolution ..."
Program Merge Conflict Resolution via Neural Transformers
Alexey Svyatkovskiy, Sarah Fakhoury, Negar Ghorbani, Todd Mytkowicz, Elizabeth Dinella, Christian Bird, Jinu Jang, Neel Sundaresan, and Shuvendu K. Lahiri (Microsoft, USA; Washington State University, USA; University of California at Irvine, USA; Microsoft Research, USA; University of Pennsylvania, USA) Collaborative software development is an integral part of the modern software development life cycle, essential to the success of large-scale software projects. When multiple developers make concurrent changes around the same lines of code, a merge conflict may occur. Such conflicts stall pull requests and continuous integration pipelines for hours to several days, seriously hurting developer productivity. To address this problem, we introduce MergeBERT, a novel neural program merge framework based on token-level three-way differencing and a transformer encoder model. By exploiting the restricted nature of merge conflict resolutions, we reformulate the task of generating the resolution sequence as a classification task over a set of primitive merge patterns extracted from real-world merge commit data. Our model achieves 63–68% accuracy for merge resolution synthesis, yielding nearly a 3× performance improvement over existing semi-structured, and 2× improvement over neural program merge tools. Finally, we demonstrate that MergeBERT is sufficiently flexible to work with source code files in Java, JavaScript, TypeScript, and C# programming languages. To measure the practical use of MergeBERT, we conduct a user study to evaluate MergeBERT suggestions with 25 developers from large OSS projects on 122 real-world conflicts they encountered. Results suggest that in practice, MergeBERT resolutions would be accepted at a higher rate than estimated by automatic metrics for precision and accuracy. Additionally, we use participant feedback to identify future avenues for improvement of MergeBERT. @InProceedings{ESEC/FSE22p822, author = {Alexey Svyatkovskiy and Sarah Fakhoury and Negar Ghorbani and Todd Mytkowicz and Elizabeth Dinella and Christian Bird and Jinu Jang and Neel Sundaresan and Shuvendu K. Lahiri}, title = {Program Merge Conflict Resolution via Neural Transformers}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {822--833}, doi = {10.1145/3540250.3549163}, year = {2022}, } Publisher's Version |
|
Fang, Chunrong |
ESEC/FSE '22-DEMO: "SemCluster: A Semi-supervised ..."
SemCluster: A Semi-supervised Clustering Tool for Crowdsourced Test Reports with Deep Image Understanding
Mingzhe Du, Shengcheng Yu, Chunrong Fang, Tongyu Li, Heyuan Zhang, and Zhenyu Chen (Nanjing University, China) Due to the openness of crowdsourced testing, mobile app crowdsourced testing has been subject to duplicate reports. The previous research methods extract the textual features of the crowdsourced test reports, combine with shallow image analysis, and perform unsupervised clustering on the crowdsourced test reports to clarify the duplication of crowdsourced test reports and solve the problem. However, these methods ignore the semantic connection between textual descriptions and screenshots, making the clustering results unsatisfactory and the deduplication effect less accurate. This paper proposes a semi-supervised clustering tool for crowdsourced test reports with deep image understanding, namely SemCluster, which makes the most of the semantic connection between textual descriptions and screenshots by constructing semantic binding rules and performing semi-supervised clustering. SemCluster improves six metrics of clustering results in the experiment compared to the state-of-the-art method, which verifies that SemCluster has achieved a good deduplication effect. The demo can be found at: https://sites.google.com/view/semcluster-demo. @InProceedings{ESEC/FSE22p1756, author = {Mingzhe Du and Shengcheng Yu and Chunrong Fang and Tongyu Li and Heyuan Zhang and Zhenyu Chen}, title = {SemCluster: A Semi-supervised Clustering Tool for Crowdsourced Test Reports with Deep Image Understanding}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1756--1759}, doi = {10.1145/3540250.3558933}, year = {2022}, } Publisher's Version ESEC/FSE '22: "RULER: Discriminative and ..." RULER: Discriminative and Iterative Adversarial Training for Deep Neural Network Fairness Guanhong Tao, Weisong Sun, Tingxu Han, Chunrong Fang, and Xiangyu Zhang (Purdue University, USA; Nanjing University, China) Deep Neural Networks (DNNs) are becoming an integral part of many real-world applications, such as autonomous driving and financial management. While these models enable autonomy, there are however concerns regarding their ethics in decision making. For instance, fairness is an aspect that requires particular attention. A number of fairness testing techniques have been proposed to address this issue, e.g., by generating test cases called individual discriminatory instances for repairing DNNs. Although they have demonstrated great potential, they tend to generate many test cases that are not directly effective in improving fairness and incur substantial computation overhead. We propose a new model repair technique, RULER, by discriminating sensitive and non-sensitive attributes during test case generation for model repair. The generated cases are then used in training to improve DNN fairness. RULER balances the trade-off between accuracy and fairness by decomposing the training procedure into two phases and introducing a novel iterative adversarial training method for fairness. Compared to the state-of-the-art techniques on four datasets, RULER has 7-28 times more effective repair test cases generated, is 10-15 times faster in test generation, and has 26-43% more fairness improvement on average. @InProceedings{ESEC/FSE22p1173, author = {Guanhong Tao and Weisong Sun and Tingxu Han and Chunrong Fang and Xiangyu Zhang}, title = {RULER: Discriminative and Iterative Adversarial Training for Deep Neural Network Fairness}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1173--1184}, doi = {10.1145/3540250.3549169}, year = {2022}, } Publisher's Version |
|
Fang, Yuzhou |
ESEC/FSE '22: "An Empirical Study of Blockchain ..."
An Empirical Study of Blockchain System Vulnerabilities: Modules, Types, and Patterns
Xiao Yi, Daoyuan Wu, Lingxiao Jiang, Yuzhou Fang, Kehuan Zhang, and Wei Zhang (Chinese University of Hong Kong, China; Singapore Management University, Singapore; Nanjing University of Posts and Telecommunications, China) Blockchain, as a distributed ledger technology, becomes increasingly popular, especially for enabling valuable cryptocurrencies and smart contracts. However, the blockchain software systems inevitably have many bugs. Although bugs in smart contracts have been extensively investigated, security bugs of the underlying blockchain systems are much less explored. In this paper, we conduct an empirical study on blockchain’s system vulnerabilities from four representative blockchains, Bitcoin, Ethereum, Monero, and Stellar. Specifically, we first design a systematic filtering process to effectively identify 1,037 vulnerabilities and their 2,317 patches from 34,245 issues/PRs (pull requests) and 85,164 commits on GitHub. We thus build the first blockchain vulnerability dataset, which is available at https://github.com/VPRLab/BlkVulnDataset. We then perform unique analyses of this dataset at three levels, including (i) file-level vulnerable module categorization by identifying and correlating module paths across projects, (ii) text-level vulnerability type clustering by natural language processing and similarity-based sentence clustering, and (iii) code-level vulnerability pattern analysis by generating and clustering code change signatures that capture both syntactic and semantic information of patch code fragments. Our analyses reveal three key findings: (i) some blockchain modules are more susceptible than the others; notably, each of the modules related to consensus, wallet, and networking has over 200 issues; (ii) about 70% of blockchain vulnerabilities are of traditional types, but we also identify four new types specific to blockchains; and (iii) we obtain 21 blockchain-specific vulnerability patterns that capture unique blockchain attributes and statuses, and demonstrate that they can be used to detect similar vulnerabilities in other popular blockchains, such as Dogecoin, Bitcoin SV, and Zcash. @InProceedings{ESEC/FSE22p709, author = {Xiao Yi and Daoyuan Wu and Lingxiao Jiang and Yuzhou Fang and Kehuan Zhang and Wei Zhang}, title = {An Empirical Study of Blockchain System Vulnerabilities: Modules, Types, and Patterns}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {709--721}, doi = {10.1145/3540250.3549105}, year = {2022}, } Publisher's Version |
|
Fedyukovich, Grigory |
ESEC/FSE '22: "Multi-Phase Invariant Synthesis ..."
Multi-Phase Invariant Synthesis
Daniel Riley and Grigory Fedyukovich (Florida State University, USA) Loops with multiple phases are challenging to verify because they require disjunctive invariants. Invariants could also have the form of implication between a precondition for the phase and a lemma that is valid throughout the phase. Such invariant structure is however not widely supported in state-of-the-art verification. We present a novel SMT-based approach to synthesize implication invariants for multi-phase loops. Our technique computes Model Based Projections to discover the program's phases and leverages data learning to get relationships among loop variables at an arbitrary place in the loop. It is effective in the challenging cases of mutually-dependent periodic phases, where many implication invariants need to be discovered simultaneously. Our approach has shown promising results in its ability to verify programs with complex phase structures. We have implemented and evaluated our algorithm against several state-of-the-art solvers. @InProceedings{ESEC/FSE22p607, author = {Daniel Riley and Grigory Fedyukovich}, title = {Multi-Phase Invariant Synthesis}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {607--619}, doi = {10.1145/3540250.3549166}, year = {2022}, } Publisher's Version Artifacts Functional |
|
Feng, Sidong |
ESEC/FSE '22: "Psychologically-Inspired, ..."
Psychologically-Inspired, Unsupervised Inference of Perceptual Groups of GUI Widgets from GUI Images
Mulong Xie, Zhenchang Xing, Sidong Feng, Xiwei Xu, Liming Zhu, and Chunyang Chen (Australian National University, Australia; CSIRO’s Data61, Australia; Monash University, Australia; UNSW, Australia) Graphical User Interface (GUI) is not merely a collection of individual and unrelated widgets, but rather partitions discrete widgets into groups by various visual cues, thus forming higher-order perceptual units such as tab, menu, card or list. The ability to automatically segment a GUI into perceptual groups of widgets constitutes a fundamental component of visual intelligence to automate GUI design, implementation and automation tasks. Although humans can partition a GUI into meaningful perceptual groups of widgets in a highly reliable way, perceptual grouping is still an open challenge for computational approaches. Existing methods rely on ad-hoc heuristics or supervised machine learning that is dependent on specific GUI implementations and runtime information. Research in psychology and biological vision has formulated a set of principles (i.e., Gestalt theory of perception) that describe how humans group elements in visual scenes based on visual cues like connectivity, similarity, proximity and continuity. These principles are domain-independent and have been widely adopted by practitioners to structure content on GUIs to improve aesthetic pleasantness and usability. Inspired by these principles, we present a novel unsupervised image-based method for inferring perceptual groups of GUI widgets. Our method requires only GUI pixel images, is independent of GUI implementation, and does not require any training data. The evaluation on a dataset of 1,091 GUIs collected from 772 mobile apps and 20 UI design mockups shows that our method significantly outperforms the state-of-the-art ad-hoc heuristics-based baseline. Our perceptual grouping method creates opportunities for improving UI-related software engineering tasks. @InProceedings{ESEC/FSE22p332, author = {Mulong Xie and Zhenchang Xing and Sidong Feng and Xiwei Xu and Liming Zhu and Chunyang Chen}, title = {Psychologically-Inspired, Unsupervised Inference of Perceptual Groups of GUI Widgets from GUI Images}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {332--343}, doi = {10.1145/3540250.3549138}, year = {2022}, } Publisher's Version Info |
|
Feng, Zixuan |
ESEC/FSE '22: "A Case Study of Implicit Mentoring, ..."
A Case Study of Implicit Mentoring, Its Prevalence, and Impact in Apache
Zixuan Feng, Amreeta Chatterjee, Anita Sarma, and Iftekhar Ahmed (Oregon State University, USA; University of California at Irvine, USA) Mentoring is traditionally viewed as a dyadic, top-down apprenticeship. This perspective, however, overlooks other forms of informal mentoring taking place in everyday activities in which developers invest time and effort. Here, we investigate informal mentoring taking place in Open Source Software (OSS). We define a specific type of informal mentoring—implicit mentoring—situations where contributors guide others through instructions and suggestions embedded in everyday (OSS) activities. We defined implicit mentoring by first performing a review of related work on mentoring, and then through formative interviews with OSS contributors and member-checking. Next, through an empirical investigation of Pull Requests (PRs) in 37 Apache Projects, we built a classifier to extract implicit mentoring. Our analysis of 107,895 PRs shows that implicit mentoring does occur through code reviews (27.41% of all PRs included implicit mentoring) and is beneficial for both mentors and mentees. We analyzed the impact of implicit mentoring on OSS contributors by investigating their contributions and learning trajectories in their projects. Through an online survey (N=231), we then triangulated these results and identified the potential benefits of implicit mentoring from OSS contributors’ perspectives. @InProceedings{ESEC/FSE22p797, author = {Zixuan Feng and Amreeta Chatterjee and Anita Sarma and Iftekhar Ahmed}, title = {A Case Study of Implicit Mentoring, Its Prevalence, and Impact in Apache}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {797--809}, doi = {10.1145/3540250.3549167}, year = {2022}, } Publisher's Version |
|
Fiedor, Jan |
ESEC/FSE '22-IND: "Unite: An Adapter for Transforming ..."
Unite: An Adapter for Transforming Analysis Tools to Web Services via OSLC
Ondřej Vašíček, Jan Fiedor, Tomáš Kratochvíla, Bohuslav Křena, Aleš Smrčka, and Tomáš Vojnar (Brno University of Technology, Czechia; Honeywell International, Czechia) This paper describes Unite, a new tool intended as an adapter for transforming non-interactive command-line analysis tools to OSLC-compliant web services. Unite aims to make such tools easier to adopt and more convenient to use by allowing them to be accessible, both locally and remotely, in a unified way and to be easily integrated into various development environments. Open Services for Lifecycle Collaboration (OSLC) is an open standard for tool integration and was chosen for this task due to its robustness, extensibility, support of data from various domains, and its growing popularity. The work is motivated by allowing existing analysis tools to be more widely used with a strong emphasis on widening their industrial usage. We have implemented Unite and used it with multiple existing static as well as dynamic analysis and verification tools, and then successfully deployed it internationally in the industry to automate verification tasks for development teams in Honeywell. We discuss Honeywell's experience with using Unite and with OSLC in general. Moreover, we also provide the Unite Client (UniC) for Eclipse to allow users to easily run various analysis tools directly from the Eclipse IDE. @InProceedings{ESEC/FSE22p1408, author = {Ondřej Vašíček and Jan Fiedor and Tomáš Kratochvíla and Bohuslav Křena and Aleš Smrčka and Tomáš Vojnar}, title = {Unite: An Adapter for Transforming Analysis Tools to Web Services via OSLC}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1408--1418}, doi = {10.1145/3540250.3558939}, year = {2022}, } Publisher's Version |
|
Filieri, Antonio |
ESEC/FSE '22-IND: "Input Splitting for Cloud-Based ..."
Input Splitting for Cloud-Based Static Application Security Testing Platforms
Maria Christakis, Thomas Cottenier, Antonio Filieri, Linghui Luo, Muhammad Numair Mansur, Lee Pike, Nicolás Rosner, Martin Schäf, Aritra Sengupta, and Willem Visser (MPI-SWS, Germany; Amazon Web Services, USA; Amazon Web Services, Germany) As software development teams adopt DevSecOps practices, application security is increasingly the responsibility of development teams, who are required to set up their own Static Application Security Testing (SAST) infrastructure. Since development teams often do not have the necessary infrastructure and expertise to set up a custom SAST solution, there is an increased need for cloud-based SAST platforms that operate as a service and run a variety of static analyzers. Adding a new static analyzer to a cloud-based SAST platform can be challenging because static analyzers greatly vary in complexity, from linters that scale efficiently to interprocedural dataflow engines that use cubic or even more complex algorithms. Careful manual evaluation is needed to decide whether a new analyzer would slow down the overall response time of the platform or may timeout too often. We explore the question of whether this can be simplified by splitting the input to the analyzer into partitions and analyzing the partitions independently. Depending on the complexity of the static analyzer, the partition size can be adjusted to curtail the overall response time. We report on an experiment where we run different analysis tools with and without splitting the inputs. The experimental results show that simple splitting strategies can effectively reduce the running time and memory usage per partition without significantly affecting the findings produced by the tool. @InProceedings{ESEC/FSE22p1367, author = {Maria Christakis and Thomas Cottenier and Antonio Filieri and Linghui Luo and Muhammad Numair Mansur and Lee Pike and Nicolás Rosner and Martin Schäf and Aritra Sengupta and Willem Visser}, title = {Input Splitting for Cloud-Based Static Application Security Testing Platforms}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1367--1378}, doi = {10.1145/3540250.3558944}, year = {2022}, } Publisher's Version |
|
Filkov, Vladimir |
ESEC/FSE '22: "Code, Quality, and Process ..."
Code, Quality, and Process Metrics in Graduated and Retired ASFI Projects
Ștefan Stănciulescu, Likang Yin, and Vladimir Filkov (University of California at Davis, USA) Recent work on open source sustainability shows that successful trajectories of projects in the Apache Software Foundation Incubator (ASFI) can be predicted early on, using a set of socio-technical measures. Because OSS projects are socio-technical systems centered around code artifacts, we hypothesize that sustainable projects may exhibit different code and process patterns than unsustainable ones, and that those patterns can grow more apparent as projects evolve over time. Here we studied the code and coding processes of over 200 ASFI projects, and found that ASFI graduated projects have different patterns of code quality and complexity than retired ones. Likewise for the coding processes – e.g., feature commits or bug-fixing commits are correlated with project graduation success. We find that minor contributors and major contributors (who contribute <5%, respectively >=95% commits) associate with graduation outcomes, implying that having also developers who contribute fewer commits are important for a project’s success. This study provides evidence that OSS projects, especially nascent ones, can benefit from introspection and instrumentation using multidimensional modeling of the whole system, including code, processes, and code quality measures, and how they are interconnected over time. @InProceedings{ESEC/FSE22p495, author = {Ștefan Stănciulescu and Likang Yin and Vladimir Filkov}, title = {Code, Quality, and Process Metrics in Graduated and Retired ASFI Projects}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {495--506}, doi = {10.1145/3540250.3549132}, year = {2022}, } Publisher's Version |
|
Flint, Samuel W. |
ESEC/FSE '22-TUT: "Performing Large-Scale Mining ..."
Performing Large-Scale Mining Studies: From Start to Finish (Tutorial)
Robert Dyer and Samuel W. Flint (University of Nebraska-Lincoln, USA) Modern software engineering research often relies on mining open-source software repositories, to either provide motivation for their research problems and/or evaluation of the proposed approach. Mining ultra-large-scale software repositories is still a difficult task, requiring substantial expertise and access to significant hardware. Tools such as Boa can help researchers easily mine large numbers of open-source repositories. There has also recently been more of a push toward open science, with an emphasis on making replication packages available. Building such replication packages incurs additional workload for researchers. In this tutorial, we teach how to use the Boa infrastructure for mining software repository data. We leverage Boa’s VS Code IDE extension to help write and submit Boa queries, and also leverage Boa’s study template to show how researchers can more easily analyze the output from Boa and automatically produce a suitable replication package that is published on Zenodo. @InProceedings{ESEC/FSE22p1822, author = {Robert Dyer and Samuel W. Flint}, title = {Performing Large-Scale Mining Studies: From Start to Finish (Tutorial)}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1822--1822}, doi = {10.1145/3540250.3569448}, year = {2022}, } Publisher's Version |
|
Ford, Denae |
ESEC/FSE '22: "Understanding Skills for OSS ..."
Understanding Skills for OSS Communities on GitHub
Jenny T. Liang, Thomas Zimmermann, and Denae Ford (University of Washington, USA; Microsoft Research, USA) The development of open source software (OSS) is a broad field which requires diverse skill sets. For example, maintainers help lead the project and promote its longevity, technical writers assist with documentation, bug reporters identify defects in software, and developers program the software. However, it is unknown which skills are used in OSS development as well as OSS contributors' general attitudes towards skills in OSS. In this paper, we address this gap by administering a survey to a diverse set of 455 OSS contributors. Guided by these responses as well as prior literature on software development expertise and social factors of OSS, we develop a model of skills in OSS that considers the many contexts OSS contributors work in. This model has 45 skills in the following 9 categories: technical skills, working styles, problem solving, contribution types, project-specific skills, interpersonal skills, external relations, management, and characteristics. Through a mix of qualitative and quantitative analyses, we find that OSS contributors are actively motivated to improve skills and perceive many benefits in sharing their skills with others. We then use this analysis to derive a set of design implications and best practices for those who incorporate skills into OSS tools and platforms, such as GitHub. @InProceedings{ESEC/FSE22p170, author = {Jenny T. Liang and Thomas Zimmermann and Denae Ford}, title = {Understanding Skills for OSS Communities on GitHub}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {170--182}, doi = {10.1145/3540250.3549082}, year = {2022}, } Publisher's Version Info |
|
Forsgren, Nicole |
ESEC/FSE '22-IND: "Nalanda: A Socio-technical ..."
Nalanda: A Socio-technical Graph Platform for Building Software Analytics Tools at Enterprise Scale
Chandra Maddila, Suhas Shanbhogue, Apoorva Agrawal, Thomas Zimmermann, Chetan Bansal, Nicole Forsgren, Divyanshu Agrawal, Kim Herzig, and Arie van Deursen (Microsoft Research, USA; Microsoft Research, India; Microsoft, USA; Delft University of Technology, Netherlands) Software development is information-dense knowledge work that requires collaboration with other developers and awareness of artifacts such as work items, pull requests, and file changes. With the speed of development increasing, information overload and information discovery are challenges for people developing and maintaining these systems. Finding information about similar code changes and experts is difficult for software engineers, especially when they work in large software systems or have just recently joined a project. In this paper, we build a large scale data platform named Nalanda platform to address the challenges of information overload and discovery. Nalanda contains two subsystems: (1) a large scale socio-technical graph system, named Nalanda graph system, and (2) a large scale index system, named Nalanda index system that aims at satisfying the information needs of software developers. To show the versatility of the Nalanda platform, we built two applications: (1) a software analytics application with a news feed named MyNalanda that has Daily Active Users (DAU) of 290 and Monthly Active Users (MAU) of 590, and (2) a recommendation system for related work items and pull requests that accomplished similar tasks (artifact recommendation) and a recommendation system for subject matter experts (expert recommendation), augmented by the Nalanda socio-technical graph. Initial studies of the two applications found that developers and engineering managers are favorable toward continued use of the news feed application for information discovery. The studies also found that developers agreed that a system like Nalanda artifact and expert recommendation application could reduce the time spent and the number of places needed to visit to find information. @InProceedings{ESEC/FSE22p1246, author = {Chandra Maddila and Suhas Shanbhogue and Apoorva Agrawal and Thomas Zimmermann and Chetan Bansal and Nicole Forsgren and Divyanshu Agrawal and Kim Herzig and Arie van Deursen}, title = {Nalanda: A Socio-technical Graph Platform for Building Software Analytics Tools at Enterprise Scale}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1246--1256}, doi = {10.1145/3540250.3558949}, year = {2022}, } Publisher's Version Info |
|
Fregnan, Enrico |
ESEC/FSE '22: "First Come First Served: The ..."
First Come First Served: The Impact of File Position on Code Review
Enrico Fregnan, Larissa Braz, Marco D'Ambros, Gül Çalıklı, and Alberto Bacchelli (University of Zurich, Switzerland; USI Lugano, Switzerland; University of Glasgow, UK) The most popular code review tools (e.g., Gerrit and GitHub) present the files to review sorted in alphabetical order. Could this choice or, more generally, the relative position in which a file is presented bias the outcome of code reviews? We investigate this hypothesis by triangulating complementary evidence in a two-step study. First, we observe developers’ code review activity. We analyze the review comments pertaining to 219,476 Pull Requests (PRs) from 138 popular Java projects on GitHub. We found files shown earlier in a PR to receive more comments than files shown later, also when controlling for possible confounding factors: e.g., the presence of discussion threads or the lines added in a file. Second, we measure the impact of file position on defect finding in code review. Recruit- ing 106 participants, we conduct an online controlled experiment in which we measure participants’ performance in detecting two unrelated defects seeded into two different files. Participants are assigned to one of two treatments in which the position of the defective files is switched. For one type of defect, participants are not affected by its file’s position; for the other, they have 64% lower odds to identify it when its file is last as opposed to first. Overall, our findings provide evidence that the relative position in which files are presented has an impact on code reviews’ outcome; we discuss these results and implications for tool design and code review. @InProceedings{ESEC/FSE22p483, author = {Enrico Fregnan and Larissa Braz and Marco D'Ambros and Gül Çalıklı and Alberto Bacchelli}, title = {First Come First Served: The Impact of File Position on Code Review}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {483--494}, doi = {10.1145/3540250.3549177}, year = {2022}, } Publisher's Version Artifacts Reusable |
|
Fresno-Aranda, Rafael |
ESEC/FSE '22-DOC: "Automated Capacity Analysis ..."
Automated Capacity Analysis of Limitation-Aware Microservices Architectures
Rafael Fresno-Aranda (University of Seville, Spain) Over the last years, the concept of API economy has fostered the creation of an ecosystem of public APIs used as business elements. These APIs include various pricing plans, which allow developers to consume an API for a specific price and under certain conditions. These conditions include capacity limits, a.k.a. limitations, that limit the usage of the API. Additionally, modern web applications are usually based on a microservices architecture (MSA), in which multiple services communicate with each other through public APIs using a standardized paradigm, commonly RESTful. When an MSA consumes external APIs with limitations, it is necessary to analyse the impact of these limitations in its capacity. These MSAs are known as Limitation-Aware Microservices Architecture (LAMA). This PhD dissertation aims to provide an automated framework to analyse the capacity of a LAMA given the formal description of its internal topology and the external pricing plans. This framework would be used to solve analysis operations, which deal with the extraction of useful information that helps developers build their LAMAs. @InProceedings{ESEC/FSE22p1780, author = {Rafael Fresno-Aranda}, title = {Automated Capacity Analysis of Limitation-Aware Microservices Architectures}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1780--1784}, doi = {10.1145/3540250.3558905}, year = {2022}, } Publisher's Version |
|
Fu, Michael |
ESEC/FSE '22: "VulRepair: A T5-Based Automated ..."
VulRepair: A T5-Based Automated Software Vulnerability Repair
Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Dinh Phung (Monash University, Australia; University of Adelaide, Australia) As software vulnerabilities grow in volume and complexity, researchers proposed various Artificial Intelligence (AI)-based approaches to help under-resourced security analysts to find, detect, and localize vulnerabilities. However, security analysts still have to spend a huge amount of effort to manually fix or repair such vulnerable functions. Recent work proposed an NMT-based Automated Vulnerability Repair, but it is still far from perfect due to various limitations. In this paper, we propose VulRepair, a T5-based automated software vulnerability repair approach that leverages the pre-training and BPE components to address various technical limitations of prior work. Through an extensive experiment with over 8,482 vulnerability fixes from 1,754 real-world software projects, we find that our VulRepair achieves a Perfect Prediction of 44%, which is 13%-21% more accurate than competitive baseline approaches. These results lead us to conclude that our VulRepair is considerably more accurate than two baseline approaches, highlighting the substantial advancement of NMT-based Automated Vulnerability Repairs. Our additional investigation also shows that our VulRepair can accurately repair as many as 745 out of 1,706 real-world well-known vulnerabilities (e.g., Use After Free, Improper Input Validation, OS Command Injection), demonstrating the practicality and significance of our VulRepair for generating vulnerability repairs, helping under-resourced security analysts on fixing vulnerabilities. @InProceedings{ESEC/FSE22p935, author = {Michael Fu and Chakkrit Tantithamthavorn and Trung Le and Van Nguyen and Dinh Phung}, title = {VulRepair: A T5-Based Automated Software Vulnerability Repair}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {935--947}, doi = {10.1145/3540250.3549098}, year = {2022}, } Publisher's Version Artifacts Reusable |
|
Fu, Shengyu |
ESEC/FSE '22: "Automating Code Review Activities ..."
Automating Code Review Activities by Large-Scale Pre-training
Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundaresan (Peking University, China; Microsoft Research, China; Sun Yat-sen University, China; LinkedIn, USA; Microsoft, USA) Code review is an essential part to software development lifecycle since it aims at guaranteeing the quality of codes. Modern code review activities necessitate developers viewing, understanding and even running the programs to assess logic, functionality, latency, style and other factors. It turns out that developers have to spend far too much time reviewing the code of their peers. Accordingly, it is in significant demand to automate the code review process. In this research, we focus on utilizing pre-training techniques for the tasks in the code review scenario. We collect a large-scale dataset of real-world code changes and code reviews from open-source projects in nine of the most popular programming languages. To better understand code diffs and reviews, we propose CodeReviewer, a pre-trained model that utilizes four pre-training tasks tailored specifically for the code review scenario. To evaluate our model, we focus on three key tasks related to code review activities, including code change quality estimation, review comment generation and code refinement. Furthermore, we establish a high-quality benchmark dataset based on our collected data for these three tasks and conduct comprehensive experiments on it. The experimental results demonstrate that our model outperforms the previous state-of-the-art pre-training approaches in all tasks. Further analysis show that our proposed pre-training tasks and the multilingual pre-training dataset benefit the model on the understanding of code changes and reviews. @InProceedings{ESEC/FSE22p1035, author = {Zhiyu Li and Shuai Lu and Daya Guo and Nan Duan and Shailesh Jannu and Grant Jenks and Deep Majumder and Jared Green and Alexey Svyatkovskiy and Shengyu Fu and Neel Sundaresan}, title = {Automating Code Review Activities by Large-Scale Pre-training}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1035--1047}, doi = {10.1145/3540250.3549081}, year = {2022}, } Publisher's Version |
|
Fu, Ying |
ESEC/FSE '22-IND: "Industry Practice of Configuration ..."
Industry Practice of Configuration Auto-tuning for Cloud Applications and Services
Runzhe Wang, Qinglong Wang, Yuxi Hu, Heyuan Shi, Yuheng Shen, Yu Zhan, Ying Fu, Zheng Liu, Xiaohai Shi, and Yu Jiang (Alibaba Group, China; Central South University, China; Tsinghua University, China; Ant Group, China; Zhejiang University, China) Auto-tuning attracts increasing attention in industry practice to optimize the performance of a system with many configurable parameters. It is particularly useful for cloud applications and services since they have complex system hierarchies and intricate knob correlations. However, existing tools and algorithms rarely consider practical problems such as workload pressure control, the support for distributed deployment, and expensive time costs, etc., which are utterly important for enterprise cloud applications and services. In this work, we significantly extend an open source tuning tool – KeenTune to optimize several typical enterprise cloud applications and services. Our practice is in collaboration with enterprise users and tuning tool developers to address the aforementioned problems. Specifically, we highlight five key challenges from our experiences and provide a set of solutions accordingly. Through applying the improved tuning tool to different application scenarios, we achieve 2%-14% improvements for the performance of MySQL, OceanBase, nginx, ingress-nginx, and 5%-70% improvements for the performance of ACK cloud container service. @InProceedings{ESEC/FSE22p1555, author = {Runzhe Wang and Qinglong Wang and Yuxi Hu and Heyuan Shi and Yuheng Shen and Yu Zhan and Ying Fu and Zheng Liu and Xiaohai Shi and Yu Jiang}, title = {Industry Practice of Configuration Auto-tuning for Cloud Applications and Services}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1555--1565}, doi = {10.1145/3540250.3558962}, year = {2022}, } Publisher's Version ESEC/FSE '22-IND: "Investigating and Improving ..." Investigating and Improving Log Parsing in Practice Ying Fu, Meng Yan, Jian Xu, Jianguo Li, Zhongxin Liu, Xiaohong Zhang, and Dan Yang (Chongqing University, China; Ant Group, China; Zhejiang University, China) Logs are widely used for system behavior diagnosis by automatic log mining. Log parsing is an important data preprocessing step that converts semi-structured log messages into structured data as the feature input for log mining. Currently, many studies are devoted to proposing new log parsers. However, to the best of our knowledge, no previous study comprehensively investigates the effectiveness of log parsers in industrial practice. To investigate the effectiveness of the log parsers in industrial practice, in this paper, we conduct an empirical study on the effectiveness of six state-of-the-art log parsers on 10 microservice applications of Ant Group. Our empirical results highlight two challenges for log parsing in practice: 1) various separators. There are various separators in a log message, and the separators in different event templates or different applications are also various. Current log parsers cannot perform well because they do not consider various separators. 2) Various lengths due to nested objects. The log messages belonging to the same event template may also have various lengths due to nested objects. The log messages of 6 out of 10 microservice applications at Ant Group with various lengths due to nested objects. 4 out of 6 state-of-the-art log parsers cannot deal with various lengths due to nested objects. In this paper, we propose an improved log parser named Drain+ based on a state-of-the-art log parser Drain. Drain+ includes two innovative components to address the above two challenges: a statistical-based separators generation component, which generates separators automatically for log message splitting, and a candidate event template merging component, which merges the candidate event templates by a template similarity method. We evaluate the effectiveness of Drain+ on 10 microservice applications of Ant Group and 16 public datasets. The results show that Drain+ outperforms the six state-of-the-art log parsers on industrial applications and public datasets. Finally, we conclude the observations in the road ahead for log parsing to inspire other researchers and practitioners. @InProceedings{ESEC/FSE22p1566, author = {Ying Fu and Meng Yan and Jian Xu and Jianguo Li and Zhongxin Liu and Xiaohong Zhang and Dan Yang}, title = {Investigating and Improving Log Parsing in Practice}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1566--1577}, doi = {10.1145/3540250.3558947}, year = {2022}, } Publisher's Version |
|
Gall, Harald C. |
ESEC/FSE '22: "On-the-Fly Syntax Highlighting ..."
On-the-Fly Syntax Highlighting using Neural Networks
Marco Edoardo Palma, Pasquale Salza, and Harald C. Gall (University of Zurich, Switzerland) With the presence of online collaborative tools for software developers, source code is shared and consulted frequently, from code viewers to merge requests and code snippets. Typically, code highlighting quality in such scenarios is sacrificed in favor of system responsiveness. In these on-the-fly settings, performing a formal grammatical analysis of the source code is not only expensive, but also intractable for the many times the input is an invalid derivation of the language. Indeed, current popular highlighters heavily rely on a system of regular expressions, typically far from the specification of the language's lexer. Due to their complexity, regular expressions need to be periodically updated as more feedback is collected from the users and their design unwelcome the detection of more complex language formations. This paper delivers a deep learning-based approach suitable for on-the-fly grammatical code highlighting of correct and incorrect language derivations, such as code viewers and snippets. It focuses on alleviating the burden on the developers, who can reuse the language's parsing strategy to produce the desired highlighting specification. Moreover, this approach is compared to nowadays online syntax highlighting tools and formal methods in terms of accuracy and execution time, across different levels of grammatical coverage, for three mainstream programming languages. The results obtained show how the proposed approach can consistently achieve near-perfect accuracy in its predictions, thereby outperforming regular expression-based strategies. @InProceedings{ESEC/FSE22p269, author = {Marco Edoardo Palma and Pasquale Salza and Harald C. Gall}, title = {On-the-Fly Syntax Highlighting using Neural Networks}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {269--280}, doi = {10.1145/3540250.3549109}, year = {2022}, } Publisher's Version Info |
|
Gao, Cuiyun |
ESEC/FSE '22: "No More Fine-Tuning? An Experimental ..."
No More Fine-Tuning? An Experimental Evaluation of Prompt Tuning in Code Intelligence
Chaozheng Wang, Yuanhang Yang, Cuiyun Gao, Yun Peng, Hongyu Zhang, and Michael R. Lyu (Harbin Institute of Technology, China; Chinese University of Hong Kong, China; University of Newcastle, Australia) Pre-trained models have been shown effective in many code intelligence tasks. These models are pre-trained on large-scale unlabeled corpus and then fine-tuned in downstream tasks. However, as the inputs to pre-training and downstream tasks are in different forms, it is hard to fully explore the knowledge of pre-trained models. Besides, the performance of fine-tuning strongly relies on the amount of downstream data, while in practice, the scenarios with scarce data are common. Recent studies in the natural language processing (NLP) field show that prompt tuning, a new paradigm for tuning, alleviates the above issues and achieves promising results in various NLP tasks. In prompt tuning, the prompts inserted during tuning provide task-specific knowledge, which is especially beneficial for tasks with relatively scarce data. In this paper, we empirically evaluate the usage and effect of prompt tuning in code intelligence tasks. We conduct prompt tuning on popular pre-trained models CodeBERT and CodeT5 and experiment with three code intelligence tasks including defect prediction, code summarization, and code translation. Our experimental results show that prompt tuning consistently outperforms fine-tuning in all three tasks. In addition, prompt tuning shows great potential in low-resource scenarios, e.g., improving the BLEU scores of fine-tuning by more than 26% on average for code summarization. Our results suggest that instead of fine-tuning, we could adapt prompt tuning for code intelligence tasks to achieve better performance, especially when lacking task-specific data. @InProceedings{ESEC/FSE22p382, author = {Chaozheng Wang and Yuanhang Yang and Cuiyun Gao and Yun Peng and Hongyu Zhang and Michael R. Lyu}, title = {No More Fine-Tuning? An Experimental Evaluation of Prompt Tuning in Code Intelligence}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {382--394}, doi = {10.1145/3540250.3549113}, year = {2022}, } Publisher's Version |
|
Gao, Lin |
ESEC/FSE '22: "Generic Sensitivity: Customizing ..."
Generic Sensitivity: Customizing Context-Sensitive Pointer Analysis for Generics
Haofeng Li, Jie Lu, Haining Meng, Liqing Cao, Yongheng Huang, Lian Li, and Lin Gao (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; TianqiSoft, China) Generic programming has been extensively used in object-oriented programs such as Java. However, existing context-sensitive pointer analyses perform poorly in analyzing generics. This paper introduces generic sensitivity, a new context customization scheme targeting generics. We design our context customization scheme in such a way that generic instantiation sites, i.e., locations instantiating generic classes/methods with concrete types, are always preserved as key context elements. This is realized by augmenting contexts with a type variable lookup map, which is efficiently updated during the analysis in a context-sensitive manner. We have implemented different variants of generic-sensitive analysis in Wala and experimental results show that the generic customization scheme can significantly improve performance and precision of context-sensitive pointer analyses. For instance, generic context customization significantly improves precision of 1-object-sensitive analysis, with an average speedup of 1.8×. In addition, generic context customization enables a 1-object-sensitive analysis to achieve overall better precision than a 2-object-sensitive analysis, with an averagely speed up of 12.6 × (62 × for chart). @InProceedings{ESEC/FSE22p1110, author = {Haofeng Li and Jie Lu and Haining Meng and Liqing Cao and Yongheng Huang and Lian Li and Lin Gao}, title = {Generic Sensitivity: Customizing Context-Sensitive Pointer Analysis for Generics}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1110--1121}, doi = {10.1145/3540250.3549122}, year = {2022}, } Publisher's Version |
|
Gao, Yuhao |
ESEC/FSE '22: "Demystifying the Underground ..."
Demystifying the Underground Ecosystem of Account Registration Bots
Yuhao Gao, Guoai Xu, Li Li, Xiapu Luo, Chenyu Wang, and Yulei Sui (University of Technology Sydney, Australia; Beijing University of Posts and Telecommunications, China; Harbin Institute of Technology, China; Monash University, Australia; Hong Kong Polytechnic University, China) Member services are a core part of most online systems. For example, member services in online social networks and video platforms make it possible to serve users customized content or track their footprint for a recommendation. However, there is a dark side to membership that lurks behind influencer marketing, coupon harvesting, and spreading fake news. All these activities rely heavily on owning masses of fake accounts, and to create new accounts efficiently, malicious registrants use automated registration bots with anti-human verification services that can easily bypass a website’s security strategies. In this paper, we take the first step toward understanding the underground ecosystem of account registration bots, and in particular, the anti-human verification services they use. From a comprehensive analysis, we determined the three most popular types of anti-human verification services. We then conducted experiments on these services from an attacker’s perspective to verify their effectiveness. The results show that all can easily bypass the security strategies website providers put in place to prevent fake registrations, such as SMS verification, CAPTCHA and IP monitoring. We further estimated the market size of the underground registration ecosystem, placing it at about US $4.8M-128.1 million per year. Our study demonstrates the urgency with which we to think about the effectiveness of our registration security strategies and should prompt us to develop new strategies for better protection. @InProceedings{ESEC/FSE22p897, author = {Yuhao Gao and Guoai Xu and Li Li and Xiapu Luo and Chenyu Wang and Yulei Sui}, title = {Demystifying the Underground Ecosystem of Account Registration Bots}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {897--909}, doi = {10.1145/3540250.3549090}, year = {2022}, } Publisher's Version |
|
Garg, Spandan |
ESEC/FSE '22: "DeepDev-PERF: A Deep Learning-Based ..."
DeepDev-PERF: A Deep Learning-Based Approach for Improving Software Performance
Spandan Garg, Roshanak Zilouchian Moghaddam, Colin B. Clement, Neel Sundaresan, and Chen Wu (Microsoft, USA; Microsoft, China) Improving software performance is an important yet challenging part of the software development cycle. Today, the majority of performance inefficiencies are identified and patched by performance experts. Recent advancements in deep learning approaches and the wide-spread availability of open-source data creates a great opportunity to automate the identification and patching of performance problems. In this paper, we present DeepDev-PERF, a transformer-based approach to suggest performance improvements for C# applications. We pretrain DeepDev-PERF on English and Source code corpora, followed by finetuning for the task of generating performance improvement patches for C# applications. Our evaluation shows that our model can generate the same performance improvement suggestion as the developer fix in 53 @InProceedings{ESEC/FSE22p948, author = {Spandan Garg and Roshanak Zilouchian Moghaddam and Colin B. Clement and Neel Sundaresan and Chen Wu}, title = {DeepDev-PERF: A Deep Learning-Based Approach for Improving Software Performance}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {948--958}, doi = {10.1145/3540250.3549096}, year = {2022}, } Publisher's Version |
|
Ge, Jidong |
ESEC/FSE '22: "Lighting Up Supervised Learning ..."
Lighting Up Supervised Learning in User Review-Based Code Localization: Dataset and Benchmark
Xinwen Hu, Yu Guo, Jianjie Lu, Zheling Zhu, Chuanyi Li, Jidong Ge, Liguo Huang, and Bin Luo (Nanjing University, China; Southern Methodist University, USA) As User Reviews (URs) of mobile Apps are proven to provide valuable feedback for maintaining and evolving applications, how to make full use of URs more efficiently in the release cycle of mobile Apps has become a widely concerned and researched topic in the Software Engineering (SE) community. In order to speed up the completion of coding work related to URs to shorten the release cycle as much as possible, the task of User Review-based code localization is proposed and studied in depth. However, due to the lack of large-scale ground truth dataset (i.e., truly related <UR, Code> pairs), existing methods are all unsupervised learning-based. In order to light up supervised learning approaches, which are driven by large labeled datasets, for Review2Code, and to compare their performances with unsupervised learning-based methods, we first introduce a large-scale human-labeled <UR, Code> ground truth dataset, including the annotation process and statistical analysis. Then, a benchmark consisting of two SOTA unsupervised learning-based and four supervised learning-based Review2Code methods is constructed based on this dataset. We believe that this paper can provide a basis for in-depth exploration of the supervised learning-based Review2Code solutions. @InProceedings{ESEC/FSE22p533, author = {Xinwen Hu and Yu Guo and Jianjie Lu and Zheling Zhu and Chuanyi Li and Jidong Ge and Liguo Huang and Bin Luo}, title = {Lighting Up Supervised Learning in User Review-Based Code Localization: Dataset and Benchmark}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {533--545}, doi = {10.1145/3540250.3549141}, year = {2022}, } Publisher's Version |
|
Geng, Scott |
ESEC/FSE '22: "NeuDep: Neural Binary Memory ..."
NeuDep: Neural Binary Memory Dependence Analysis
Kexin Pei, Dongdong She, Michael Wang, Scott Geng, Zhou Xuan, Yaniv David, Junfeng Yang, Suman Jana, and Baishakhi Ray (Columbia University, USA; Massachusetts Institute of Technology, USA; Purdue University, USA) Determining whether multiple instructions can access the same memory location is a critical task in binary analysis. It is challenging as statically computing precise alias information is undecidable in theory. The problem aggravates at the binary level due to the presence of compiler optimizations and the absence of symbols and types. Existing approaches either produce significant spurious dependencies due to conservative analysis or scale poorly to complex binaries. We present a new machine-learning-based approach to predict memory dependencies by exploiting the model's learned knowledge about how binary programs execute. Our approach features (i) a self-supervised procedure that pretrains a neural net to reason over binary code and its dynamic value flows through memory addresses, followed by (ii) supervised finetuning to infer the memory dependencies statically. To facilitate efficient learning, we develop dedicated neural architectures to encode the heterogeneous inputs (i.e., code, data values, and memory addresses from traces) with specific modules and fuse them with a composition learning strategy. We implement our approach in NeuDep and evaluate it on 41 popular software projects compiled by 2 compilers, 4 optimizations, and 4 obfuscation passes. We demonstrate that NeuDep is more precise (1.5x) and faster (3.5x) than the current state-of-the-art. Extensive probing studies on security-critical reverse engineering tasks suggest that NeuDep understands memory access patterns, learns function signatures, and is able to match indirect calls. All these tasks either assist or benefit from inferring memory dependencies. Notably, NeuDep also outperforms the current state-of-the-art on these tasks. @InProceedings{ESEC/FSE22p747, author = {Kexin Pei and Dongdong She and Michael Wang and Scott Geng and Zhou Xuan and Yaniv David and Junfeng Yang and Suman Jana and Baishakhi Ray}, title = {NeuDep: Neural Binary Memory Dependence Analysis}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {747--759}, doi = {10.1145/3540250.3549147}, year = {2022}, } Publisher's Version |
|
Ghorbani, Negar |
ESEC/FSE '22: "Program Merge Conflict Resolution ..."
Program Merge Conflict Resolution via Neural Transformers
Alexey Svyatkovskiy, Sarah Fakhoury, Negar Ghorbani, Todd Mytkowicz, Elizabeth Dinella, Christian Bird, Jinu Jang, Neel Sundaresan, and Shuvendu K. Lahiri (Microsoft, USA; Washington State University, USA; University of California at Irvine, USA; Microsoft Research, USA; University of Pennsylvania, USA) Collaborative software development is an integral part of the modern software development life cycle, essential to the success of large-scale software projects. When multiple developers make concurrent changes around the same lines of code, a merge conflict may occur. Such conflicts stall pull requests and continuous integration pipelines for hours to several days, seriously hurting developer productivity. To address this problem, we introduce MergeBERT, a novel neural program merge framework based on token-level three-way differencing and a transformer encoder model. By exploiting the restricted nature of merge conflict resolutions, we reformulate the task of generating the resolution sequence as a classification task over a set of primitive merge patterns extracted from real-world merge commit data. Our model achieves 63–68% accuracy for merge resolution synthesis, yielding nearly a 3× performance improvement over existing semi-structured, and 2× improvement over neural program merge tools. Finally, we demonstrate that MergeBERT is sufficiently flexible to work with source code files in Java, JavaScript, TypeScript, and C# programming languages. To measure the practical use of MergeBERT, we conduct a user study to evaluate MergeBERT suggestions with 25 developers from large OSS projects on 122 real-world conflicts they encountered. Results suggest that in practice, MergeBERT resolutions would be accepted at a higher rate than estimated by automatic metrics for precision and accuracy. Additionally, we use participant feedback to identify future avenues for improvement of MergeBERT. @InProceedings{ESEC/FSE22p822, author = {Alexey Svyatkovskiy and Sarah Fakhoury and Negar Ghorbani and Todd Mytkowicz and Elizabeth Dinella and Christian Bird and Jinu Jang and Neel Sundaresan and Shuvendu K. Lahiri}, title = {Program Merge Conflict Resolution via Neural Transformers}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {822--833}, doi = {10.1145/3540250.3549163}, year = {2022}, } Publisher's Version |
|
Giacobbe, Mirco |
ESEC/FSE '22: "Neural Termination Analysis ..."
Neural Termination Analysis
Mirco Giacobbe, Daniel Kroening, and Julian Parsert (University of Birmingham, UK; University of Oxford, UK) We introduce a novel approach to the automated termination analysis of computer programs: we use neural networks to represent ranking functions. Ranking functions map program states to values that are bounded from below and decrease as a program runs; the existence of a ranking function proves that the program terminates. We train a neural network from sampled execution traces of a program so that the network's output decreases along the traces; then, we use symbolic reasoning to formally verify that it generalises to all possible executions. Upon the affirmative answer we obtain a formal certificate of termination for the program, which we call a neural ranking function. We demonstrate that, thanks to the ability of neural networks to represent nonlinear functions, our method succeeds over programs that are beyond the reach of state-of-the-art tools. This includes programs that use disjunctions in their loop conditions and programs that include nonlinear expressions. @InProceedings{ESEC/FSE22p633, author = {Mirco Giacobbe and Daniel Kroening and Julian Parsert}, title = {Neural Termination Analysis}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {633--645}, doi = {10.1145/3540250.3549120}, year = {2022}, } Publisher's Version |
|
Gligoric, Milos |
ESEC/FSE '22-DEMO: "Python-by-Contract Dataset ..."
Python-by-Contract Dataset
Jiyang Zhang, Marko Ristin, Phillip Schanely, Hans Wernher van de Venn, and Milos Gligoric (University of Texas at Austin, USA; Zurich University of Applied Sciences, Switzerland) Design-by-contract as a programming technique is becoming popular in Python community as various tools have been developed for automatically testing the code based on the contracts. However, there is no sufficiently large and representative Python code base with contracts to evaluate these different testing tools. We present Python-by-contract dataset containing 514 Python functions annotated with contracts using icontract library. We show that our Python-by-contract dataset can be easily used by existing testing tools that take advantage of contracts. The demo video can be found at https://youtu.be/08wZN-xh6mY. @InProceedings{ESEC/FSE22p1652, author = {Jiyang Zhang and Marko Ristin and Phillip Schanely and Hans Wernher van de Venn and Milos Gligoric}, title = {Python-by-Contract Dataset}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1652--1656}, doi = {10.1145/3540250.3558917}, year = {2022}, } Publisher's Version |
|
Golubev, Yaroslav |
ESEC/FSE '22-IND: "All You Need Is Logs: Improving ..."
All You Need Is Logs: Improving Code Completion by Learning from Anonymous IDE Usage Logs
Vitaliy Bibaev, Alexey Kalina, Vadim Lomshakov, Yaroslav Golubev, Alexander Bezzubov, Nikita Povarov, and Timofey Bryksin (JetBrains, Serbia; JetBrains, Germany; JetBrains, Russia; JetBrains Research, Serbia; JetBrains, Netherlands; JetBrains Research, Cyprus) In this work, we propose an approach for collecting completion usage logs from the users in an IDE and using them to train a machine learning based model for ranking completion candidates. We developed a set of features that describe completion candidates and their context, and deployed their anonymized collection in the Early Access Program of IntelliJ-based IDEs. We used the logs to collect a dataset of code completions from users, and employed it to train a ranking CatBoost model. Then, we evaluated it in two settings: on a held-out set of the collected completions and in a separate A/B test on two different groups of users in the IDE. Our evaluation shows that using a simple ranking model trained on the past user behavior logs significantly improved code completion experience. Compared to the default heuristics-based ranking, our model demonstrated a decrease in the number of typing actions necessary to perform the completion in the IDE from 2.073 to 1.832. The approach adheres to privacy requirements and legal constraints, since it does not require collecting personal information, performing all the necessary anonymization on the client's side. Importantly, it can be improved continuously: implementing new features, collecting new data, and evaluating new models - this way, we have been using it in production since the end of 2020. @InProceedings{ESEC/FSE22p1269, author = {Vitaliy Bibaev and Alexey Kalina and Vadim Lomshakov and Yaroslav Golubev and Alexander Bezzubov and Nikita Povarov and Timofey Bryksin}, title = {All You Need Is Logs: Improving Code Completion by Learning from Anonymous IDE Usage Logs}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1269--1279}, doi = {10.1145/3540250.3558968}, year = {2022}, } Publisher's Version |
|
Gong, Siyi |
ESEC/FSE '22-IVR: "A Study on Identifying Code ..."
A Study on Identifying Code Author from Real Development
Siyi Gong and Hao Zhong (Shanghai Jiao Tong University, China) Identifying code authors is important in many research topics, and various approaches have been proposed. Although these approaches achieve promising results on their datasets, their true effectiveness is still in question. To the best of our knowledge, only one large-scale study was conducted to explore the impacts of related factors (e.g., the temporal effect and the distribution of files per author). This study selected Google Code Jam programs as their subjects, but such programs are quite different from the source files that programmers write in daily development. To understand their effectiveness and challenges, we replicate their study and use their approach to analyze source files that are retrieved from real projects. The prior study claims that the temporal effect and the distribution of files per author have only minor impacts on their trained models. In the contrast, we find that in 85.48% pairs of training and testing sets, the accuracy of a trained model is less effective when the temporal effect is considered, and in total, the average accuracy decreases by 0.4298. In addition, when we use the real distribution of files as inputs, their approach can accurately identify only one or two core code authors, although a project can have more than ten authors. By revealing the limitations of the prior approach, our study sheds lights on where to make future improvements. @InProceedings{ESEC/FSE22p1627, author = {Siyi Gong and Hao Zhong}, title = {A Study on Identifying Code Author from Real Development}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1627--1631}, doi = {10.1145/3540250.3560878}, year = {2022}, } Publisher's Version |
|
Gopinath, Rahul |
ESEC/FSE '22-DEMO: "CLIFuzzer: Mining Grammars ..."
CLIFuzzer: Mining Grammars for Command-Line Invocations
Abhilash Gupta, Rahul Gopinath, and Andreas Zeller (CISPA Helmholtz Center for Information Security, Germany) The behavior of command-line utilities can be very much influenced by passing command-line options and arguments—configuration settings that enable, disable, or otherwise influence parts of the code to be executed. Hence, systematic testing of command-line utilities requires testing them with diverse configurations of supported command-line options. We introduce CLIFuzzer, a tool that takes an executable program and, using dynamic analysis to track input processing, automatically extract a full set of its options, arguments, and argument types. This set forms a grammar that represents the valid sequences of valid options and arguments. Producing invocations from this grammar, we can fuzz the program with an endless list of random configurations, covering the related code. This leads to increased coverage and new bugs over purely mutation based fuzzers. @InProceedings{ESEC/FSE22p1667, author = {Abhilash Gupta and Rahul Gopinath and Andreas Zeller}, title = {CLIFuzzer: Mining Grammars for Command-Line Invocations}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1667--1671}, doi = {10.1145/3540250.3558918}, year = {2022}, } Publisher's Version |
|
Gopstein, Dan |
ESEC/FSE '22-IND: "Discovering Feature Flag Interdependencies ..."
Discovering Feature Flag Interdependencies in Microsoft Office
Michael Schröder, Katja Kevic, Dan Gopstein, Brendan Murphy, and Jennifer Beckmann (TU Wien, Austria; Microsoft, UK; Microsoft, USA) Feature flags are a popular method to control functionality in released code. They enable rapid development and deployment, but can also quickly accumulate technical debt. Complex interactions between feature flags can go unnoticed, especially if interdependent flags are located far apart in the code, and these unknown dependencies could become a source of serious bugs. Testing all possible combinations of feature flags is infeasible in large systems like Microsoft Office, which has about 12000 active flags. The goal of our research is to aid product teams in improving system reliability by providing an approach to automatically discover feature flag interdependencies. We use probabilistic reasoning to infer causal relationships from feature flag query logs. Our approach is language-agnostic, scales easily to large heterogeneous codebases, and is robust against noise such as code drift or imperfect log data. We evaluated our approach on real-world query logs from Microsoft Office and are able to achieve over 90% precision while recalling non-trivial indirect feature flag relationships across different source files. We also investigated re-occurring patterns of relationships and describe applications for targeted testing, determining deployment velocity, error mitigation, and diagnostics. @InProceedings{ESEC/FSE22p1419, author = {Michael Schröder and Katja Kevic and Dan Gopstein and Brendan Murphy and Jennifer Beckmann}, title = {Discovering Feature Flag Interdependencies in Microsoft Office}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1419--1429}, doi = {10.1145/3540250.3558942}, year = {2022}, } Publisher's Version |
|
Green, Collin |
ESEC/FSE '22-IND: "What Improves Developer Productivity ..."
What Improves Developer Productivity at Google? Code Quality
Lan Cheng, Emerson Murphy-Hill, Mark Canning, Ciera Jaspan, Collin Green, Andrea Knight, Nan Zhang, and Elizabeth Kammer (Google, USA) Understanding what affects software developer productivity can help organizations choose wise investments in their technical and social environment. But the research literature either focuses on what correlates with developer productivity in ecologically valid settings or focuses on what causes developer productivity in highly constrained settings. In this paper, we bridge the gap by studying software developers at Google through two analyses. In the first analysis, we use panel data with 39 productivity factors, finding that code quality, technical debt, infrastructure tools and support, team communication, goals and priorities, and organizational change and process are all causally linked to self-reported developer productivity. In the second analysis, we use a lagged panel analysis to strengthen our causal claims. We find that increases in perceived code quality tend to be followed by increased perceived developer productivity, but not vice versa, providing the strongest evidence to date that code quality affects individual developer productivity. @InProceedings{ESEC/FSE22p1302, author = {Lan Cheng and Emerson Murphy-Hill and Mark Canning and Ciera Jaspan and Collin Green and Andrea Knight and Nan Zhang and Elizabeth Kammer}, title = {What Improves Developer Productivity at Google? Code Quality}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1302--1313}, doi = {10.1145/3540250.3558940}, year = {2022}, } Publisher's Version Archive submitted (330 kB) |
|
Green, Jared |
ESEC/FSE '22: "Automating Code Review Activities ..."
Automating Code Review Activities by Large-Scale Pre-training
Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundaresan (Peking University, China; Microsoft Research, China; Sun Yat-sen University, China; LinkedIn, USA; Microsoft, USA) Code review is an essential part to software development lifecycle since it aims at guaranteeing the quality of codes. Modern code review activities necessitate developers viewing, understanding and even running the programs to assess logic, functionality, latency, style and other factors. It turns out that developers have to spend far too much time reviewing the code of their peers. Accordingly, it is in significant demand to automate the code review process. In this research, we focus on utilizing pre-training techniques for the tasks in the code review scenario. We collect a large-scale dataset of real-world code changes and code reviews from open-source projects in nine of the most popular programming languages. To better understand code diffs and reviews, we propose CodeReviewer, a pre-trained model that utilizes four pre-training tasks tailored specifically for the code review scenario. To evaluate our model, we focus on three key tasks related to code review activities, including code change quality estimation, review comment generation and code refinement. Furthermore, we establish a high-quality benchmark dataset based on our collected data for these three tasks and conduct comprehensive experiments on it. The experimental results demonstrate that our model outperforms the previous state-of-the-art pre-training approaches in all tasks. Further analysis show that our proposed pre-training tasks and the multilingual pre-training dataset benefit the model on the understanding of code changes and reviews. @InProceedings{ESEC/FSE22p1035, author = {Zhiyu Li and Shuai Lu and Daya Guo and Nan Duan and Shailesh Jannu and Grant Jenks and Deep Majumder and Jared Green and Alexey Svyatkovskiy and Shengyu Fu and Neel Sundaresan}, title = {Automating Code Review Activities by Large-Scale Pre-training}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1035--1047}, doi = {10.1145/3540250.3549081}, year = {2022}, } Publisher's Version |
|
Gu, Taotao |
ESEC/FSE '22-IND: "Group-Based Corpus Scheduling ..."
Group-Based Corpus Scheduling for Parallel Fuzzing
Taotao Gu, Xiang Li, Shuaibing Lu, Jianwen Tian, Yuanping Nie, Xiaohui Kuang, Zhechao Lin, Chenyifan Liu, Jie Liang, and Yu Jiang (Academy of Military Sciences, China; National University of Defense Technology, China; Tsinghua University, China) Parallel fuzzing relies on hardware resources to guarantee test throughput and efficiency. In industrial practice, it is well known that parallel fuzzing faces the challenge of task division, but most works neglect the important process of corpus allocation. In this paper, we proposed a group-based corpus scheduling strategy to address these two issues, which has been accepted by the LLVM community. And we implement a parallel fuzzer based on this strategy called glibFuzzer. glibFuzzer first groups the global corpus into different subsets and then assigns different energy scores and different scores to them. The energy scores were mainly determined by the seed size and the length of coverage information, and the difference score can describe the degree of difference in the code covered by different subsets of seeds. In each round of key local corpus construction, the master node selects high-quality seeds by combining the two scores to improve test efficiency and avoid task conflict. To prove the effectiveness of the strategy, we conducted an extensive evaluation on the real-world programs and FuzzBench. After 4×24 CPU-hours, glibFuzzer covered 22.02% more branches and executed 19.42 times more test cases than libFuzzer in 18 real-world programs. glibFuzzer showed an average branch coverage increase of 73.02%, 55.02%, 55.86% over AFL, PAFL, UniFuzz, respectively. More importantly, glibFuzzer found over 100 unique vulnerabilities. @InProceedings{ESEC/FSE22p1521, author = {Taotao Gu and Xiang Li and Shuaibing Lu and Jianwen Tian and Yuanping Nie and Xiaohui Kuang and Zhechao Lin and Chenyifan Liu and Jie Liang and Yu Jiang}, title = {Group-Based Corpus Scheduling for Parallel Fuzzing}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1521--1532}, doi = {10.1145/3540250.3560885}, year = {2022}, } Publisher's Version |
|
Gu, Xiaodong |
ESEC/FSE '22: "Diet Code Is Healthy: Simplifying ..."
Diet Code Is Healthy: Simplifying Programs for Pre-trained Models of Code
Zhaowei Zhang, Hongyu Zhang, Beijun Shen, and Xiaodong Gu (Shanghai Jiao Tong University, China; University of Newcastle, Australia) Pre-trained code representation models such as CodeBERT have demonstrated superior performance in a variety of software engineering tasks, yet they are often heavy in complexity, quadratically with the length of the input sequence. Our empirical analysis of CodeBERT's attention reveals that CodeBERT pays more attention to certain types of tokens and statements such as keywords and data-relevant statements. Based on these findings, we propose DietCode, which aims at lightweight leverage of large pre-trained models for source code. DietCode simplifies the input program of CodeBERT with three strategies, namely, word dropout, frequency filtering, and an attention-based strategy that selects statements and tokens that receive the most attention weights during pre-training. Hence, it gives a substantial reduction in the computational cost without hampering the model performance. Experimental results on two downstream tasks show that DietCode provides comparable results to CodeBERT with 40% less computational cost in fine-tuning and testing. @InProceedings{ESEC/FSE22p1073, author = {Zhaowei Zhang and Hongyu Zhang and Beijun Shen and Xiaodong Gu}, title = {Diet Code Is Healthy: Simplifying Programs for Pre-trained Models of Code}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1073--1084}, doi = {10.1145/3540250.3549094}, year = {2022}, } Publisher's Version |
|
Gulwani, Sumit |
ESEC/FSE '22-INV: "AI-Assisted Programming: Applications, ..."
AI-Assisted Programming: Applications, User Experiences, and Neuro-Symbolic Techniques (Keynote)
Sumit Gulwani (Microsoft, USA) AI can enhance programming experiences for a diverse set of programmers: from professional developers and data scientists (proficient programmers) who need help in software engineering and data wrangling, all the way to spreadsheet users (low-code programmers) who need help in authoring formulas, and students (novice programmers) who seek hints when stuck with their programming homework. To communicate their need to AI, users can express their intent explicitly—as input-output examples or natural-language specification—or implicitly—where they encounter a bug (and expect AI to suggest a fix), or simply allow AI to observe their last few lines of code or edits (to have it suggest the next steps). The task of synthesizing an intended program snippet from the user’s intent is both a search and a ranking problem. Search is required to discover candidate programs that correspond to the (often ambiguous) intent, and ranking is required to pick the best program from multiple plausible alternatives. This creates a fertile playground for combining symbolic-reasoning techniques, which model the semantics of programming operators, and machine-learning techniques, which can model human preferences in programming. Recent advances in large language models like Codex offer further promise to advance such neuro-symbolic techniques. Finally, a few critical requirements in AI-assisted programming are usability, precision, and trust; and they create opportunities for innovative user experiences and interactivity paradigms. In this talk, I will explain these concepts using some existing successes, including the Flash Fill feature in Excel, Data Connectors in PowerQuery, and IntelliCode/CoPilot in Visual Studio. I will also describe several new opportunities in AI-assisted programming, which can drive the next set of foundational neuro-symbolic advances. @InProceedings{ESEC/FSE22p1, author = {Sumit Gulwani}, title = {AI-Assisted Programming: Applications, User Experiences, and Neuro-Symbolic Techniques (Keynote)}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1--1}, doi = {10.1145/3540250.3569444}, year = {2022}, } Publisher's Version ESEC/FSE '22: "NL2Viz: Natural Language to ..." NL2Viz: Natural Language to Visualization via Constrained Syntax-Guided Synthesis Zhengkai Wu, Vu Le, Ashish Tiwari, Sumit Gulwani, Arjun Radhakrishna, Ivan Radiček, Gustavo Soares, Xinyu Wang, Zhenwen Li, and Tao Xie (University of Illinois at Urbana-Champaign, USA; Microsoft, USA; University of Michigan, USA; Peking University, China) Recent development in NL2CODE (Natural Language to Code) research allows end-users, especially novice programmers to create a concrete implementation of their ideas such as data visualization by providing natural language (NL) instructions. An NL2CODE system often fails to achieve its goal due to three major challenges: the user's words have contextual semantics, the user may not include all details needed for code generation, and the system results are imperfect and require further refinement. To address the aforementioned three challenges for NL to Visualization, we propose a new approach and its supporting tool named NL2VIZ with three salient features: (1) leveraging not only the user's NL input but also the data and program context that the NL query is upon, (2) using hard/soft constraints to reflect different confidence levels in the constraints retrieved from the user input and data/program context, and (3) providing support for result refinement and reuse. We implement NL2VIZ in the Jupyter Notebook environment and evaluate NL2VIZ on a real-world visualization benchmark and a public dataset to show the effectiveness of NL2VIZ. We also conduct a user study involving 6 data scientist professionals to demonstrate the usability of NL2VIZ, the readability of the generated code, and NL2VIZ's effectiveness in helping users generate desired visualizations effectively and efficiently. @InProceedings{ESEC/FSE22p972, author = {Zhengkai Wu and Vu Le and Ashish Tiwari and Sumit Gulwani and Arjun Radhakrishna and Ivan Radiček and Gustavo Soares and Xinyu Wang and Zhenwen Li and Tao Xie}, title = {NL2Viz: Natural Language to Visualization via Constrained Syntax-Guided Synthesis}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {972--983}, doi = {10.1145/3540250.3549140}, year = {2022}, } Publisher's Version |
|
Guo, Chengjun |
ESEC/FSE '22-IVR: "Discrepancies among Pre-trained ..."
Discrepancies among Pre-trained Deep Neural Networks: A New Threat to Model Zoo Reliability
Diego Montes, Pongpatapee Peerapatanapokin, Jeff Schultz, Chengjun Guo, Wenxin Jiang, and James C. Davis (Purdue University, USA) Training deep neural networks (DNNs) takes significant time and resources. A practice for expedited deployment is to use pre-trained deep neural networks (PTNNs), often from model zoos--collections of PTNNs; yet, the reliability of model zoos remains unexamined. In the absence of an industry standard for the implementation and performance of PTNNs, engineers cannot confidently incorporate them into production systems. As a first step, discovering potential discrepancies between PTNNs across model zoos would reveal a threat to model zoo reliability. Prior works indicated existing variances in deep learning systems in terms of accuracy. However, broader measures of reliability for PTNNs from model zoos are unexplored. This work measures notable discrepancies between accuracy, latency, and architecture of 36 PTNNs across four model zoos. Among the top 10 discrepancies, we find differences of 1.23%-2.62% in accuracy and 9%-131% in latency. We also find mismatches in architecture for well-known DNN architectures (e.g., ResNet and AlexNet). Our findings call for future works on empirical validation, automated tools for measurement, and best practices for implementation. @InProceedings{ESEC/FSE22p1605, author = {Diego Montes and Pongpatapee Peerapatanapokin and Jeff Schultz and Chengjun Guo and Wenxin Jiang and James C. Davis}, title = {Discrepancies among Pre-trained Deep Neural Networks: A New Threat to Model Zoo Reliability}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1605--1609}, doi = {10.1145/3540250.3560881}, year = {2022}, } Publisher's Version |
|
Guo, Daya |
ESEC/FSE '22: "Automating Code Review Activities ..."
Automating Code Review Activities by Large-Scale Pre-training
Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundaresan (Peking University, China; Microsoft Research, China; Sun Yat-sen University, China; LinkedIn, USA; Microsoft, USA) Code review is an essential part to software development lifecycle since it aims at guaranteeing the quality of codes. Modern code review activities necessitate developers viewing, understanding and even running the programs to assess logic, functionality, latency, style and other factors. It turns out that developers have to spend far too much time reviewing the code of their peers. Accordingly, it is in significant demand to automate the code review process. In this research, we focus on utilizing pre-training techniques for the tasks in the code review scenario. We collect a large-scale dataset of real-world code changes and code reviews from open-source projects in nine of the most popular programming languages. To better understand code diffs and reviews, we propose CodeReviewer, a pre-trained model that utilizes four pre-training tasks tailored specifically for the code review scenario. To evaluate our model, we focus on three key tasks related to code review activities, including code change quality estimation, review comment generation and code refinement. Furthermore, we establish a high-quality benchmark dataset based on our collected data for these three tasks and conduct comprehensive experiments on it. The experimental results demonstrate that our model outperforms the previous state-of-the-art pre-training approaches in all tasks. Further analysis show that our proposed pre-training tasks and the multilingual pre-training dataset benefit the model on the understanding of code changes and reviews. @InProceedings{ESEC/FSE22p1035, author = {Zhiyu Li and Shuai Lu and Daya Guo and Nan Duan and Shailesh Jannu and Grant Jenks and Deep Majumder and Jared Green and Alexey Svyatkovskiy and Shengyu Fu and Neel Sundaresan}, title = {Automating Code Review Activities by Large-Scale Pre-training}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1035--1047}, doi = {10.1145/3540250.3549081}, year = {2022}, } Publisher's Version |
|
Guo, Lihua |
ESEC/FSE '22: "Minerva: Browser API Fuzzing ..."
Minerva: Browser API Fuzzing with Dynamic Mod-Ref Analysis
Chijin Zhou, Quan Zhang, Mingzhe Wang, Lihua Guo, Jie Liang, Zhe Liu, Mathias Payer, and Yu Jiang (Tsinghua University, China; Nanjing University of Aeronautics and Astronautics, China; EPFL, Switzerland) Browser APIs are essential to the modern web experience. Due to their large number and complexity, they vastly expand the attack surface of browsers. To detect vulnerabilities in these APIs, fuzzers generate test cases with a large amount of random API invocations. However, the massive search space formed by arbitrary API combinations hinders their effectiveness: since randomly-picked API invocations unlikely interfere with each other (i.e., compute on partially shared data), few interesting API interactions are explored. Consequently, reducing the search space by revealing inter-API relations is a major challenge in browser fuzzing. We propose Minerva, an efficient browser fuzzer for browser API bug detection. The key idea is to leverage API interference relations to reduce redundancy and improve coverage. Minerva consists of two modules: dynamic mod-ref analysis and guided code generation. Before fuzzing starts, the dynamic mod-ref analysis module builds an API interference graph. It first automatically identifies individual browser APIs from the browser’s code base. Next, it instruments the browser to dynamically collect mod-ref relations between APIs. During fuzzing, the guided code generation module synthesizes highly-relevant API invocations guided by the mod-ref relations. We evaluate Minerva on three mainstream browsers, i.e. Safari, FireFox, and Chromium. Compared to state-of-the-art fuzzers, Minerva improves edge coverage by 19.63% to 229.62% and finds 2x to 3x more unique bugs. Besides, Minerva has discovered 35 previously-unknown bugs out of which 20 have been fixed with 5 CVEs assigned and acknowledged by browser vendors. @InProceedings{ESEC/FSE22p1135, author = {Chijin Zhou and Quan Zhang and Mingzhe Wang and Lihua Guo and Jie Liang and Zhe Liu and Mathias Payer and Yu Jiang}, title = {Minerva: Browser API Fuzzing with Dynamic Mod-Ref Analysis}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1135--1147}, doi = {10.1145/3540250.3549107}, year = {2022}, } Publisher's Version |
|
Guo, Qizhi |
ESEC/FSE '22-IND: "Incorporating Domain Knowledge ..."
Incorporating Domain Knowledge through Task Augmentation for Front-End JavaScript Code Generation
Sijie Shen, Xiang Zhu, Yihong Dong, Qizhi Guo, Yankun Zhen, and Ge Li (Peking University, China; Alibaba Group, China) Code generation aims to generate a code snippet automatically from natural language descriptions. Generally, the mainstream code generation methods rely on a large amount of paired training data, including both the natural language description and the code. However, in some domain-specific scenarios, building such a large paired corpus for code generation is difficult because there is no directly available pairing data, and a lot of effort is required to manually write the code descriptions to construct a high-quality training dataset. Due to the limited training data, the generation model cannot be well trained and is likely to be overfitting, making the model's performance unsatisfactory for real-world use. To this end, in this paper, we propose a task augmentation method that incorporates domain knowledge into code generation models through auxiliary tasks and a Subtoken-TranX model by extending the original TranX model to support subtoken-level code generation. To verify our proposed approach, we collect a real-world code generation dataset and conduct experiments on it. Our experimental results demonstrate that the subtoken-level TranX model outperforms the original TranX model and the Transformer model on our dataset, and the exact match accuracy of Subtoken-TranX improves significantly by 12.75% with the help of our task augmentation method. The model performance on several code categories has satisfied the requirements for application in industrial systems. Our proposed approach has been adopted by Alibaba's BizCook platform. To the best of our knowledge, this is the first domain code generation system adopted in industrial development environments. @InProceedings{ESEC/FSE22p1533, author = {Sijie Shen and Xiang Zhu and Yihong Dong and Qizhi Guo and Yankun Zhen and Ge Li}, title = {Incorporating Domain Knowledge through Task Augmentation for Front-End JavaScript Code Generation}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1533--1543}, doi = {10.1145/3540250.3558965}, year = {2022}, } Publisher's Version |
|
Guo, Shikai |
ESEC/FSE '22: "Detecting Simulink Compiler ..."
Detecting Simulink Compiler Bugs via Controllable Zombie Blocks Mutation
Shikai Guo, He Jiang, Zhihao Xu, Xiaochen Li, Zhilei Ren, Zhide Zhou, and Rong Chen (Dalian Maritime University, China; Dalian University of Technology, China) As a popular Cyber-Physical System (CPS) development tool chain, MathWorks Simulink is widely used to prototype CPS models in safety-critical applications, e.g., aerospace and healthcare. It is crucial to ensure the correctness and reliability of Simulink compiler (i.e., the compiler module of Simulink) in practice since all CPS models depend on compilation. However, Simulink compiler testing is challenging due to millions of lines of source code and the lack of the complete formal language specification. Although several methods have been proposed to automatically test Simulink compiler, there still remains two challenges to be tackled, namely the limited variant space and the insufficient mutation diversity. To address these challenges, we propose COMBAT, a new differential testing method for Simulink compiler testing. COMBAT includes an EMI (Equivalence Modulo Input) mutation component and a diverse variant generation component. The EMI mutation component inserts assertion statements (e.g., If /While blocks) at arbitrary points of the seed CPS model. These statements break each insertion point into true and false branches. Then, COMBAT feeds all the data passed through the insertion point into the true branch to preserve the equivalence of CPS variants. In such a way, the body of the false branch could be viewed as a new variant space, thus addressing the first challenge. The diverse variant generation component uses Markov chain Monte Carlo optimization to sample the seed CPS model and generate complex mutations of long sequences of blocks in the variant space, thus addressing the second challenge. Experiments demonstrate that COMBAT significantly outperforms the state-of-the-art approaches in Simulink compiler testing. Within five months, COMBAT has reported 16 valid bugs for Simulink R2021b, of which 11 bugs have been confirmed as new bugs by MathWorks Support. @InProceedings{ESEC/FSE22p1061, author = {Shikai Guo and He Jiang and Zhihao Xu and Xiaochen Li and Zhilei Ren and Zhide Zhou and Rong Chen}, title = {Detecting Simulink Compiler Bugs via Controllable Zombie Blocks Mutation}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1061--1072}, doi = {10.1145/3540250.3549159}, year = {2022}, } Publisher's Version |
|
Guo, Xiaofeng |
ESEC/FSE '22-IND: "Trace Analysis Based Microservice ..."
Trace Analysis Based Microservice Architecture Measurement
Xin Peng, Chenxi Zhang, Zhongyuan Zhao, Akasaka Isami, Xiaofeng Guo, and Yunna Cui (Fudan University, China; Alibaba Group, China) Microservice architecture design highly relies on expert experience and may often result in improper service decomposition. Moreover, a microservice architecture is likely to degrade with the continuous evolution of services. Architecture measurement is thus important for the long-term evolution of microservice architectures. Due to the independent and dynamic nature of services, source code analysis based approaches cannot well capture the interactions between services. In this paper, we propose a trace analysis based microservice architecture measurement approach. We define a trace data model for microservice architecture measurement, which enables fine-grained analysis of the execution processes of requests and the interactions between interfaces and services. Based on the data model, we define 14 architectural metrics to measure the service independence and invocation chain complexity of a microservice system. We implement the approach and conduct three case studies with a student course project, an open-source microservice benchmark system, and three industrial microservice systems. The results show that our approach can well characterize the independence and invocation chain complexity of microservice architectures and help developers to identify microservice architecture issues caused by improper service decomposition and architecture degradation. @InProceedings{ESEC/FSE22p1589, author = {Xin Peng and Chenxi Zhang and Zhongyuan Zhao and Akasaka Isami and Xiaofeng Guo and Yunna Cui}, title = {Trace Analysis Based Microservice Architecture Measurement}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1589--1599}, doi = {10.1145/3540250.3558951}, year = {2022}, } Publisher's Version |
|
Guo, Yao |
ESEC/FSE '22-IND: "What Did You Pack in My App? ..."
What Did You Pack in My App? A Systematic Analysis of Commercial Android Packers
Zikan Dong, Hongxuan Liu, Liu Wang, Xiapu Luo, Yao Guo, Guoai Xu, Xusheng Xiao, and Haoyu Wang (Beijing University of Posts and Telecommunications, China; Peking University, China; Hong Kong Polytechnic University, China; Case Western Reserve University, USA; Huazhong University of Science and Technology, China) Commercial Android packers have been widely used by developers as a way to protect their apps from being tampered with. However, app packer is usually provided as an online service developed by security vendors, and the packed apps are well protected. It is thus hard to know what exactly is packed in the app, and few existing studies in the community have systematically analyzed the behaviors of commercial app packers. In this paper, we propose PackDiff, a dynamic analysis system to inspect the fine-grained behaviors of commercial packers. By instrumenting the Android system, PackDiff records the runtime behaviors of Android apps (e.g., Linux system call invocations, Java API calls, Binder interactions, etc.), which are further processed to pinpoint the additional sensitive behaviors introduced by packers. By applying PackDiff to roughly 200 apps protected by seven commercial packers, we observe the disappointing facts of existing commercial packers. Most app packers have introduced unnecessary behaviors (e.g., accessing sensitive data), serious performance and compatibility issues, and they can even be abused to create evasive malware and repackaged apps, which contradicts with their design purposes. @InProceedings{ESEC/FSE22p1430, author = {Zikan Dong and Hongxuan Liu and Liu Wang and Xiapu Luo and Yao Guo and Guoai Xu and Xusheng Xiao and Haoyu Wang}, title = {What Did You Pack in My App? A Systematic Analysis of Commercial Android Packers}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1430--1440}, doi = {10.1145/3540250.3558969}, year = {2022}, } Publisher's Version |
|
Guo, Yu |
ESEC/FSE '22: "Lighting Up Supervised Learning ..."
Lighting Up Supervised Learning in User Review-Based Code Localization: Dataset and Benchmark
Xinwen Hu, Yu Guo, Jianjie Lu, Zheling Zhu, Chuanyi Li, Jidong Ge, Liguo Huang, and Bin Luo (Nanjing University, China; Southern Methodist University, USA) As User Reviews (URs) of mobile Apps are proven to provide valuable feedback for maintaining and evolving applications, how to make full use of URs more efficiently in the release cycle of mobile Apps has become a widely concerned and researched topic in the Software Engineering (SE) community. In order to speed up the completion of coding work related to URs to shorten the release cycle as much as possible, the task of User Review-based code localization is proposed and studied in depth. However, due to the lack of large-scale ground truth dataset (i.e., truly related <UR, Code> pairs), existing methods are all unsupervised learning-based. In order to light up supervised learning approaches, which are driven by large labeled datasets, for Review2Code, and to compare their performances with unsupervised learning-based methods, we first introduce a large-scale human-labeled <UR, Code> ground truth dataset, including the annotation process and statistical analysis. Then, a benchmark consisting of two SOTA unsupervised learning-based and four supervised learning-based Review2Code methods is constructed based on this dataset. We believe that this paper can provide a basis for in-depth exploration of the supervised learning-based Review2Code solutions. @InProceedings{ESEC/FSE22p533, author = {Xinwen Hu and Yu Guo and Jianjie Lu and Zheling Zhu and Chuanyi Li and Jidong Ge and Liguo Huang and Bin Luo}, title = {Lighting Up Supervised Learning in User Review-Based Code Localization: Dataset and Benchmark}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {533--545}, doi = {10.1145/3540250.3549141}, year = {2022}, } Publisher's Version |
|
Gupta, Abhilash |
ESEC/FSE '22-DEMO: "CLIFuzzer: Mining Grammars ..."
CLIFuzzer: Mining Grammars for Command-Line Invocations
Abhilash Gupta, Rahul Gopinath, and Andreas Zeller (CISPA Helmholtz Center for Information Security, Germany) The behavior of command-line utilities can be very much influenced by passing command-line options and arguments—configuration settings that enable, disable, or otherwise influence parts of the code to be executed. Hence, systematic testing of command-line utilities requires testing them with diverse configurations of supported command-line options. We introduce CLIFuzzer, a tool that takes an executable program and, using dynamic analysis to track input processing, automatically extract a full set of its options, arguments, and argument types. This set forms a grammar that represents the valid sequences of valid options and arguments. Producing invocations from this grammar, we can fuzz the program with an endless list of random configurations, covering the related code. This leads to increased coverage and new bugs over purely mutation based fuzzers. @InProceedings{ESEC/FSE22p1667, author = {Abhilash Gupta and Rahul Gopinath and Andreas Zeller}, title = {CLIFuzzer: Mining Grammars for Command-Line Invocations}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1667--1671}, doi = {10.1145/3540250.3558918}, year = {2022}, } Publisher's Version |
|
Gupta, Anurag |
ESEC/FSE '22-IND: "AutoTSG: Learning and Synthesis ..."
AutoTSG: Learning and Synthesis for Incident Troubleshooting
Manish Shetty, Chetan Bansal, Sai Pramod Upadhyayula, Arjun Radhakrishna, and Anurag Gupta (Microsoft Research, India; Microsoft, USA) Incident management is a key aspect of operating large-scale cloud services. To aid with faster and efficient resolution of incidents, engineering teams document frequent troubleshooting steps in the form of Troubleshooting Guides (TSGs), to be used by on-call engineers (OCEs). However, TSGs are siloed, unstructured, and often incomplete, requiring developers to manually understand and execute necessary steps. This results in a plethora of issues such as on-call fatigue, reduced productivity, and human errors. In this work, we conduct a large-scale empirical study of over 4K+ TSGs mapped to incidents and find that TSGs are widely used and help significantly reduce mitigation efforts. We then analyze feedback on TSGs provided by 400+ OCEs and propose a taxonomy of issues that highlights significant gaps in TSG quality. To alleviate these gaps, we investigate the automation of TSGs and propose AutoTSG -- a novel framework for automation of TSGs to executable workflows by combining machine learning and program synthesis. Our evaluation of AutoTSG on 50 TSGs shows the effectiveness in both identifying TSG statements (accuracy 0.89) and parsing them for execution (precision 0.94 and recall 0.91). Lastly, we survey ten Microsoft engineers and show the importance of TSG automation and the usefulness of AutoTSG. @InProceedings{ESEC/FSE22p1477, author = {Manish Shetty and Chetan Bansal and Sai Pramod Upadhyayula and Arjun Radhakrishna and Anurag Gupta}, title = {AutoTSG: Learning and Synthesis for Incident Troubleshooting}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1477--1488}, doi = {10.1145/3540250.3558958}, year = {2022}, } Publisher's Version |
|
Hall, Tracy |
ESEC/FSE '22-IND: "Towards Developer-Centered ..."
Towards Developer-Centered Automatic Program Repair: Findings from Bloomberg
Emily Rowan Winter, Vesna Nowack, David Bowes, Steve Counsell, Tracy Hall, Sæmundur Haraldsson, John Woodward, Serkan Kirbas, Etienne Windels, Olayori McBello, Abdurahman Atakishiyev, Kevin Kells, and Matthew Pagano (Lancaster University, UK; Brunel University London, UK; University of Stirling, UK; Queen Mary University of London, UK; Bloomberg, UK) This paper reports on qualitative research into automatic program repair (APR) at Bloomberg. Six focus groups were conducted with a total of seventeen participants (including both developers of the APR tool and developers using the tool) to consider: the development at Bloomberg of a prototype APR tool (Fixie); developers’ early experiences using the tool; and developers’ perspectives on how they would like to interact with the tool in future. APR is developing rapidly and it is important to understand in greater detail developers' experiences using this emerging technology. In this paper, we provide in-depth, qualitative data from an industrial setting. We found that the development of APR at Bloomberg had become increasingly user-centered, emphasising how fixes were presented to developers, as well as particular features, such as customisability. From the focus groups with developers who had used Fixie, we found particular concern with the pragmatic aspects of APR, such as how and when fixes were presented to them. Based on our findings, we make a series of recommendations to inform future APR development, highlighting how APR tools should 'start small', be customisable, and fit with developers' workflows. We also suggest that APR tools should capitalise on the promise of repair bots and draw on advances in explainable AI. @InProceedings{ESEC/FSE22p1578, author = {Emily Rowan Winter and Vesna Nowack and David Bowes and Steve Counsell and Tracy Hall and Sæmundur Haraldsson and John Woodward and Serkan Kirbas and Etienne Windels and Olayori McBello and Abdurahman Atakishiyev and Kevin Kells and Matthew Pagano}, title = {Towards Developer-Centered Automatic Program Repair: Findings from Bloomberg}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1578--1588}, doi = {10.1145/3540250.3558953}, year = {2022}, } Publisher's Version |
|
Hallin, Erik |
ESEC/FSE '22-IND: "Testing of Machine Learning ..."
Testing of Machine Learning Models with Limited Samples: An Industrial Vacuum Pumping Application
Ayan Chatterjee, Bestoun S. Ahmed, Erik Hallin, and Anton Engman (Karlstad University, Sweden; Uddeholms, Sweden) There is often a scarcity of training data for machine learning (ML) classification and regression models in industrial production, especially for time-consuming or sparsely run manufacturing processes. Traditionally, a majority of the limited ground-truth data is used for training, while a handful of samples are left for testing. In that case, the number of test samples is inadequate to properly evaluate the robustness of the ML models under test (i.e., the system under test) for classification and regression. Furthermore, the output of these ML models may be inaccurate or even fail if the input data differ from the expected. This is the case for ML models used in the Electroslag Remelting (ESR) process in the refined steel industry to predict the pressure in a vacuum chamber. A vacuum pumping event that occurs once a workday generates a few hundred samples in a year of pumping for training and testing. In the absence of adequate training and test samples, this paper first presents a method to generate a fresh set of augmented samples based on vacuum pumping principles. Based on the generated augmented samples, three test scenarios and one test oracle are presented to assess the robustness of an ML model used for production on an industrial scale. Experiments are conducted with real industrial production data obtained from Uddeholms AB steel company. The evaluations indicate that Ensemble and Neural Network are the most robust when trained on augmented data using the proposed testing strategy. The evaluation also demonstrates the proposed method's effectiveness in checking and improving ML algorithms' robustness in such situations. The work improves software testing's state-of-the-art robustness testing in similar settings. Finally, the paper presents an MLOps implementation of the proposed approach for real-time ML model prediction and action on the edge node and automated continuous delivery of ML software from the cloud. @InProceedings{ESEC/FSE22p1280, author = {Ayan Chatterjee and Bestoun S. Ahmed and Erik Hallin and Anton Engman}, title = {Testing of Machine Learning Models with Limited Samples: An Industrial Vacuum Pumping Application}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1280--1290}, doi = {10.1145/3540250.3558943}, year = {2022}, } Publisher's Version |
|
Han, DongGyun |
ESEC/FSE '22-DEMO: "iTiger: An Automatic Issue ..."
iTiger: An Automatic Issue Title Generation Tool
Ting Zhang, Ivana Clairine Irsan, Ferdian Thung, DongGyun Han, David Lo, and Lingxiao Jiang (Singapore Management University, Singapore; Royal Holloway University of London, UK) In both commercial and open-source software, bug reports or issues are used to track bugs or feature requests. However, the quality of issues can differ a lot. Prior research has found that bug reports with good quality tend to gain more attention than the ones with poor quality. As an essential component of an issue, title quality is an important aspect of issue quality. Moreover, issues are usually presented in a list view, where only the issue title and some metadata are present. In this case, a concise and accurate title is crucial for readers to grasp the general concept of the issue and facilitate the issue triaging. Previous work formulated the issue title generation task as a one-sentence summarization task. A sequence-to-sequence model was employed to solve this task. However, it requires a large amount of domain-specific training data to attain good performance in issue title generation. Recently, pre-trained models, which learned knowledge from large-scale general corpora, have shown much success in software engineering tasks. In this work, we make the first attempt to fine-tune BART, which has been pre-trained using English corpora, to generate issue titles. We implemented the fine-tuned BART as a web tool named iTiger, which can suggest an issue title based on the issue description. iTiger is fine-tuned on 267,094 GitHub issues. We compared iTiger with the state-of-the-art method, i.e., iTAPE, on 33,438 issues. The automatic evaluation shows that iTiger outperforms iTAPE by 29.7 Demo URL: https://youtu.be/-JMWR9-lR78 Source code and replication package URL: https://github.com/soarsmu/iTiger @InProceedings{ESEC/FSE22p1637, author = {Ting Zhang and Ivana Clairine Irsan and Ferdian Thung and DongGyun Han and David Lo and Lingxiao Jiang}, title = {iTiger: An Automatic Issue Title Generation Tool}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1637--1641}, doi = {10.1145/3540250.3558934}, year = {2022}, } Publisher's Version Video |
|
Han, Liping |
ESEC/FSE '22-IND: "Are Elevator Software Robust ..."
Are Elevator Software Robust against Uncertainties? Results and Experiences from an Industrial Case Study
Liping Han, Tao Yue, Shaukat Ali, Aitor Arrieta, and Maite Arratibel (Nanjing University of Aeronautics and Astronautics, China; Simula Research Laboratory, Norway; Mondragon University, Spain; Orona, Spain) Industrial elevator systems are complex Cyber-Physical Systems operating in uncertain environments and experiencing uncertain passenger behaviors, hardware delays, and software errors. Identifying, understanding, and classifying such uncertainties are essential to enable system designers to reason about uncertainties and subsequently develop solutions for empowering elevator systems to deal with uncertainties systematically. To this end, we present a method, called RuCynefin, based on the Cynefin framework to classify uncertainties in industrial elevator systems from our industrial partner (Orona, Spain), results of which can then be used for assessing their robustness. RuCynefin is equipped with a novel classification algorithm to identify the Cynefin contexts for a variety of uncertainties in industrial elevator systems, and a novel metric for measuring the robustness using the uncertainty classification. We evaluated RuCynefin with an industrial case study of 90 dispatchers from Orona to assess their robustness against uncertainties. Results show that RuCynefin could effectively identify several situations for which certain dispatchers were not robust. Specifically, 93% of such versions showed some degree of low robustness against uncertainties. We also provide insights on the potential practical usages of RuCynefin, which are useful for practitioners in this field. @InProceedings{ESEC/FSE22p1331, author = {Liping Han and Tao Yue and Shaukat Ali and Aitor Arrieta and Maite Arratibel}, title = {Are Elevator Software Robust against Uncertainties? Results and Experiences from an Industrial Case Study}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1331--1342}, doi = {10.1145/3540250.3558955}, year = {2022}, } Publisher's Version |
|
Han, Tingxu |
ESEC/FSE '22: "RULER: Discriminative and ..."
RULER: Discriminative and Iterative Adversarial Training for Deep Neural Network Fairness
Guanhong Tao, Weisong Sun, Tingxu Han, Chunrong Fang, and Xiangyu Zhang (Purdue University, USA; Nanjing University, China) Deep Neural Networks (DNNs) are becoming an integral part of many real-world applications, such as autonomous driving and financial management. While these models enable autonomy, there are however concerns regarding their ethics in decision making. For instance, fairness is an aspect that requires particular attention. A number of fairness testing techniques have been proposed to address this issue, e.g., by generating test cases called individual discriminatory instances for repairing DNNs. Although they have demonstrated great potential, they tend to generate many test cases that are not directly effective in improving fairness and incur substantial computation overhead. We propose a new model repair technique, RULER, by discriminating sensitive and non-sensitive attributes during test case generation for model repair. The generated cases are then used in training to improve DNN fairness. RULER balances the trade-off between accuracy and fairness by decomposing the training procedure into two phases and introducing a novel iterative adversarial training method for fairness. Compared to the state-of-the-art techniques on four datasets, RULER has 7-28 times more effective repair test cases generated, is 10-15 times faster in test generation, and has 26-43% more fairness improvement on average. @InProceedings{ESEC/FSE22p1173, author = {Guanhong Tao and Weisong Sun and Tingxu Han and Chunrong Fang and Xiangyu Zhang}, title = {RULER: Discriminative and Iterative Adversarial Training for Deep Neural Network Fairness}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1173--1184}, doi = {10.1145/3540250.3549169}, year = {2022}, } Publisher's Version |
|
Haque, Mirazul |
ESEC/FSE '22: "NMTSloth: Understanding and ..."
NMTSloth: Understanding and Testing Efficiency Degradation of Neural Machine Translation Systems
Simin Chen, Cong Liu, Mirazul Haque, Zihe Song, and Wei Yang (University of Texas at Dallas, USA) Neural Machine Translation (NMT) systems have received much recent attention due to their human-level accuracy. While existing works mostly focus on either improving accuracy or testing accuracy robustness, the computation efficiency of NMT systems, which is of paramount importance due to often vast translation demands and real-time requirements, has surprisingly received little attention. In this paper, we make the first attempt to understand and test potential computation efficiency robustness in state-of-the-art NMT systems. By analyzing the working mechanism and implementation of 1455 public-accessible NMT systems, we observe a fundamental property in NMT systems that could be manipulated in an adversarial manner to reduce computation efficiency significantly. Our interesting observation is that the output length determines the computation efficiency of NMT systems instead of the input, where the output length depends on two factors: an often sufficiently large yet pessimistic pre-configured threshold controlling the max number of iterations and a runtime generated end of sentence (EOS) token. Our key motivation is to generate test inputs that could sufficiently delay the generation of EOS such that NMT systems would have to go through enough iterations to satisfy the pre-configured threshold. We present NMTSloth, which develops a gradient-guided technique that searches for a minimal and unnoticeable perturbation at character-level, token-level, and structure-level, which sufficiently delays the appearance of EOS and forces these inputs to reach the naturally-unreachable threshold. To demonstrate the effectiveness of NMTSloth, we conduct a systematic evaluation on three public-available NMT systems: Google T5, AllenAI WMT14, and Helsinki-NLP translators. Experimental results show that NMTSloth can increase NMT systems' response latency and energy consumption by 85% to 3153% and 86% to 3052%, respectively, by perturbing just one character or token in the input sentence. Our case study shows that inputs generated by NMTSloth significantly affect the battery power in real-world mobile devices (i.e., drain more than 30 times battery power than normal inputs). @InProceedings{ESEC/FSE22p1148, author = {Simin Chen and Cong Liu and Mirazul Haque and Zihe Song and Wei Yang}, title = {NMTSloth: Understanding and Testing Efficiency Degradation of Neural Machine Translation Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1148--1160}, doi = {10.1145/3540250.3549102}, year = {2022}, } Publisher's Version |
|
Haraldsson, Sæmundur |
ESEC/FSE '22-IND: "Towards Developer-Centered ..."
Towards Developer-Centered Automatic Program Repair: Findings from Bloomberg
Emily Rowan Winter, Vesna Nowack, David Bowes, Steve Counsell, Tracy Hall, Sæmundur Haraldsson, John Woodward, Serkan Kirbas, Etienne Windels, Olayori McBello, Abdurahman Atakishiyev, Kevin Kells, and Matthew Pagano (Lancaster University, UK; Brunel University London, UK; University of Stirling, UK; Queen Mary University of London, UK; Bloomberg, UK) This paper reports on qualitative research into automatic program repair (APR) at Bloomberg. Six focus groups were conducted with a total of seventeen participants (including both developers of the APR tool and developers using the tool) to consider: the development at Bloomberg of a prototype APR tool (Fixie); developers’ early experiences using the tool; and developers’ perspectives on how they would like to interact with the tool in future. APR is developing rapidly and it is important to understand in greater detail developers' experiences using this emerging technology. In this paper, we provide in-depth, qualitative data from an industrial setting. We found that the development of APR at Bloomberg had become increasingly user-centered, emphasising how fixes were presented to developers, as well as particular features, such as customisability. From the focus groups with developers who had used Fixie, we found particular concern with the pragmatic aspects of APR, such as how and when fixes were presented to them. Based on our findings, we make a series of recommendations to inform future APR development, highlighting how APR tools should 'start small', be customisable, and fit with developers' workflows. We also suggest that APR tools should capitalise on the promise of repair bots and draw on advances in explainable AI. @InProceedings{ESEC/FSE22p1578, author = {Emily Rowan Winter and Vesna Nowack and David Bowes and Steve Counsell and Tracy Hall and Sæmundur Haraldsson and John Woodward and Serkan Kirbas and Etienne Windels and Olayori McBello and Abdurahman Atakishiyev and Kevin Kells and Matthew Pagano}, title = {Towards Developer-Centered Automatic Program Repair: Findings from Bloomberg}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1578--1588}, doi = {10.1145/3540250.3558953}, year = {2022}, } Publisher's Version |
|
Harman, Mark |
ESEC/FSE '22: "MAAT: A Novel Ensemble Approach ..."
MAAT: A Novel Ensemble Approach to Addressing Fairness and Performance Bugs for Machine Learning Software
Zhenpeng Chen, Jie M. Zhang, Federica Sarro, and Mark Harman (University College London, UK; King’s College London, UK) Machine Learning (ML) software can lead to unfair and unethical decisions, making software fairness bugs an increasingly significant concern for software engineers. However, addressing fairness bugs often comes at the cost of introducing more ML performance (e.g., accuracy) bugs. In this paper, we propose MAAT, a novel ensemble approach to improving fairness-performance trade-off for ML software. Conventional ensemble methods combine different models with identical learning objectives. MAAT, instead, combines models optimized for different objectives: fairness and ML performance. We conduct an extensive evaluation of MAAT with 5 state-of-the-art methods, 9 software decision tasks, and 15 fairness-performance measurements. The results show that MAAT significantly outperforms the state-of-the-art. In particular, MAAT beats the trade-off baseline constructed by a recent benchmarking tool in 92.2% of the overall cases evaluated, 12.2 percentage points more than the best technique currently available. Moreover, the superiority of MAAT over the state-of-the-art holds on all the tasks and measurements that we study. We have made publicly available the code and data of this work to allow for future replication and extension. @InProceedings{ESEC/FSE22p1122, author = {Zhenpeng Chen and Jie M. Zhang and Federica Sarro and Mark Harman}, title = {MAAT: A Novel Ensemble Approach to Addressing Fairness and Performance Bugs for Machine Learning Software}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1122--1134}, doi = {10.1145/3540250.3549093}, year = {2022}, } Publisher's Version Published Artifact Artifacts Available Artifacts Reusable |
|
Hart, Jacob |
ESEC/FSE '22: "Pair Programming Conversations ..."
Pair Programming Conversations with Agents vs. Developers: Challenges and Opportunities for SE Community
Peter Robe, Sandeep K. Kuttal, Jake AuBuchon, and Jacob Hart (University of Tulsa, USA) Recent research has shown feasibility of an interactive pair-programming conversational agent, but implementing such an agent poses three challenges: a lack of benchmark datasets, absence of software engineering specific labels, and the need to understand developer conversations. To address these challenges, we conducted a Wizard of Oz study with 14 participants pair programming with a simulated agent and collected 4,443 developer-agent utterances. Based on this dataset, we created 26 software engineering labels using an open coding process to develop a hierarchical classification scheme. To understand labeled developer-agent conversations, we compared the accuracy of three state-of-the-art transformer-based language models, BERT, GPT-2, and XLNet, which performed interchangeably. In order to begin creating a developer-agent dataset, researchers and practitioners need to conduct resource intensive Wizard of Oz studies. Presently, there exists vast amounts of developer-developer conversations on video hosting websites. To investigate the feasibility of using developer-developer conversations, we labeled a publicly available developer-developer dataset (3,436 utterances) with our hierarchical classification scheme and found that a BERT model trained on developer-developer data performed ~10% worse than the BERT trained on developer-agent data, but when using transfer-learning, accuracy improved. Finally, our qualitative analysis revealed that developer-developer conversations are more implicit, neutral, and opinionated than developer-agent conversations. Our results have implications for software engineering researchers and practitioners developing conversational agents. @InProceedings{ESEC/FSE22p319, author = {Peter Robe and Sandeep K. Kuttal and Jake AuBuchon and Jacob Hart}, title = {Pair Programming Conversations with Agents vs. Developers: Challenges and Opportunities for SE Community}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {319--331}, doi = {10.1145/3540250.3549127}, year = {2022}, } Publisher's Version |
|
Haryono, Stefanus Agus |
ESEC/FSE '22: "AutoPruner: Transformer-Based ..."
AutoPruner: Transformer-Based Call Graph Pruning
Thanh Le-Cong, Hong Jin Kang, Truong Giang Nguyen, Stefanus Agus Haryono, David Lo, Xuan-Bach D. Le, and Quyet Thang Huynh (Singapore Management University, Singapore; University of Melbourne, Australia; Hanoi University of Science and Technology, Vietnam) Constructing a static call graph requires trade-offs between soundness and precision. Program analysis techniques for constructing call graphs are unfortunately usually imprecise. To address this problem, researchers have recently proposed call graph pruning empowered by machine learning to post-process call graphs constructed by static analysis. A machine learning model is built to capture information from the call graph by extracting structural features for use in a random forest classifier. It then removes edges that are predicted to be false positives. Despite the improvements shown by machine learning models, they are still limited as they do not consider the source code semantics and thus often are not able to effectively distinguish true and false positives. In this paper, we present a novel call graph pruning technique, AutoPruner, for eliminating false positives in call graphs via both statistical semantic and structural analysis. Given a call graph constructed by traditional static analysis tools, AutoPruner takes a Transformer-based approach to capture the semantic relationships between the caller and callee functions associated with each edge in the call graph. To do so, AutoPruner fine-tunes a model of code that was pre-trained on a large corpus to represent source code based on descriptions of its semantics. Next, the model is used to extract semantic features from the functions related to each edge in the call graph. AutoPruner uses these semantic features together with the structural features extracted from the call graph to classify each edge via a feed-forward neural network. Our empirical evaluation on a benchmark dataset of real-world programs shows that AutoPruner outperforms the state-of-the-art baselines, improving on F-measure by up to 13% in identifying false-positive edges in a static call graph. Moreover, AutoPruner achieves improvements on two client analyses, including halving the false alarm rate on null pointer analysis and over 10% improvements on monomorphic call-site detection. Additionally, our ablation study and qualitative analysis show that the semantic features extracted by AutoPruner capture a remarkable amount of information for distinguishing between true and false positives. @InProceedings{ESEC/FSE22p520, author = {Thanh Le-Cong and Hong Jin Kang and Truong Giang Nguyen and Stefanus Agus Haryono and David Lo and Xuan-Bach D. Le and Quyet Thang Huynh}, title = {AutoPruner: Transformer-Based Call Graph Pruning}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {520--532}, doi = {10.1145/3540250.3549175}, year = {2022}, } Publisher's Version Artifacts Functional |
|
Harzevili, Nima Shiri |
ESEC/FSE '22: "API Recommendation for Machine ..."
API Recommendation for Machine Learning Libraries: How Far Are We?
Moshi Wei, Yuchao Huang, Junjie Wang, Jiho Shin, Nima Shiri Harzevili, and Song Wang (York University, Canada; Institute of Software at Chinese Academy of Sciences, China) Application Programming Interfaces (APIs) are designed to help developers build software more effectively. Recommending the right APIs for specific tasks is gaining increasing attention among researchers and developers. However, most of the existing approaches are mainly evaluated for general programming tasks using statically typed programming languages such as Java. Little is known about their practical effectiveness and usefulness for machine learning (ML) programming tasks with dynamically typed programming languages such as Python, whose paradigms are fundamentally different from general programming tasks. This is of great value considering the increasing popularity of ML and the large number of new questions appearing on question answering websites. In this work, we set out to investigate the effectiveness of existing API recommendation approaches for Python-based ML programming tasks from Stack Overflow (SO). Specifically, we conducted an empirical study of six widely-used Python-based ML libraries using two state-of-the-art API recommendation approaches, i.e., BIKER and DeepAPI. We found that the existing approaches perform poorly for two main reasons: (1) Python-based ML tasks often require significant long API sequences; and (2) there are common API usage patterns in Python-based ML programming tasks that existing approaches cannot handle. Inspired by our findings, we proposed a simple but effective frequent itemset mining-based approach, i.e., FIMAX, to boost API recommendation approaches, i.e., enhance existing API recommendation approaches for Python-based ML programming tasks by leveraging the common API usage information from SO questions. Our evaluation shows that FIMAX improves existing state-of-the-art API recommendation approaches by up to 54.3% and 57.4% in MRR and MAP, respectively. Our user study with 14 developers further demonstrates the practicality of FIMAX for API recommendation. @InProceedings{ESEC/FSE22p370, author = {Moshi Wei and Yuchao Huang and Junjie Wang and Jiho Shin and Nima Shiri Harzevili and Song Wang}, title = {API Recommendation for Machine Learning Libraries: How Far Are We?}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {370--381}, doi = {10.1145/3540250.3549124}, year = {2022}, } Publisher's Version |
|
Hassan, Ahmed E. |
ESEC/FSE '22: "Automated Unearthing of Dangerous ..."
Automated Unearthing of Dangerous Issue Reports
Shengyi Pan, Jiayuan Zhou, Filipe Roseiro Cogo, Xin Xia, Lingfeng Bao, Xing Hu, Shanping Li, and Ahmed E. Hassan (Zhejiang University, China; Huawei, Canada; Huawei, China; Queen’s University, Canada) The coordinated vulnerability disclosure (CVD) process is commonly adopted for open source software (OSS) vulnerability management, which suggests to privately report the discovered vulnerabilities and keep relevant information secret until the official disclosure. However, in practice, due to various reasons (e.g., lacking security domain expertise or the sense of security management), many vulnerabilities are first reported via public issue reports (IRs) before its official disclosure. Such IRs are dangerous IRs, since attackers can take advantages of the leaked vulnerability information to launch zero-day attacks. It is crucial to identify such dangerous IRs at an early stage, such that OSS users can start the vulnerability remediation process earlier and OSS maintainers can timely manage the dangerous IRs. In this paper, we propose and evaluate a deep learning based approach, namely MemVul, to automatically identify dangerous IRs at the time they are reported. MemVul augments the neural networks with a memory component, which stores the external vulnerability knowledge from Common Weakness Enumeration (CWE). We rely on publicly accessible CVE-referred IRs (CIRs) to operationalize the concept of dangerous IR. We mine 3,937 CIRs distributed across 1,390 OSS projects hosted on GitHub. Evaluated under a practical scenario of high data imbalance, MemVul achieves the best trade-off between precision and recall among all baselines. In particular, the F1-score of MemVul (i.e., 0.49) improves the best performing baseline by 44%. For IRs that are predicted as CIRs but not reported to CVE, we conduct a user study to investigate their usefulness to OSS stakeholders. We observe that 82% (41 out of 50) of these IRs are security-related and 28 of them are suggested by security experts to be publicly disclosed, indicating MemVul is capable of identifying undisclosed dangerous IRs. @InProceedings{ESEC/FSE22p834, author = {Shengyi Pan and Jiayuan Zhou and Filipe Roseiro Cogo and Xin Xia and Lingfeng Bao and Xing Hu and Shanping Li and Ahmed E. Hassan}, title = {Automated Unearthing of Dangerous Issue Reports}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {834--846}, doi = {10.1145/3540250.3549156}, year = {2022}, } Publisher's Version ESEC/FSE '22: "What Motivates Software Practitioners ..." What Motivates Software Practitioners to Contribute to Inner Source? Zhiyuan Wan, Xin Xia, Yun Zhang, David Lo, Daibing Zhou, Qiuyuan Chen, and Ahmed E. Hassan (Zhejiang University, China; Huawei, China; Zhejiang University City College, China; Singapore Management University, Singapore; Queen’s University, Canada) Software development organizations have adopted open source development practices to support or augment their software development processes, a phenomenon referred to as inner source. Given the rapid adoption of inner source, we wonder what motivates software practitioners to contribute to inner source projects. We followed a mixed-methods approach--a qualitative phase of interviews with 20 interviewees, followed by a quantitative phase of an exploratory survey with 124 respondents from 13 countries across four continents. Our study uncovers practitioners' motivation to contribute to inner source projects, as well as how the motivation differs from what motivates practitioners to participate in open source projects. We also investigate how software practitioners' motivation impacts their contribution level and continuance intention in inner source projects. Based on our findings, we outline directions for future research and provide recommendations for organizations and software practitioners. @InProceedings{ESEC/FSE22p132, author = {Zhiyuan Wan and Xin Xia and Yun Zhang and David Lo and Daibing Zhou and Qiuyuan Chen and Ahmed E. Hassan}, title = {What Motivates Software Practitioners to Contribute to Inner Source?}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {132--144}, doi = {10.1145/3540250.3549148}, year = {2022}, } Publisher's Version |
|
He, Hao |
ESEC/FSE '22-DEMO: "GFI-Bot: Automated Good First ..."
GFI-Bot: Automated Good First Issue Recommendation on GitHub
Hao He, Haonan Su, Wenxin Xiao, Runzhi He, and Minghui Zhou (Peking University, China) To facilitate newcomer onboarding, GitHub recommends the use of "good first issue" (GFI) labels to signal issues suitable for newcomers to resolve. However, previous research shows that manually labeled GFIs are scarce and inappropriate, showing a need for automated recommendations. In this paper, we present GFI-Bot (accessible at https://gfibot.io), a proof-of-concept machine learning powered bot for automated GFI recommendation in practice. Project maintainers can configure GFI-Bot to discover and label possible GFIs so that newcomers can easily locate issues for making their first contributions. GFI-Bot also provides a high-quality, up-to-date dataset for advancing GFI recommendation research. @InProceedings{ESEC/FSE22p1751, author = {Hao He and Haonan Su and Wenxin Xiao and Runzhi He and Minghui Zhou}, title = {GFI-Bot: Automated Good First Issue Recommendation on GitHub}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1751--1755}, doi = {10.1145/3540250.3558922}, year = {2022}, } Publisher's Version |
|
He, Pinjia |
ESEC/FSE '22-IND: "An Empirical Study of Log ..."
An Empirical Study of Log Analysis at Microsoft
Shilin He, Xu Zhang, Pinjia He, Yong Xu, Liqun Li, Yu Kang, Minghua Ma, Yining Wei, Yingnong Dang, Saravanakumar Rajmohan, and Qingwei Lin (Microsoft Research, China; Chinese University of Hong Kong at Shenzhen, China; Microsoft Azure, China; Microsoft Azure, USA; Microsoft 365, USA) Logs are crucial to the management and maintenance of software systems. In recent years, log analysis research has achieved notable progress on various topics such as log parsing and log-based anomaly detection. However, the real voices from front-line practitioners are seldom heard. For example, what are the pain points of log analysis in practice? In this work, we conduct a comprehensive survey study on log analysis at Microsoft. We collected feedback from 105 employees through a questionnaire of 13 questions and individual interviews with 12 employees. We summarize the format, scenario, method, tool, and pain points of log analysis. Additionally, by comparing the industrial practices with academic research, we discuss the gaps between academia and industry, and future opportunities on log analysis with four inspiring findings. Particularly, we observe a huge gap exists between log anomaly detection research and failure alerting practices regarding the goal, technique, efficiency, etc. Moreover, data-driven log parsing, which has been widely studied in recent research, can be alternatively achieved by simply logging template IDs during software development. We hope this paper could uncover the real needs of industrial practitioners and the unnoticed yet significant gap between industry and academia, and inspire interesting future directions that converge efforts from both sides. @InProceedings{ESEC/FSE22p1465, author = {Shilin He and Xu Zhang and Pinjia He and Yong Xu and Liqun Li and Yu Kang and Minghua Ma and Yining Wei and Yingnong Dang and Saravanakumar Rajmohan and Qingwei Lin}, title = {An Empirical Study of Log Analysis at Microsoft}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1465--1476}, doi = {10.1145/3540250.3558963}, year = {2022}, } Publisher's Version |
|
He, Runzhi |
ESEC/FSE '22-DEMO: "GFI-Bot: Automated Good First ..."
GFI-Bot: Automated Good First Issue Recommendation on GitHub
Hao He, Haonan Su, Wenxin Xiao, Runzhi He, and Minghui Zhou (Peking University, China) To facilitate newcomer onboarding, GitHub recommends the use of "good first issue" (GFI) labels to signal issues suitable for newcomers to resolve. However, previous research shows that manually labeled GFIs are scarce and inappropriate, showing a need for automated recommendations. In this paper, we present GFI-Bot (accessible at https://gfibot.io), a proof-of-concept machine learning powered bot for automated GFI recommendation in practice. Project maintainers can configure GFI-Bot to discover and label possible GFIs so that newcomers can easily locate issues for making their first contributions. GFI-Bot also provides a high-quality, up-to-date dataset for advancing GFI recommendation research. @InProceedings{ESEC/FSE22p1751, author = {Hao He and Haonan Su and Wenxin Xiao and Runzhi He and Minghui Zhou}, title = {GFI-Bot: Automated Good First Issue Recommendation on GitHub}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1751--1755}, doi = {10.1145/3540250.3558922}, year = {2022}, } Publisher's Version |
|
He, Shilin |
ESEC/FSE '22-IND: "An Empirical Investigation ..."
An Empirical Investigation of Missing Data Handling in Cloud Node Failure Prediction
Minghua Ma, Yudong Liu, Yuang Tong, Haozhe Li, Pu Zhao, Yong Xu, Hongyu Zhang, Shilin He, Lu Wang, Yingnong Dang, Saravanakumar Rajmohan, and Qingwei Lin (Microsoft Research, China; University of Newcastle, Australia; Microsoft, USA) Cloud computing systems have become increasingly popular in recent years. A typical cloud system utilizes millions of computing nodes as the basic infrastructure. Node failure has been identified as one of the most prevalent causes of cloud system downtime. To improve the reliability of cloud systems, many previous studies collected monitoring metrics from nodes and built models to predict node failures before the failures happen. However, based on our experience with large-scale real-world cloud systems in Microsoft, we find that the task of predicting node failure is severely hampered by missing data. There is a large amount of missing data, and the online latest data utilized for prediction is even worse. As a result, the real-time performance of the node prediction model is limited. In this paper, we first characterize the missing data problem for node failure prediction. Then, we evaluate several existing data interpolation approaches, and find that node dimension interpolation approaches outperform time dimension ones and deep learning based interpolation is the best for early prediction. Our findings can help academics and engineers address the missing data problem in cloud node failure prediction and other data-driven software engineering scenarios. @InProceedings{ESEC/FSE22p1453, author = {Minghua Ma and Yudong Liu and Yuang Tong and Haozhe Li and Pu Zhao and Yong Xu and Hongyu Zhang and Shilin He and Lu Wang and Yingnong Dang and Saravanakumar Rajmohan and Qingwei Lin}, title = {An Empirical Investigation of Missing Data Handling in Cloud Node Failure Prediction}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1453--1464}, doi = {10.1145/3540250.3558946}, year = {2022}, } Publisher's Version ESEC/FSE '22-IND: "An Empirical Study of Log ..." An Empirical Study of Log Analysis at Microsoft Shilin He, Xu Zhang, Pinjia He, Yong Xu, Liqun Li, Yu Kang, Minghua Ma, Yining Wei, Yingnong Dang, Saravanakumar Rajmohan, and Qingwei Lin (Microsoft Research, China; Chinese University of Hong Kong at Shenzhen, China; Microsoft Azure, China; Microsoft Azure, USA; Microsoft 365, USA) Logs are crucial to the management and maintenance of software systems. In recent years, log analysis research has achieved notable progress on various topics such as log parsing and log-based anomaly detection. However, the real voices from front-line practitioners are seldom heard. For example, what are the pain points of log analysis in practice? In this work, we conduct a comprehensive survey study on log analysis at Microsoft. We collected feedback from 105 employees through a questionnaire of 13 questions and individual interviews with 12 employees. We summarize the format, scenario, method, tool, and pain points of log analysis. Additionally, by comparing the industrial practices with academic research, we discuss the gaps between academia and industry, and future opportunities on log analysis with four inspiring findings. Particularly, we observe a huge gap exists between log anomaly detection research and failure alerting practices regarding the goal, technique, efficiency, etc. Moreover, data-driven log parsing, which has been widely studied in recent research, can be alternatively achieved by simply logging template IDs during software development. We hope this paper could uncover the real needs of industrial practitioners and the unnoticed yet significant gap between industry and academia, and inspire interesting future directions that converge efforts from both sides. @InProceedings{ESEC/FSE22p1465, author = {Shilin He and Xu Zhang and Pinjia He and Yong Xu and Liqun Li and Yu Kang and Minghua Ma and Yining Wei and Yingnong Dang and Saravanakumar Rajmohan and Qingwei Lin}, title = {An Empirical Study of Log Analysis at Microsoft}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1465--1476}, doi = {10.1145/3540250.3558963}, year = {2022}, } Publisher's Version ESEC/FSE '22: "SPINE: A Scalable Log Parser ..." SPINE: A Scalable Log Parser with Feedback Guidance Xuheng Wang, Xu Zhang, Liqun Li, Shilin He, Hongyu Zhang, Yudong Liu, Lingling Zheng, Yu Kang, Qingwei Lin, Yingnong Dang, Saravanakumar Rajmohan, and Dongmei Zhang (Tsinghua University, China; Microsoft Research, China; University of Newcastle, Australia; Microsoft Azure, USA; Microsoft 365, USA) Log parsing, which extracts log templates and parameters, is a critical prerequisite step for automated log analysis techniques. Though existing log parsers have achieved promising accuracy on public log datasets, they still face many challenges when applied in the industry. Through studying the characteristics of real-world log data and analyzing the limitations of existing log parsers, we identify two problems. Firstly, it is non-trivial to scale a log parser to a vast number of logs, especially in real-world scenarios where the log data is extremely imbalanced. Secondly, existing log parsers overlook the importance of user feedback, which is imperative for parser fine-tuning under the continuous evolution of log data. To overcome the challenges, we propose SPINE, which is a highly scalable log parser with user feedback guidance. Based on our log parser equipped with initial grouping and progressive clustering,we propose a novel log data scheduling algorithm to improve the efficiency of parallelization under the large-scale imbalanced log data. Besides, we introduce user feedback to make the parser fast adapt to the evolving logs. We evaluated SPINE on 16 public log datasets. SPINE achieves more than 0.90 parsing accuracy on average with the highest parsing efficiency, which outperforms the state-of-the-art log parsers. We also evaluated SPINE in the production environment of Microsoft, in which SPINE can parse 30million logs in less than 8 minutes under 16 executors, achieving near real-time performance. In addition, our evaluations show that SPINE can consistently achieve good accuracy under log evolution with a moderate number of user feedback. @InProceedings{ESEC/FSE22p1198, author = {Xuheng Wang and Xu Zhang and Liqun Li and Shilin He and Hongyu Zhang and Yudong Liu and Lingling Zheng and Yu Kang and Qingwei Lin and Yingnong Dang and Saravanakumar Rajmohan and Dongmei Zhang}, title = {SPINE: A Scalable Log Parser with Feedback Guidance}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1198--1208}, doi = {10.1145/3540250.3549176}, year = {2022}, } Publisher's Version |
|
Heo, Jinseok |
ESEC/FSE '22-IND: "An Empirical Study of Deep ..."
An Empirical Study of Deep Transfer Learning-Based Program Repair for Kotlin Projects
Misoo Kim, Youngkyoung Kim, Hohyeon Jeong, Jinseok Heo, Sungoh Kim, Hyunhee Chung, and Eunseok Lee (Sungkyunkwan University, South Korea; Samsung Electronics, South Korea) Deep learning-based automated program repair (DL-APR) can automatically fix software bugs and has received significant attention in the industry because of its potential to significantly reduce software development and maintenance costs. The Samsung mobile experience (MX) team is currently switching from Java to Kotlin projects. This study reviews the application of DL-APR, which automatically fixes defects that arise during this switching process; however, the shortage of Kotlin defect-fixing datasets in Samsung MX team precludes us from fully utilizing the power of deep learning. Therefore, strategies are needed to effectively reuse the pretrained DL-APR model. This demand can be met using the Kotlin defect-fixing datasets constructed from industrial and open-source repositories, and transfer learning. This study aims to validate the performance of the pretrained DL-APR model in fixing defects in the Samsung Kotlin projects, then improve its performance by applying transfer learning. We show that transfer learning with open source and industrial Kotlin defect-fixing datasets can improve the defect-fixing performance of the existing DL-APR by 307%. Furthermore, we confirmed that the performance was improved by 532% compared with the baseline DL-APR model as a result of transferring the knowledge of an industrial (non-defect) bug-fixing dataset. We also discovered that the embedded vectors and overlapping code tokens of the code-change pairs are valuable features for selecting useful knowledge transfer instances by improving the performance of APR models by up to 696%. Our study demonstrates the possibility of applying transfer learning to practitioners who review the application of DL-APR to industrial software. @InProceedings{ESEC/FSE22p1441, author = {Misoo Kim and Youngkyoung Kim and Hohyeon Jeong and Jinseok Heo and Sungoh Kim and Hyunhee Chung and Eunseok Lee}, title = {An Empirical Study of Deep Transfer Learning-Based Program Repair for Kotlin Projects}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1441--1452}, doi = {10.1145/3540250.3558967}, year = {2022}, } Publisher's Version |
|
Hermann, Ben |
ESEC/FSE '22: "A Retrospective Study of One ..."
A Retrospective Study of One Decade of Artifact Evaluations
Stefan Winter, Christopher S. Timperley, Ben Hermann, Jürgen Cito, Jonathan Bell, Michael Hilton, and Dirk Beyer (LMU Munich, Germany; Carnegie Mellon University, USA; TU Dortmund, Germany; TU Wien, Austria; Northeastern University, USA) Most software engineering research involves the development of a prototype, a proof of concept, or a measurement apparatus. Together with the data collected in the research process, they are collectively referred to as research artifacts and are subject to artifact evaluation (AE) at scientific conferences. Since its initiation in the SE community at ESEC/FSE 2011, both the goals and the process of AE have evolved and today expectations towards AE are strongly linked with reproducible research results and reusable tools that other researchers can build their work on. However, to date little evidence has been provided that artifacts which have passed AE actually live up to these high expectations, i.e., to which degree AE processes contribute to AE's goals and whether the overhead they impose is justified. We aim to fill this gap by providing an in-depth analysis of research artifacts from a decade of software engineering (SE) and programming languages (PL) conferences, based on which we reflect on the goals and mechanisms of AE in our community. In summary, our analyses (1) suggest that articles with artifacts do not generally have better visibility in the community, (2) provide evidence how evaluated and not evaluated artifacts differ with respect to different quality criteria, and (3) highlight opportunities for further improving AE processes. @InProceedings{ESEC/FSE22p145, author = {Stefan Winter and Christopher S. Timperley and Ben Hermann and Jürgen Cito and Jonathan Bell and Michael Hilton and Dirk Beyer}, title = {A Retrospective Study of One Decade of Artifact Evaluations}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {145--156}, doi = {10.1145/3540250.3549172}, year = {2022}, } Publisher's Version Artifacts Reusable |
|
Herzig, Kim |
ESEC/FSE '22-IND: "Nalanda: A Socio-technical ..."
Nalanda: A Socio-technical Graph Platform for Building Software Analytics Tools at Enterprise Scale
Chandra Maddila, Suhas Shanbhogue, Apoorva Agrawal, Thomas Zimmermann, Chetan Bansal, Nicole Forsgren, Divyanshu Agrawal, Kim Herzig, and Arie van Deursen (Microsoft Research, USA; Microsoft Research, India; Microsoft, USA; Delft University of Technology, Netherlands) Software development is information-dense knowledge work that requires collaboration with other developers and awareness of artifacts such as work items, pull requests, and file changes. With the speed of development increasing, information overload and information discovery are challenges for people developing and maintaining these systems. Finding information about similar code changes and experts is difficult for software engineers, especially when they work in large software systems or have just recently joined a project. In this paper, we build a large scale data platform named Nalanda platform to address the challenges of information overload and discovery. Nalanda contains two subsystems: (1) a large scale socio-technical graph system, named Nalanda graph system, and (2) a large scale index system, named Nalanda index system that aims at satisfying the information needs of software developers. To show the versatility of the Nalanda platform, we built two applications: (1) a software analytics application with a news feed named MyNalanda that has Daily Active Users (DAU) of 290 and Monthly Active Users (MAU) of 590, and (2) a recommendation system for related work items and pull requests that accomplished similar tasks (artifact recommendation) and a recommendation system for subject matter experts (expert recommendation), augmented by the Nalanda socio-technical graph. Initial studies of the two applications found that developers and engineering managers are favorable toward continued use of the news feed application for information discovery. The studies also found that developers agreed that a system like Nalanda artifact and expert recommendation application could reduce the time spent and the number of places needed to visit to find information. @InProceedings{ESEC/FSE22p1246, author = {Chandra Maddila and Suhas Shanbhogue and Apoorva Agrawal and Thomas Zimmermann and Chetan Bansal and Nicole Forsgren and Divyanshu Agrawal and Kim Herzig and Arie van Deursen}, title = {Nalanda: A Socio-technical Graph Platform for Building Software Analytics Tools at Enterprise Scale}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1246--1256}, doi = {10.1145/3540250.3558949}, year = {2022}, } Publisher's Version Info |
|
Hilton, Michael |
ESEC/FSE '22: "A Retrospective Study of One ..."
A Retrospective Study of One Decade of Artifact Evaluations
Stefan Winter, Christopher S. Timperley, Ben Hermann, Jürgen Cito, Jonathan Bell, Michael Hilton, and Dirk Beyer (LMU Munich, Germany; Carnegie Mellon University, USA; TU Dortmund, Germany; TU Wien, Austria; Northeastern University, USA) Most software engineering research involves the development of a prototype, a proof of concept, or a measurement apparatus. Together with the data collected in the research process, they are collectively referred to as research artifacts and are subject to artifact evaluation (AE) at scientific conferences. Since its initiation in the SE community at ESEC/FSE 2011, both the goals and the process of AE have evolved and today expectations towards AE are strongly linked with reproducible research results and reusable tools that other researchers can build their work on. However, to date little evidence has been provided that artifacts which have passed AE actually live up to these high expectations, i.e., to which degree AE processes contribute to AE's goals and whether the overhead they impose is justified. We aim to fill this gap by providing an in-depth analysis of research artifacts from a decade of software engineering (SE) and programming languages (PL) conferences, based on which we reflect on the goals and mechanisms of AE in our community. In summary, our analyses (1) suggest that articles with artifacts do not generally have better visibility in the community, (2) provide evidence how evaluated and not evaluated artifacts differ with respect to different quality criteria, and (3) highlight opportunities for further improving AE processes. @InProceedings{ESEC/FSE22p145, author = {Stefan Winter and Christopher S. Timperley and Ben Hermann and Jürgen Cito and Jonathan Bell and Michael Hilton and Dirk Beyer}, title = {A Retrospective Study of One Decade of Artifact Evaluations}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {145--156}, doi = {10.1145/3540250.3549172}, year = {2022}, } Publisher's Version Artifacts Reusable |
|
Hoffmann, Henry |
ESEC/FSE '22: "AgileCtrl: A Self-Adaptive ..."
AgileCtrl: A Self-Adaptive Framework for Configuration Tuning
Shu Wang, Henry Hoffmann, and Shan Lu (LinkedIn, USA; University of Chicago, USA) Software systems increasingly expose performance-sensitive configuration parameters, or PerfConfs, to users. Unfortunately, the right settings of these PerfConfs are difficult to decide and often change at run time. To address this problem, prior research has proposed self-adaptive frameworks that automatically monitor the software’s behavior and dynamically tune configurations to provide the desired performance despite dynamic changes. However, these frameworks often require configuration themselves; sometimes explicitly in the form of additional parameters, sometimes implicitly in the form of training. This paper proposes a new framework, AgileCtrl, that eliminates the need of configuration for a large family of control-based self-adaptive frameworks. AgileCtrl’s key insight is to not just monitor the original software, but additionally to monitor its adaptations and reconfigure itself when its internal adaptation mechanisms are not meeting software requirements. We evaluate AgileCtrl by comparing against recent control-based approaches to self-adaptation that require user configuration. Across a number of case studies, we find AgileCtrl withstands model errors up to 106×, saves the system from performance oscillation and crashes, and improves the performance up to 53%. It also auto-adjusts improper performance goals while improving the performance by 50%. @InProceedings{ESEC/FSE22p459, author = {Shu Wang and Henry Hoffmann and Shan Lu}, title = {AgileCtrl: A Self-Adaptive Framework for Configuration Tuning}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {459--471}, doi = {10.1145/3540250.3549136}, year = {2022}, } Publisher's Version |
|
Hohenstein, Uwe |
ESEC/FSE '22-IND: "Sometimes You Have to Treat ..."
Sometimes You Have to Treat the Symptoms: Tackling Model Drift in an Industrial Clone-and-Own Software Product Line
Christof Tinnes, Wolfgang Rössler, Uwe Hohenstein, Torsten Kühn, Andreas Biesdorf, and Sven Apel (Siemens, Germany; Siemens Mobility, Germany; Saarland University, Germany) Many industrial software product lines use a clone-and-own approach for reuse among software products. As a result, the different products in the product line may drift apart, which implies increased efforts for tasks such as change propagation, domain analysis, and quality assurance. While many solutions have been proposed in the literature, these are often difficult to apply in a real-world setting. We study this drift of products in a concrete large-scale industrial model-driven clone-and-own software product line in the railway domain at our industry partner. For this purpose, we conducted interviews and a survey, and we investigated the models in the model history of this project. We found that increased efforts are mainly caused by large model differences and increased communication efforts. We argue that, in the short-term, treating the symptoms (i.e., handling large model differences) can help to keep efforts for software product-line engineering acceptable — instead of employing sophisticated variability management. To treat the symptoms, we employ a solution based on semantic-lifting to simplify model differences. Using the interviews and the survey, we evaluate the feasibility of variability management approaches and the semantic-lifting approach in the context of this project. @InProceedings{ESEC/FSE22p1355, author = {Christof Tinnes and Wolfgang Rössler and Uwe Hohenstein and Torsten Kühn and Andreas Biesdorf and Sven Apel}, title = {Sometimes You Have to Treat the Symptoms: Tackling Model Drift in an Industrial Clone-and-Own Software Product Line}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1355--1366}, doi = {10.1145/3540250.3558960}, year = {2022}, } Publisher's Version |
|
Hong, Yang |
ESEC/FSE '22: "CommentFinder: A Simpler, ..."
CommentFinder: A Simpler, Faster, More Accurate Code Review Comments Recommendation
Yang Hong, Chakkrit Tantithamthavorn, Patanamon Thongtanunam, and Aldeida Aleti (Monash University, Australia; University of Melbourne, Australia) Code review is an effective quality assurance practice, but can be labor-intensive since developers have to manually review the code and provide written feedback. Recently, a Deep Learning (DL)-based approach was introduced to automatically recommend code review comments based on changed methods. While the approach showed promising results, it requires expensive computational resource and time which limits its use in practice. To address this limitation, we propose CommentFinder – a retrieval-based approach to recommend code review comments. Through an empirical evaluation of 151,019 changed methods, we evaluate the effectiveness and efficiency of CommentFinder against the state-of-the-art approach. We find that when recommending the best-1 review comment candidate, our CommentFinder is 32% better than prior work in recommending the correct code review comment. In addition, CommentFinder is 49 times faster than the prior work. These findings highlight that our CommentFinder could help reviewers to reduce the manual efforts by recommending code review comments, while requiring less computational time. @InProceedings{ESEC/FSE22p507, author = {Yang Hong and Chakkrit Tantithamthavorn and Patanamon Thongtanunam and Aldeida Aleti}, title = {CommentFinder: A Simpler, Faster, More Accurate Code Review Comments Recommendation}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {507--519}, doi = {10.1145/3540250.3549119}, year = {2022}, } Publisher's Version |
|
Hu, Chunming |
ESEC/FSE '22: "SamplingCA: Effective and ..."
SamplingCA: Effective and Efficient Sampling-Based Pairwise Testing for Highly Configurable Software Systems
Chuan Luo, Qiyuan Zhao, Shaowei Cai, Hongyu Zhang, and Chunming Hu (Beihang University, China; Shanghai Jiao Tong University, China; Institute of Software at Chinese Academy of Sciences, China; University of Newcastle, Australia) Combinatorial interaction testing (CIT) is an effective paradigm for testing highly configurable systems, and its goal is to generate a t-wise covering array (CA) as a test suite, where t is the strength of testing. It is recognized that pairwise testing (i.e., CIT with t=2) is the most common CIT technique, and has high fault detection capability in practice. The problem of pairwise CA generation (PCAG), which is a core problem in pairwise testing, aims at generating a pairwise CA (i.e., 2-wise CA) of minimum size, subject to hard constraints. The PCAG problem is a hard combinatorial optimization problem, which urgently requires practical methods for generating pairwise CAs (PCAs) of small sizes. However, existing PCAG algorithms suffer from the severe scalability issue; that is, when solving large-scale PCAG instances, existing state-of-the-art PCAG algorithms usually cost a fairly long time to generate large PCAs, which would make the testing of highly configurable systems both ineffective and inefficient. In this paper, we propose a novel and effective sampling-based approach dubbed SamplingCA for solving the PCAG problem. SamplingCA first utilizes sampling techniques to obtain a small test suite that covers valid pairwise tuples as many as possible, and then adds a few more test cases into the test suite to ensure that all valid pairwise tuples are covered. Extensive experiments on 125 public PCAG instances show that our approach can generate much smaller PCAs than its state-of-the-art competitors, indicating the effectiveness of SamplingCA. Also, our experiments show that SamplingCA runs one to two orders of magnitude faster than its competitors, demonstrating the efficiency of SamplingCA. Our results confirm that SamplingCA is able to address the scalability issue and considerably pushes forward the state of the art in PCAG solving. @InProceedings{ESEC/FSE22p1185, author = {Chuan Luo and Qiyuan Zhao and Shaowei Cai and Hongyu Zhang and Chunming Hu}, title = {SamplingCA: Effective and Efficient Sampling-Based Pairwise Testing for Highly Configurable Software Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1185--1197}, doi = {10.1145/3540250.3549155}, year = {2022}, } Publisher's Version Artifacts Reusable |
|
Hu, Longjie |
ESEC/FSE '22: "Understanding Performance ..."
Understanding Performance Problems in Deep Learning Systems
Junming Cao, Bihuan Chen, Chao Sun, Longjie Hu, Shuaihong Wu, and Xin Peng (Fudan University, China) Deep learning (DL) has been widely applied to many domains. Unique challenges in engineering DL systems are posed by the programming paradigm shift from traditional systems to DL systems, and performance is one of the challenges. Performance problems (PPs) in DL systems can cause severe consequences such as excessive resource consumption and financial loss. While bugs in DL systems have been extensively investigated, PPs in DL systems have hardly been explored. To bridge this gap, we present the first comprehensive study to i) characterize symptoms, root causes, and introducing and exposing stages of PPs in DL systems developed in TensorFLow and Keras, with 224 PPs collected from 210 StackOverflow posts, and to ii) assess the capability of existing performance analysis approaches in tackling PPs, with a constructed benchmark of 58 PPs in DL systems. Our findings shed light on the implications on developing high-performance DL systems, and detecting and localizing PPs in DL systems. To demonstrate the usefulness of our findings, we develop a static checker DeepPerf to detect three types of PPs. It has detected 488 new PPs in 130 GitHub projects. 105 and 27 PPs have been confirmed and fixed. @InProceedings{ESEC/FSE22p357, author = {Junming Cao and Bihuan Chen and Chao Sun and Longjie Hu and Shuaihong Wu and Xin Peng}, title = {Understanding Performance Problems in Deep Learning Systems}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {357--369}, doi = {10.1145/3540250.3549123}, year = {2022}, } Publisher's Version Info Artifacts Reusable |
|
Hu, Xing |
ESEC/FSE '22: "Automated Unearthing of Dangerous ..."
Automated Unearthing of Dangerous Issue Reports
Shengyi Pan, Jiayuan Zhou, Filipe Roseiro Cogo, Xin Xia, Lingfeng Bao, Xing Hu, Shanping Li, and Ahmed E. Hassan (Zhejiang University, China; Huawei, Canada; Huawei, China; Queen’s University, Canada) The coordinated vulnerability disclosure (CVD) process is commonly adopted for open source software (OSS) vulnerability management, which suggests to privately report the discovered vulnerabilities and keep relevant information secret until the official disclosure. However, in practice, due to various reasons (e.g., lacking security domain expertise or the sense of security management), many vulnerabilities are first reported via public issue reports (IRs) before its official disclosure. Such IRs are dangerous IRs, since attackers can take advantages of the leaked vulnerability information to launch zero-day attacks. It is crucial to identify such dangerous IRs at an early stage, such that OSS users can start the vulnerability remediation process earlier and OSS maintainers can timely manage the dangerous IRs. In this paper, we propose and evaluate a deep learning based approach, namely MemVul, to automatically identify dangerous IRs at the time they are reported. MemVul augments the neural networks with a memory component, which stores the external vulnerability knowledge from Common Weakness Enumeration (CWE). We rely on publicly accessible CVE-referred IRs (CIRs) to operationalize the concept of dangerous IR. We mine 3,937 CIRs distributed across 1,390 OSS projects hosted on GitHub. Evaluated under a practical scenario of high data imbalance, MemVul achieves the best trade-off between precision and recall among all baselines. In particular, the F1-score of MemVul (i.e., 0.49) improves the best performing baseline by 44%. For IRs that are predicted as CIRs but not reported to CVE, we conduct a user study to investigate their usefulness to OSS stakeholders. We observe that 82% (41 out of 50) of these IRs are security-related and 28 of them are suggested by security experts to be publicly disclosed, indicating MemVul is capable of identifying undisclosed dangerous IRs. @InProceedings{ESEC/FSE22p834, author = {Shengyi Pan and Jiayuan Zhou and Filipe Roseiro Cogo and Xin Xia and Lingfeng Bao and Xing Hu and Shanping Li and Ahmed E. Hassan}, title = {Automated Unearthing of Dangerous Issue Reports}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {834--846}, doi = {10.1145/3540250.3549156}, year = {2022}, } Publisher's Version |
|
Hu, Xinwen |
ESEC/FSE '22: "Lighting Up Supervised Learning ..."
Lighting Up Supervised Learning in User Review-Based Code Localization: Dataset and Benchmark
Xinwen Hu, Yu Guo, Jianjie Lu, Zheling Zhu, Chuanyi Li, Jidong Ge, Liguo Huang, and Bin Luo (Nanjing University, China; Southern Methodist University, USA) As User Reviews (URs) of mobile Apps are proven to provide valuable feedback for maintaining and evolving applications, how to make full use of URs more efficiently in the release cycle of mobile Apps has become a widely concerned and researched topic in the Software Engineering (SE) community. In order to speed up the completion of coding work related to URs to shorten the release cycle as much as possible, the task of User Review-based code localization is proposed and studied in depth. However, due to the lack of large-scale ground truth dataset (i.e., truly related <UR, Code> pairs), existing methods are all unsupervised learning-based. In order to light up supervised learning approaches, which are driven by large labeled datasets, for Review2Code, and to compare their performances with unsupervised learning-based methods, we first introduce a large-scale human-labeled <UR, Code> ground truth dataset, including the annotation process and statistical analysis. Then, a benchmark consisting of two SOTA unsupervised learning-based and four supervised learning-based Review2Code methods is constructed based on this dataset. We believe that this paper can provide a basis for in-depth exploration of the supervised learning-based Review2Code solutions. @InProceedings{ESEC/FSE22p533, author = {Xinwen Hu and Yu Guo and Jianjie Lu and Zheling Zhu and Chuanyi Li and Jidong Ge and Liguo Huang and Bin Luo}, title = {Lighting Up Supervised Learning in User Review-Based Code Localization: Dataset and Benchmark}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {533--545}, doi = {10.1145/3540250.3549141}, year = {2022}, } Publisher's Version |
|
Hu, Yang |
ESEC/FSE '22: "SymMC: Approximate Model Enumeration ..."
SymMC: Approximate Model Enumeration and Counting using Symmetry Information for Alloy Specifications
Wenxi Wang, Yang Hu, Kenneth L. McMillan, and Sarfraz Khurshid (University of Texas at Austin, USA) Specifying and analyzing critical properties of software systems plays an important role in the development of reliable systems. Alloy is a mature tool-set that provides a first-order relational logic for writing specifications, and a fully automatic powerful backend for analyzing the specifications. It has been widely applied in areas including verification, security, and synthesis. Symmetry breaking is a useful approach for pruning the search space to efficiently check the satisfiability of combinatorial problems. As the backend solver of Alloy, Kodkod does the partial symmetry breaking (PaSB) for Alloy specifications. While full symmetry breaking remains challenging to scale, a recent study showed that Kodkod PaSB could significantly reduce the model counting time, albeit at the cost of producing only partial model counts. However, the desired term is either the isomorphic count under no symmetry breaking, or the non-isomorphic models/count under full symmetry breaking. This paper presents an approach called SymMC, which utilizes the symmetry information to compute all the desired terms for Alloy specifications. To make SymMC scalable, we propose approximate algorithms based on sampling to estimate the desired terms. We show that our proposed estimators have consistency and upper bound properties. To our knowledge, SymMC is the first approach that automatically approximates non-isomorphic model enumeration/counting for Alloy specifications. Thanks to the non-isomorphic model counting, SymMC also provides the first automatic quantification measurement on the solution space pruning ability of Kodkod PaSB. Furthermore, empirical evaluations show that SymMC provides a competitive isomorphic counting approach for Alloy specifications compared to the state-of-the-art model counters. @InProceedings{ESEC/FSE22p1209, author = {Wenxi Wang and Yang Hu and Kenneth L. McMillan and Sarfraz Khurshid}, title = {SymMC: Approximate Model Enumeration and Counting using Symmetry Information for Alloy Specifications}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1209--1220}, doi = {10.1145/3540250.3549161}, year = {2022}, } Publisher's Version |
|
Hu, Yuxi |
ESEC/FSE '22-IND: "Industry Practice of Configuration ..."
Industry Practice of Configuration Auto-tuning for Cloud Applications and Services
Runzhe Wang, Qinglong Wang, Yuxi Hu, Heyuan Shi, Yuheng Shen, Yu Zhan, Ying Fu, Zheng Liu, Xiaohai Shi, and Yu Jiang (Alibaba Group, China; Central South University, China; Tsinghua University, China; Ant Group, China; Zhejiang University, China) Auto-tuning attracts increasing attention in industry practice to optimize the performance of a system with many configurable parameters. It is particularly useful for cloud applications and services since they have complex system hierarchies and intricate knob correlations. However, existing tools and algorithms rarely consider practical problems such as workload pressure control, the support for distributed deployment, and expensive time costs, etc., which are utterly important for enterprise cloud applications and services. In this work, we significantly extend an open source tuning tool – KeenTune to optimize several typical enterprise cloud applications and services. Our practice is in collaboration with enterprise users and tuning tool developers to address the aforementioned problems. Specifically, we highlight five key challenges from our experiences and provide a set of solutions accordingly. Through applying the improved tuning tool to different application scenarios, we achieve 2%-14% improvements for the performance of MySQL, OceanBase, nginx, ingress-nginx, and 5%-70% improvements for the performance of ACK cloud container service. @InProceedings{ESEC/FSE22p1555, author = {Runzhe Wang and Qinglong Wang and Yuxi Hu and Heyuan Shi and Yuheng Shen and Yu Zhan and Ying Fu and Zheng Liu and Xiaohai Shi and Yu Jiang}, title = {Industry Practice of Configuration Auto-tuning for Cloud Applications and Services}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1555--1565}, doi = {10.1145/3540250.3558962}, year = {2022}, } Publisher's Version |
|
Hua, Zihan |
ESEC/FSE '22: "AUGER: Automatically Generating ..."
AUGER: Automatically Generating Review Comments with Pre-training Models
Lingwei Li, Li Yang, Huaxi Jiang, Jun Yan, Tiejian Luo, Zihan Hua, Geng Liang, and Chun Zuo (Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Wuhan University, China; Sinosoft, China) Code review is one of the best practices as a powerful safeguard for software quality. In practice, senior or highly skilled reviewers inspect source code and provide constructive comments, consider- ing what authors may ignore, for example, some special cases. The collaborative validation between contributors results in code being highly qualified and less chance of bugs. However, since personal knowledge is limited and varies, the efficiency and effectiveness of code review practice are worthy of further improvement. In fact, it still takes a colossal and time-consuming effort to deliver useful review comments. This paper explores a synergy of multiple practical review comments to enhance code review and proposes AUGER (AUtomatically GEnerating Review comments): a review comments generator with pre-training models. We first collect empirical review data from 11 notable Java projects and construct a dataset of 10,882 code changes. By leveraging Text-to-Text Transfer Transformer (T5) models, the framework synthesizes valuable knowledge in the training stage and effectively outperforms baselines by 37.38% in ROUGE-L. 29% of our automatic review comments are considered useful according to prior studies. The inference generates just in 20 seconds and is also open to training further. Moreover, the performance also gets improved when thoroughly analyzed in case study. @InProceedings{ESEC/FSE22p1009, author = {Lingwei Li and Li Yang and Huaxi Jiang and Jun Yan and Tiejian Luo and Zihan Hua and Geng Liang and Chun Zuo}, title = {AUGER: Automatically Generating Review Comments with Pre-training Models}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1009--1021}, doi = {10.1145/3540250.3549099}, year = {2022}, } Publisher's Version |
|
Huang, Kaifeng |
ESEC/FSE '22: "Tracking Patches for Open ..."
Tracking Patches for Open Source Software Vulnerabilities
Congying Xu, Bihuan Chen, Chenhao Lu, Kaifeng Huang, Xin Peng, and Yang Liu (Fudan University, China; Nanyang Technological University, Singapore) Open source software (OSS) vulnerabilities threaten the security of software systems that use OSS. Vulnerability databases provide valuable information (e.g., vulnerable version and patch) to mitigate OSS vulnerabilities. There arises a growing concern about the information quality of vulnerability databases. However, it is unclear what the quality of patches in existing vulnerability databases is; and existing manual or heuristic-based approaches for patch tracking are either too expensive or too specific to apply to all OSS vulnerabilities. To address these problems, we first conduct an empirical study to understand the quality and characteristics of patches for OSS vulnerabilities in two industrial vulnerability databases. Inspired by our study, we then propose the first automated approach, Tracer, to track patches for OSS vulnerabilities from multiple knowledge sources. Our evaluation has demonstrated that i) Tracer can track patches for up to 273.8% more vulnerabilities than heuristic-based approaches while achieving a higher F1-score by up to 116.8%; and ii) Tracer can complement industrial vulnerability databases. Our evaluation has also indicated the generality and practical usefulness of Tracer. @InProceedings{ESEC/FSE22p860, author = {Congying Xu and Bihuan Chen and Chenhao Lu and Kaifeng Huang and Xin Peng and Yang Liu}, title = {Tracking Patches for Open Source Software Vulnerabilities}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {860--871}, doi = {10.1145/3540250.3549125}, year = {2022}, } Publisher's Version |
|
Huang, Liguo |
ESEC/FSE '22: "Lighting Up Supervised Learning ..."
Lighting Up Supervised Learning in User Review-Based Code Localization: Dataset and Benchmark
Xinwen Hu, Yu Guo, Jianjie Lu, Zheling Zhu, Chuanyi Li, Jidong Ge, Liguo Huang, and Bin Luo (Nanjing University, China; Southern Methodist University, USA) As User Reviews (URs) of mobile Apps are proven to provide valuable feedback for maintaining and evolving applications, how to make full use of URs more efficiently in the release cycle of mobile Apps has become a widely concerned and researched topic in the Software Engineering (SE) community. In order to speed up the completion of coding work related to URs to shorten the release cycle as much as possible, the task of User Review-based code localization is proposed and studied in depth. However, due to the lack of large-scale ground truth dataset (i.e., truly related <UR, Code> pairs), existing methods are all unsupervised learning-based. In order to light up supervised learning approaches, which are driven by large labeled datasets, for Review2Code, and to compare their performances with unsupervised learning-based methods, we first introduce a large-scale human-labeled <UR, Code> ground truth dataset, including the annotation process and statistical analysis. Then, a benchmark consisting of two SOTA unsupervised learning-based and four supervised learning-based Review2Code methods is constructed based on this dataset. We believe that this paper can provide a basis for in-depth exploration of the supervised learning-based Review2Code solutions. @InProceedings{ESEC/FSE22p533, author = {Xinwen Hu and Yu Guo and Jianjie Lu and Zheling Zhu and Chuanyi Li and Jidong Ge and Liguo Huang and Bin Luo}, title = {Lighting Up Supervised Learning in User Review-Based Code Localization: Dataset and Benchmark}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {533--545}, doi = {10.1145/3540250.3549141}, year = {2022}, } Publisher's Version |
|
Huang, Qianying |
ESEC/FSE '22-IND: "Workgraph: Personal Focus ..."
Workgraph: Personal Focus vs. Interruption for Engineers at Meta
Yifen Chen, Peter C. Rigby, Yulin Chen, Kun Jiang, Nader Dehghani, Qianying Huang, Peter Cottle, Clayton Andrews, Noah Lee, and Nachiappan Nagappan (Meta, USA; Concordia University, Canada) All engineers dislike interruptions because it takes away from the deep focus time needed to write complex code. Our goal is to reduce unnecessary interruptions at . We first describe our Workgraph platform that logs how engineers use our internal work tools at . Using these anonymized logs, we create sessions. sessions are defined in opposition to interruption and are the amount of time until the engineer is interrupted by, for example, a work chat message. We describe descriptive statistics related to how long engineers are able to focus. We find that at Meta, Engineers have a total of 14.25 hours of personal-focus time per week. These numbers are comparable with those reported by other software firms. We then create a Random Forest model to understand which factors influence the median daily personal-focus time. We find that the more time an engineer spends in the IDE the longer their focus. We also find that the more central an engineer is in the social work network, the shorter their personal-focus time. Other factors such as role and domain/pillar have little impact on personal-focus at Meta. To help engineers achieve longer blocks of personal-focus and help them stay in flow, Meta developed the AutoFocus tool that blocks work chat notifications when an engineer is working on code for 12 minutes or longer. AutoFocus allows the sender to still force a work chat message using “@notify” ensuring that urgent messages still get through, but allowing the sender to reflect on the importance of the message. In a large experiment, we find that AutoFocus increases the amount of personal-focus time by 20.27%, and it has now been rolled out widely at Meta. @InProceedings{ESEC/FSE22p1390, author = {Yifen Chen and Peter C. Rigby and Yulin Chen and Kun Jiang and Nader Dehghani and Qianying Huang and Peter Cottle and Clayton Andrews and Noah Lee and Nachiappan Nagappan}, title = {Workgraph: Personal Focus vs. Interruption for Engineers at Meta}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1390--1397}, doi = {10.1145/3540250.3558961}, year = {2022}, } Publisher's Version ESEC/FSE '22: "Using Nudges to Accelerate ..." Using Nudges to Accelerate Code Reviews at Scale Qianhua Shan, David Sukhdeo, Qianying Huang, Seth Rogers, Lawrence Chen, Elise Paradis, Peter C. Rigby, and Nachiappan Nagappan (Meta, USA; Concordia University, Canada) We describe a large-scale study to reduce the amount of time code review takes. Each quarter at Meta we survey developers. Combining sentiment data from a developer experience survey and telemetry data from our diff review tool, we address, “When does a diff review feel too slow?” From the sentiment data alone, we learn that 84.7% of developers are satisfied with the time their diffs spend in review. By enriching the survey results with telemetry for each respondent, we determined that sentiment is closely associated with the 75th percentile time in review for that respondent’s diffs, ie those that take more than 24 hours. To encourage developers to act on stale diffs that have had no action for 24 or more hours, we designed a NudgeBot to notify, ie nudge, reviewers. To determine who to nudge when a diff is stale, we created a model to rank the reviewers based on the probability that they will make a comment or perform some other action on a diff. This model outperformed models that looked at files the reviewer had modified in the past. Combining this information with prior author-review relationships, we achieved an MRR and AUC of .81 and .88, respectively. To evaluate NudgeBot in production, we conducted an A/B cluster-randomized experiment on over 30k engineers. We observed substantial statistically significant decrease in both time in review (-6.8%, p=0.049) and time to first reviewer action (-9.9%, p=0.010). We also used guard metrics to ensure that most reviews were still done in fewer than 24 hours and that reviewers still spend the same amount of time looking at diffs, and saw no statistically significant change in these metrics. NudgeBot is now rolled out company wide and is used daily by thousands of engineers at Meta. @InProceedings{ESEC/FSE22p472, author = {Qianhua Shan and David Sukhdeo and Qianying Huang and Seth Rogers and Lawrence Chen and Elise Paradis and Peter C. Rigby and Nachiappan Nagappan}, title = {Using Nudges to Accelerate Code Reviews at Scale}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {472--482}, doi = {10.1145/3540250.3549104}, year = {2022}, } Publisher's Version |
|
Huang, Yongheng |
ESEC/FSE '22: "Generic Sensitivity: Customizing ..."
Generic Sensitivity: Customizing Context-Sensitive Pointer Analysis for Generics
Haofeng Li, Jie Lu, Haining Meng, Liqing Cao, Yongheng Huang, Lian Li, and Lin Gao (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; TianqiSoft, China) Generic programming has been extensively used in object-oriented programs such as Java. However, existing context-sensitive pointer analyses perform poorly in analyzing generics. This paper introduces generic sensitivity, a new context customization scheme targeting generics. We design our context customization scheme in such a way that generic instantiation sites, i.e., locations instantiating generic classes/methods with concrete types, are always preserved as key context elements. This is realized by augmenting contexts with a type variable lookup map, which is efficiently updated during the analysis in a context-sensitive manner. We have implemented different variants of generic-sensitive analysis in Wala and experimental results show that the generic customization scheme can significantly improve performance and precision of context-sensitive pointer analyses. For instance, generic context customization significantly improves precision of 1-object-sensitive analysis, with an average speedup of 1.8×. In addition, generic context customization enables a 1-object-sensitive analysis to achieve overall better precision than a 2-object-sensitive analysis, with an averagely speed up of 12.6 × (62 × for chart). @InProceedings{ESEC/FSE22p1110, author = {Haofeng Li and Jie Lu and Haining Meng and Liqing Cao and Yongheng Huang and Lian Li and Lin Gao}, title = {Generic Sensitivity: Customizing Context-Sensitive Pointer Analysis for Generics}, booktitle = {Proc.\ ESEC/FSE}, publisher = {ACM}, pages = {1110--1121}, doi = {10.1145/3540250.3549122}, year = {2022}, } Publisher's Version |
|
Huang, Yuchao | ESEC/FSE '22: "API Recommendation for Machine ..." |