ESEC/FSE 2022
30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2022)
Powered by
Conference Publishing Consulting

30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2022), November 14–18, 2022, Singapore, Singapore

ESEC/FSE 2022 – Proceedings

Contents - Abstracts - Authors
Twitter: https://twitter.com/esecfse

Frontmatter

Title Page


Message from the Chairs
On behalf of all members of the organizing committee, we are delighted to welcome everyone to the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) 2022. The event continues the long, distinguished ESEC/FSE tradition of presenting the most innovative research, and facilitating interactions between scientists and engineers who are passionate about advancing the theory and practice of software engineering.

ESEC/FSE 2022 Organization
ESEC/FSE 2022 Committee Listings

ESEC/FSE 2022 Sponsors and Supporters
ESEC/FSE 2022 Sponsors and Supporters


Invited Contributions

Keynotes

AI-Assisted Programming: Applications, User Experiences, and Neuro-Symbolic Techniques (Keynote)
Sumit Gulwani
(Microsoft, USA)
AI can enhance programming experiences for a diverse set of programmers: from professional developers and data scientists (proficient programmers) who need help in software engineering and data wrangling, all the way to spreadsheet users (low-code programmers) who need help in authoring formulas, and students (novice programmers) who seek hints when stuck with their programming homework. To communicate their need to AI, users can express their intent explicitly—as input-output examples or natural-language specification—or implicitly—where they encounter a bug (and expect AI to suggest a fix), or simply allow AI to observe their last few lines of code or edits (to have it suggest the next steps).
The task of synthesizing an intended program snippet from the user’s intent is both a search and a ranking problem. Search is required to discover candidate programs that correspond to the (often ambiguous) intent, and ranking is required to pick the best program from multiple plausible alternatives. This creates a fertile playground for combining symbolic-reasoning techniques, which model the semantics of programming operators, and machine-learning techniques, which can model human preferences in programming. Recent advances in large language models like Codex offer further promise to advance such neuro-symbolic techniques.
Finally, a few critical requirements in AI-assisted programming are usability, precision, and trust; and they create opportunities for innovative user experiences and interactivity paradigms. In this talk, I will explain these concepts using some existing successes, including the Flash Fill feature in Excel, Data Connectors in PowerQuery, and IntelliCode/CoPilot in Visual Studio. I will also describe several new opportunities in AI-assisted programming, which can drive the next set of foundational neuro-symbolic advances.

Publisher's Version
On Safety, Assurance, and Reliability: A Software Engineering Perspective (Keynote)
Marsha Chechik
(University of Toronto, Canada)
From financial services platforms to social networks to vehicle control, software has come to mediate many activities of daily life. Governing bodies and standards organizations have responded to this trend by creating regulations and standards to address issues such as safety, security and privacy. In this environment, the compliance of software development to standards and regulations has emerged as a key requirement. Compliance claims and arguments are often captured in assurance cases, with linked evidence of compliance. Evidence can come from test cases, verification proofs, human judgement, or a combination of these. That is, we try to build (safety-critical) systems carefully according to well justified methods and articulate these justifications in an assurance case that is ultimately judged by a human.
Building safety arguments for traditional software systems is difficult — they are lengthy and expensive to maintain, especially as software undergoes change. Safety is also notoriously non­compositional — each subsystem might be safe but together they may create unsafe behaviors. It is also easy to miss cases, which in the simplest case would mean developing an argument for when a condition is true but missing arguing for a false condition. Furthermore, many ML-based systems are becoming safety-critical. For example, recent Tesla self-driving cars misclassified emergency vehicles and caused multiple crashes. ML-based systems typically do not have precisely specified and machine-verifiable requirements. While some safety requirements can be stated clearly: “the system should detect all pedestrians at a crossing”, these requirements are for the entire system, making them too high-level for safety analysis of individual components. Thus, systems with ML components (MLCs) add a significant layer of complexity for safety assurance.
I argue that safety assurance should be an integral part of building safe and reliable software systems, but this process needs support from advanced software engineering and software analysis. In this talk, I outline a few approaches for development of principled, tool-supported methodologies for creating and managing assurance arguments. I then describe some of the recent work on specifying and verifying reliability requirements for machine-learned components in safety-critical domains.

Publisher's Version

Impact Award Paper Keynote

Task Modularity and the Emergence of Software Value Streams (Impact Award Paper Keynote)
Gail C. Murphy and Mik Kersten
(University of British Columbia, Canada; Tasktop Technologies, Canada)
The creation of techniques, languages and tools to support building software from modular parts has enabled the development and evolution of large complex software systems. For many years, the focus of modularity was on the structure of the software system. The thinking was that the right modularity would enable software teams involved in different pieces of the system to work as independently as possible. As a community, we learned over the years that a system has no one optimal modularity, and in fact, the work that is undertaken to add new features or fix defects often crosscuts the modularity of the software. The research we conducted on task contexts and Mylyn over fifteen years ago recognized that capturing the activity performed on development tasks provided a means to make explicit emergent modularity for a system. With Mylyn, we explored how the activity of one developer could enable the surfacing of emergent modularity to help a developer perform tasks on a software system. Through stewardship of the Mylyn open source project and the creation of Tasktop Technologies, we brought these ideas into use in industry. Over the past decade, through interactions with practicing developers and their organizations, we have learned more about how modularity emerges at the individual developer, team, and organizational levels. Tasktop’s products moved accordingly from supporting the activity of individual developers to supporting the streams of value teams within an organization produce. In this talk, we will discuss how following the flow of tasks within software development supports value stream management and outline open research questions about the socio-technical aspects of developing complex software systems.

Publisher's Version

Invited Tutorials

Academic Prototyping (Invited Tutorial)
Andreas Zeller
(CISPA Helmholtz Center for Information Security, Germany)
Much of our research requires building tools to evaluate and demonstrate new approaches. Yet, tool building can take large amounts of time and resources. And it brings risks: The original idea might not work; rendering all efforts futile. And after the student in charge has left, the tool becomes a maintenance problem.
In this tutorial, I present tools and techniques for the quick creation of prototypes for academic research. Building prototypes in and for Python is at least ten times faster than traditional engineering in and for C or Java; it thus frees lots of time for creating and refining approaches. Using Jupyter Notebooks to code and run experiments saves rationales, settings, examples, and results in self-contained documents that can be read and reused easily by other students and researchers. Once the basics of the new approach are settled (and only then!) can one go and port the new approach to “traditional” languages – or move on to the next challenge.
You don’t believe me? Watch me illustrate this approach by building a symbolic fuzzer (think KLEE) for Python from scratch, live coding the capturing and solving of path conditions in less than 20 minutes. From there, we will determine what it is that makes languages like Python and tools like Jupyter Notebooks so suitable for academic prototyping, and the consequences this has (or should have) for organizing our research.

Publisher's Version
Multi-perspective Representation Learning for Source Code Analytics (Invited Tutorial)
Zhi Jin
(Peking University, China)
Programming languages are artificial and highly restricted languages. But source code is there to tell computers as well as programmers what to do, as an act of communication. Despite its weird syntax and is riddled with different delimiters, the good news is that the very large corpus of open-source code is available. That makes it reasonable to apply machine learning techniques to source code to enable the source code analytics.
Despite there are plenty of deep learning frameworks in the field of NLP, source code analytics has different features. In addition to the conventional way of coding, understanding the meaning of code involves many perspectives. The source code representation could be the token sequence, the API call sequence, the data dependency graph, and the control flow graph, as well as the program hierarchy, etc. This tutorial will tell the long, ongoing, and fruitful journey on exploiting the potential power of deep learning techniques in source code analytics. It will highlight that how code representation models can be utilized to support software engineers to perform different tasks that require proficient programming knowledge. The exploratory work show that code does imply the learnable knowledge, more precisely the learnable tacit knowledge. Although such knowledge is not easily transferrable between humans, it can be transferred between the automated programming tasks. A vision for future research will be stated for source code analytics.

Publisher's Version

Main Research

Machine Learning I

Adaptive Fairness Improvement Based on Causality Analysis
Mengdi Zhang and Jun Sun
(Singapore Management University, Singapore)
Given a discriminating neural network, the problem of fairness improvement is to systematically reduce discrimination without significantly scarifies its performance (i.e., accuracy). Multiple categories of fairness improving methods have been proposed for neural networks, including pre-processing, in-processing and post-processing. Our empirical study however shows that these methods are not always effective (e.g., they may improve fairness by paying the price of huge accuracy drop) or even not helpful (e.g., they may even worsen both fairness and accuracy). In this work, we propose an approach which adaptively chooses the fairness improving method based on causality analysis. That is, we choose the method based on how the neurons and attributes responsible for unfairness are distributed among the input attributes and the hidden neurons. Our experimental evaluation shows that our approach is effective (i.e., always identify the best fairness improving method) and efficient (i.e., with an average time overhead of 5 minutes).

Publisher's Version
NatGen: Generative Pre-training by “Naturalizing” Source Code
Saikat Chakraborty, Toufique Ahmed, Yangruibo Ding, Premkumar T. Devanbu, and Baishakhi Ray
(Columbia University, USA; University of California at Davis, USA)
Pre-trained Generative Language models (e.g., PLBART, CodeT5, SPT-Code) for source code yielded strong results on several tasks in the past few years, including code generation and translation. These models have adopted varying pre-training objectives to learn statistics of code construction from very large-scale corpora in a self-supervised fashion; the success of pre-trained models largely hinges on these pre-training objectives. This paper proposes a new pre-training objective, “Naturalizing” of source code, exploiting code’s bimodal, dual-channel (formal & natural channels) nature. Unlike natural language, code’s bimodal, dual-channel nature allows us to generate semantically equivalent code at scale. We introduce six classes of semantic preserving transformations to introduce unnatural forms of code, and then force our model to produce more natural original programs written by developers. Learning to generate equivalent, but more natural code, at scale, over large corpora of open-source code, without explicit manual supervision, helps the model learn to both ingest & generate code. We fine-tune our model in three generative Software Engineering tasks: code generation, code translation, and code refinement with limited human-curated labeled data and achieve state-of-the-art performance rivaling CodeT5. We show that our pre-trained model is especially competitive at zero-shot and few-shot learning, and better at learning code properties (e.g., syntax, data flow)

Publisher's Version

Software Testing I

Testing of Autonomous Driving Systems: Where Are We and Where Should We Go?
Guannan Lou, Yao Deng, Xi Zheng, Mengshi Zhang, and Tianyi Zhang
(Macquarie University, Australia; Meta, USA; Purdue University, USA)
Autonomous driving has shown great potential to reform modern transportation. Yet its reliability and safety have drawn a lot of attention and concerns. Compared with traditional software systems, autonomous driving systems (ADSs) often use deep neural networks in tandem with logic-based modules. This new paradigm poses unique challenges for software testing. Despite the recent development of new ADS testing techniques, it is not clear to what extent those techniques have addressed the needs of ADS practitioners. To fill this gap, we present the first comprehensive study to identify the current practices and needs of ADS testing. We conducted semi-structured interviews with developers from 10 autonomous driving companies and surveyed 100 developers who have worked on autonomous driving systems. A systematic analysis of the interview and survey data revealed 7 common practices and 4 emerging needs of autonomous driving testing. Through a comprehensive literature review, we developed a taxonomy of existing ADS testing techniques and analyzed the gap between ADS research and practitioners’ needs. Finally, we proposed several future directions for SE researchers, such as developing test reduction techniques to accelerate simulation-based ADS testing.

Publisher's Version
Fuzzing Deep-Learning Libraries via Automated Relational API Inference
Yinlin Deng, Chenyuan Yang, Anjiang Wei, and Lingming Zhang
(University of Illinois at Urbana-Champaign, USA; Stanford University, USA)
Deep Learning (DL) has gained wide attention in recent years. Meanwhile, bugs in DL systems can lead to serious consequences, and may even threaten human lives. As a result, a growing body of research has been dedicated to DL model testing. However, there is still limited work on testing DL libraries, e.g., PyTorch and TensorFlow, which serve as the foundations for building, training, and running DL models. Prior work on fuzzing DL libraries can only generate tests for APIs which have been invoked by documentation examples, developer tests, or DL models, leaving a large number of APIs untested. In this paper, we propose DeepREL, the first approach to automatically inferring relational APIs for more effective DL library fuzzing. Our basic hypothesis is that for a DL library under test, there may exist a number of APIs sharing similar input parameters and outputs; in this way, we can easily “borrow” test inputs from invoked APIs to test other relational APIs. Furthermore, we formalize the notion of value equivalence and status equivalence for relational APIs to serve as the oracle for effective bug finding. We have implemented DeepREL as a fully automated end-to-end relational API inference and fuzzing technique for DL libraries, which 1) automatically infers potential API relations based on API syntactic/semantic information, 2) synthesizes concrete test programs for invoking relational APIs, 3) validates the inferred relational APIs via representative test inputs, and finally 4) performs fuzzing on the verified relational APIs to find potential inconsistencies. Our evaluation on two of the most popular DL libraries, PyTorch and TensorFlow, demonstrates that DeepREL can cover 157% more APIs than state-of-the-art FreeFuzz. To date, DeepREL has detected 162 bugs in total, with 106 already confirmed by the developers as previously unknown bugs. Surprisingly, DeepREL has detected 13.5% of the high-priority bugs for the entire PyTorch issue-tracking system in a three-month period. Also, besides the 162 code bugs, we have also detected 14 documentation bugs (all confirmed).

Publisher's Version
SEDiff: Scope-Aware Differential Fuzzing to Test Internal Function Models in Symbolic Execution
Penghui Li, Wei Meng, and Kangjie Lu
(Chinese University of Hong Kong, China; University of Minnesota, USA)
Symbolic execution has become a foundational program analysis technique. Performing symbolic execution unavoidably encounters internal functions (e.g., library functions) that provide basic operations such as string processing. Many symbolic execution engines construct internal function models that abstract function behaviors for scalability and compatibility concerns. Due to the high complexity of constructing the models, developers intentionally summarize only partial behaviors of a function, namely modeled functionalities, in the models. The correctness of the internal function models is critical because it would impact all applications of symbolic execution, e.g., bug detection and model checking.
A naive solution to testing the correctness of internal function models is to cross-check whether the behaviors of the models comply with their corresponding original function implementations. However, such a solution would mostly detect overwhelming inconsistencies concerning the unmodeled functionalities, which are out of the scope of models and thus considered false reports. We argue that a reasonable testing approach should target only the functionalities that developers intend to model. While being necessary, automatically identifying the modeled functionalities, i.e., the scope, is a significant challenge.
In this paper, we propose a scope-aware differential testing framework, SEDiff, to tackle this problem. We design a novel algorithm to automatically map the modeled functionalities to the code in the original implementations. SEDiff then applies scope-aware grey-box differential fuzzing to relevant code in the original implementations. It also equips a new scope-aware input generator and a tailored bug checker that efficiently and correctly detect erroneous inconsistencies. We extensively evaluated SEDiff on several popular real-world symbolic execution engines targeting binary, web and kernel. Our manual investigation shows that SEDiff precisely identifies the modeled functionalities and detects 46 new bugs in the internal function models used in the symbolic execution engines.

Publisher's Version
Perfect Is the Enemy of Test Oracle
Ali Reza Ibrahimzada, Yigit Varli, Dilara Tekinoglu, and Reyhaneh Jabbarvand
(University of Illinois at Urbana-Champaign, USA; Middle East Technical University, USA; University of Massachusetts at Amherst, USA)
Automation of test oracles is one of the most challenging facets of software testing, but remains comparatively less addressed compared to automated test input generation. Test oracles rely on a ground-truth that can distinguish between the correct and buggy behavior to determine whether a test fails (detects a bug) or passes. What makes the oracle problem challenging and undecidable is the assumption that the ground-truth should know the exact expected, correct, or buggy behavior. However, we argue that one can still build an accurate oracle without knowing the exact correct or buggy behavior, but how these two might differ. This paper presents , a learning-based approach that in the absence of test assertions or other types of oracle, can determine whether a unit test passes or fails on a given method under test (MUT). To build the ground-truth, jointly embeds unit tests and the implementation of MUTs into a unified vector space, in such a way that the neural representation of tests are similar to that of MUTs they pass on them, but dissimilar to MUTs they fail on them. The classifier built on top of this vector representation serves as the oracle to generate “fail” labels, when test inputs detect a bug in MUT or “pass” labels, otherwise. Our extensive experiments on applying to more than 5K unit tests from a diverse set of open-source Java projects show that the produced oracle is (1) effective in predicting the fail or pass labels, achieving an overall accuracy, precision, recall, and F1 measure of 93%, 86%, 94%, and 90%, (2) generalizable, predicting the labels for the unit test of projects that were not in training or validation set with negligible performance drop, and (3) efficient, detecting the existence of bugs in only 6.5 milliseconds on average. Moreover, by interpreting the neural model and looking at it beyond a closed-box solution, we confirm that the oracle is valid, i.e., it predicts the labels through learning relevant features.

Publisher's Version Info Artifacts Functional
Scenario-Based Test Reduction and Prioritization for Multi-Module Autonomous Driving Systems
Yao Deng, Xi Zheng, Mengshi Zhang, Guannan Lou, and Tianyi Zhang
(Macquarie University, Australia; Meta, USA; Purdue University, USA)
When developing autonomous driving systems (ADS), developers often need to replay previously collected driving recordings to check the correctness of newly introduced changes to the system. However, simply replaying the entire recording is not necessary given the high redundancy of driving scenes in a recording (e.g., keeping the same lane for 10 minutes on a highway). In this pa- per, we propose a novel test reduction and prioritization approach for multi-module ADS. First, our approach automatically encodes frames in a driving recording to feature vectors based on a driving scene schema. Then, the given recording is sliced into segments based on the similarity of consecutive vectors. Lengthy segments are truncated to reduce the length of a recording and redundant segments with the same vector are removed. The remaining seg- ments are prioritized based on both the coverage and the rarity of driving scenes. We implemented this approach on an industry- level, multi-module ADS called Apollo and evaluated it on three road maps in various regression settings. The results show that our approach significantly reduced the original recordings by over 34% while keeping comparable test effectiveness, identifying almost all injected faults. Furthermore, our test prioritization method achieves about 22% to 39% and 41% to 53% improvements over three baselines in terms of both the average percentage of faults detected (APFD) and TOP-K.

Publisher's Version
MOSAT: Finding Safety Violations of Autonomous Driving Systems using Multi-objective Genetic Algorithm
Haoxiang Tian, Yan Jiang, Guoquan Wu, Jiren Yan, Jun Wei, Wei Chen, Shuo Li, and Dan Ye
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; University of Chinese Academy of Sciences Nanjing College, China; China Southern Power Grid, China; University of Chinese Academy of Sciences Chongqing School, China; Nanjing Institute of Software Technology, China)
Autonomous Driving Systems (ADSs) are safety-critical systems, and safety violations of Autonomous Vehicles (AVs) in real traffic will cause huge losses. Therefore, ADSs must be fully tested before deployed on real world roads. Simulation testing is essential to find safety violations of ADS. This paper proposes MOSAT, a multi-objective search-based testing framework, which constructs diverse and adversarial driving environment to expose safety violations of ADSs. Specifically, based on atomic driving maneuvers, MOSAT introduces motif pattern, which describes a sequence of maneuvers that can challenge ADS effectively. MOSAT constructs test scenarios by atomic maneuvers and motif patterns, and uses multi-objective genetic algorithm to search for adversarial and diverse test scenarios. Moreover, in order to test the performance of ADS comprehensively during long-mile driving, we design a novel continuous simulation testing technique, which runs the scenarios generated by multiple parallel search processes alternately in the simulator and can continuously create different perturbations to ADS. We demonstrate MOSAT on an industrial-grade platform, Baidu Apollo, and the experimental results show that MOSAT can effectively generate safety-critical scenarios to crash ADSs and it exposes 11 distinct types of safety violations in a short period of time. It also outperforms state-of-the-art techniques by finding more 6 distinct safety violations on the same road.

Publisher's Version

Empirical I

Are We Building on the Rock? On the Importance of Data Preprocessing for Code Summarization
Lin Shi, Fangwen Mu, Xiao Chen, Song Wang, Junjie Wang, Ye Yang, Ge Li, Xin Xia, and Qing Wang
(Institute of Software at Chinese Academy of Sciences, China; York University, Canada; Stevens Institute of Technology, USA; Peking University, China; Huawei, China)
Code summarization, the task of generating useful comments given the code, has long been of interest. Most of the existing code summarization models are trained and validated on widely-used code comment benchmark datasets. However, little is known about the quality of the benchmark datasets built from real-world projects. Are the benchmark datasets as good as expected? To bridge the gap, we conduct a systematic research to assess and improve the quality of four benchmark datasets widely used for code summarization tasks. First, we propose an automated code-comment cleaning tool that can accurately detect noisy data caused by inappropriate data preprocessing operations from existing benchmark datasets. Then, we apply the tool to further assess the data quality of the four benchmark datasets, based on the detected noises. Finally, we conduct comparative experiments to investigate the impact of noisy data on the performance of code summarization models. The results show that these data preprocessing noises widely exist in all four benchmark datasets, and removing these noisy data leads to a significant improvement on the performance of code summarization. We believe that the findings and insights will enable a better understanding of data quality in code summarization tasks, and pave the way for relevant research and practice.

Publisher's Version
Correlates of Programmer Efficacy and Their Link to Experience: A Combined EEG and Eye-Tracking Study
Norman Peitek, Annabelle Bergum, Maurice Rekrut, Jonas Mucke, Matthias Nadig, Chris Parnin, Janet Siegmund, and Sven Apel
(Saarland University, Germany; German Research Center for Artificial Intelligence, Germany; Chemnitz University of Technology, Germany; North Carolina State University, USA)
Background: Despite similar education and background, programmers can exhibit vast differences in efficacy. While research has identified some potential factors, such as programming experience and domain knowledge, the effect of these factors on programmers' efficacy is not well understood. Aims: We aim at unraveling the relationship between efficacy (speed and correctness) and measures of programming experience. We further investigate the correlates of programmer efficacy in terms of reading behavior and cognitive load. Method: For this purpose, we conducted a controlled experiment with 37 participants using electroencephalography (EEG) and eye tracking. We asked participants to comprehend up to 32 Java source-code snippets and observed their eye gaze and neural correlates of cognitive load. We analyzed the correlation of participants' efficacy with popular programming experience measures. Results: We found that programmers with high efficacy read source code more targeted and with lower cognitive load. Commonly used experience levels do not predict programmer efficacy well, but self-estimation and indicators of learning eagerness are fairly accurate. Implications: The identified correlates of programmer efficacy can be used for future research and practice (e.g., hiring). Future research should also consider efficacy as a group sampling method, rather than using simple experience measures.

Publisher's Version Info
What Motivates Software Practitioners to Contribute to Inner Source?
Zhiyuan Wan, Xin Xia, Yun Zhang, David Lo, Daibing Zhou, Qiuyuan Chen, and Ahmed E. Hassan
(Zhejiang University, China; Huawei, China; Zhejiang University City College, China; Singapore Management University, Singapore; Queen’s University, Canada)
Software development organizations have adopted open source development practices to support or augment their software development processes, a phenomenon referred to as inner source. Given the rapid adoption of inner source, we wonder what motivates software practitioners to contribute to inner source projects. We followed a mixed-methods approach--a qualitative phase of interviews with 20 interviewees, followed by a quantitative phase of an exploratory survey with 124 respondents from 13 countries across four continents. Our study uncovers practitioners' motivation to contribute to inner source projects, as well as how the motivation differs from what motivates practitioners to participate in open source projects. We also investigate how software practitioners' motivation impacts their contribution level and continuance intention in inner source projects. Based on our findings, we outline directions for future research and provide recommendations for organizations and software practitioners.

Publisher's Version

Community

A Retrospective Study of One Decade of Artifact Evaluations
Stefan Winter, Christopher S. Timperley, Ben Hermann, Jürgen Cito, Jonathan Bell, Michael Hilton, and Dirk Beyer
(LMU Munich, Germany; Carnegie Mellon University, USA; TU Dortmund, Germany; TU Wien, Austria; Northeastern University, USA)
Most software engineering research involves the development of a prototype, a proof of concept, or a measurement apparatus. Together with the data collected in the research process, they are collectively referred to as research artifacts and are subject to artifact evaluation (AE) at scientific conferences. Since its initiation in the SE community at ESEC/FSE 2011, both the goals and the process of AE have evolved and today expectations towards AE are strongly linked with reproducible research results and reusable tools that other researchers can build their work on. However, to date little evidence has been provided that artifacts which have passed AE actually live up to these high expectations, i.e., to which degree AE processes contribute to AE's goals and whether the overhead they impose is justified.
We aim to fill this gap by providing an in-depth analysis of research artifacts from a decade of software engineering (SE) and programming languages (PL) conferences, based on which we reflect on the goals and mechanisms of AE in our community. In summary, our analyses (1) suggest that articles with artifacts do not generally have better visibility in the community, (2) provide evidence how evaluated and not evaluated artifacts differ with respect to different quality criteria, and (3) highlight opportunities for further improving AE processes.

Publisher's Version Artifacts Reusable
Quantifying Community Evolution in Developer Social Networks
Liang Wang, Ying Li, Jierui Zhang, and Xianping Tao
(Nanjing University, China)
Understanding the evolution of communities in developer social networks (DSNs) around open source software (OSS) projects can provide valuable insights about the socio-technical process of OSS development. Existing studies show the evolutionary behaviors of social communities can effectively be described using patterns including split, shrink, merge, expand, emerge, and extinct. However, existing pattern-based approaches are limited in supporting quantitative analysis, and are potentially problematic for using the patterns in a mutually exclusive manner when describing community evolution. In this work, we propose that different patterns can occur simultaneously between every pair of communities during the evolution, just in different degrees. Four entropy-based indices are devised to measure the degree of community split, shrink, merge, and expand, respectively, which can provide a comprehensive and quantitative measure of community evolution in DSNs. The indices have properties desirable to quantify community evolution including monotonicity, and bounded maximum and minimum values that correspond to meaningful cases. They can also be combined to describe more patterns such as community emerge and extinct. We conduct studies with real-world OSS projects to evaluate the validity of the proposed indices. The results suggest the proposed indices can effectively capture community evolution, and are consistent with existing approaches in detecting evolution patterns in DSNs with an accuracy of 94.1%. The results also show that the indices are useful in predicting OSS team productivity with an accuracy of 0.718. In summary, the proposed approach is among the first to quantify the degree of community evolution with respect to different patterns, which is promising in supporting future research and applications about DSNs and OSS development.

Publisher's Version Info
Understanding Skills for OSS Communities on GitHub
Jenny T. Liang, Thomas Zimmermann, and Denae Ford
(University of Washington, USA; Microsoft Research, USA)
The development of open source software (OSS) is a broad field which requires diverse skill sets. For example, maintainers help lead the project and promote its longevity, technical writers assist with documentation, bug reporters identify defects in software, and developers program the software. However, it is unknown which skills are used in OSS development as well as OSS contributors' general attitudes towards skills in OSS. In this paper, we address this gap by administering a survey to a diverse set of 455 OSS contributors. Guided by these responses as well as prior literature on software development expertise and social factors of OSS, we develop a model of skills in OSS that considers the many contexts OSS contributors work in. This model has 45 skills in the following 9 categories: technical skills, working styles, problem solving, contribution types, project-specific skills, interpersonal skills, external relations, management, and characteristics. Through a mix of qualitative and quantitative analyses, we find that OSS contributors are actively motivated to improve skills and perceive many benefits in sharing their skills with others. We then use this analysis to derive a set of design implications and best practices for those who incorporate skills into OSS tools and platforms, such as GitHub.

Publisher's Version Info

Software Evolution

Accurate Method and Variable Tracking in Commit History
Mehran Jodavi and Nikolaos Tsantalis
(Concordia University, Canada)
Tracking program elements in the commit history of a project is essential for supporting various software maintenance, comprehension and evolution tasks. Accuracy is of paramount importance for the adoption of program element tracking tools by developers and researchers. To this end, we propose CodeTracker, a refactoring-aware tool that can generate the commit change history for method and variable declarations with a very high accuracy. More specifically, CodeTracker has 99.9% precision and recall in method tracking, surpassing the previous state-of-the-art tool, CodeShovel, with a comparable execution time. CodeTracker is the first tool of its kind that can track the change history of variables with 99.7% precision and 99.8% recall. To evaluate its accuracy in variable tracking, we extended the oracle created by Grund et al. for the evaluation of CodeShovel, with the complete change history of all 1345 variables and parameters declared in the 200 methods comprising the Grund et al. oracle. We make our tool and extended oracle publicly available to enable the replication of our experiments and facilitate future research on program element tracking techniques.

Publisher's Version Info Artifacts Functional
Classifying Edits to Variability in Source Code
Paul Maximilian Bittner, Christof Tinnes, Alexander Schultheiß, Sören Viegener, Timo Kehrer, and Thomas Thüm
(University of Ulm, Germany; Siemens, Germany; Humboldt University of Berlin, Germany; University of Bern, Switzerland)
For highly configurable software systems, such as the Linux kernel, maintaining and evolving variability information along changes to source code poses a major challenge. While source code itself may be edited, also feature-to-code mappings may be introduced, removed, or changed. In practice, such edits are often conducted ad-hoc and without proper documentation. To support the maintenance and evolution of variability, it is desirable to understand the impact of each edit on the variability. We propose the first complete and unambiguous classification of edits to variability in source code by means of a catalog of edit classes. This catalog is based on a scheme that can be used to build classifications that are complete and unambiguous by construction. To this end, we introduce a complete and sound model for edits to variability. In about 21.5ms per commit, we validate the correctness and suitability of our classification by classifying each edit in 1.7 million commits in the change histories of 44 open-source software systems automatically. We are able to classify all edits with syntactically correct feature-to-code mappings and find that all our edit classes occur in practice.

Publisher's Version Artifacts Reusable
The Evolution of Type Annotations in Python: An Empirical Study
Luca Di Grazia and Michael Pradel
(University of Stuttgart, Germany)
Type annotations and gradual type checkers attempt to reveal errors and facilitate maintenance in dynamically typed programming languages. Despite the availability of these features and tools, it is currently unclear how quickly developers are adopting them, what strategies they follow when doing so, and whether adding type annotations reveals more type errors. This paper presents the first large-scale empirical study of the evolution of type annotations and type errors in Python. The study is based on an analysis of 1,414,936 type annotation changes, which we extract from 1,123,393 commits among 9,655 projects. Our results show that (i) type annotations are getting more popular, and once added, often remain unchanged in the projects for a long time, (ii) projects follow three evolution patterns for type annotation usage -- regular annotation, type sprints, and occasional uses -- and that the used pattern correlates with the number of contributors, (iii) more type annotations help find more type errors (0.704 correlation), but nevertheless, many commits (78.3%) are committed despite having such errors. Our findings show that better developer training and automated techniques for adding type annotations are needed, as most code still remains unannotated, and they call for a better integration of gradual type checking into the development process.

Publisher's Version Info Artifacts Reusable
UTANGO: Untangling Commits with Context-Aware, Graph-Based, Code Change Clustering Learning Model
Yi Li, Shaohua Wang, and Tien N. Nguyen
(New Jersey Institute of Technology, USA; University of Texas at Dallas, USA)
During software evolution, developers make several changes and commit them into the repositories. Unfortunately, many of them tangle different purposes, both hampering program comprehension and reducing separation of concerns. Automated approaches with deterministic solutions have been proposed to untangle commits. However, specifying an effective clustering criteria on the changes in a commit for untangling is challenging for those approaches. In this work, we present UTango, a machine learning (ML)-based approach that learns to untangle the changes in a commit. We develop a novel code change clustering learning model that learns to cluster the code changes, represented by the embeddings, into different groups with different concerns. We adapt the agglomerative clustering algorithm into a supervised-learning clustering model operating on the learned code change embeddings via trainable parameters and a loss function in comparing the predicted clusters and the correct ones during training. To facilitate our clustering learning model, we develop a context-aware, graph-based, code change representation learning model, leveraging Label, Graph-based Convolution Network to produce the contextualized embeddings for code changes, that integrates program dependencies and the surrounding contexts of the changes. The contexts and cloned code are also explicitly represented, helping UTango distinguish the concerns. Our empirical evaluation on C# and Java datasets with 1,612 and 14k tangled commits show that it achieves the accuracy of 28.6%– 462.5% and 13.3%–100.0% relatively higher than the state-of-the-art commit-untangling approaches for C# and Java, respectively.

Publisher's Version Info

Program Analysis I

Static Executes-Before Analysis for Event Driven Programs
Rekha Pai, Abhishek Uppar, Akshatha Shenoy, Pranshul Kushwaha, and Deepak D'Souza
(IISc Bangalore, India; TCS Research, India)
The executes-before relation between tasks is fundamental in the analysis of Event Driven Programs with several downstream applications like race detection and identifying redundant synchronizations. We present a sound, efficient, and effective static analysis technique to compute executes-before pairs of tasks for a general class of event driven programs. The analysis is based on a small but comprehensive set of rules evaluated on a novel structure called the task post graph of a program. We show how to use the executes-before information to identify disjoint-blocks in event driven programs and further use them to improve the precision of data race detection for these programs. We have implemented our analysis in the Flowdroid framework in a tool called AndRacer and evaluated it on several Android apps, bringing out the scalability, recall, and improved precision of the analyses

Publisher's Version Artifacts Functional
Security Code Smells in Apps: Are We Getting Better?
Steven Arzt
(Fraunhofer SIT, Germany; ATHENE, Germany)
Users increasingly rely on mobile apps for everyday tasks, including security- and privacy-sensitive tasks such as online banking, e-health, and e-government. Additionally, a wealth of sensors captures the movements and habits of the users for fitness tracking and convenience. Despite legal regulations imposing requirements and limits on the processing of privacy-sensitive data, users must still trust the app developers to apply suffcient protections. In this paper, we investigate the state of security in Android apps and how security-related code smells have evolved since the introduction of the Android operating system.
With an analysis of 300 apps per year over 12 years between 2010 and 2021 from the Google Play Store, we find that the number of code scanner findings per thousand lines of code decreases over time. Still, this development is offset by the increase in code size. Apps have more and more findings, suggesting that the overall security level decreases. This trend is driven by flaws in the use of cryptography, insecure compiler flags, insecure uses of WebView components, and insecure uses of language features such as reflection. Based on our data, we argue for stricter controls on apps before admission to the store.

Publisher's Version
Large-Scale Analysis of Non-Termination Bugs in Real-World OSS Projects
Xiuhan Shi, Xiaofei Xie, Yi Li, Yao Zhang, Sen Chen, and Xiaohong Li
(Tianjin University, China; Singapore Management University, Singapore; Nanyang Technological University, Singapore)
Termination is a crucial program property. Non-termination bugs can be subtle to detect and may remain hidden for long before they take effect. Many real-world programs still suffer from vast consequences (e.g., no response) caused by non-termination bugs. As a classic problem, termination proving has been studied for many years. Many termination checking tools and techniques have been developed and demonstrated effectiveness on existing well-established benchmarks. However, the capability of these tools in finding practical non-termination bugs has yet to be tested on real-world projects. To fill in this gap, in this paper, we conducted the first large-scale empirical study of non-termination bugs in real-world OSS projects. Specifically, we first devoted substantial manual efforts in collecting and analyzing 445 non-termination bugs from 3,142 GitHub commits and provided a systematic classifi-cation of the bugs based on their root causes. We constructed a new benchmark set characterizing the real-world bugs with simplified programs, including a non-termination dataset with 56 real and reproducible non-termination bugs and a termination dataset with 58 fixed programs. With the constructed benchmark, we evaluated five state-of-the-art termination analysis tools. The results show that the capabilities of the tested tools to make correct verdicts have obviously dropped compared with the existing benchmarks. Meanwhile, we identified the challenges and limitations that these tools face by analyzing the root causes of their unhandled bugs. Fi-nally, we summarized the challenges and future research directions for detecting non-termination bugs in real-world projects.

Publisher's Version Info
On-the-Fly Syntax Highlighting using Neural Networks
Marco Edoardo Palma, Pasquale Salza, and Harald C. Gall
(University of Zurich, Switzerland)
With the presence of online collaborative tools for software developers, source code is shared and consulted frequently, from code viewers to merge requests and code snippets. Typically, code highlighting quality in such scenarios is sacrificed in favor of system responsiveness. In these on-the-fly settings, performing a formal grammatical analysis of the source code is not only expensive, but also intractable for the many times the input is an invalid derivation of the language. Indeed, current popular highlighters heavily rely on a system of regular expressions, typically far from the specification of the language's lexer. Due to their complexity, regular expressions need to be periodically updated as more feedback is collected from the users and their design unwelcome the detection of more complex language formations. This paper delivers a deep learning-based approach suitable for on-the-fly grammatical code highlighting of correct and incorrect language derivations, such as code viewers and snippets. It focuses on alleviating the burden on the developers, who can reuse the language's parsing strategy to produce the desired highlighting specification. Moreover, this approach is compared to nowadays online syntax highlighting tools and formal methods in terms of accuracy and execution time, across different levels of grammatical coverage, for three mainstream programming languages. The results obtained show how the proposed approach can consistently achieve near-perfect accuracy in its predictions, thereby outperforming regular expression-based strategies.

Publisher's Version Info
Declarative Smart Contracts
Haoxian Chen, Gerald Whitters, Mohammad Javad Amiri, Yuepeng Wang, and Boon Thau Loo
(University of Pennsylvania, USA; Simon Fraser University, Canada)
This paper presents DeCon, a declarative programming language for implementing smart contracts and specifying contract-level properties. Driven by the observation that smart contract operations and contract-level properties can be naturally expressed as relational constraints, DeCon models each smart contract as a set of relational tables that store transaction records. This relational representation of smart contracts enables convenient specification of contract properties, facilitates run-time monitoring of potential property violations, and brings clarity to contract debugging via data provenance. Specifically, a DeCon program consists of a set of declarative rules and violation query rules over the relational representation, describing the smart contract implementation and contract-level properties, respectively. We have developed a tool that can compile DeCon programs into executable Solidity programs, with instrumentation for run-time property monitoring. Our case studies demonstrate that DeCon can implement realistic smart contracts such as ERC20 and ERC721 digital tokens. Our evaluation results reveal the marginal overhead of DeCon compared to the open-source reference implementation, incurring 14% median gas overhead for execution, and another 16% median gas overhead for run-time verification.

Publisher's Version Artifacts Functional

Human/Computer Interaction

Asynchronous Technical Interviews: Reducing the Effect of Supervised Think-Aloud on Communication Ability
Mahnaz Behroozi, Chris Parnin, and Chris Brown
(IBM, USA; North Carolina State University, USA; Virginia Tech, USA)
Software engineers often face a critical test before landing a job—passing a technical interview. During these sessions, candidates must write code while thinking aloud as they work toward a solution to a problem under the watchful eye of an interviewer. While thinking aloud during technical interviews gives interviewers a picture of candidates’ problem-solving ability, surprisingly, these types of interviews often prevent candidates from communicating their thought process effectively. To understand if poor performance related to interviewer presence can be reduced while preserving communication and technical skills, we introduce asynchronous technical interviews—where candidates submit recordings of think-aloud and coding. We compare this approach to traditional whiteboard interviews and find that, by eliminating interviewer supervision, asynchronicity significantly improved the clarity of think-aloud via increased informativeness and reduced stress. Moreover, we discovered asynchronous technical interviews preserved, and in some cases even enhanced, technical problem-solving strategies and code quality. This work offers insight into asynchronous technical interviews as a design for supporting communication during interviews, and discusses trade-offs and guidelines for implementing this approach in software engineering hiring practices.

Publisher's Version
How to Formulate Specific How-To Questions in Software Development?
Mingwei Liu, Xin Peng, Andrian Marcus, Christoph Treude, Jiazhan Xie, Huanjun Xu, and Yanjun Yang
(Fudan University, China; University of Texas at Dallas, USA; University of Melbourne, Australia)
Developers often ask how-to questions using search engines, technical Q&A communities, and interactive Q&A systems to seek help for specific programming tasks. However, they often do not formulate the questions in a specific way, making it hard for the systems to return the best answers. We propose an approach (TaskKG4Q) that interactively helps developers formulate a programming related how-to question. TaskKG4Q is using a programming task knowledge graph (task KG in short) mined from Stack Overflow questions, which provides a hierarchical conceptual structure for tasks in terms of [actions], [objects], and [constraints]. An empirical evaluation of the intrinsic quality of the task KG revealed that 75.0% of the annotated questions in the task KG are correct. The comparison between TaskKG4Q and two baselines revealed that TaskKG4Q can help developers formulate more specific how-to questions. More so, an empirical study with novice programmers revealed that they write more effective questions for finding answers to their programming tasks on Stack Overflow.

Publisher's Version
Pair Programming Conversations with Agents vs. Developers: Challenges and Opportunities for SE Community
Peter Robe, Sandeep K. Kuttal, Jake AuBuchon, and Jacob Hart
(University of Tulsa, USA)
Recent research has shown feasibility of an interactive pair-programming conversational agent, but implementing such an agent poses three challenges: a lack of benchmark datasets, absence of software engineering specific labels, and the need to understand developer conversations. To address these challenges, we conducted a Wizard of Oz study with 14 participants pair programming with a simulated agent and collected 4,443 developer-agent utterances. Based on this dataset, we created 26 software engineering labels using an open coding process to develop a hierarchical classification scheme. To understand labeled developer-agent conversations, we compared the accuracy of three state-of-the-art transformer-based language models, BERT, GPT-2, and XLNet, which performed interchangeably. In order to begin creating a developer-agent dataset, researchers and practitioners need to conduct resource intensive Wizard of Oz studies. Presently, there exists vast amounts of developer-developer conversations on video hosting websites. To investigate the feasibility of using developer-developer conversations, we labeled a publicly available developer-developer dataset (3,436 utterances) with our hierarchical classification scheme and found that a BERT model trained on developer-developer data performed ~10% worse than the BERT trained on developer-agent data, but when using transfer-learning, accuracy improved. Finally, our qualitative analysis revealed that developer-developer conversations are more implicit, neutral, and opinionated than developer-agent conversations. Our results have implications for software engineering researchers and practitioners developing conversational agents.

Publisher's Version
Psychologically-Inspired, Unsupervised Inference of Perceptual Groups of GUI Widgets from GUI Images
Mulong Xie, Zhenchang Xing, Sidong Feng, Xiwei Xu, Liming Zhu, and Chunyang Chen
(Australian National University, Australia; CSIRO’s Data61, Australia; Monash University, Australia; UNSW, Australia)
Graphical User Interface (GUI) is not merely a collection of individual and unrelated widgets, but rather partitions discrete widgets into groups by various visual cues, thus forming higher-order perceptual units such as tab, menu, card or list. The ability to automatically segment a GUI into perceptual groups of widgets constitutes a fundamental component of visual intelligence to automate GUI design, implementation and automation tasks. Although humans can partition a GUI into meaningful perceptual groups of widgets in a highly reliable way, perceptual grouping is still an open challenge for computational approaches. Existing methods rely on ad-hoc heuristics or supervised machine learning that is dependent on specific GUI implementations and runtime information. Research in psychology and biological vision has formulated a set of principles (i.e., Gestalt theory of perception) that describe how humans group elements in visual scenes based on visual cues like connectivity, similarity, proximity and continuity. These principles are domain-independent and have been widely adopted by practitioners to structure content on GUIs to improve aesthetic pleasantness and usability. Inspired by these principles, we present a novel unsupervised image-based method for inferring perceptual groups of GUI widgets. Our method requires only GUI pixel images, is independent of GUI implementation, and does not require any training data. The evaluation on a dataset of 1,091 GUIs collected from 772 mobile apps and 20 UI design mockups shows that our method significantly outperforms the state-of-the-art ad-hoc heuristics-based baseline. Our perceptual grouping method creates opportunities for improving UI-related software engineering tasks.

Publisher's Version Info
Toward Interactive Bug Reporting for (Android App) End-Users
Yang Song, Junayed Mahmud, Ying Zhou, Oscar Chaparro, Kevin Moran, Andrian Marcus, and Denys Poshyvanyk
(College of William and Mary, USA; George Mason University, USA; University of Texas at Dallas, USA)
Many software bugs are reported manually, particularly bugs that manifest themselves visually in the user interface. End-users typically report these bugs via app reviewing websites, issue trackers, or in-app built-in bug reporting tools, if available. While these systems have various features that facilitate bug reporting (e.g., textual templates or forms), they often provide limited guidance, concrete feedback, or quality verification to end-users, who are often inexperienced at reporting bugs and submit low-quality bug reports that lead to excessive developer effort in bug report management tasks. We propose an interactive bug reporting system for end-users (Burt), implemented as a task-oriented chatbot. Unlike existing bug reporting systems, Burt provides guided reporting of essential bug report elements (i.e., the observed behavior, expected behavior, and steps to reproduce the bug), instant quality verification, and graphical suggestions for these elements. We implemented a version of Burt for Android and conducted an empirical evaluation study with end-users, who reported 12 bugs from six Android apps studied in prior work. The reporters found that Burt’s guidance and automated suggestions/clarifications are useful and Burt is easy to use. We found that Burt reports contain higher-quality information than reports collected via a template-based bug reporting system. Improvements to Burt, informed by the reporters, include support for various wordings to describe bug report elements and improved quality verification. Our work marks an important paradigm shift from static to interactive bug reporting for end-users.

Publisher's Version Artifacts Reusable

Machine Learning II

Understanding Performance Problems in Deep Learning Systems
Junming Cao, Bihuan Chen, Chao Sun, Longjie Hu, Shuaihong Wu, and Xin Peng
(Fudan University, China)
Deep learning (DL) has been widely applied to many domains. Unique challenges in engineering DL systems are posed by the programming paradigm shift from traditional systems to DL systems, and performance is one of the challenges. Performance problems (PPs) in DL systems can cause severe consequences such as excessive resource consumption and financial loss. While bugs in DL systems have been extensively investigated, PPs in DL systems have hardly been explored. To bridge this gap, we present the first comprehensive study to i) characterize symptoms, root causes, and introducing and exposing stages of PPs in DL systems developed in TensorFLow and Keras, with 224 PPs collected from 210 StackOverflow posts, and to ii) assess the capability of existing performance analysis approaches in tackling PPs, with a constructed benchmark of 58 PPs in DL systems. Our findings shed light on the implications on developing high-performance DL systems, and detecting and localizing PPs in DL systems. To demonstrate the usefulness of our findings, we develop a static checker DeepPerf to detect three types of PPs. It has detected 488 new PPs in 130 GitHub projects. 105 and 27 PPs have been confirmed and fixed.

Publisher's Version Info Artifacts Reusable
API Recommendation for Machine Learning Libraries: How Far Are We?
Moshi Wei, Yuchao Huang, Junjie Wang, Jiho Shin, Nima Shiri Harzevili, and Song Wang
(York University, Canada; Institute of Software at Chinese Academy of Sciences, China)
Application Programming Interfaces (APIs) are designed to help developers build software more effectively. Recommending the right APIs for specific tasks is gaining increasing attention among researchers and developers. However, most of the existing approaches are mainly evaluated for general programming tasks using statically typed programming languages such as Java. Little is known about their practical effectiveness and usefulness for machine learning (ML) programming tasks with dynamically typed programming languages such as Python, whose paradigms are fundamentally different from general programming tasks. This is of great value considering the increasing popularity of ML and the large number of new questions appearing on question answering websites. In this work, we set out to investigate the effectiveness of existing API recommendation approaches for Python-based ML programming tasks from Stack Overflow (SO). Specifically, we conducted an empirical study of six widely-used Python-based ML libraries using two state-of-the-art API recommendation approaches, i.e., BIKER and DeepAPI. We found that the existing approaches perform poorly for two main reasons: (1) Python-based ML tasks often require significant long API sequences; and (2) there are common API usage patterns in Python-based ML programming tasks that existing approaches cannot handle. Inspired by our findings, we proposed a simple but effective frequent itemset mining-based approach, i.e., FIMAX, to boost API recommendation approaches, i.e., enhance existing API recommendation approaches for Python-based ML programming tasks by leveraging the common API usage information from SO questions. Our evaluation shows that FIMAX improves existing state-of-the-art API recommendation approaches by up to 54.3% and 57.4% in MRR and MAP, respectively. Our user study with 14 developers further demonstrates the practicality of FIMAX for API recommendation.

Publisher's Version
No More Fine-Tuning? An Experimental Evaluation of Prompt Tuning in Code Intelligence
Chaozheng Wang, Yuanhang Yang, Cuiyun Gao, Yun Peng, Hongyu Zhang, and Michael R. Lyu
(Harbin Institute of Technology, China; Chinese University of Hong Kong, China; University of Newcastle, Australia)
Pre-trained models have been shown effective in many code intelligence tasks. These models are pre-trained on large-scale unlabeled corpus and then fine-tuned in downstream tasks. However, as the inputs to pre-training and downstream tasks are in different forms, it is hard to fully explore the knowledge of pre-trained models. Besides, the performance of fine-tuning strongly relies on the amount of downstream data, while in practice, the scenarios with scarce data are common. Recent studies in the natural language processing (NLP) field show that prompt tuning, a new paradigm for tuning, alleviates the above issues and achieves promising results in various NLP tasks. In prompt tuning, the prompts inserted during tuning provide task-specific knowledge, which is especially beneficial for tasks with relatively scarce data. In this paper, we empirically evaluate the usage and effect of prompt tuning in code intelligence tasks. We conduct prompt tuning on popular pre-trained models CodeBERT and CodeT5 and experiment with three code intelligence tasks including defect prediction, code summarization, and code translation. Our experimental results show that prompt tuning consistently outperforms fine-tuning in all three tasks. In addition, prompt tuning shows great potential in low-resource scenarios, e.g., improving the BLEU scores of fine-tuning by more than 26% on average for code summarization. Our results suggest that instead of fine-tuning, we could adapt prompt tuning for code intelligence tasks to achieve better performance, especially when lacking task-specific data.

Publisher's Version

Software Testing II

Cross-Device Record and Replay for Android Apps
Cong Li, Yanyan Jiang, and Chang Xu
(Nanjing University, China)
Cross-device replay for Android apps is challenging because apps have to adapt or even restructure their GUIs responsively upon screen-size or orientation change across devices. As a first exploratory work, this paper demonstrates that cross-device record and replay can be made simple and practical by a one-pass, greedy algorithm by the Rx framework leveraging the least surprise principle in the GUI design. The experimental results of over 1,000 replay settings encouragingly show that our implemented Rx prototype tool effectively solved non-trivial cross-device replay cases beyond any known non-search-based work's scope, and had still competitive capabilities on same-device replay with start-of-the-art techniques.

Publisher's Version Info
Online Testing of RESTful APIs: Promises and Challenges
Alberto Martin-Lopez, Sergio Segura, and Antonio Ruiz-Cortés
(University of Seville, Spain)
Online testing of web APIs—testing APIs in production—is gaining traction in industry. Platforms such as RapidAPI and Sauce Labs provide online testing and monitoring services of web APIs 24/7, typically by re-executing manually designed test cases on the target APIs on a regular basis. In parallel, research on the automated generation of test cases for RESTful APIs has seen significant advances in recent years. However, despite their promising results in the lab, it is unclear whether research tools would scale to industrial-size settings and, more importantly, how they would perform in an online testing setup, increasingly common in practice. In this paper, we report the results of an empirical study on the use of automated test case generation methods for online testing of RESTful APIs. Specifically, we used the RESTest framework to automatically generate and execute test cases in 13 industrial APIs for 15 days non-stop, resulting in over one million test cases. To scale at this level, we had to transition from a monolithic tool approach to a multi-bot architecture with over 200 bots working cooperatively in tasks like test generation and reporting. As a result, we uncovered about 390K failures, which we conservatively triaged into 254 bugs, 65 of which have been acknowledged or fixed by developers to date. Among others, we identified confirmed faults in the APIs of Amadeus, Foursquare, Yelp, and YouTube, accessed by millions of applications worldwide. More importantly, our reports have guided developers on improving their APIs, including bug fixes and documentation updates in the APIs of Amadeus and YouTube. Our results show the potential of online testing of RESTful APIs as the next must-have feature in industry, but also some of the key challenges to overcome for its full adoption in practice.

Publisher's Version Artifacts Reusable
Avgust: Automating Usage-Based Test Generation from Videos of App Executions
Yixue Zhao, Saghar Talebipour, Kesina Baral, Hyojae Park, Leon Yee, Safwat Ali Khan, Yuriy Brun, Nenad Medvidović, and Kevin Moran
(University of Massachusetts at Amherst, USA; University of Southern California, USA; George Mason University, USA; Sharon High School, USA; Valley Christian High School, USA)
Writing and maintaining UI tests for mobile apps is a time-consuming and tedious task. While decades of research have produced auto- mated approaches for UI test generation, these approaches typically focus on testing for crashes or maximizing code coverage. By contrast, recent research has shown that developers prefer usage-based tests, which center around specific uses of app features, to help support activities such as regression testing. Very few existing techniques support the generation of such tests, as doing so requires automating the difficult task of understanding the semantics of UI screens and user inputs. In this paper, we introduce Avgust, which automates key steps of generating usage-based tests. Avgust uses neural models for image understanding to process video recordings of app uses to synthesize an app-agnostic state-machine encoding of those uses. Then, Avgust uses this encoding to synthesize test cases for a new target app. We evaluate Avgust on 374 videos of common uses of 18 popular apps and show that 69% of the tests Avgust generates successfully execute the desired usage, and that Avgust’s classifiers outperform the state of the art.

Publisher's Version Info Artifacts Functional
Detecting Non-crashing Functional Bugs in Android Apps via Deep-State Differential Analysis
Jue Wang, Yanyan Jiang, Ting Su, Shaohua Li, Chang Xu, Jian Lu, and Zhendong Su
(Nanjing University, China; East China Normal University, China; ETH Zurich, Switzerland)
Non-crashing functional bugs of Android apps can seriously affect user experience. Often buried in rare program paths, such bugs are difficult to detect but lead to severe consequences. Unfortunately, very few automatic functional bug oracles for Android apps exist, and they are all specific to limited types of bugs. In this paper, we introduce a novel technique named deep-state differential analysis, which brings the classical "bugs as deviant behaviors" oracle to Android apps as a generic automatic test oracle. Our oracle utilizes the observations on the execution of automatically generated test inputs that (1) there can be a large number of traces reaching internal app states with similar GUI layouts, and only a small portion of them would reach an erroneous app state, and (2) when performing the same sequence of actions on similar GUI layouts, the outcomes will be limited. Therefore, for each set of test inputs terminating at similar GUI layouts, we manifest comparable app behaviors by appending the same events to these inputs, cluster the manifested behaviors, and identify minorities as possible anomalies. We also calibrate the distribution of these test inputs by a novel input calibration procedure, to ensure the distribution of these test inputs is balanced with rare bug occurrences.
We implemented the deep-state differential analysis algorithm as an exploratory prototype Odin and evaluated it against 17 popular real-world Android apps. Odin successfully identified 28 non-crashing functional bugs (five of which were previously unknown) of various root causes with reasonable precision. Detailed comparisons and analyses show that a large fraction (11/28) of these bugs cannot be detected by state-of-the-art techniques.

Publisher's Version
RoboFuzz: Fuzzing Robotic Systems over Robot Operating System (ROS) for Finding Correctness Bugs
Seulbae Kim and Taesoo Kim
(Georgia Institute of Technology, USA)
Robotic systems are becoming an integral part of human lives. Responding to the increased demands for robot productions, Robot Operating System (ROS), an open-source middleware suite for robotic development, is gaining traction by providing practical tools and libraries for quickly developing robots. In this paper, we are concerned with a relatively less-tested class of bugs in ROS and ROS-based robotic systems, called semantic correctness bugs, including the violation of specification, violation of physical laws, and cyber-physical discrepancy. These bugs often stem from the cyber-physical nature of robotic systems, in which noisy hardware components are intertwined with software components, and thus cannot be detected by existing fuzzing approaches that mostly focus on finding memory-safety bugs.
We propose RoboFuzz, a feedback-driven fuzzing framework that integrates with ROS and enables testing of the correctness bugs. RoboFuzz features (1) data type-aware mutation for effectively stressing data-driven ROS systems, (2) hybrid execution for acquiring robotic states from both real-world and a simulator, capturing unforeseen cyber-physical discrepancies, (3) an oracle handler that identifies correctness bugs by checking the execution states against predefined correctness oracles, and (4) a semantic feedback engine for providing augmented guidance to the input mutator, complementing classic code coverage-based feedback, which is less effective for distributed, data-driven robots. By encoding the correctness invariants of ROS and four ROS-compatible robotic systems into specialized oracles, RoboFuzz detected 30 previously unknown bugs, of which 25 are acknowledged and six have been fixed.

Publisher's Version Artifacts Reusable

Empirical II

AgileCtrl: A Self-Adaptive Framework for Configuration Tuning
Shu Wang, Henry Hoffmann, and Shan Lu
(LinkedIn, USA; University of Chicago, USA)
Software systems increasingly expose performance-sensitive configuration parameters, or PerfConfs, to users. Unfortunately, the right settings of these PerfConfs are difficult to decide and often change at run time. To address this problem, prior research has proposed self-adaptive frameworks that automatically monitor the software’s behavior and dynamically tune configurations to provide the desired performance despite dynamic changes. However, these frameworks often require configuration themselves; sometimes explicitly in the form of additional parameters, sometimes implicitly in the form of training.
This paper proposes a new framework, AgileCtrl, that eliminates the need of configuration for a large family of control-based self-adaptive frameworks. AgileCtrl’s key insight is to not just monitor the original software, but additionally to monitor its adaptations and reconfigure itself when its internal adaptation mechanisms are not meeting software requirements. We evaluate AgileCtrl by comparing against recent control-based approaches to self-adaptation that require user configuration. Across a number of case studies, we find AgileCtrl withstands model errors up to 106×, saves the system from performance oscillation and crashes, and improves the performance up to 53%. It also auto-adjusts improper performance goals while improving the performance by 50%.

Publisher's Version
Using Nudges to Accelerate Code Reviews at Scale
Qianhua Shan, David Sukhdeo, Qianying Huang, Seth Rogers, Lawrence Chen, Elise Paradis, Peter C. Rigby, and Nachiappan Nagappan
(Meta, USA; Concordia University, Canada)
We describe a large-scale study to reduce the amount of time code review takes. Each quarter at Meta we survey developers. Combining sentiment data from a developer experience survey and telemetry data from our diff review tool, we address, “When does a diff review feel too slow?” From the sentiment data alone, we learn that 84.7% of developers are satisfied with the time their diffs spend in review. By enriching the survey results with telemetry for each respondent, we determined that sentiment is closely associated with the 75th percentile time in review for that respondent’s diffs, ie those that take more than 24 hours.
To encourage developers to act on stale diffs that have had no action for 24 or more hours, we designed a NudgeBot to notify, ie nudge, reviewers. To determine who to nudge when a diff is stale, we created a model to rank the reviewers based on the probability that they will make a comment or perform some other action on a diff. This model outperformed models that looked at files the reviewer had modified in the past. Combining this information with prior author-review relationships, we achieved an MRR and AUC of .81 and .88, respectively.
To evaluate NudgeBot in production, we conducted an A/B cluster-randomized experiment on over 30k engineers. We observed substantial statistically significant decrease in both time in review (-6.8%, p=0.049) and time to first reviewer action (-9.9%, p=0.010). We also used guard metrics to ensure that most reviews were still done in fewer than 24 hours and that reviewers still spend the same amount of time looking at diffs, and saw no statistically significant change in these metrics. NudgeBot is now rolled out company wide and is used daily by thousands of engineers at Meta.

Publisher's Version
First Come First Served: The Impact of File Position on Code Review
Enrico Fregnan, Larissa Braz, Marco D'Ambros, Gül Çalıklı, and Alberto Bacchelli
(University of Zurich, Switzerland; USI Lugano, Switzerland; University of Glasgow, UK)
The most popular code review tools (e.g., Gerrit and GitHub) present the files to review sorted in alphabetical order. Could this choice or, more generally, the relative position in which a file is presented bias the outcome of code reviews? We investigate this hypothesis by triangulating complementary evidence in a two-step study. First, we observe developers’ code review activity. We analyze the review comments pertaining to 219,476 Pull Requests (PRs) from 138 popular Java projects on GitHub. We found files shown earlier in a PR to receive more comments than files shown later, also when controlling for possible confounding factors: e.g., the presence of discussion threads or the lines added in a file. Second, we measure the impact of file position on defect finding in code review. Recruit- ing 106 participants, we conduct an online controlled experiment in which we measure participants’ performance in detecting two unrelated defects seeded into two different files. Participants are assigned to one of two treatments in which the position of the defective files is switched. For one type of defect, participants are not affected by its file’s position; for the other, they have 64% lower odds to identify it when its file is last as opposed to first. Overall, our findings provide evidence that the relative position in which files are presented has an impact on code reviews’ outcome; we discuss these results and implications for tool design and code review.

Publisher's Version Artifacts Reusable
Code, Quality, and Process Metrics in Graduated and Retired ASFI Projects
Ștefan Stănciulescu, Likang Yin, and Vladimir Filkov
(University of California at Davis, USA)
Recent work on open source sustainability shows that successful trajectories of projects in the Apache Software Foundation Incubator (ASFI) can be predicted early on, using a set of socio-technical measures. Because OSS projects are socio-technical systems centered around code artifacts, we hypothesize that sustainable projects may exhibit different code and process patterns than unsustainable ones, and that those patterns can grow more apparent as projects evolve over time. Here we studied the code and coding processes of over 200 ASFI projects, and found that ASFI graduated projects have different patterns of code quality and complexity than retired ones. Likewise for the coding processes – e.g., feature commits or bug-fixing commits are correlated with project graduation success. We find that minor contributors and major contributors (who contribute <5%, respectively >=95% commits) associate with graduation outcomes, implying that having also developers who contribute fewer commits are important for a project’s success. This study provides evidence that OSS projects, especially nascent ones, can benefit from introspection and instrumentation using multidimensional modeling of the whole system, including code, processes, and code quality measures, and how they are interconnected over time.

Publisher's Version
CommentFinder: A Simpler, Faster, More Accurate Code Review Comments Recommendation
Yang Hong, Chakkrit Tantithamthavorn, Patanamon Thongtanunam, and Aldeida Aleti
(Monash University, Australia; University of Melbourne, Australia)
Code review is an effective quality assurance practice, but can be labor-intensive since developers have to manually review the code and provide written feedback. Recently, a Deep Learning (DL)-based approach was introduced to automatically recommend code review comments based on changed methods. While the approach showed promising results, it requires expensive computational resource and time which limits its use in practice. To address this limitation, we propose CommentFinder – a retrieval-based approach to recommend code review comments. Through an empirical evaluation of 151,019 changed methods, we evaluate the effectiveness and efficiency of CommentFinder against the state-of-the-art approach. We find that when recommending the best-1 review comment candidate, our CommentFinder is 32% better than prior work in recommending the correct code review comment. In addition, CommentFinder is 49 times faster than the prior work. These findings highlight that our CommentFinder could help reviewers to reduce the manual efforts by recommending code review comments, while requiring less computational time.

Publisher's Version

Machine Learning III

AutoPruner: Transformer-Based Call Graph Pruning
Thanh Le-Cong, Hong Jin Kang, Truong Giang Nguyen, Stefanus Agus Haryono, David Lo, Xuan-Bach D. Le, and Quyet Thang Huynh
(Singapore Management University, Singapore; University of Melbourne, Australia; Hanoi University of Science and Technology, Vietnam)
Constructing a static call graph requires trade-offs between soundness and precision. Program analysis techniques for constructing call graphs are unfortunately usually imprecise. To address this problem, researchers have recently proposed call graph pruning empowered by machine learning to post-process call graphs constructed by static analysis. A machine learning model is built to capture information from the call graph by extracting structural features for use in a random forest classifier. It then removes edges that are predicted to be false positives. Despite the improvements shown by machine learning models, they are still limited as they do not consider the source code semantics and thus often are not able to effectively distinguish true and false positives.
In this paper, we present a novel call graph pruning technique, AutoPruner, for eliminating false positives in call graphs via both statistical semantic and structural analysis. Given a call graph constructed by traditional static analysis tools, AutoPruner takes a Transformer-based approach to capture the semantic relationships between the caller and callee functions associated with each edge in the call graph. To do so, AutoPruner fine-tunes a model of code that was pre-trained on a large corpus to represent source code based on descriptions of its semantics. Next, the model is used to extract semantic features from the functions related to each edge in the call graph. AutoPruner uses these semantic features together with the structural features extracted from the call graph to classify each edge via a feed-forward neural network. Our empirical evaluation on a benchmark dataset of real-world programs shows that AutoPruner outperforms the state-of-the-art baselines, improving on F-measure by up to 13% in identifying false-positive edges in a static call graph. Moreover, AutoPruner achieves improvements on two client analyses, including halving the false alarm rate on null pointer analysis and over 10% improvements on monomorphic call-site detection. Additionally, our ablation study and qualitative analysis show that the semantic features extracted by AutoPruner capture a remarkable amount of information for distinguishing between true and false positives.

Publisher's Version Artifacts Functional
Lighting Up Supervised Learning in User Review-Based Code Localization: Dataset and Benchmark
Xinwen Hu, Yu Guo, Jianjie Lu, Zheling Zhu, Chuanyi Li, Jidong Ge, Liguo Huang, and Bin Luo
(Nanjing University, China; Southern Methodist University, USA)
As User Reviews (URs) of mobile Apps are proven to provide valuable feedback for maintaining and evolving applications, how to make full use of URs more efficiently in the release cycle of mobile Apps has become a widely concerned and researched topic in the Software Engineering (SE) community. In order to speed up the completion of coding work related to URs to shorten the release cycle as much as possible, the task of User Review-based code localization is proposed and studied in depth. However, due to the lack of large-scale ground truth dataset (i.e., truly related <UR, Code> pairs), existing methods are all unsupervised learning-based. In order to light up supervised learning approaches, which are driven by large labeled datasets, for Review2Code, and to compare their performances with unsupervised learning-based methods, we first introduce a large-scale human-labeled <UR, Code> ground truth dataset, including the annotation process and statistical analysis. Then, a benchmark consisting of two SOTA unsupervised learning-based and four supervised learning-based Review2Code methods is constructed based on this dataset. We believe that this paper can provide a basis for in-depth exploration of the supervised learning-based Review2Code solutions.

Publisher's Version
CORMS: A GitHub and Gerrit Based Hybrid Code Reviewer Recommendation Approach for Modern Code Review
Prahar Pandya and Saurabh Tiwari
(DA-IICT Gandhinagar, India)
Modern Code review (MCR) techniques are widely adopted in both open-source software platforms and organizations to ensure the quality of their software products. However, the selection of reviewers for code review is cumbersome with the increasing size of development teams. The recommendation of inappropriate reviewers for code review can take more time and effort to complete the task effectively. In this paper, we extended the baseline of reviewers' recommendation framework - RevFinder - to handle issues with newly created files, retired reviewers, the external validity of results, and the accuracies of the state-of-the-art RevFinder. Our proposed hybrid approach, CORMS, works on similarity analysis to compute similarities among file paths, projects/sub-projects, author information, and prediction models to recommend reviewers based on the subject of the change. We conducted a detailed analysis on the widely used 20 projects of both Gerrit and GitHub to compare our results with RevFinder. Our results reveal that on average, CORMS, can achieve top-1, top-3, top-5, and top-10 accuracies, and Mean Reciprocal Rank (MRR) of 45.1%, 67.5%, 74.6%, 79.9% and 0.58 for the 20 projects, consequently improves the RevFinder approach by 44.9%, 34.4%, 20.8%, 12.3% and 18.4%, respectively.

Publisher's Version
Hierarchical Bayesian Multi-kernel Learning for Integrated Classification and Summarization of App Reviews
Moayad Alshangiti, Weishi Shi, Eduardo Lima, Xumin Liu, and Qi Yu
(University of Jeddah, Saudi Arabia; Rochester Institute of Technology, USA)
App stores enable users to share their experiences directly with the developers in the form of app reviews. Recent studies have shown that the feedback received from users is a valuable source of information for requirements extraction, which encourages app developers to leverage the reviews for app update and maintenance purposes. Follow-up studies proposed automated techniques to help developers filter the large volume of daily and noisy reviews and/or summarize their content. However, all previous studies approached the app reviews classification and summarization as separate tasks, which complicated the process and introduced unnecessary overhead. Moreover, none of those approaches explored the potential of utilizing the hierarchical relationships that exist between the labels of app reviews for the purpose of building a more accurate model. In this work, we propose Hierarchical Multi-Kernel Relevance Vector Machines (HMK-RVM), a Bayesian multi-kernel technique that integrates app review classification and summarization using a unified model. Moreover, it can provide insights into the learned patterns and underlying data for easier model interpretation. We evaluated our proposed approach on two real-world datasets and showed that in addition to the gained insights, the model produces equal or better results than the state of the art.

Publisher's Version
Semi-supervised Pre-processing for Learning-Based Traceability Framework on Real-World Software Projects
Liming Dong, He Zhang, Wei Liu, Zhiluo Weng, and Hongyu Kuang
(Nanjing University, China)
The traceability of software artifacts has been recognized as an important factor to support various activities in software development processes. However, traceability can be difficult and time-consuming to create and maintain manually, thereby automated approaches have gained much attention. Unfortunately, existing automated approaches for traceability suffer from practical issues. This paper aims to gain an understanding of the potential challenges for the underperforming of the state-of-the-art, ML-based trace link classifiers applied in real-world projects. By investigating different industrial datasets, we found that two critical (and classic) challenges, i.e. data imbalance and sparse problems, lie in real-world projects’ traceability automation. To overcome these challenges, we developed a framework called SPLINT to incorporate hybrid textual similarity measures and semi-supervised learning strategies as enhancements to the learning-based traceability approaches. We carried out experiments with six open-source platforms and ten industry datasets. The results confirm that SPLINT is able to operate at higher performance on two communities’ datasets. Specifically, the industrial datasets, which significantly suffer from data imbalance and sparsity problems, show an increase in F2-score over 14% and AUC over 8% on average. The adjusted class-balancing and self-training policies used in SPLINT (CBST-Adjust) also work effectively for the selection of pseudo-labels on minor classes from unlabeled trace sets, demonstrating SPLINT’s practicability.

Publisher's Version

Formal Methods

Input Invariants
Dominic Steinhöfel and Andreas Zeller
(CISPA Helmholtz Center for Information Security, Germany)
How can we generate valid system inputs? Grammar-based fuzzers are highly efficient in producing syntactically valid system inputs. However, programs will often reject inputs that are semantically invalid. We introduce ISLa, a declarative specification language for context-sensitive properties of structured system inputs based on context-free grammars. With ISLa, it is possible to specify input constraints like "a variable has to be defined before it is used," "the 'file name' block must be 100 bytes long," or "the number of columns in all CSV rows must be identical."
Such constraints go into the ISLa fuzzer, which leverages the power of solvers like Z3 to solve semantic constraints and, on top, handles quantifiers and predicates over grammar structure. We show that a few ISLa constraints suffice to produce 100% semantically valid inputs while still maintaining input diversity. ISLa can also parse and precisely validate inputs against semantic constraints.
ISLa constraints can be mined from existing input samples. For this, our ISLearn prototype uses a catalog of common patterns, instantiates these over input elements, and retains those candidates that hold for the inputs observed and whose instantiations are fully accepted by input-processing programs. The resulting constraints can then again be used for fuzzing and parsing.

Publisher's Version Info Artifacts Reusable
Modus: A Datalog Dialect for Building Container Images
Chris Tomy, Tingmao Wang, Earl T. Barr, and Sergey Mechtaev
(University College London, UK)
Containers help share and deploy software by packaging it with all its dependencies. Tools, like Docker or Kubernetes, spawn containers from images as specified by a build system’s language, such as Dockerfile. A build system takes many parameters to build an image, including OS and application versions. These build parameters can interact: setting one can restrict another. Dockerfile lacks support for reifying and constraining these interactions, thus forcing developers to write a build script per workflow. As a result, developers have resorted to creating ad-hoc solutions such as templates or domain-specific frameworks that harm performance and complicate maintenance because they are verbose and mix languages.
To address this problem, we introduce Modus, a Datalog dialect for building container images. Modus' key insight is that container definitions naturally map to proof trees of Horn clauses. In these trees, container configurations correspond to logical facts, build instructions correspond to logic rules, and the build tree is computed as the minimal proof of the Datalog query specifying the target image. Modus relies on Datalog’s expressivity to specify complex workflows with concision and facilitate automatic parallelisation.
We evaluated Modus by porting build systems of six popular Docker Hub images to Modus. Modus reduced the code size by 20.1% compared to the used ad-hoc solutions, while imposing a negligible performance overhead, preserving the original image size and image efficiency. We also provide a detailed analysis of porting OpenJDK image build system to Modus.

Publisher's Version Info Artifacts Reusable
Multi-Phase Invariant Synthesis
Daniel Riley and Grigory Fedyukovich
(Florida State University, USA)
Loops with multiple phases are challenging to verify because they require disjunctive invariants. Invariants could also have the form of implication between a precondition for the phase and a lemma that is valid throughout the phase. Such invariant structure is however not widely supported in state-of-the-art verification. We present a novel SMT-based approach to synthesize implication invariants for multi-phase loops. Our technique computes Model Based Projections to discover the program's phases and leverages data learning to get relationships among loop variables at an arbitrary place in the loop. It is effective in the challenging cases of mutually-dependent periodic phases, where many implication invariants need to be discovered simultaneously. Our approach has shown promising results in its ability to verify programs with complex phase structures. We have implemented and evaluated our algorithm against several state-of-the-art solvers.

Publisher's Version Artifacts Functional
Parasol: Efficient Parallel Synthesis of Large Model Spaces
Clay Stevens and Hamid Bagheri
(University of Nebraska-Lincoln, USA)
Formal analysis is an invaluable tool for software engineers, yet state-of-the-art formal analysis techniques suffer from well-known limitations in terms of scalability. In particular, some software design domains—such as tradeoff analysis and security analysis—require systematic exploration of potentially huge model spaces, which further exacerbates the problem. Despite this present and urgent challenge, few techniques exist to support the systematic exploration of large model spaces. This paper introduces Parasol, an approach and accompanying tool suite, to improve the scalability of large-scale formal model space exploration. Parasol presents a novel parallel model space synthesis approach, backed with unsupervised learning to automatically derive domain knowledge, guiding a balanced partitioning of the model space. This allows Parasol to synthesize the models in each partition in parallel, significantly reducing synthesis time and making large-scale systematic model space exploration for real-world systems more tractable. Our empirical results corroborate that Parasol substantially reduces (by 460% on average) the time required for model space synthesis, compared to state-of-the-art model space synthesis techniques relying on both incremental and parallel constraint solving technologies as well as competing, non-learning-based partitioning methods.

Publisher's Version
Neural Termination Analysis
Mirco Giacobbe, Daniel Kroening, and Julian Parsert
(University of Birmingham, UK; University of Oxford, UK)
We introduce a novel approach to the automated termination analysis of computer programs: we use neural networks to represent ranking functions. Ranking functions map program states to values that are bounded from below and decrease as a program runs; the existence of a ranking function proves that the program terminates. We train a neural network from sampled execution traces of a program so that the network's output decreases along the traces; then, we use symbolic reasoning to formally verify that it generalises to all possible executions. Upon the affirmative answer we obtain a formal certificate of termination for the program, which we call a neural ranking function. We demonstrate that, thanks to the ability of neural networks to represent nonlinear functions, our method succeeds over programs that are beyond the reach of state-of-the-art tools. This includes programs that use disjunctions in their loop conditions and programs that include nonlinear expressions.

Publisher's Version

Debugging/Localization

PaReco: Patched Clones and Missed Patches among the Divergent Variants of a Software Family
Poedjadevie Kadjel Ramkisoen, John Businge, Brent van Bladel, Alexandre Decan, Serge Demeyer, Coen De Roover, and Foutse Khomh
(University of Antwerp, Belgium; Flanders Make, Belgium; University of Nevada at Las Vegas, USA; University of Mons, Belgium; F.R.S.-FNRS, Belgium; Vrije Universiteit Brussel, Belgium; Polytechnique Montréal, Canada)
Re-using whole repositories as a starting point for new projects is often done by maintaining a variant fork parallel to the original. However, the common artifacts between both are not always kept up to date. As a result, patches are not optimally integrated across the two repositories, which may lead to sub-optimal maintenance between the variant and the original project. A bug existing in both repositories can be patched in one but not the other (we see this as a missed opportunity) or it can be manually patched in both probably by different developers (we see this as effort duplication). In this paper we present a tool (named PaReCo) which relies on clone detection to mine cases of missed opportunity and effort duplication from a pool of patches. We analyzed 364 (source to target) variant pairs with 8,323 patches resulting in a curated dataset containing 1,116 cases of effort duplication and 1,008 cases of missed opportunities. We achieve a precision of 91%, recall of 80%, accuracy of 88%, and F1-score of 85%. Furthermore, we investigated the time interval between patches and found out that, on average, missed patches in the target variants have been introduced in the source variants 52 weeks earlier. Consequently, PaReCo can be used to manage variability in “time” by automatically identifying interesting patches in later project releases to be backported to supported earlier releases.

Publisher's Version
Fault Localization to Detect Co-change Fixing Locations
Yi Li, Shaohua Wang, and Tien N. Nguyen
(New Jersey Institute of Technology, USA; University of Texas at Dallas, USA)
Fault Localization (FL) is a precursor step to most Automated Program Repair (APR) approaches, which fix the faulty statements identified by the FL tools. We present FixLocator, a Deep Learning (DL)-based fault localization approach supporting the detection of faulty statements in one or multiple methods that need to be modified accordingly in the same fix. Let us call them co-change (CC) fixing locations for a fault. We treat this FL problem as dual-task learning with two models. The method-level FL model, MethFL, learns the methods to be fixed together. The statement-level FL model, StmtFL, learns the statements to be co-fixed. Correct learning in one model can benefit the other and vice versa. Thus, we simultaneously train them with soft-sharing the models' parameters via cross-stitch units to enable the propagation of the impact of MethFL and StmtFL onto each other. Moreover, we explore a novel feature for FL: the co-changed statements. We also use Graph-based Convolution Network to integrate different types of program dependencies.
Our empirical results show that FixLocator relatively improves over the state-of-the-art statement-level FL baselines by locating 26.5%–155.6% more CC fixing statements. To evaluate its usefulness in APR, we used FixLocator in combination with the state-of-the-art APR tools. The results show that FixLocator+DEAR (the original FL in DEAR replaced by FixLocator) and FixLocator+CURE improve relatively over the original DEAR and Ochiai+CURE by 10.5% and 42.9% in terms of the number of fixed bugs.

Publisher's Version Info
The Best of Both Worlds: Integrating Semantic Features with Expert Features for Defect Prediction and Localization
Chao Ni, Wei Wang, Kaiwen Yang, Xin Xia, Kui Liu, and David Lo
(Zhejiang University, China; Huawei, China; Singapore Management University, Singapore)
To improve software quality, just-in-time defect prediction (JIT-DP) (identifying defect-inducing commits) and just-in-time defect localization (JIT-DL) (identifying defect-inducing code lines in commits) have been widely studied by learning semantic features or expert features respectively, and indeed achieved promising performance. Semantic features and expert features describe code change commits from different aspects, however, the best of the two features have not been fully explored together to boost the just-in-time defect prediction and localization in the literature yet. Additional, JIT-DP identifies defects at the coarse commit level, while as the consequent task of JIT-DP, JIT-DL cannot achieve the accurate localization of defect-inducing code lines in a commit without JIT-DP. We hypothesize that the two JIT tasks can be combined together to boost the accurate prediction and localization of defect-inducing commits by integrating semantic features with expert features. Therefore, we propose to build a unified model, JIT-Fine, for the just-in-time defect prediction and localization by leveraging the best of semantic features and expert features. To assess the feasibility of JIT-Fine, we first build a large-scale line-level manually labeled dataset, JIT-Defects4J. Then, we make a comprehensive comparison with six state-of-the-art baselines under various settings using ten performance measures grouped into two types: effort-agnostic and effort-aware. The experimental results indicate that JIT-Fine can outperform all state-of-the-art baselines on both JIT-DP and JITDL tasks in terms of ten performance measures with a substantial improvement (i.e., 10%-629% in terms of effort-agnostic measures on JIT-DP, 5%-54% in terms of effort-aware measures on JIT-DP, and 4%-117% in terms of effort-aware measures on JIT-DL).

Publisher's Version

Mining Software Repositories

An Exploratory Study on the Predominant Programming Paradigms in Python Code
Robert Dyer and Jigyasa Chauhan
(University of Nebraska-Lincoln, USA)
Python is a multi-paradigm programming language that fully supports object-oriented (OO) programming. The language allows writing code in a non-procedural imperative manner, using procedures, using classes, or in a functional style. To date, no one has studied what paradigm(s), if any, are predominant in Python code and projects. In this work, we first define a technique to classify Python files into predominant paradigm(s). We then automate our approach and evaluate it against human judgements, showing over 80% agreement. We then analyze over 100k open-source Python projects, automatically classifying each source file and investigating the paradigm distributions. The results indicate Python developers tend to heavily favor OO features. We also observed a positive correlation between OO and procedural paradigms and the size of the project. And despite few files or projects being predominantly functional, we still found many functional feature uses.

Publisher's Version Info Artifacts Reusable
Making Python Code Idiomatic by Automatic Refactoring Non-idiomatic Python Code with Pythonic Idioms
Zejun Zhang, Zhenchang Xing, Xin Xia, Xiwei Xu, and Liming Zhu
(Australian National University, Australia; CSIRO’s Data61, Australia; Huawei, China)
Compared to other programming languages (e.g., Java), Python has more idioms to make Python code concise and efficient. Although pythonic idioms are well accepted in the Python community, Python programmers are often faced with many challenges in using them, for example, being unaware of certain pythonic idioms or do not know how to use them properly. Based on an analysis of 7,638 Python repositories on GitHub, we find that non-idiomatic Python code that can be implemented with pythonic idioms occurs frequently and widely. Unfortunately, there is no tool for automatically refactoring such non-idiomatic code into idiomatic code. In this paper, we design and implement an automatic refactoring tool to make Python code idiomatic. We identify nine pythonic idioms by systematically contrasting the abstract syntax grammar of Python and Java. Then we define the syntactic patterns for detecting non-idiomatic code for each pythonic idiom. Finally, we devise atomic AST-rewriting operations and refactoring steps to refactor non-idiomatic code into idiomatic code. We test and review over 4,115 refactorings applied to 1,065 Python projects from GitHub, and submit 90 pull requests for the 90 randomly sampled refactorings to 84 projects. These evaluations confirm the high accuracy, practicality and usefulness of our refactoring tool on real-world Python code.

Publisher's Version
An Empirical Study of Blockchain System Vulnerabilities: Modules, Types, and Patterns
Xiao Yi, Daoyuan Wu, Lingxiao Jiang, Yuzhou Fang, Kehuan Zhang, and Wei Zhang
(Chinese University of Hong Kong, China; Singapore Management University, Singapore; Nanjing University of Posts and Telecommunications, China)
Blockchain, as a distributed ledger technology, becomes increasingly popular, especially for enabling valuable cryptocurrencies and smart contracts. However, the blockchain software systems inevitably have many bugs. Although bugs in smart contracts have been extensively investigated, security bugs of the underlying blockchain systems are much less explored. In this paper, we conduct an empirical study on blockchain’s system vulnerabilities from four representative blockchains, Bitcoin, Ethereum, Monero, and Stellar. Specifically, we first design a systematic filtering process to effectively identify 1,037 vulnerabilities and their 2,317 patches from 34,245 issues/PRs (pull requests) and 85,164 commits on GitHub. We thus build the first blockchain vulnerability dataset, which is available at https://github.com/VPRLab/BlkVulnDataset. We then perform unique analyses of this dataset at three levels, including (i) file-level vulnerable module categorization by identifying and correlating module paths across projects, (ii) text-level vulnerability type clustering by natural language processing and similarity-based sentence clustering, and (iii) code-level vulnerability pattern analysis by generating and clustering code change signatures that capture both syntactic and semantic information of patch code fragments.
Our analyses reveal three key findings: (i) some blockchain modules are more susceptible than the others; notably, each of the modules related to consensus, wallet, and networking has over 200 issues; (ii) about 70% of blockchain vulnerabilities are of traditional types, but we also identify four new types specific to blockchains; and (iii) we obtain 21 blockchain-specific vulnerability patterns that capture unique blockchain attributes and statuses, and demonstrate that they can be used to detect similar vulnerabilities in other popular blockchains, such as Dogecoin, Bitcoin SV, and Zcash.

Publisher's Version
How to Better Utilize Code Graphs in Semantic Code Search?
Yucen Shi, Ying Yin, Zhengkui Wang, David Lo, Tao Zhang, Xin Xia, Yuhai Zhao, and Bowen Xu
(Northeastern University, China; Singapore Institute of Technology, Singapore; Singapore Management University, Singapore; Macau University of Science and Technology, China; Huawei, China)
Semantic code search greatly facilitates software reuse, which enables users to find code snippets highly matching user-specified natural language queries. Due to the rich expressive power of code graphs (e.g., control-flow graph and program dependency graph), both of the two mainstream research works (i.e., multi-modal models and pre-trained models) have attempted to incorporate code graphs for code modelling. However, they still have some limitations: First, there is still much room for improvement in terms of search effectiveness. Second, they have not fully considered the unique features of code graphs.
In this paper, we propose a Graph-to-Sequence Converter, namely G2SC. Through converting the code graphs into lossless sequences, G2SC enables to address the problem of small graph learning using sequence feature learning and capture both the edges and nodes attribute information of code graphs. Thus, the effectiveness of code search can be greatly improved. In particular, G2SC first converts the code graph into a unique corresponding node sequence by a specific graph traversal strategy. Then, it gets a statement sequence by replacing each node with its corresponding statement. A set of carefully designed graph traversal strategies guarantee that the process is one-to-one and reversible. G2SC enables capturing rich semantic relationships (i.e., control flow, data flow, node/relationship properties) and provides learning model-friendly data transformation. It can be flexibly integrated with existing models to better utilize the code graphs. As a proof-of-concept application, we present two G2SC enabled models: GSMM (G2SC enabled multi-modal model) and GSCodeBERT (G2SC enabled CodeBERT model). Extensive experiment results on two real large-scale datasets demonstrate that GSMM and GSCodeBERT can greatly improve the state-of-the-art models MMAN and GraphCodeBERT by 92% and 22% on R@1, and 63% and 11.5% on MRR, respectively.

Publisher's Version
23 Shades of Self-Admitted Technical Debt: An Empirical Study on Machine Learning Software
David OBrien, Sumon Biswas, Sayem Imtiaz, Rabe Abdalkareem, Emad Shihab, and Hridesh Rajan
(Iowa State University, USA; Carleton University, Canada; Concordia University, Canada)
In software development, the term “technical debt” (TD) is used to characterize short-term solutions and workarounds implemented in source code which may incur a long-term cost. Technical debt has a variety of forms and can thus affect multiple qualities of software including but not limited to its legibility, performance, and structure. In this paper, we have conducted a comprehensive study on the technical debts in machine learning (ML) based software. TD can appear differently in ML software by infecting the data that ML models are trained on, thus affecting the functional behavior of ML systems. The growing inclusion of ML components in modern software systems have introduced a new set of TDs. Does ML software have similar TDs to traditional software? If not, what are the new types of ML specific TDs? Which ML pipeline stages do these debts appear? Do these debts differ in ML tools and applications and when they get removed? Currently, we do not know the state of the ML TDs in the wild. To address these questions, we mined 68,820 self-admitted technical debts (SATD) from all the revisions of a curated dataset consisting of 2,641 popular ML repositories from GitHub, along with their introduction and removal. By applying an open-coding scheme and following upon prior works, we provide a comprehensive taxonomy of ML SATDs. Our study analyzes ML SATD type organizations, their frequencies within stages of ML software, the differences between ML SATDs in applications and tools, and quantifies the removal of ML SATDs. The findings discovered suggest implications for ML developers and researchers to create maintainable ML systems.

Publisher's Version Artifacts Functional

Program Analysis II

NeuDep: Neural Binary Memory Dependence Analysis
Kexin Pei, Dongdong She, Michael Wang, Scott Geng, Zhou Xuan, Yaniv David, Junfeng Yang, Suman Jana, and Baishakhi Ray
(Columbia University, USA; Massachusetts Institute of Technology, USA; Purdue University, USA)
Determining whether multiple instructions can access the same memory location is a critical task in binary analysis. It is challenging as statically computing precise alias information is undecidable in theory. The problem aggravates at the binary level due to the presence of compiler optimizations and the absence of symbols and types. Existing approaches either produce significant spurious dependencies due to conservative analysis or scale poorly to complex binaries.
We present a new machine-learning-based approach to predict memory dependencies by exploiting the model's learned knowledge about how binary programs execute. Our approach features (i) a self-supervised procedure that pretrains a neural net to reason over binary code and its dynamic value flows through memory addresses, followed by (ii) supervised finetuning to infer the memory dependencies statically. To facilitate efficient learning, we develop dedicated neural architectures to encode the heterogeneous inputs (i.e., code, data values, and memory addresses from traces) with specific modules and fuse them with a composition learning strategy.
We implement our approach in NeuDep and evaluate it on 41 popular software projects compiled by 2 compilers, 4 optimizations, and 4 obfuscation passes. We demonstrate that NeuDep is more precise (1.5x) and faster (3.5x) than the current state-of-the-art. Extensive probing studies on security-critical reverse engineering tasks suggest that NeuDep understands memory access patterns, learns function signatures, and is able to match indirect calls. All these tasks either assist or benefit from inferring memory dependencies. Notably, NeuDep also outperforms the current state-of-the-art on these tasks.

Publisher's Version
DynaPyt: A Dynamic Analysis Framework for Python
Aryaz Eghbali and Michael Pradel
(University of Stuttgart, Germany)
Python is a widely used programming language that powers important application domains such as machine learning, data analysis, and web applications. For many programs in these domains it is consequential to analyze aspects like security and performance, and with Python’s dynamic nature, it is crucial to be able to dynamically analyze Python programs. However, existing tools and frameworks do not provide the means to implement dynamic analyses easily and practitioners resort to implementing an ad-hoc dynamic analysis for their own use case. This work presents DynaPyt, the first general-purpose framework for heavy-weight dynamic analysis of Python programs. Compared to existing tools for other programming languages, our framework provides a wider range of analysis hooks arranged in a hierarchical structure, which allows developers to concisely implement analyses. DynaPyt features selective instrumentation and execution modification as well. We evaluate our framework on test suites of 9 popular open-source Python projects, 1,268,545 lines of code in total, and show that it, by and large, preserves the semantics of the original execution. The running time of DynaPyt is between 1.2x and 16x times the original execution time, which is in line with similar frameworks designed for other languages, and 5.6%–88.6% faster than analyses using a built-in tracing API offered by Python. We also implement multiple analyses, show the simplicity of implementing them and some potential use cases of DynaPyt. Among the analyses implemented are: an analysis to detect a memory blow up in Pytorch programs, a taint analysis to detect SQL injections, and an analysis to warn about a runtime performance anti-pattern.

Publisher's Version Info Artifacts Functional
Cross-Language Android Permission Specification
Chaoran Li, Xiao Chen, Ruoxi Sun, Minhui Xue, Sheng Wen, Muhammad Ejaz Ahmed, Seyit Camtepe, and Yang Xiang
(Swinburne University of Technology, Australia; Monash University, Australia; University of Adelaide, Australia; CSIRO’s Data61, Australia)
The Android system manages access to sensitive APIs by permission enforcement. An application (app) must declare proper permissions before invoking specific Android APIs. However, there is no official documentation providing the complete list of permission-protected APIs and the corresponding permissions to date. Researchers have spent significant efforts extracting such API protection mapping from the Android API framework, which leverages static code analysis to determine if specific permissions are required before accessing an API. Nevertheless, none of them has attempted to analyze the protection mapping in the native library (i.e., code written in C and C++), an essential component of the Android framework that handles communication with the lower-level hardware, such as cameras and sensors. While the protection mapping can be utilized to detect various security vulnerabilities in Android apps, such as permission over-privilege, imprecise mapping will lead to false results in detecting such security vulnerabilities. To fill this gap, we thereby propose to construct the protection mapping involved in the native libraries of the Android framework to present a complete and accurate specification of Android API protection. We develop a prototype system, named NatiDroid, to facilitate the cross-language static analysis and compare its performance with two state-of-the-practice tools, termed Axplorer and Arcade. We evaluate NatiDroid on more than 11,000 Android apps, including system apps from custom Android ROMs and third-party apps from the Google Play. Our NatiDroid can identify up to 464 new API-permission mappings, in contrast to the worst-case results derived from both Axplorer and Arcade, where approximately 71% apps have at least one false positive in permission over-privilege. We have disclosed all the potential vulnerabilities detected to the stakeholders.

Publisher's Version
Peahen: Fast and Precise Static Deadlock Detection via Context Reduction
Yuandao Cai, Chengfeng Ye, Qingkai Shi, and Charles Zhang
(Hong Kong University of Science and Technology, China; Ant Group, China)
Deadlocks still severely inflict reliability and security issues upon software systems of the modern age. Worse still, as we note, in prior static deadlock detectors, good precision does not go hand-in-hand with high scalability --- their approaches are either context-insensitive, thereby engendering many false positives, or suffer from the calling context explosion to reach context-sensitive, thus compromising good efficiency. In this paper, we advocate Peahen, geared towards precise yet also scalable static deadlock detection. At its crux, Peahen decomposes the computational effort for embracing high precision into two cooperative analysis stages: (i) context-insensitive lock-graph construction, which selectively encodes the essential lock-acquisition information on each edge, and (ii) three precise yet lazy refinements, which incorporate such edge information into progressively refining the deadlock cycles in the lock graph only for a few interesting calling contexts.
Our extensive experiments yield promising results: Peahen dramatically out-performs the state-of-the-art tools on accuracy without losing scalability; it can efficiently check million-line systems at a low false positive rate; and it has uncovered many confirmed deadlocks in dozens of mature open-source systems.

Publisher's Version

Collaboration

A Case Study of Implicit Mentoring, Its Prevalence, and Impact in Apache
Zixuan Feng, Amreeta Chatterjee, Anita Sarma, and Iftekhar Ahmed
(Oregon State University, USA; University of California at Irvine, USA)
Mentoring is traditionally viewed as a dyadic, top-down apprenticeship. This perspective, however, overlooks other forms of informal mentoring taking place in everyday activities in which developers invest time and effort. Here, we investigate informal mentoring taking place in Open Source Software (OSS). We define a specific type of informal mentoring—implicit mentoring—situations where contributors guide others through instructions and suggestions embedded in everyday (OSS) activities. We defined implicit mentoring by first performing a review of related work on mentoring, and then through formative interviews with OSS contributors and member-checking. Next, through an empirical investigation of Pull Requests (PRs) in 37 Apache Projects, we built a classifier to extract implicit mentoring. Our analysis of 107,895 PRs shows that implicit mentoring does occur through code reviews (27.41% of all PRs included implicit mentoring) and is beneficial for both mentors and mentees. We analyzed the impact of implicit mentoring on OSS contributors by investigating their contributions and learning trajectories in their projects. Through an online survey (N=231), we then triangulated these results and identified the potential benefits of implicit mentoring from OSS contributors’ perspectives.

Publisher's Version
Software Security during Modern Code Review: The Developer’s Perspective
Larissa Braz and Alberto Bacchelli
(University of Zurich, Switzerland)
To avoid software vulnerabilities, organizations are shifting security to earlier stages of the software development, such as at code review time. In this paper, we aim to understand the developers’ perspective on assessing software security during code review, the challenges they encounter, and the support that companies and projects provide. To this end, we conduct a two-step investigation: we interview 10 professional developers and survey 182 practitioners about software security assessment during code review. The outcome is an overview of how developers perceive software security during code review and a set of identified challenges. Our study revealed that most developers do not immediately report to focus on security issues during code review. Only after being asked about software security, developers state to always consider it during review and acknowledge its importance. Most companies do not provide security training, yet expect developers to still ensure security during reviews. Accordingly, developers report the lack of training and security knowledge as the main challenges they face when checking for security issues. In addition, they have challenges with third-party libraries and to identify interactions between parts of code that could have security implications. Moreover, security may be disregarded during reviews due to developers’ assumptions about the security dynamic of the application they develop. Preprint: https://arxiv.org/abs/2208.04261 Data and materials: https://doi.org/10.5281/zenodo.6969369

Publisher's Version Info
Program Merge Conflict Resolution via Neural Transformers
Alexey Svyatkovskiy, Sarah Fakhoury, Negar Ghorbani, Todd Mytkowicz, Elizabeth Dinella, Christian Bird, Jinu Jang, Neel Sundaresan, and Shuvendu K. Lahiri
(Microsoft, USA; Washington State University, USA; University of California at Irvine, USA; Microsoft Research, USA; University of Pennsylvania, USA)
Collaborative software development is an integral part of the modern software development life cycle, essential to the success of large-scale software projects. When multiple developers make concurrent changes around the same lines of code, a merge conflict may occur. Such conflicts stall pull requests and continuous integration pipelines for hours to several days, seriously hurting developer productivity. To address this problem, we introduce MergeBERT, a novel neural program merge framework based on token-level three-way differencing and a transformer encoder model. By exploiting the restricted nature of merge conflict resolutions, we reformulate the task of generating the resolution sequence as a classification task over a set of primitive merge patterns extracted from real-world merge commit data. Our model achieves 63–68% accuracy for merge resolution synthesis, yielding nearly a 3× performance improvement over existing semi-structured, and 2× improvement over neural program merge tools. Finally, we demonstrate that MergeBERT is sufficiently flexible to work with source code files in Java, JavaScript, TypeScript, and C# programming languages. To measure the practical use of MergeBERT, we conduct a user study to evaluate MergeBERT suggestions with 25 developers from large OSS projects on 122 real-world conflicts they encountered. Results suggest that in practice, MergeBERT resolutions would be accepted at a higher rate than estimated by automatic metrics for precision and accuracy. Additionally, we use participant feedback to identify future avenues for improvement of MergeBERT.

Publisher's Version

Security

Automated Unearthing of Dangerous Issue Reports
Shengyi Pan, Jiayuan Zhou, Filipe Roseiro Cogo, Xin Xia, Lingfeng Bao, Xing Hu, Shanping Li, and Ahmed E. Hassan
(Zhejiang University, China; Huawei, Canada; Huawei, China; Queen’s University, Canada)
The coordinated vulnerability disclosure (CVD) process is commonly adopted for open source software (OSS) vulnerability management, which suggests to privately report the discovered vulnerabilities and keep relevant information secret until the official disclosure. However, in practice, due to various reasons (e.g., lacking security domain expertise or the sense of security management), many vulnerabilities are first reported via public issue reports (IRs) before its official disclosure. Such IRs are dangerous IRs, since attackers can take advantages of the leaked vulnerability information to launch zero-day attacks. It is crucial to identify such dangerous IRs at an early stage, such that OSS users can start the vulnerability remediation process earlier and OSS maintainers can timely manage the dangerous IRs. In this paper, we propose and evaluate a deep learning based approach, namely MemVul, to automatically identify dangerous IRs at the time they are reported. MemVul augments the neural networks with a memory component, which stores the external vulnerability knowledge from Common Weakness Enumeration (CWE). We rely on publicly accessible CVE-referred IRs (CIRs) to operationalize the concept of dangerous IR. We mine 3,937 CIRs distributed across 1,390 OSS projects hosted on GitHub. Evaluated under a practical scenario of high data imbalance, MemVul achieves the best trade-off between precision and recall among all baselines. In particular, the F1-score of MemVul (i.e., 0.49) improves the best performing baseline by 44%. For IRs that are predicted as CIRs but not reported to CVE, we conduct a user study to investigate their usefulness to OSS stakeholders. We observe that 82% (41 out of 50) of these IRs are security-related and 28 of them are suggested by security experts to be publicly disclosed, indicating MemVul is capable of identifying undisclosed dangerous IRs.

Publisher's Version
On the Vulnerability Proneness of Multilingual Code
Wen Li, Li Li, and Haipeng Cai
(Washington State University, USA; Monash University, Australia)
Software construction using multiple languages has long been a norm, yet it is still unclear if multilingual code construction has significant security implications and real security consequences. This paper aims to address this question with a large-scale study of popular multi-language projects on GitHub and their evolution histories, enabled by our novel techniques for multilingual code characterization. We found statistically significant associations between the proneness of multilingual code to vulnerabilities (in general and of specific categories) and its language selection. We also found this association is correlated with that of the language interfacing mechanism, not that of individual languages. We validated our statistical findings with in-depth case studies on actual vulnerabilities, explained via the mechanism and language selection. Our results call for immediate actions to assess and defend against multilingual vulnerabilities, for which we provide practical recommendations.

Publisher's Version Artifacts Reusable
Tracking Patches for Open Source Software Vulnerabilities
Congying Xu, Bihuan Chen, Chenhao Lu, Kaifeng Huang, Xin Peng, and Yang Liu
(Fudan University, China; Nanyang Technological University, Singapore)
Open source software (OSS) vulnerabilities threaten the security of software systems that use OSS. Vulnerability databases provide valuable information (e.g., vulnerable version and patch) to mitigate OSS vulnerabilities. There arises a growing concern about the information quality of vulnerability databases. However, it is unclear what the quality of patches in existing vulnerability databases is; and existing manual or heuristic-based approaches for patch tracking are either too expensive or too specific to apply to all OSS vulnerabilities.
To address these problems, we first conduct an empirical study to understand the quality and characteristics of patches for OSS vulnerabilities in two industrial vulnerability databases. Inspired by our study, we then propose the first automated approach, Tracer, to track patches for OSS vulnerabilities from multiple knowledge sources. Our evaluation has demonstrated that i) Tracer can track patches for up to 273.8% more vulnerabilities than heuristic-based approaches while achieving a higher F1-score by up to 116.8%; and ii) Tracer can complement industrial vulnerability databases. Our evaluation has also indicated the generality and practical usefulness of Tracer.

Publisher's Version
DeJITLeak: Eliminating JIT-Induced Timing Side-Channel Leaks
Qi Qin, JulianAndres JiYang, Fu Song, Taolue Chen, and Xinyu Xing
(ShanghaiTech University, China; Birkbeck University of London, UK; Northwestern University, USA)
Timing side-channels can be exploited to infer secret information when the execution time of a program is correlated with secrets. Recent work has shown that Just-In-Time (JIT) compilation can introduce new timing side-channels in programs even if they are time-balanced at the source code level. In this paper, we propose a novel approach to eliminate JIT-induced leaks. We first formalise timing side-channel security under JIT compilation via the notion of time-balancing, laying the foundation for reasoning about programs with JIT compilation. We then propose to eliminate JIT-induced leaks via a fine-grained JIT compilation. To this end, we provide an automated approach to generate compilation policies and a novel type system to guarantee its soundness. We develop a tool DeJITLeak for real-world Java and implement the fine-grained JIT compilation in HotSpot JVM. Experimental results show that DeJITLeak can effectively and efficiently eliminate JIT-induced leaks on three widely adopted benchmarks in the setting of side-channel detection.

Publisher's Version Info Artifacts Reusable

Dependability

Quantitative Relational Modelling with QAlloy
Pedro Silva, José N. Oliveira, Nuno Macedo, and Alcino Cunha
(University of Minho, Portugal; INESC TEC, Portugal; University of Porto, Portugal)
Alloy is a popular language and tool for formal software design. A key factor to this popularity is its relational logic, an elegant specification language with a minimal syntax and semantics. However, many software problems nowadays involve both structural and quantitative requirements, and Alloy's relational logic is not well suited to reason about the latter. This paper introduces QAlloy, an extension of Alloy with quantitative relations that add integer quantities to associations between domain elements. Having integers internalised in relations, instead of being explicit domain elements like in standard Alloy, allows quantitative requirements to be specified in QAlloy with a similar elegance to structural requirements, with the side-effect of providing basic dimensional analysis support via the type system. The QAlloy Analyzer also implements an SMT-based engine that enables quantities to be unbounded, thus avoiding many problems that may arise with the current bounded integer semantics of Alloy.

Publisher's Version Artifacts Functional
Demystifying the Underground Ecosystem of Account Registration Bots
Yuhao Gao, Guoai Xu, Li Li, Xiapu Luo, Chenyu Wang, and Yulei Sui
(University of Technology Sydney, Australia; Beijing University of Posts and Telecommunications, China; Harbin Institute of Technology, China; Monash University, Australia; Hong Kong Polytechnic University, China)
Member services are a core part of most online systems. For example, member services in online social networks and video platforms make it possible to serve users customized content or track their footprint for a recommendation. However, there is a dark side to membership that lurks behind influencer marketing, coupon harvesting, and spreading fake news. All these activities rely heavily on owning masses of fake accounts, and to create new accounts efficiently, malicious registrants use automated registration bots with anti-human verification services that can easily bypass a website’s security strategies.
In this paper, we take the first step toward understanding the underground ecosystem of account registration bots, and in particular, the anti-human verification services they use. From a comprehensive analysis, we determined the three most popular types of anti-human verification services. We then conducted experiments on these services from an attacker’s perspective to verify their effectiveness. The results show that all can easily bypass the security strategies website providers put in place to prevent fake registrations, such as SMS verification, CAPTCHA and IP monitoring. We further estimated the market size of the underground registration ecosystem, placing it at about US $4.8M-128.1 million per year. Our study demonstrates the urgency with which we to think about the effectiveness of our registration security strategies and should prompt us to develop new strategies for better protection.

Publisher's Version
Using Graph Neural Networks for Program Termination
Yoav Alon and Cristina David
(University of Bristol, UK)
Termination analyses investigate the termination behavior of programs, intending to detect nontermination, which is known to cause a variety of program bugs (e.g. hanging programs, denial-of-service vulnerabilities). Beyond formal approaches, various attempts have been made to estimate the termination behavior of programs using neural networks. However, the majority of these approaches continue to rely on formal methods to provide strong soundness guarantees and consequently suffer from similar limitations. In this paper, we move away from formal methods and embrace the stochastic nature of machine learning models. Instead of aiming for rigorous guarantees that can be interpreted by solvers, our objective is to provide an estimation of a program's termination behavior and of the likely reason for nontermination (when applicable) that a programmer can use for debugging purposes. Compared to previous approaches using neural networks for program termination, we also take advantage of the graph representation of programs by employing Graph Neural Networks. To further assist programmers in understanding and debugging nontermination bugs, we adapt the notions of attention and semantic segmentation, previously used for other application domains, to programs. Overall, we designed and implemented classifiers for program termination based on Graph Convolutional Networks and Graph Attention Networks, as well as a semantic segmentation Graph Neural Network that localizes AST nodes likely to cause nontermination. We also illustrated how the information provided by semantic segmentation can be combined with program slicing to further aid debugging.

Publisher's Version Artifacts Functional

Program Repair/Synthesis

PyTER: Effective Program Repair for Python Type Errors
Wonseok Oh and Hakjoo Oh
(Korea University, South Korea)
We present PyTER, an automated program repair (APR) technique for Python type errors. Python developers struggle with type error exceptions that are prevalent and difficult to fix. Despite the importance, however, automatically repairing type errors in dynamically typed languages such as Python has received little attention in the APR community and no existing techniques are readily available for practical use. PyTER is the first technique that is carefully designed to fix diverse type errors in real-world Python applications. To this end, we present a novel APR approach that uses dynamic and static analyses to infer correct and incorrect types of program variables, and leverage their difference to effectively identify faulty locations and patch candidates. We evaluated PyTER on 93 type errors collected from open-source projects. The result shows that PyTER is able to fix 48.4% of them with a precision of 77.6%.

Publisher's Version Artifacts Reusable
VulRepair: A T5-Based Automated Software Vulnerability Repair
Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Dinh Phung
(Monash University, Australia; University of Adelaide, Australia)
As software vulnerabilities grow in volume and complexity, researchers proposed various Artificial Intelligence (AI)-based approaches to help under-resourced security analysts to find, detect, and localize vulnerabilities. However, security analysts still have to spend a huge amount of effort to manually fix or repair such vulnerable functions. Recent work proposed an NMT-based Automated Vulnerability Repair, but it is still far from perfect due to various limitations. In this paper, we propose VulRepair, a T5-based automated software vulnerability repair approach that leverages the pre-training and BPE components to address various technical limitations of prior work. Through an extensive experiment with over 8,482 vulnerability fixes from 1,754 real-world software projects, we find that our VulRepair achieves a Perfect Prediction of 44%, which is 13%-21% more accurate than competitive baseline approaches. These results lead us to conclude that our VulRepair is considerably more accurate than two baseline approaches, highlighting the substantial advancement of NMT-based Automated Vulnerability Repairs. Our additional investigation also shows that our VulRepair can accurately repair as many as 745 out of 1,706 real-world well-known vulnerabilities (e.g., Use After Free, Improper Input Validation, OS Command Injection), demonstrating the practicality and significance of our VulRepair for generating vulnerability repairs, helping under-resourced security analysts on fixing vulnerabilities.

Publisher's Version Artifacts Reusable
DeepDev-PERF: A Deep Learning-Based Approach for Improving Software Performance
Spandan Garg, Roshanak Zilouchian Moghaddam, Colin B. Clement, Neel Sundaresan, and Chen Wu
(Microsoft, USA; Microsoft, China)
Improving software performance is an important yet challenging part of the software development cycle. Today, the majority of performance inefficiencies are identified and patched by performance experts. Recent advancements in deep learning approaches and the wide-spread availability of open-source data creates a great opportunity to automate the identification and patching of performance problems. In this paper, we present DeepDev-PERF, a transformer-based approach to suggest performance improvements for C# applications. We pretrain DeepDev-PERF on English and Source code corpora, followed by finetuning for the task of generating performance improvement patches for C# applications. Our evaluation shows that our model can generate the same performance improvement suggestion as the developer fix in  53

Publisher's Version
Less Training, More Repairing Please: Revisiting Automated Program Repair via Zero-Shot Learning
Chunqiu Steven Xia and Lingming Zhang
(University of Illinois at Urbana-Champaign, USA)
Due to the promising future of Automated Program Repair (APR), researchers have proposed various APR techniques, including heuristic-based, template-based, and constraint-based techniques. Among such classic APR techniques, template-based techniques have been widely recognized as state of the art. However, such template-based techniques require predefined templates to perform repair, and their effectiveness is thus limited. To this end, researchers have leveraged the recent advances in Deep Learning to further improve APR. Such learning-based techniques typically view APR as a Neural Machine Translation problem, using the buggy/fixed code snippets as the source/target languages for translation. In this way, such techniques heavily rely on large numbers of high-quality bug-fixing commits, which can be extremely costly/challenging to construct and may limit their edit variety and context representation.
In this paper, we aim to revisit the learning-based APR problem, and propose AlphaRepair, the first cloze-style (or infilling-style) APR approach to directly leveraging large pre-trained code models for APR without any fine-tuning/retraining on historical bug fixes. Our main insight is instead of modeling what a repair edit should look like (i.e., a NMT task), we can directly predict what the correct code is based on the context information (i.e., a cloze or text infilling task). Although our approach is general and can be built on various pre-trained code models, we have implemented AlphaRepair as a practical multilingual APR tool based on the recent CodeBERT model. Our evaluation of AlphaRepair on the widely used Defects4J benchmark shows for the first time that learning-based APR without any history bug fixes can already outperform state-of-the-art APR techniques. We also studied the impact of different design choices and show that AlphaRepair performs even better on a newer version of Defects4J (2.0) with 3.3X more fixes than best performing baseline, indicating that AlphaRepair can potentially avoid the dataset-overfitting issue of existing techniques. Additionally, we demonstrate the multilingual repair ability of AlphaRepair by evaluating on the QuixBugs dataset where AlphaRepair achieved the state-of-the-art results on both Java and Python versions.

Publisher's Version
NL2Viz: Natural Language to Visualization via Constrained Syntax-Guided Synthesis
Zhengkai Wu, Vu Le, Ashish Tiwari, Sumit Gulwani, Arjun Radhakrishna, Ivan Radiček, Gustavo Soares, Xinyu Wang, Zhenwen Li, and Tao Xie
(University of Illinois at Urbana-Champaign, USA; Microsoft, USA; University of Michigan, USA; Peking University, China)
Recent development in NL2CODE (Natural Language to Code) research allows end-users, especially novice programmers to create a concrete implementation of their ideas such as data visualization by providing natural language (NL) instructions. An NL2CODE system often fails to achieve its goal due to three major challenges: the user's words have contextual semantics, the user may not include all details needed for code generation, and the system results are imperfect and require further refinement. To address the aforementioned three challenges for NL to Visualization, we propose a new approach and its supporting tool named NL2VIZ with three salient features: (1) leveraging not only the user's NL input but also the data and program context that the NL query is upon, (2) using hard/soft constraints to reflect different confidence levels in the constraints retrieved from the user input and data/program context, and (3) providing support for result refinement and reuse.
We implement NL2VIZ in the Jupyter Notebook environment and evaluate NL2VIZ on a real-world visualization benchmark and a public dataset to show the effectiveness of NL2VIZ. We also conduct a user study involving 6 data scientist professionals to demonstrate the usability of NL2VIZ, the readability of the generated code, and NL2VIZ's effectiveness in helping users generate desired visualizations effectively and efficiently.

Publisher's Version

Online Presentations

AccessiText: Automated Detection of Text Accessibility Issues in Android Apps
Abdulaziz Alshayban and Sam Malek
(University of California at Irvine, USA)
For 15% of the world population with disabilities, accessibility is arguably the most critical software quality attribute. The growing reliance of users with disability on mobile apps to complete their day-to-day tasks further stresses the need for accessible software. Mobile operating systems, such as iOS and Android, provide various integrated assistive services to help individuals with disabilities perform tasks that could otherwise be difficult or not possible. However, for these assistive services to work correctly, developers have to support them in their app by following a set of best practices and accessibility guidelines. Text Scaling Assistive Service (TSAS) is utilized by people with low vision, to increase the text size and make apps accessible to them. However, the use of TSAS with incompatible apps can result in unexpected behavior introducing accessibility barriers to users. This paper presents approach, an automated testing technique for text accessibility issues arising from incompatibility between apps and TSAS. As a first step, we identify five different types of text accessibility by analyzing more than 600 candidate issues reported by users in (i) app reviews for Android and iOS, and (ii) Twitter data collected from public Twitter accounts. To automatically detect such issues, approach utilizes the UI screenshots and various metadata information extracted using dynamic analysis, and then applies various heuristics informed by the different types of text accessibility issues identified earlier. Evaluation of approach on 30 real-world Android apps corroborates its effectiveness by achieving 88.27% precision and 95.76% recall on average in detecting text accessibility issues.

Publisher's Version
Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems
Zeyan Li, Nengwen Zhao, Mingjie Li, Xianglin Lu, Lixin Wang, Dongdong Chang, Xiaohui Nie, Li Cao, Wenchi Zhang, Kaixin Sui, Yanhua Wang, Xu Du, Guoqiang Duan, and Dan Pei
(Tsinghua University, China; China Construction Bank, China; BizSeer, China)
Fault localization is challenging in an online service system due to its monitoring data's large volume and variety and complex dependencies across/within its components (e.g., services or databases). Furthermore, engineers require fault localization solutions to be actionable and interpretable, which existing research approaches cannot satisfy. Therefore, the common industry practice is that, for a specific online service system, its experienced engineers focus on localization for recurring failures based on the knowledge accumulated about the system and historical failures. More specifically, 1) they can identify the underlying root causes and take mitigation actions when pinpointing a group of indicative metrics on the faulty component; 2) their diagnosis knowledge is roughly based on how one failure might affect the components in the whole system.
Although the above common practice is actionable and interpretable, it is largely manual, thus slow and sometimes inaccurate. In this paper, we aim to automate this practice through machine learning. That is, we propose an actionable and interpretable fault localization approach, DejaVu, for recurring failures in online service systems. For a specific online service system, DejaVu takes historical failures and dependencies in the system as input and trains a localization model offline; for an incoming failure, the trained model online recommends where the failure occurs (i.e., the faulty components) and which kind of failure occurs (i.e., the indicative group of metrics) (thus actionable), which are further interpreted both globally and locally (thus interpretable). Based on the evaluation on 601 failures from three production systems and one open-source benchmark, in less than one second, DejaVu can rank the ground truths at 1.66∼5.03-th among a long candidate list on average, outperforming baselines by 54.52%.

Publisher's Version
AUGER: Automatically Generating Review Comments with Pre-training Models
Lingwei Li, Li Yang, Huaxi Jiang, Jun Yan, Tiejian Luo, Zihan Hua, Geng Liang, and Chun Zuo
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Wuhan University, China; Sinosoft, China)
Code review is one of the best practices as a powerful safeguard for software quality. In practice, senior or highly skilled reviewers inspect source code and provide constructive comments, consider- ing what authors may ignore, for example, some special cases. The collaborative validation between contributors results in code being highly qualified and less chance of bugs. However, since personal knowledge is limited and varies, the efficiency and effectiveness of code review practice are worthy of further improvement. In fact, it still takes a colossal and time-consuming effort to deliver useful review comments. This paper explores a synergy of multiple practical review comments to enhance code review and proposes AUGER (AUtomatically GEnerating Review comments): a review comments generator with pre-training models. We first collect empirical review data from 11 notable Java projects and construct a dataset of 10,882 code changes. By leveraging Text-to-Text Transfer Transformer (T5) models, the framework synthesizes valuable knowledge in the training stage and effectively outperforms baselines by 37.38% in ROUGE-L. 29% of our automatic review comments are considered useful according to prior studies. The inference generates just in 20 seconds and is also open to training further. Moreover, the performance also gets improved when thoroughly analyzed in case study.

Publisher's Version
Automatically Deriving JavaScript Static Analyzers from Specifications using Meta-level Static Analysis
Jihyeok Park, Seungmin An, and Sukyoung Ryu
(Oracle, Australia; KAIST, South Korea)
JavaScript is one of the most dominant programming languages. However, despite its popularity, it is a challenging task to correctly understand the behaviors of JavaScript programs because of their highly dynamic nature. Researchers have developed various static analyzers that strive to conform to ECMA-262, the standard specification of JavaScript. Unfortunately, all the existing JavaScript static analyzers require manual updates for new language features. This problem has become more critical since 2015 because the JavaScript language itself rapidly evolves with a yearly release cadence and open development process.
In this paper, we present JSAVER, the first tool that automatically derives JavaScript static analyzers from language specifications. The main idea of our approach is to extract a definitional interpreter from ECMA-262 and perform a meta-level static analysis with the extracted interpreter. A meta-level static analysis is a novel technique that indirectly analyzes programs by analyzing a definitional interpreter with the programs. We also describe how to indirectly configure abstract domains and analysis sensitivities in a meta-level static analysis. For evaluation, we derived a static analyzer from the latest ECMA-262 (ES12, 2021) using JSAVER. The derived analyzer soundly analyzed all applicable 18,556 official conformance tests with 99.0% of precision in 590 ms on average. In addition, we demonstrate the configurability and adaptability of JSAVER with several case studies.

Publisher's Version Artifacts Reusable
Automating Code Review Activities by Large-Scale Pre-training
Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundaresan
(Peking University, China; Microsoft Research, China; Sun Yat-sen University, China; LinkedIn, USA; Microsoft, USA)
Code review is an essential part to software development lifecycle since it aims at guaranteeing the quality of codes. Modern code review activities necessitate developers viewing, understanding and even running the programs to assess logic, functionality, latency, style and other factors. It turns out that developers have to spend far too much time reviewing the code of their peers. Accordingly, it is in significant demand to automate the code review process. In this research, we focus on utilizing pre-training techniques for the tasks in the code review scenario. We collect a large-scale dataset of real-world code changes and code reviews from open-source projects in nine of the most popular programming languages. To better understand code diffs and reviews, we propose CodeReviewer, a pre-trained model that utilizes four pre-training tasks tailored specifically for the code review scenario. To evaluate our model, we focus on three key tasks related to code review activities, including code change quality estimation, review comment generation and code refinement. Furthermore, we establish a high-quality benchmark dataset based on our collected data for these three tasks and conduct comprehensive experiments on it. The experimental results demonstrate that our model outperforms the previous state-of-the-art pre-training approaches in all tasks. Further analysis show that our proposed pre-training tasks and the multilingual pre-training dataset benefit the model on the understanding of code changes and reviews.

Publisher's Version
Corporate Dominance in Open Source Ecosystems: A Case Study of OpenStack
Yuxia Zhang, Klaas-Jan Stol, Hui Liu, and Minghui Zhou
(Beijing Institute of Technology, China; Lero, Ireland; University College Cork, Ireland; Peking University, China)
Corporate participation plays an increasing role in Open Source Software (OSS) development. Unlike volunteers in OSS projects, companies are driven by business objectives. To pursue corporate interests, companies may try to dominate the development direction of OSS projects. One company's domination in OSS may 'crowd out' other contributors, changing the nature of the project, and jeopardizing the sustainability of the OSS ecosystem. Prior studies of corporate involvement in OSS have primarily focused on predominately positive aspects such as business strategies, contribution models, and collaboration patterns. However, there is a scarcity of research on the potential drawbacks of corporate engagement. In this paper, we investigate corporate dominance in OSS ecosystems. We draw on the field of Economics and quantify company domination using a dominance measure; we investigate the prevalence, patterns, and impact of domination in the evolution of the OpenStack ecosystem. We find evidence of company domination in over 73% of the repositories in OpenStack, and approximately 25% of companies dominate one or more repositories per version. We identify five patterns of corporate dominance: Early incubation, Full-time hosting, Growing domination, Occasional domination, and Last remaining. We find that domination has a significantly negative relationship with the survival probability of OSS projects. This study provides insights for building sustainable relationships between companies and the OSS ecosystems in which they seek to get involved.

Publisher's Version
Detecting Simulink Compiler Bugs via Controllable Zombie Blocks Mutation
Shikai Guo, He Jiang, Zhihao Xu, Xiaochen Li, Zhilei Ren, Zhide Zhou, and Rong Chen
(Dalian Maritime University, China; Dalian University of Technology, China)
As a popular Cyber-Physical System (CPS) development tool chain, MathWorks Simulink is widely used to prototype CPS models in safety-critical applications, e.g., aerospace and healthcare. It is crucial to ensure the correctness and reliability of Simulink compiler (i.e., the compiler module of Simulink) in practice since all CPS models depend on compilation. However, Simulink compiler testing is challenging due to millions of lines of source code and the lack of the complete formal language specification. Although several methods have been proposed to automatically test Simulink compiler, there still remains two challenges to be tackled, namely the limited variant space and the insufficient mutation diversity. To address these challenges, we propose COMBAT, a new differential testing method for Simulink compiler testing. COMBAT includes an EMI (Equivalence Modulo Input) mutation component and a diverse variant generation component. The EMI mutation component inserts assertion statements (e.g., If /While blocks) at arbitrary points of the seed CPS model. These statements break each insertion point into true and false branches. Then, COMBAT feeds all the data passed through the insertion point into the true branch to preserve the equivalence of CPS variants. In such a way, the body of the false branch could be viewed as a new variant space, thus addressing the first challenge. The diverse variant generation component uses Markov chain Monte Carlo optimization to sample the seed CPS model and generate complex mutations of long sequences of blocks in the variant space, thus addressing the second challenge. Experiments demonstrate that COMBAT significantly outperforms the state-of-the-art approaches in Simulink compiler testing. Within five months, COMBAT has reported 16 valid bugs for Simulink R2021b, of which 11 bugs have been confirmed as new bugs by MathWorks Support.

Publisher's Version
Diet Code Is Healthy: Simplifying Programs for Pre-trained Models of Code
Zhaowei Zhang, Hongyu Zhang, Beijun Shen, and Xiaodong Gu
(Shanghai Jiao Tong University, China; University of Newcastle, Australia)
Pre-trained code representation models such as CodeBERT have demonstrated superior performance in a variety of software engineering tasks, yet they are often heavy in complexity, quadratically with the length of the input sequence. Our empirical analysis of CodeBERT's attention reveals that CodeBERT pays more attention to certain types of tokens and statements such as keywords and data-relevant statements. Based on these findings, we propose DietCode, which aims at lightweight leverage of large pre-trained models for source code. DietCode simplifies the input program of CodeBERT with three strategies, namely, word dropout, frequency filtering, and an attention-based strategy that selects statements and tokens that receive the most attention weights during pre-training. Hence, it gives a substantial reduction in the computational cost without hampering the model performance. Experimental results on two downstream tasks show that DietCode provides comparable results to CodeBERT with 40% less computational cost in fine-tuning and testing.

Publisher's Version
Do Bugs Lead to Unnaturalness of Source Code?
Yanjie Jiang, Hui Liu, Yuxia Zhang, Weixing Ji, Hao Zhong, and Lu Zhang
(Beijing Institute of Technology, China; Shanghai Jiao Tong University, China; Peking University, China)
Texts in natural languages are highly repetitive and predictable because of the naturalness of natural languages. Recent research validated that source code in programming languages is also repetitive and predictable, and naturalness is an inherent property of source code. It was also reported that buggy code is significantly less natural than bug-free one, and bug fixing substantially improves the naturalness of the involved source code. In this paper, we revisit the naturalness of buggy code and investigate the effect of bug-fixing on the naturalness of source code. Different from the existing investigation, we leverage two large-scale and high-quality bug repositories where bug-irrelevant changes in bug-fixing commits have been explicitly excluded. Our evaluation results confirm that buggy lines are often less natural than bug-free ones. However, fixing bugs could not significantly improve the naturalness of involved code lines. Fixed lines on average are as unnatural as buggy ones. Consequently, bugs are not the root cause of the unnaturalness of source code, and it could be inaccurate to identify buggy code lines solely by the naturalness of source code. Our evaluation results suggest that the naturalness-based buggy line detection results in extremely low precision (less than one percentage).

Publisher's Version
Generating Realistic Vulnerabilities via Neural Code Editing: An Empirical Study
Yu Nong, Yuzhe Ou, Michael Pradel, Feng Chen, and Haipeng Cai
(Washington State University, USA; University of Texas at Dallas, USA; University of Stuttgart, Germany)
The availability of large-scale, realistic vulnerability datasets is essential both for benchmarking existing techniques and for developing effective new data-driven approaches for software security. Yet such datasets are critically lacking. A promising solution is to generate such datasets by injecting vulnerabilities into real-world programs, which are richly available. Thus, in this paper, we explore the feasibility of vulnerability injection through neural code editing. With a synthetic dataset and a real-world one, we investigate the potential and gaps of three state-of-the-art neural code editors for vulnerability injection. We find that the studied editors have critical limitations on the real-world dataset, where the best accuracy is only 10.03%, versus 79.40% on the synthetic dataset. While the graph-based editors are more effective (successfully injecting vulnerabilities in up to 34.93% of real-world testing samples) than the sequence-based one (0 success), they still suffer from complex code structures and fall short for long edits due to their insufficient designs of the preprocessing and deep learning (DL) models. We reveal the promise of neural code editing for generating realistic vulnerable samples, as they help boost the effectiveness of DL-based vulnerability detectors by up to 49.51% in terms of F1 score. We also provide insights into the gaps in current editors (e.g., they are good at deleting but not at replacing code) and actionable suggestions for addressing them (e.g., designing effective editing primitives).

Publisher's Version Artifacts Functional
Generic Sensitivity: Customizing Context-Sensitive Pointer Analysis for Generics
Haofeng Li, Jie Lu, Haining Meng, Liqing Cao, Yongheng Huang, Lian Li, and Lin Gao
(Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; TianqiSoft, China)
Generic programming has been extensively used in object-oriented programs such as Java. However, existing context-sensitive pointer analyses perform poorly in analyzing generics. This paper introduces generic sensitivity, a new context customization scheme targeting generics. We design our context customization scheme in such a way that generic instantiation sites, i.e., locations instantiating generic classes/methods with concrete types, are always preserved as key context elements. This is realized by augmenting contexts with a type variable lookup map, which is efficiently updated during the analysis in a context-sensitive manner.
We have implemented different variants of generic-sensitive analysis in Wala and experimental results show that the generic customization scheme can significantly improve performance and precision of context-sensitive pointer analyses. For instance, generic context customization significantly improves precision of 1-object-sensitive analysis, with an average speedup of 1.8×. In addition, generic context customization enables a 1-object-sensitive analysis to achieve overall better precision than a 2-object-sensitive analysis, with an averagely speed up of 12.6 × (62 × for chart).

Publisher's Version
MAAT: A Novel Ensemble Approach to Addressing Fairness and Performance Bugs for Machine Learning Software
Zhenpeng Chen, Jie M. Zhang, Federica Sarro, and Mark Harman
(University College London, UK; King’s College London, UK)
Machine Learning (ML) software can lead to unfair and unethical decisions, making software fairness bugs an increasingly significant concern for software engineers. However, addressing fairness bugs often comes at the cost of introducing more ML performance (e.g., accuracy) bugs. In this paper, we propose MAAT, a novel ensemble approach to improving fairness-performance trade-off for ML software. Conventional ensemble methods combine different models with identical learning objectives. MAAT, instead, combines models optimized for different objectives: fairness and ML performance. We conduct an extensive evaluation of MAAT with 5 state-of-the-art methods, 9 software decision tasks, and 15 fairness-performance measurements. The results show that MAAT significantly outperforms the state-of-the-art. In particular, MAAT beats the trade-off baseline constructed by a recent benchmarking tool in 92.2% of the overall cases evaluated, 12.2 percentage points more than the best technique currently available. Moreover, the superiority of MAAT over the state-of-the-art holds on all the tasks and measurements that we study. We have made publicly available the code and data of this work to allow for future replication and extension.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Minerva: Browser API Fuzzing with Dynamic Mod-Ref Analysis
Chijin Zhou, Quan Zhang, Mingzhe Wang, Lihua Guo, Jie Liang, Zhe Liu, Mathias Payer, and Yu Jiang
(Tsinghua University, China; Nanjing University of Aeronautics and Astronautics, China; EPFL, Switzerland)
Browser APIs are essential to the modern web experience. Due to their large number and complexity, they vastly expand the attack surface of browsers. To detect vulnerabilities in these APIs, fuzzers generate test cases with a large amount of random API invocations. However, the massive search space formed by arbitrary API combinations hinders their effectiveness: since randomly-picked API invocations unlikely interfere with each other (i.e., compute on partially shared data), few interesting API interactions are explored. Consequently, reducing the search space by revealing inter-API relations is a major challenge in browser fuzzing.
We propose Minerva, an efficient browser fuzzer for browser API bug detection. The key idea is to leverage API interference relations to reduce redundancy and improve coverage. Minerva consists of two modules: dynamic mod-ref analysis and guided code generation. Before fuzzing starts, the dynamic mod-ref analysis module builds an API interference graph. It first automatically identifies individual browser APIs from the browser’s code base. Next, it instruments the browser to dynamically collect mod-ref relations between APIs. During fuzzing, the guided code generation module synthesizes highly-relevant API invocations guided by the mod-ref relations. We evaluate Minerva on three mainstream browsers, i.e. Safari, FireFox, and Chromium. Compared to state-of-the-art fuzzers, Minerva improves edge coverage by 19.63% to 229.62% and finds 2x to 3x more unique bugs. Besides, Minerva has discovered 35 previously-unknown bugs out of which 20 have been fixed with 5 CVEs assigned and acknowledged by browser vendors.

Publisher's Version
NMTSloth: Understanding and Testing Efficiency Degradation of Neural Machine Translation Systems
Simin Chen, Cong Liu, Mirazul Haque, Zihe Song, and Wei Yang
(University of Texas at Dallas, USA)
Neural Machine Translation (NMT) systems have received much recent attention due to their human-level accuracy. While existing works mostly focus on either improving accuracy or testing accuracy robustness, the computation efficiency of NMT systems, which is of paramount importance due to often vast translation demands and real-time requirements, has surprisingly received little attention. In this paper, we make the first attempt to understand and test potential computation efficiency robustness in state-of-the-art NMT systems. By analyzing the working mechanism and implementation of 1455 public-accessible NMT systems, we observe a fundamental property in NMT systems that could be manipulated in an adversarial manner to reduce computation efficiency significantly. Our interesting observation is that the output length determines the computation efficiency of NMT systems instead of the input, where the output length depends on two factors: an often sufficiently large yet pessimistic pre-configured threshold controlling the max number of iterations and a runtime generated end of sentence (EOS) token. Our key motivation is to generate test inputs that could sufficiently delay the generation of EOS such that NMT systems would have to go through enough iterations to satisfy the pre-configured threshold. We present NMTSloth, which develops a gradient-guided technique that searches for a minimal and unnoticeable perturbation at character-level, token-level, and structure-level, which sufficiently delays the appearance of EOS and forces these inputs to reach the naturally-unreachable threshold. To demonstrate the effectiveness of NMTSloth, we conduct a systematic evaluation on three public-available NMT systems: Google T5, AllenAI WMT14, and Helsinki-NLP translators. Experimental results show that NMTSloth can increase NMT systems' response latency and energy consumption by 85% to 3153% and 86% to 3052%, respectively, by perturbing just one character or token in the input sentence. Our case study shows that inputs generated by NMTSloth significantly affect the battery power in real-world mobile devices (i.e., drain more than 30 times battery power than normal inputs).

Publisher's Version
Putting Them under Microscope: A Fine-Grained Approach for Detecting Redundant Test Cases in Natural Language
Zhiyuan Chang, Mingyang Li, Junjie Wang, Qing Wang, and Shoubin Li
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)
Natural language (NL) documentation is the bridge between software managers and testers, and NL test cases are prevalent in system-level testing and other quality assurance activities. Due to reasons such as requirements redundancy, parallel testing, tester turn-over within long evolving history, there are inevitably lots of redundant test cases, which significantly increase the cost. Previous redundancy detection approaches typically treat the textual descriptions as a whole to compare their similarity and suffer from low precision. Our observation reveals that a test case can have explicit test-oriented entities, such as tested function Components, Constraints, etc; and there are also specific relations between these entities. This inspires us with a potential opportunity for accurate redundancy detection. In this paper, we first define five test-oriented entity categories and four associated relation categories, and re-formulate the NL test case redundancy detection problem as the comparison of detailed testing content guided by the test-oriented entities and relations. Following that, we propose Tscope, a fine-grained approach for redundant NL test case detection by dissecting test cases into atomic test tuple(s) with the entities restricted by associated relations. To serve as the test case dissection, Tscope designs a context-aware model for the automatic entity and relation extraction. Evaluation on 3,467 test cases from ten projects shows Tscope could achieve 91.8% precision, 74.8% recall and 82.4% F1, significantly outperforming state-of-the-art approaches and commonly-used classifiers. This new formulation of the NL test case redundant detection problem can motivate the follow-up studies in further improving this task and other related tasks involving NL descriptions.

Publisher's Version
RULER: Discriminative and Iterative Adversarial Training for Deep Neural Network Fairness
Guanhong Tao, Weisong Sun, Tingxu Han, Chunrong Fang, and Xiangyu Zhang
(Purdue University, USA; Nanjing University, China)
Deep Neural Networks (DNNs) are becoming an integral part of many real-world applications, such as autonomous driving and financial management. While these models enable autonomy, there are however concerns regarding their ethics in decision making. For instance, fairness is an aspect that requires particular attention. A number of fairness testing techniques have been proposed to address this issue, e.g., by generating test cases called individual discriminatory instances for repairing DNNs. Although they have demonstrated great potential, they tend to generate many test cases that are not directly effective in improving fairness and incur substantial computation overhead. We propose a new model repair technique, RULER, by discriminating sensitive and non-sensitive attributes during test case generation for model repair. The generated cases are then used in training to improve DNN fairness. RULER balances the trade-off between accuracy and fairness by decomposing the training procedure into two phases and introducing a novel iterative adversarial training method for fairness. Compared to the state-of-the-art techniques on four datasets, RULER has 7-28 times more effective repair test cases generated, is 10-15 times faster in test generation, and has 26-43% more fairness improvement on average.

Publisher's Version
SamplingCA: Effective and Efficient Sampling-Based Pairwise Testing for Highly Configurable Software Systems
Chuan Luo, Qiyuan Zhao, Shaowei Cai, Hongyu Zhang, and Chunming Hu
(Beihang University, China; Shanghai Jiao Tong University, China; Institute of Software at Chinese Academy of Sciences, China; University of Newcastle, Australia)
Combinatorial interaction testing (CIT) is an effective paradigm for testing highly configurable systems, and its goal is to generate a t-wise covering array (CA) as a test suite, where t is the strength of testing. It is recognized that pairwise testing (i.e., CIT with t=2) is the most common CIT technique, and has high fault detection capability in practice. The problem of pairwise CA generation (PCAG), which is a core problem in pairwise testing, aims at generating a pairwise CA (i.e., 2-wise CA) of minimum size, subject to hard constraints. The PCAG problem is a hard combinatorial optimization problem, which urgently requires practical methods for generating pairwise CAs (PCAs) of small sizes. However, existing PCAG algorithms suffer from the severe scalability issue; that is, when solving large-scale PCAG instances, existing state-of-the-art PCAG algorithms usually cost a fairly long time to generate large PCAs, which would make the testing of highly configurable systems both ineffective and inefficient. In this paper, we propose a novel and effective sampling-based approach dubbed SamplingCA for solving the PCAG problem. SamplingCA first utilizes sampling techniques to obtain a small test suite that covers valid pairwise tuples as many as possible, and then adds a few more test cases into the test suite to ensure that all valid pairwise tuples are covered. Extensive experiments on 125 public PCAG instances show that our approach can generate much smaller PCAs than its state-of-the-art competitors, indicating the effectiveness of SamplingCA. Also, our experiments show that SamplingCA runs one to two orders of magnitude faster than its competitors, demonstrating the efficiency of SamplingCA. Our results confirm that SamplingCA is able to address the scalability issue and considerably pushes forward the state of the art in PCAG solving.

Publisher's Version Artifacts Reusable
SPINE: A Scalable Log Parser with Feedback Guidance
Xuheng Wang, Xu Zhang, Liqun Li, Shilin He, Hongyu Zhang, Yudong Liu, Lingling Zheng, Yu Kang, Qingwei Lin, Yingnong Dang, Saravanakumar Rajmohan, and Dongmei Zhang
(Tsinghua University, China; Microsoft Research, China; University of Newcastle, Australia; Microsoft Azure, USA; Microsoft 365, USA)
Log parsing, which extracts log templates and parameters, is a critical prerequisite step for automated log analysis techniques. Though existing log parsers have achieved promising accuracy on public log datasets, they still face many challenges when applied in the industry. Through studying the characteristics of real-world log data and analyzing the limitations of existing log parsers, we identify two problems. Firstly, it is non-trivial to scale a log parser to a vast number of logs, especially in real-world scenarios where the log data is extremely imbalanced. Secondly, existing log parsers overlook the importance of user feedback, which is imperative for parser fine-tuning under the continuous evolution of log data. To overcome the challenges, we propose SPINE, which is a highly scalable log parser with user feedback guidance. Based on our log parser equipped with initial grouping and progressive clustering,we propose a novel log data scheduling algorithm to improve the efficiency of parallelization under the large-scale imbalanced log data. Besides, we introduce user feedback to make the parser fast adapt to the evolving logs. We evaluated SPINE on 16 public log datasets. SPINE achieves more than 0.90 parsing accuracy on average with the highest parsing efficiency, which outperforms the state-of-the-art log parsers. We also evaluated SPINE in the production environment of Microsoft, in which SPINE can parse 30million logs in less than 8 minutes under 16 executors, achieving near real-time performance. In addition, our evaluations show that SPINE can consistently achieve good accuracy under log evolution with a moderate number of user feedback.

Publisher's Version
SymMC: Approximate Model Enumeration and Counting using Symmetry Information for Alloy Specifications
Wenxi Wang, Yang Hu, Kenneth L. McMillan, and Sarfraz Khurshid
(University of Texas at Austin, USA)
Specifying and analyzing critical properties of software systems plays an important role in the development of reliable systems. Alloy is a mature tool-set that provides a first-order relational logic for writing specifications, and a fully automatic powerful backend for analyzing the specifications. It has been widely applied in areas including verification, security, and synthesis.
Symmetry breaking is a useful approach for pruning the search space to efficiently check the satisfiability of combinatorial problems. As the backend solver of Alloy, Kodkod does the partial symmetry breaking (PaSB) for Alloy specifications. While full symmetry breaking remains challenging to scale, a recent study showed that Kodkod PaSB could significantly reduce the model counting time, albeit at the cost of producing only partial model counts. However, the desired term is either the isomorphic count under no symmetry breaking, or the non-isomorphic models/count under full symmetry breaking. This paper presents an approach called SymMC, which utilizes the symmetry information to compute all the desired terms for Alloy specifications. To make SymMC scalable, we propose approximate algorithms based on sampling to estimate the desired terms. We show that our proposed estimators have consistency and upper bound properties. To our knowledge, SymMC is the first approach that automatically approximates non-isomorphic model enumeration/counting for Alloy specifications. Thanks to the non-isomorphic model counting, SymMC also provides the first automatic quantification measurement on the solution space pruning ability of Kodkod PaSB. Furthermore, empirical evaluations show that SymMC provides a competitive isomorphic counting approach for Alloy specifications compared to the state-of-the-art model counters.

Publisher's Version
TraceCRL: Contrastive Representation Learning for Microservice Trace Analysis
Chenxi Zhang, Xin Peng, Tong Zhou, Chaofeng Sha, Zhenghui Yan, Yiru Chen, and Hong Yang
(Fudan University, China)
Due to the large amount and high complexity of trace data, microservice trace analysis tasks such as anomaly detection, fault diagnosis, and tail-based sampling widely adopt machine learning technology. These trace analysis approaches usually use a preprocessing step to map structured features of traces to vector representations in an ad-hoc way. Therefore, they may lose important information such as topological dependencies between service operations. In this paper, we propose TraceCRL, a trace representation learning approach based on contrastive learning and graph neural network, which can incorporate graph structured information in the downstream trace analysis tasks. Given a trace, TraceCRL constructs an operation invocation graph where nodes represent service operations and edges represent operation invocations together with predefined features for invocation status and related metrics. Based on the operation invocation graphs of traces TraceCRL uses a contrastive learning method to train a graph neural network-based model for trace representation. In particular, TraceCRL employs six trace data augmentation strategies to alleviate the problems of class collision and uniformity of representation in contrastive learning. Our experimental studies show that TraceCRL can significantly improve the performance of trace anomaly detection and offline trace sampling. It also confirms the effectiveness of the trace augmentation strategies and the efficiency of TraceCRL.

Publisher's Version
You See What I Want You to See: Poisoning Vulnerabilities in Neural Code Search
Yao Wan, Shijie Zhang, Hongyu Zhang, Yulei Sui, Guandong Xu, Dezhong Yao, Hai Jin, and Lichao Sun
(Huazhong University of Science and Technology, China; University of Newcastle, Australia; University of Technology Sydney, Australia; Lehigh University, USA)
Searching and reusing code snippets from open-source software repositories based on natural-language queries can greatly improve programming productivity.Recently, deep-learning-based approaches have become increasingly popular for code search. Despite substantial progress in training accurate models of code search, the robustness of these models has received little attention so far.
In this paper, we aim to study and understand the security and robustness of code search models by answering the following question: Can we inject backdoors into deep-learning-based code search models? If so, can we detect poisoned data and remove these backdoors? This work studies and develops a series of backdoor attacks on the deep-learning-based models for code search, through data poisoning. We first show that existing models are vulnerable to data-poisoning-based backdoor attacks. We then introduce a simple yet effective attack on neural code search models by poisoning their corresponding training dataset.
Moreover, we demonstrate that attacks can also influence the ranking of the code search results by adding a few specially-crafted source code files to the training corpus. We show that this type of backdoor attack is effective for several representative deep-learning-based code search systems, and can successfully manipulate the ranking list of searching results. Taking the bidirectional RNN-based code search system as an example, the normalized ranking of the target candidate can be significantly raised from top 50% to top 4.43%, given a query containing an attacker targeted word, e.g., file. To defend a model against such attack, we empirically examine an existing popular defense strategy and evaluate its performance. Our results show the explored defense strategy is not yet effective in our proposed backdoor attack for code search systems.

Publisher's Version

Industry

Machine Learning

Nalanda: A Socio-technical Graph Platform for Building Software Analytics Tools at Enterprise Scale
Chandra Maddila, Suhas Shanbhogue, Apoorva Agrawal, Thomas Zimmermann, Chetan Bansal, Nicole Forsgren, Divyanshu Agrawal, Kim Herzig, and Arie van Deursen
(Microsoft Research, USA; Microsoft Research, India; Microsoft, USA; Delft University of Technology, Netherlands)
Software development is information-dense knowledge work that requires collaboration with other developers and awareness of artifacts such as work items, pull requests, and file changes. With the speed of development increasing, information overload and information discovery are challenges for people developing and maintaining these systems. Finding information about similar code changes and experts is difficult for software engineers, especially when they work in large software systems or have just recently joined a project. In this paper, we build a large scale data platform named Nalanda platform to address the challenges of information overload and discovery. Nalanda contains two subsystems: (1) a large scale socio-technical graph system, named Nalanda graph system, and (2) a large scale index system, named Nalanda index system that aims at satisfying the information needs of software developers. To show the versatility of the Nalanda platform, we built two applications: (1) a software analytics application with a news feed named MyNalanda that has Daily Active Users (DAU) of 290 and Monthly Active Users (MAU) of 590, and (2) a recommendation system for related work items and pull requests that accomplished similar tasks (artifact recommendation) and a recommendation system for subject matter experts (expert recommendation), augmented by the Nalanda socio-technical graph. Initial studies of the two applications found that developers and engineering managers are favorable toward continued use of the news feed application for information discovery. The studies also found that developers agreed that a system like Nalanda artifact and expert recommendation application could reduce the time spent and the number of places needed to visit to find information.

Publisher's Version Info
Uncertainty-Aware Transfer Learning to Evolve Digital Twins for Industrial Elevators
Qinghua Xu, Shaukat Ali, Tao Yue, and Maite Arratibel
(Simula Research Laboratory, Norway; University of Oslo, Norway; Orona, Spain)
Digital twins are increasingly developed to support the development, operation, and maintenance of cyber-physical systems such as industrial elevators. However, industrial elevators continuously evolve due to changes in physical installations, introducing new software features, updating existing ones, and making changes due to regulations (e.g., enforcing restricted elevator capacity due to COVID-19), etc. Thus, digital twin functionalities (often built on neural network-based models) need to evolve themselves constantly to be synchronized with the industrial elevators. Such an evolution is preferred to be automated, as manual evolution is time-consuming and error-prone. Moreover, collecting sufficient data to re-train neural network models of digital twins could be expensive or even infeasible. To this end, we propose unceRtaInty-aware tranSfer lEarning enriched Digital Twins LATTICE, a transfer learning based approach capable of transferring knowledge about the waiting time prediction capability of a digital twin of an industrial elevator across different scenarios. LATTICE also leverages uncertainty quantification to further improve its effectiveness. To evaluate LATTICE, we conducted experiments with 10 versions of an elevator dispatching software from Orona, Spain, which are deployed in a Software in the Loop (SiL) environment. Experiment results show that LATTICE, on average, improves the Mean Squared Error by 13.131% and the utilization of uncertainty quantification further improves it by 2.71%.

Publisher's Version
All You Need Is Logs: Improving Code Completion by Learning from Anonymous IDE Usage Logs
Vitaliy Bibaev, Alexey Kalina, Vadim Lomshakov, Yaroslav Golubev, Alexander Bezzubov, Nikita Povarov, and Timofey Bryksin
(JetBrains, Serbia; JetBrains, Germany; JetBrains, Russia; JetBrains Research, Serbia; JetBrains, Netherlands; JetBrains Research, Cyprus)
In this work, we propose an approach for collecting completion usage logs from the users in an IDE and using them to train a machine learning based model for ranking completion candidates. We developed a set of features that describe completion candidates and their context, and deployed their anonymized collection in the Early Access Program of IntelliJ-based IDEs. We used the logs to collect a dataset of code completions from users, and employed it to train a ranking CatBoost model. Then, we evaluated it in two settings: on a held-out set of the collected completions and in a separate A/B test on two different groups of users in the IDE. Our evaluation shows that using a simple ranking model trained on the past user behavior logs significantly improved code completion experience. Compared to the default heuristics-based ranking, our model demonstrated a decrease in the number of typing actions necessary to perform the completion in the IDE from 2.073 to 1.832.
The approach adheres to privacy requirements and legal constraints, since it does not require collecting personal information, performing all the necessary anonymization on the client's side. Importantly, it can be improved continuously: implementing new features, collecting new data, and evaluating new models - this way, we have been using it in production since the end of 2020.

Publisher's Version
Testing of Machine Learning Models with Limited Samples: An Industrial Vacuum Pumping Application
Ayan Chatterjee, Bestoun S. Ahmed, Erik Hallin, and Anton Engman
(Karlstad University, Sweden; Uddeholms, Sweden)
There is often a scarcity of training data for machine learning (ML) classification and regression models in industrial production, especially for time-consuming or sparsely run manufacturing processes. Traditionally, a majority of the limited ground-truth data is used for training, while a handful of samples are left for testing. In that case, the number of test samples is inadequate to properly evaluate the robustness of the ML models under test (i.e., the system under test) for classification and regression. Furthermore, the output of these ML models may be inaccurate or even fail if the input data differ from the expected. This is the case for ML models used in the Electroslag Remelting (ESR) process in the refined steel industry to predict the pressure in a vacuum chamber. A vacuum pumping event that occurs once a workday generates a few hundred samples in a year of pumping for training and testing. In the absence of adequate training and test samples, this paper first presents a method to generate a fresh set of augmented samples based on vacuum pumping principles. Based on the generated augmented samples, three test scenarios and one test oracle are presented to assess the robustness of an ML model used for production on an industrial scale. Experiments are conducted with real industrial production data obtained from Uddeholms AB steel company. The evaluations indicate that Ensemble and Neural Network are the most robust when trained on augmented data using the proposed testing strategy. The evaluation also demonstrates the proposed method's effectiveness in checking and improving ML algorithms' robustness in such situations. The work improves software testing's state-of-the-art robustness testing in similar settings. Finally, the paper presents an MLOps implementation of the proposed approach for real-time ML model prediction and action on the edge node and automated continuous delivery of ML software from the cloud.

Publisher's Version
Improving ML-Based Information Retrieval Software with User-Driven Functional Testing and Defect Class Analysis
Junjie Zhu, Teng Long, Wei Wang, and Atif Memon
(Apple, USA)
Machine Learning (ML) has become the cornerstone of information retrieval (IR) software, as it can drive better user experience by leveraging information-rich data and complex models. However, evaluating the emergent behavior of ML-based IR software can be challenging with traditional software testing approaches: when developers modify the software, they cannot often extract useful information from individual test instances; rather, they seek to holistically verify whether—and where—their modifications caused significant regressions or improvements at scale. In this paper, we introduce not only such a holistic approach to evaluate the system-level behavior of the software, but also the concept of a defect class, which represents a partition of the input space on which the ML-based software does measurably worse for an existing feature or on which the ML task is more challenging for a new feature. We leverage large volumes of functional test cases, automatically obtained, to derive these defect classes, and propose new ways to improve the IR software from an end-user’s perspective. Applying our approach on a real production Search-AutoComplete system that contains a query interpretation ML component, we demonstrate that (1) our holistic metrics successfully identified two regressions and one improvement, where all 3 were independently verified with retrospective A/B experiments, (2) the automatically obtained defect classes provided actionable insights during early-stage ML development, and (3) we also detected defect classes at the finer sub-component level for which there were significant regressions, which we blocked prior to different releases.

Publisher's Version

Empirical

What Improves Developer Productivity at Google? Code Quality
Lan Cheng, Emerson Murphy-Hill, Mark Canning, Ciera Jaspan, Collin Green, Andrea Knight, Nan Zhang, and Elizabeth Kammer
(Google, USA)
Understanding what affects software developer productivity can help organizations choose wise investments in their technical and social environment. But the research literature either focuses on what correlates with developer productivity in ecologically valid settings or focuses on what causes developer productivity in highly constrained settings. In this paper, we bridge the gap by studying software developers at Google through two analyses. In the first analysis, we use panel data with 39 productivity factors, finding that code quality, technical debt, infrastructure tools and support, team communication, goals and priorities, and organizational change and process are all causally linked to self-reported developer productivity. In the second analysis, we use a lagged panel analysis to strengthen our causal claims. We find that increases in perceived code quality tend to be followed by increased perceived developer productivity, but not vice versa, providing the strongest evidence to date that code quality affects individual developer productivity.

Publisher's Version Archive submitted (330 kB)
Understanding Why We Cannot Model How Long a Code Review Will Take: An Industrial Case Study
Lawrence Chen, Peter C. Rigby, and Nachiappan Nagappan
(Meta, USA; Concordia University, Canada)
Code review is an effective practice for finding defects, but because it is manually intensive it can slow down the continuous integration of changes. Our goal was to understand the factors that influenced the time a change, ie a diff at Meta, would spend in review. A developer survey showed that diff reviews start to feel slow after they have been waiting for around 24 hour review. We built a review time predictor model to identify potential factors that may be causing reviews to take longer, which we could use to predict when would be the best time to nudge reviewers or to identify diff-related factors that we may need to address.
The strongest feature of the time spent in review model we built was the day of the week because diffs submitted near the weekend may have to wait for Monday for review. After removing time on weekends, the remaining features, including size of diff and the number of meetings the reviewers have did not provide substantial predictive power, thereby not being able to predict how long a code review would take.
We contributed to the effort to reduce stale diffs by suggesting that diffs be nudged near the start of the workday and that diffs published near the weekend be nudged sooner on Friday to avoid waiting the entire weekend. We use a nudging threshold rather than a model because we showed that TimeInReview cannot be accurately modelled. The NudgeBot has been rolled to over 30k developers at Meta.

Publisher's Version
Leveraging Test Plan Quality to Improve Code Review Efficacy
Lawrence Chen, Rui Abreu, Tobi Akomolede, Peter C. Rigby, Satish Chandra, and Nachiappan Nagappan
(Meta Platforms, USA; Concordia University, Canada)
In modern code reviews, many artifacts play roles in knowledge- sharing and documentation: summaries, test plans, and comments, etc. Improving developer tools and facilitating better code reviews require an understanding of the quality of pull requests and their artifacts. This is difficult to measure, however, because they are often free-form natural language and unstructured text data. In this paper, we focus on measuring the quality of test plans at Meta. Test plans are used as a communication mechanism between the author of a pull request and its reviewers, serving as walkthroughs to help confirm that the changed code is behaving as expected. We collected developer opinions on over 650 test plans from more than 500 Meta developers, then introduced a transformer-based model to leverage the success of natural language processing (NLP) tech- niques in the code review domain. In our study, we show that the learned model is able to capture the sentiment of developers and reflect a correlation of test plan quality with review engagement and reversions: compared to a decision tree model, our proposed transformer-based model achieves a 7% higher F1-score. Finally, we present a case study of how such a metric may be useful in experiments to inform improvements in developer tools and experiences.

Publisher's Version
Are Elevator Software Robust against Uncertainties? Results and Experiences from an Industrial Case Study
Liping Han, Tao Yue, Shaukat Ali, Aitor Arrieta, and Maite Arratibel
(Nanjing University of Aeronautics and Astronautics, China; Simula Research Laboratory, Norway; Mondragon University, Spain; Orona, Spain)
Industrial elevator systems are complex Cyber-Physical Systems operating in uncertain environments and experiencing uncertain passenger behaviors, hardware delays, and software errors. Identifying, understanding, and classifying such uncertainties are essential to enable system designers to reason about uncertainties and subsequently develop solutions for empowering elevator systems to deal with uncertainties systematically. To this end, we present a method, called RuCynefin, based on the Cynefin framework to classify uncertainties in industrial elevator systems from our industrial partner (Orona, Spain), results of which can then be used for assessing their robustness. RuCynefin is equipped with a novel classification algorithm to identify the Cynefin contexts for a variety of uncertainties in industrial elevator systems, and a novel metric for measuring the robustness using the uncertainty classification. We evaluated RuCynefin with an industrial case study of 90 dispatchers from Orona to assess their robustness against uncertainties. Results show that RuCynefin could effectively identify several situations for which certain dispatchers were not robust. Specifically, 93% of such versions showed some degree of low robustness against uncertainties. We also provide insights on the potential practical usages of RuCynefin, which are useful for practitioners in this field.

Publisher's Version

Community

Achievement Unlocked: A Case Study on Gamifying DevOps Practices in Industry
Patrick Ayoup, Diego Elias Costa, and Emad Shihab
(Concordia University, Canada; Université du Québec à Montréal, Canada)
Gamification is the use of game elements such as points, leaderboards, and badges in a non-game context to encourage a desired behavior from individuals interacting with an environment. Recently, gamification has found its way into software engineering contexts as a means to promote certain activities to practitioners. Previous studies investigated the use of gamification to promote the adoption of a variety of tools and practices, however, these studies were either performed in an educational environment or in small to medium-sized teams of developers in the industry.
We performed a large-scale mixed-methods study on the effects of badge-based gamification in promoting the adoption of DevOps practices in a very large company and evaluated how practice adoption is associated with changes in key delivery, quality, and throughput metrics of 333 software projects. We observed an accelerated adoption of some gamified DevOps practices by at least 60%, with increased adoption rates up to 6x. We found mixed results when associating badge adoption and metric changes: teams that earned testing badges showed an increase in bug fixing commits but output fewer commits and pull requests; teams that earned code review and quality tooling badges exhibited faster delivery metrics. Finally, our empirical study was supplemented by a survey with 45 developers where 73% of respondents found badges to be helpful for learning about and adopting new standardized practices. Our results contribute to the rich knowledge on gamification with a unique and important perspective from real industry practitioners.

Publisher's Version

Software Evolution

Sometimes You Have to Treat the Symptoms: Tackling Model Drift in an Industrial Clone-and-Own Software Product Line
Christof Tinnes, Wolfgang Rössler, Uwe Hohenstein, Torsten Kühn, Andreas Biesdorf, and Sven Apel
(Siemens, Germany; Siemens Mobility, Germany; Saarland University, Germany)
Many industrial software product lines use a clone-and-own approach for reuse among software products. As a result, the different products in the product line may drift apart, which implies increased efforts for tasks such as change propagation, domain analysis, and quality assurance. While many solutions have been proposed in the literature, these are often difficult to apply in a real-world setting. We study this drift of products in a concrete large-scale industrial model-driven clone-and-own software product line in the railway domain at our industry partner. For this purpose, we conducted interviews and a survey, and we investigated the models in the model history of this project. We found that increased efforts are mainly caused by large model differences and increased communication efforts. We argue that, in the short-term, treating the symptoms (i.e., handling large model differences) can help to keep efforts for software product-line engineering acceptable — instead of employing sophisticated variability management. To treat the symptoms, we employ a solution based on semantic-lifting to simplify model differences. Using the interviews and the survey, we evaluate the feasibility of variability management approaches and the semantic-lifting approach in the context of this project.

Publisher's Version

Program Analysis

Input Splitting for Cloud-Based Static Application Security Testing Platforms
Maria Christakis, Thomas Cottenier, Antonio Filieri, Linghui Luo, Muhammad Numair Mansur, Lee Pike, Nicolás Rosner, Martin Schäf, Aritra Sengupta, and Willem Visser
(MPI-SWS, Germany; Amazon Web Services, USA; Amazon Web Services, Germany)
As software development teams adopt DevSecOps practices, application security is increasingly the responsibility of development teams, who are required to set up their own Static Application Security Testing (SAST) infrastructure.
Since development teams often do not have the necessary infrastructure and expertise to set up a custom SAST solution, there is an increased need for cloud-based SAST platforms that operate as a service and run a variety of static analyzers. Adding a new static analyzer to a cloud-based SAST platform can be challenging because static analyzers greatly vary in complexity, from linters that scale efficiently to interprocedural dataflow engines that use cubic or even more complex algorithms. Careful manual evaluation is needed to decide whether a new analyzer would slow down the overall response time of the platform or may timeout too often.
We explore the question of whether this can be simplified by splitting the input to the analyzer into partitions and analyzing the partitions independently. Depending on the complexity of the static analyzer, the partition size can be adjusted to curtail the overall response time. We report on an experiment where we run different analysis tools with and without splitting the inputs. The experimental results show that simple splitting strategies can effectively reduce the running time and memory usage per partition without significantly affecting the findings produced by the tool.

Publisher's Version

Debugging/Localization

Metadata-Based Retrieval for Resolution Recommendation in AIOps
Harshit Kumar, Ruchi Mahindru, and Debanjana Kar
(IBM Research, India; IBM Research, USA)
For a cloud service provider, the goal is to proactively identify signals that can help reduce outages and/or reduce the mean-time-to-detect and mean-time-to-resolve. After an incident is reported, the Site Reliability Engineers diagnose the fault and search for a resolution by formulating a textual query to find similar historical incidents - this approach is called text-based retrieval. However, it has been observed that the formulated queries are inadequate and short. An alternate approach, presented in this paper, integrates information spread across heterogeneous and siloed datasets, as a ready-to-use knowledge base for metadata-based resolution retrieval. Additionally, it exploits historical problem context for building metadata prediction models which are used at run-time for automatically formulating queries from log anomalies detected by the Log Anomaly Detection module. The query, thus formed, is run against the metadata-based index, unlike the text-based index in text retrieval, resulting in superior performance, in terms of relevancy of the resolution documents retrieved. Through experiments on web application server applications deployed on the cloud, we show the efficacy of metadata-based retrieval, which not only returns targeted results as compared to text-based retrieval but also the relevant resolution document appear amongst the top 3 positions for 60% of the queries.

Publisher's Version

Collaboration

Workgraph: Personal Focus vs. Interruption for Engineers at Meta
Yifen Chen, Peter C. Rigby, Yulin Chen, Kun Jiang, Nader Dehghani, Qianying Huang, Peter Cottle, Clayton Andrews, Noah Lee, and Nachiappan Nagappan
(Meta, USA; Concordia University, Canada)
All engineers dislike interruptions because it takes away from the deep focus time needed to write complex code. Our goal is to reduce unnecessary interruptions at . We first describe our Workgraph platform that logs how engineers use our internal work tools at . Using these anonymized logs, we create sessions. sessions are defined in opposition to interruption and are the amount of time until the engineer is interrupted by, for example, a work chat message.
We describe descriptive statistics related to how long engineers are able to focus. We find that at Meta, Engineers have a total of 14.25 hours of personal-focus time per week. These numbers are comparable with those reported by other software firms.
We then create a Random Forest model to understand which factors influence the median daily personal-focus time. We find that the more time an engineer spends in the IDE the longer their focus. We also find that the more central an engineer is in the social work network, the shorter their personal-focus time. Other factors such as role and domain/pillar have little impact on personal-focus at Meta.
To help engineers achieve longer blocks of personal-focus and help them stay in flow, Meta developed the AutoFocus tool that blocks work chat notifications when an engineer is working on code for 12 minutes or longer. AutoFocus allows the sender to still force a work chat message using “@notify” ensuring that urgent messages still get through, but allowing the sender to reflect on the importance of the message. In a large experiment, we find that AutoFocus increases the amount of personal-focus time by 20.27%, and it has now been rolled out widely at Meta.

Publisher's Version
Understanding Automated Code Review Process and Developer Experience in Industry
Hyungjin Kim, Yonghwi Kwon, Sangwoo Joh, Hyukin Kwon, Yeonhee Ryou, and Taeksu Kim
(Samsung Research, South Korea)
Code Review Automation can reduce human efforts during code review by automatically providing valuable information to reviewers. Nevertheless, it is a challenge to automate the process for large-scale companies, such as Samsung Electronics, due to their complexity: various development environments, frequent review requests, huge size of software, and diverse process among the teams.
In this work, we show how we automated the code review process for those intricate environments, and share some lessons learned during two years of operation. Our unified code review automation system, Code Review Bot, is designed to process review requests holistically regardless of such environments, and checks various quality-assurance items such as potential defects in the code, coding style, test coverage, and open source license violations.
Some key findings include: 1) about 60% of issues found by Code Review Bot were reviewed and fixed in advance of product releases, 2) more than 70% of developers gave positive feedback about the system, 3) developers rapidly and actively responded to reviews, and 4) the automation did not much affect the amount or the frequency of human code reviews compared to the internal policy to encourage code review activities. Our findings provide practical evidence that automating code review helps assure software quality.

Publisher's Version

Dependability

Unite: An Adapter for Transforming Analysis Tools to Web Services via OSLC
Ondřej Vašíček, Jan Fiedor, Tomáš Kratochvíla, Bohuslav Křena, Aleš Smrčka, and Tomáš Vojnar
(Brno University of Technology, Czechia; Honeywell International, Czechia)
This paper describes Unite, a new tool intended as an adapter for transforming non-interactive command-line analysis tools to OSLC-compliant web services. Unite aims to make such tools easier to adopt and more convenient to use by allowing them to be accessible, both locally and remotely, in a unified way and to be easily integrated into various development environments. Open Services for Lifecycle Collaboration (OSLC) is an open standard for tool integration and was chosen for this task due to its robustness, extensibility, support of data from various domains, and its growing popularity. The work is motivated by allowing existing analysis tools to be more widely used with a strong emphasis on widening their industrial usage. We have implemented Unite and used it with multiple existing static as well as dynamic analysis and verification tools, and then successfully deployed it internationally in the industry to automate verification tasks for development teams in Honeywell. We discuss Honeywell's experience with using Unite and with OSLC in general. Moreover, we also provide the Unite Client (UniC) for Eclipse to allow users to easily run various analysis tools directly from the Eclipse IDE.

Publisher's Version
Discovering Feature Flag Interdependencies in Microsoft Office
Michael Schröder, Katja Kevic, Dan Gopstein, Brendan Murphy, and Jennifer Beckmann
(TU Wien, Austria; Microsoft, UK; Microsoft, USA)
Feature flags are a popular method to control functionality in released code. They enable rapid development and deployment, but can also quickly accumulate technical debt. Complex interactions between feature flags can go unnoticed, especially if interdependent flags are located far apart in the code, and these unknown dependencies could become a source of serious bugs. Testing all possible combinations of feature flags is infeasible in large systems like Microsoft Office, which has about 12000 active flags. The goal of our research is to aid product teams in improving system reliability by providing an approach to automatically discover feature flag interdependencies. We use probabilistic reasoning to infer causal relationships from feature flag query logs. Our approach is language-agnostic, scales easily to large heterogeneous codebases, and is robust against noise such as code drift or imperfect log data. We evaluated our approach on real-world query logs from Microsoft Office and are able to achieve over 90% precision while recalling non-trivial indirect feature flag relationships across different source files. We also investigated re-occurring patterns of relationships and describe applications for targeted testing, determining deployment velocity, error mitigation, and diagnostics.

Publisher's Version
What Did You Pack in My App? A Systematic Analysis of Commercial Android Packers
Zikan Dong, Hongxuan Liu, Liu Wang, Xiapu Luo, Yao Guo, Guoai Xu, Xusheng Xiao, and Haoyu Wang
(Beijing University of Posts and Telecommunications, China; Peking University, China; Hong Kong Polytechnic University, China; Case Western Reserve University, USA; Huazhong University of Science and Technology, China)
Commercial Android packers have been widely used by developers as a way to protect their apps from being tampered with. However, app packer is usually provided as an online service developed by security vendors, and the packed apps are well protected. It is thus hard to know what exactly is packed in the app, and few existing studies in the community have systematically analyzed the behaviors of commercial app packers. In this paper, we propose PackDiff, a dynamic analysis system to inspect the fine-grained behaviors of commercial packers. By instrumenting the Android system, PackDiff records the runtime behaviors of Android apps (e.g., Linux system call invocations, Java API calls, Binder interactions, etc.), which are further processed to pinpoint the additional sensitive behaviors introduced by packers. By applying PackDiff to roughly 200 apps protected by seven commercial packers, we observe the disappointing facts of existing commercial packers. Most app packers have introduced unnecessary behaviors (e.g., accessing sensitive data), serious performance and compatibility issues, and they can even be abused to create evasive malware and repackaged apps, which contradicts with their design purposes.

Publisher's Version

Repair/Synthesis

An Empirical Study of Deep Transfer Learning-Based Program Repair for Kotlin Projects
Misoo Kim, Youngkyoung Kim, Hohyeon Jeong, Jinseok Heo, Sungoh Kim, Hyunhee Chung, and Eunseok Lee
(Sungkyunkwan University, South Korea; Samsung Electronics, South Korea)
Deep learning-based automated program repair (DL-APR) can automatically fix software bugs and has received significant attention in the industry because of its potential to significantly reduce software development and maintenance costs. The Samsung mobile experience (MX) team is currently switching from Java to Kotlin projects. This study reviews the application of DL-APR, which automatically fixes defects that arise during this switching process; however, the shortage of Kotlin defect-fixing datasets in Samsung MX team precludes us from fully utilizing the power of deep learning. Therefore, strategies are needed to effectively reuse the pretrained DL-APR model. This demand can be met using the Kotlin defect-fixing datasets constructed from industrial and open-source repositories, and transfer learning. This study aims to validate the performance of the pretrained DL-APR model in fixing defects in the Samsung Kotlin projects, then improve its performance by applying transfer learning. We show that transfer learning with open source and industrial Kotlin defect-fixing datasets can improve the defect-fixing performance of the existing DL-APR by 307%. Furthermore, we confirmed that the performance was improved by 532% compared with the baseline DL-APR model as a result of transferring the knowledge of an industrial (non-defect) bug-fixing dataset. We also discovered that the embedded vectors and overlapping code tokens of the code-change pairs are valuable features for selecting useful knowledge transfer instances by improving the performance of APR models by up to 696%. Our study demonstrates the possibility of applying transfer learning to practitioners who review the application of DL-APR to industrial software.

Publisher's Version

Online Presentations

An Empirical Investigation of Missing Data Handling in Cloud Node Failure Prediction
Minghua Ma, Yudong Liu, Yuang Tong, Haozhe Li, Pu Zhao, Yong Xu, Hongyu Zhang, Shilin He, Lu Wang, Yingnong Dang, Saravanakumar Rajmohan, and Qingwei Lin
(Microsoft Research, China; University of Newcastle, Australia; Microsoft, USA)
Cloud computing systems have become increasingly popular in recent years. A typical cloud system utilizes millions of computing nodes as the basic infrastructure. Node failure has been identified as one of the most prevalent causes of cloud system downtime. To improve the reliability of cloud systems, many previous studies collected monitoring metrics from nodes and built models to predict node failures before the failures happen. However, based on our experience with large-scale real-world cloud systems in Microsoft, we find that the task of predicting node failure is severely hampered by missing data. There is a large amount of missing data, and the online latest data utilized for prediction is even worse. As a result, the real-time performance of the node prediction model is limited. In this paper, we first characterize the missing data problem for node failure prediction. Then, we evaluate several existing data interpolation approaches, and find that node dimension interpolation approaches outperform time dimension ones and deep learning based interpolation is the best for early prediction. Our findings can help academics and engineers address the missing data problem in cloud node failure prediction and other data-driven software engineering scenarios.

Publisher's Version
An Empirical Study of Log Analysis at Microsoft
Shilin He, Xu Zhang, Pinjia He, Yong Xu, Liqun Li, Yu Kang, Minghua Ma, Yining Wei, Yingnong Dang, Saravanakumar Rajmohan, and Qingwei Lin
(Microsoft Research, China; Chinese University of Hong Kong at Shenzhen, China; Microsoft Azure, China; Microsoft Azure, USA; Microsoft 365, USA)
Logs are crucial to the management and maintenance of software systems. In recent years, log analysis research has achieved notable progress on various topics such as log parsing and log-based anomaly detection. However, the real voices from front-line practitioners are seldom heard. For example, what are the pain points of log analysis in practice? In this work, we conduct a comprehensive survey study on log analysis at Microsoft. We collected feedback from 105 employees through a questionnaire of 13 questions and individual interviews with 12 employees. We summarize the format, scenario, method, tool, and pain points of log analysis. Additionally, by comparing the industrial practices with academic research, we discuss the gaps between academia and industry, and future opportunities on log analysis with four inspiring findings. Particularly, we observe a huge gap exists between log anomaly detection research and failure alerting practices regarding the goal, technique, efficiency, etc. Moreover, data-driven log parsing, which has been widely studied in recent research, can be alternatively achieved by simply logging template IDs during software development. We hope this paper could uncover the real needs of industrial practitioners and the unnoticed yet significant gap between industry and academia, and inspire interesting future directions that converge efforts from both sides.

Publisher's Version
AutoTSG: Learning and Synthesis for Incident Troubleshooting
Manish Shetty, Chetan Bansal, Sai Pramod Upadhyayula, Arjun Radhakrishna, and Anurag Gupta
(Microsoft Research, India; Microsoft, USA)
Incident management is a key aspect of operating large-scale cloud services. To aid with faster and efficient resolution of incidents, engineering teams document frequent troubleshooting steps in the form of Troubleshooting Guides (TSGs), to be used by on-call engineers (OCEs). However, TSGs are siloed, unstructured, and often incomplete, requiring developers to manually understand and execute necessary steps. This results in a plethora of issues such as on-call fatigue, reduced productivity, and human errors. In this work, we conduct a large-scale empirical study of over 4K+ TSGs mapped to incidents and find that TSGs are widely used and help significantly reduce mitigation efforts. We then analyze feedback on TSGs provided by 400+ OCEs and propose a taxonomy of issues that highlights significant gaps in TSG quality. To alleviate these gaps, we investigate the automation of TSGs and propose AutoTSG -- a novel framework for automation of TSGs to executable workflows by combining machine learning and program synthesis. Our evaluation of AutoTSG on 50 TSGs shows the effectiveness in both identifying TSG statements (accuracy 0.89) and parsing them for execution (precision 0.94 and recall 0.91). Lastly, we survey ten Microsoft engineers and show the importance of TSG automation and the usefulness of AutoTSG.

Publisher's Version
Demystifying “Removed Reviews” in iOS App Store
Liu Wang, Haoyu Wang, Xiapu Luo, Tao Zhang, Shangguang Wang, and Xuanzhe Liu
(Beijing University of Posts and Telecommunications, China; Huazhong University of Science and Technology, China; Hong Kong Polytechnic University, China; Macau University of Science and Technology, China; Peking University, China)
The app markets enable users to submit feedback for downloaded apps in the form of star ratings and text reviews, which are meant to be helpful and trustworthy for decision making to both developers and other users. App markets have released strict guidelines/policies for user review submissions. However, there has been growing evidence showing the untrustworthy and poor-quality of app reviews, making the app store review environment a shambles. Therefore, review removal is a common practice, and market maintainers have to remove undesired reviews from the market periodically in a reactive manner. Although some reports and news outlets have mentioned removed reviews, our research community still lacks the comprehensive understanding of the landscape of this kind of reviews. To fill the void, in this paper, we present a large-scale and longitudinal study of removed reviews in iOS App Store. We first collaborate with our industry partner to collect over 30 million removed reviews for 33,665 popular apps over the course of a full year in 2020. This comprehensive dataset enables us to characterize the overall landscape of removed reviews. We next investigate the practical reasons leading to the removal of policy-violating reviews, and summarize several interesting reasons, including fake reviews, offensive reviews, etc. More importantly, most of these mis-behaviors can be reflected on reviews’ basic information including the posters, narrative content, and posting time. It motivates us to design an automated approach to flag the policy-violation reviews, and our experiment result on the labelled benchmark can achieve a good performance (F1=97%). We further make an attempt to apply our approach to the large-scale industry setting, and the result suggests the promising industry usage scenario of our approach. Our approach can act as a gatekeeper to pinpoint policy-violation reviews beforehand, which will be quite effective in improving the maintenance process of app reviews in the industrial setting.

Publisher's Version
Exploring and Evaluating Personalized Models for Code Generation
Andrei Zlotchevski, Dawn Drain, Alexey Svyatkovskiy, Colin B. Clement, Neel Sundaresan, and Michele Tufano
(McGill University, Canada; Anthropic, USA; Microsoft, USA)
Large Transformer models achieved the state-of-the-art status for Natural Language Understanding tasks and are increasingly becoming the baseline model architecture for modeling source code. Transformers are usually pre-trained on large unsupervised corpora, learning token representations and transformations relevant to modeling generally available text, and are then fine-tuned on a particular downstream task of interest. While fine-tuning is a tried-and-true method for adapting a model to a new domain -- for example, question-answering on a given topic -- generalization remains an on-going challenge. In this paper, we explore and evaluate transformer model fine-tuning for personalization. In the context of generating unit tests for Java methods, we evaluate learning to personalize to a specific software project using several personalization techniques. We consider three key approaches: (i) custom fine-tuning, which allows all the model parameters to be tuned; (ii) lightweight fine-tuning, which freezes most of the model's parameters, allowing tuning of the token embeddings and softmax layer only or the final layer alone; (iii) prefix tuning, which keeps model parameters frozen, but optimizes a small project-specific prefix vector. Each of these techniques offers a trade-off in total compute cost and predictive performance, which we evaluate by code and task-specific metrics, training time, and total computational operations. We compare these fine-tuning strategies for code generation and discuss the potential generalization and cost benefits of each in various deployment scenarios.

Publisher's Version
FlakeRepro: Automated and Efficient Reproduction of Concurrency-Related Flaky Tests
Tanakorn Leesatapornwongsa, Xiang Ren, and Suman Nath
(Microsoft Research, USA; University of Toronto, Canada)
Flaky tests, which can non-deterministically pass or fail on the same code, impose significant burden on developers by providing misleading signals during regression testing. Microsoft developers consider flaky tests as one of the top two reasons for slowing down software development. In order to debug the root-cause of a flaky behavior, a developer often needs to first reliably reproduce a failed execution. Unfortunately, this is non-trivial. For example, most of the flakiness in unit tests are caused by concurrency, and reproducing their failures requires specific thread interleaving. To address this challenge, we introduce FlakeRepro that helps developers reproduce a failed execution of a flaky test caused by concurrency. FlakeRepro combines static and dynamic analysis to quickly identify an interleaving that makes a flaky test fail with the same original error message. FlakeRepro is efficient: it can reproduce a failed execution after exploring few tens of interleavings. FlakeRepro integrates well with existing systems: it automatically instruments test binaries that can run on existing and unmodified test pipelines.
We have implemented FlakeRepro for .NET and used it in Microsoft. In an experiment with 22 Microsoft projects, FlakeRepro could reproduce 26 of total 31 concurrent-related flaky tests, after exploring only <7 interleavings and within 6 minutes on average.

Publisher's Version
Group-Based Corpus Scheduling for Parallel Fuzzing
Taotao Gu, Xiang Li, Shuaibing Lu, Jianwen Tian, Yuanping Nie, Xiaohui Kuang, Zhechao Lin, Chenyifan Liu, Jie Liang, and Yu Jiang
(Academy of Military Sciences, China; National University of Defense Technology, China; Tsinghua University, China)
Parallel fuzzing relies on hardware resources to guarantee test throughput and efficiency. In industrial practice, it is well known that parallel fuzzing faces the challenge of task division, but most works neglect the important process of corpus allocation. In this paper, we proposed a group-based corpus scheduling strategy to address these two issues, which has been accepted by the LLVM community. And we implement a parallel fuzzer based on this strategy called glibFuzzer. glibFuzzer first groups the global corpus into different subsets and then assigns different energy scores and different scores to them. The energy scores were mainly determined by the seed size and the length of coverage information, and the difference score can describe the degree of difference in the code covered by different subsets of seeds. In each round of key local corpus construction, the master node selects high-quality seeds by combining the two scores to improve test efficiency and avoid task conflict. To prove the effectiveness of the strategy, we conducted an extensive evaluation on the real-world programs and FuzzBench. After 4×24 CPU-hours, glibFuzzer covered 22.02% more branches and executed 19.42 times more test cases than libFuzzer in 18 real-world programs. glibFuzzer showed an average branch coverage increase of 73.02%, 55.02%, 55.86% over AFL, PAFL, UniFuzz, respectively. More importantly, glibFuzzer found over 100 unique vulnerabilities.

Publisher's Version
Incorporating Domain Knowledge through Task Augmentation for Front-End JavaScript Code Generation
Sijie Shen, Xiang Zhu, Yihong Dong, Qizhi Guo, Yankun Zhen, and Ge Li
(Peking University, China; Alibaba Group, China)
Code generation aims to generate a code snippet automatically from natural language descriptions. Generally, the mainstream code generation methods rely on a large amount of paired training data, including both the natural language description and the code. However, in some domain-specific scenarios, building such a large paired corpus for code generation is difficult because there is no directly available pairing data, and a lot of effort is required to manually write the code descriptions to construct a high-quality training dataset. Due to the limited training data, the generation model cannot be well trained and is likely to be overfitting, making the model's performance unsatisfactory for real-world use. To this end, in this paper, we propose a task augmentation method that incorporates domain knowledge into code generation models through auxiliary tasks and a Subtoken-TranX model by extending the original TranX model to support subtoken-level code generation. To verify our proposed approach, we collect a real-world code generation dataset and conduct experiments on it. Our experimental results demonstrate that the subtoken-level TranX model outperforms the original TranX model and the Transformer model on our dataset, and the exact match accuracy of Subtoken-TranX improves significantly by 12.75% with the help of our task augmentation method. The model performance on several code categories has satisfied the requirements for application in industrial systems. Our proposed approach has been adopted by Alibaba's BizCook platform. To the best of our knowledge, this is the first domain code generation system adopted in industrial development environments.

Publisher's Version
Industry Experiences with Large-Scale Refactoring
James Ivers, Robert L. Nord, Ipek Ozkaya, Chris Seifried, Christopher S. Timperley, and Marouane Kessentini
(Carnegie Mellon University, USA; Oakland University, USA)
Software refactoring plays an important role in software engineering. Developers often turn to refactoring when they want to restructure software to improve its quality without changing its external behavior. Small-scale (floss) refactoring is common in industry and is often performed by a single developer in short sessions, even though developers do much of this work manually instead of using refactoring tools. However, some refactoring efforts are much larger in scale, requiring entire teams and months or years of effort, and the role of tools in these efforts is not as well studied. In this paper, we report on a survey we conducted with developers to understand large-scale refactoring and its tool support needs. Our results from 107 industry developers demonstrate that projects commonly go through multiple large-scale refactorings, each of which requires considerable effort. Our study finds that developers use several categories of tools to support large-scale refactoring and rely more heavily on general-purpose tools like IDEs than on tools designed specifically to support refactoring. Tool support varies across the different activities, with some particularly challenging activities seeing little use of tools in practice. Furthermore, our analysis suggests significant impact is possible through advances in tool support for comprehension and testing, as well as through support for the needs of business stakeholders.

Publisher's Version
Industry Practice of Configuration Auto-tuning for Cloud Applications and Services
Runzhe Wang, Qinglong Wang, Yuxi Hu, Heyuan Shi, Yuheng Shen, Yu Zhan, Ying Fu, Zheng Liu, Xiaohai Shi, and Yu Jiang
(Alibaba Group, China; Central South University, China; Tsinghua University, China; Ant Group, China; Zhejiang University, China)
Auto-tuning attracts increasing attention in industry practice to optimize the performance of a system with many configurable parameters. It is particularly useful for cloud applications and services since they have complex system hierarchies and intricate knob correlations. However, existing tools and algorithms rarely consider practical problems such as workload pressure control, the support for distributed deployment, and expensive time costs, etc., which are utterly important for enterprise cloud applications and services. In this work, we significantly extend an open source tuning tool – KeenTune to optimize several typical enterprise cloud applications and services. Our practice is in collaboration with enterprise users and tuning tool developers to address the aforementioned problems. Specifically, we highlight five key challenges from our experiences and provide a set of solutions accordingly. Through applying the improved tuning tool to different application scenarios, we achieve 2%-14% improvements for the performance of MySQL, OceanBase, nginx, ingress-nginx, and 5%-70% improvements for the performance of ACK cloud container service.

Publisher's Version
Investigating and Improving Log Parsing in Practice
Ying Fu, Meng Yan, Jian Xu, Jianguo Li, Zhongxin Liu, Xiaohong Zhang, and Dan Yang
(Chongqing University, China; Ant Group, China; Zhejiang University, China)
Logs are widely used for system behavior diagnosis by automatic log mining. Log parsing is an important data preprocessing step that converts semi-structured log messages into structured data as the feature input for log mining. Currently, many studies are devoted to proposing new log parsers. However, to the best of our knowledge, no previous study comprehensively investigates the effectiveness of log parsers in industrial practice. To investigate the effectiveness of the log parsers in industrial practice, in this paper, we conduct an empirical study on the effectiveness of six state-of-the-art log parsers on 10 microservice applications of Ant Group. Our empirical results highlight two challenges for log parsing in practice: 1) various separators. There are various separators in a log message, and the separators in different event templates or different applications are also various. Current log parsers cannot perform well because they do not consider various separators. 2) Various lengths due to nested objects. The log messages belonging to the same event template may also have various lengths due to nested objects. The log messages of 6 out of 10 microservice applications at Ant Group with various lengths due to nested objects. 4 out of 6 state-of-the-art log parsers cannot deal with various lengths due to nested objects. In this paper, we propose an improved log parser named Drain+ based on a state-of-the-art log parser Drain. Drain+ includes two innovative components to address the above two challenges: a statistical-based separators generation component, which generates separators automatically for log message splitting, and a candidate event template merging component, which merges the candidate event templates by a template similarity method. We evaluate the effectiveness of Drain+ on 10 microservice applications of Ant Group and 16 public datasets. The results show that Drain+ outperforms the six state-of-the-art log parsers on industrial applications and public datasets. Finally, we conclude the observations in the road ahead for log parsing to inspire other researchers and practitioners.

Publisher's Version
Towards Developer-Centered Automatic Program Repair: Findings from Bloomberg
Emily Rowan Winter, Vesna Nowack, David Bowes, Steve Counsell, Tracy Hall, Sæmundur Haraldsson, John Woodward, Serkan Kirbas, Etienne Windels, Olayori McBello, Abdurahman Atakishiyev, Kevin Kells, and Matthew Pagano
(Lancaster University, UK; Brunel University London, UK; University of Stirling, UK; Queen Mary University of London, UK; Bloomberg, UK)
This paper reports on qualitative research into automatic program repair (APR) at Bloomberg. Six focus groups were conducted with a total of seventeen participants (including both developers of the APR tool and developers using the tool) to consider: the development at Bloomberg of a prototype APR tool (Fixie); developers’ early experiences using the tool; and developers’ perspectives on how they would like to interact with the tool in future. APR is developing rapidly and it is important to understand in greater detail developers' experiences using this emerging technology. In this paper, we provide in-depth, qualitative data from an industrial setting. We found that the development of APR at Bloomberg had become increasingly user-centered, emphasising how fixes were presented to developers, as well as particular features, such as customisability. From the focus groups with developers who had used Fixie, we found particular concern with the pragmatic aspects of APR, such as how and when fixes were presented to them. Based on our findings, we make a series of recommendations to inform future APR development, highlighting how APR tools should 'start small', be customisable, and fit with developers' workflows. We also suggest that APR tools should capitalise on the promise of repair bots and draw on advances in explainable AI.

Publisher's Version
Trace Analysis Based Microservice Architecture Measurement
Xin Peng, Chenxi Zhang, Zhongyuan Zhao, Akasaka Isami, Xiaofeng Guo, and Yunna Cui
(Fudan University, China; Alibaba Group, China)
Microservice architecture design highly relies on expert experience and may often result in improper service decomposition. Moreover, a microservice architecture is likely to degrade with the continuous evolution of services. Architecture measurement is thus important for the long-term evolution of microservice architectures. Due to the independent and dynamic nature of services, source code analysis based approaches cannot well capture the interactions between services. In this paper, we propose a trace analysis based microservice architecture measurement approach. We define a trace data model for microservice architecture measurement, which enables fine-grained analysis of the execution processes of requests and the interactions between interfaces and services. Based on the data model, we define 14 architectural metrics to measure the service independence and invocation chain complexity of a microservice system. We implement the approach and conduct three case studies with a student course project, an open-source microservice benchmark system, and three industrial microservice systems. The results show that our approach can well characterize the independence and invocation chain complexity of microservice architectures and help developers to identify microservice architecture issues caused by improper service decomposition and architecture degradation.

Publisher's Version

Ideas, Visions, and Reflections

Community

In War and Peace: The Impact of World Politics on Software Ecosystems
Raula Gaikovina Kula and Christoph Treude
(NAIST, Japan; University of Melbourne, Australia)
Reliance on third-party libraries is now commonplace in contemporary software engineering. Being open source in nature, these libraries should advocate for a world where the freedoms and opportunities of open source software can be enjoyed by all. Yet, there is a growing concern related to maintainers using their influence to make political stances (i.e., referred to as protestware). In this paper, we reflect on the impact of world politics on software ecosystems, especially in the context of the ongoing War in Ukraine. We show three cases where world politics has had an impact on a software ecosystem, and how these incidents may result in either benign or malignant consequences. We further point to specific opportunities for research, and conclude with a research agenda with ten research questions to guide future research directions.

Publisher's Version

Machine Learning

Discrepancies among Pre-trained Deep Neural Networks: A New Threat to Model Zoo Reliability
Diego Montes, Pongpatapee Peerapatanapokin, Jeff Schultz, Chengjun Guo, Wenxin Jiang, and James C. Davis
(Purdue University, USA)
Training deep neural networks (DNNs) takes significant time and resources. A practice for expedited deployment is to use pre-trained deep neural networks (PTNNs), often from model zoos--collections of PTNNs; yet, the reliability of model zoos remains unexamined. In the absence of an industry standard for the implementation and performance of PTNNs, engineers cannot confidently incorporate them into production systems. As a first step, discovering potential discrepancies between PTNNs across model zoos would reveal a threat to model zoo reliability. Prior works indicated existing variances in deep learning systems in terms of accuracy. However, broader measures of reliability for PTNNs from model zoos are unexplored. This work measures notable discrepancies between accuracy, latency, and architecture of 36 PTNNs across four model zoos. Among the top 10 discrepancies, we find differences of 1.23%-2.62% in accuracy and 9%-131% in latency. We also find mismatches in architecture for well-known DNN architectures (e.g., ResNet and AlexNet). Our findings call for future works on empirical validation, automated tools for measurement, and best practices for implementation.

Publisher's Version
Exploring the Under-Explored Terrain of Non-open Source Data for Software Engineering through the Lens of Federated Learning
Shriram Shanbhag and Sridhar Chimalakonda
(IIT Tirupati, India)
The availability of open source projects on platforms like GitHub has led to the wide use of the artifacts from these projects in software engineering research. These publicly available artifacts have been used to train artificial intelligence models used in various empirical studies and the development of tools. However, these advancements have missed out on the artifacts from non-open source projects due to the unavailability of the data. A major cause for the unavailability of the data from non-open source repositories is the issue concerning data privacy. In this paper, we propose using federated learning to address the issue of data privacy to enable the use of data from non-open source to train AI models used in software engineering research. We believe that this can potentially enable industries to collaborate with software engineering researchers without concerns about privacy. We present the preliminary evaluation of the use of federated learning to train a classifier to label bug-fix commits from an existing study to demonstrate its feasibility. The federated approach achieved an F1 score of 0.83 compared to a score of 0.84 using the centralized approach. We also present our vision of the potential implications of the use of federated learning in software engineering research.

Publisher's Version

Debugging/Localization

Reflections on Software Failure Analysis
Paschal C. Amusuo, Aishwarya Sharma, Siddharth R. Rao, Abbey Vincent, and James C. Davis
(Purdue University, USA)
Failure studies are important in revealing the root causes, behaviors, and life cycle of defects in software systems. These studies either focus on understanding the characteristics of defects in specific classes of systems, or the characteristics of a specific type of defect in the systems it manifests in. Failure studies have influenced various software engineering research directions, especially in the area of software evolution, defect detection, and program repair.
In this paper, we reflect on the conduct of failure studies in software engineering. We reviewed a sample of 52 failure study papers. We identified several recurring problems in these studies, some of which hinder the ability of software engineering community to trust or replicate the results. Based on our findings, we suggest future research directions, including identifying and analyzing failure causal chains, standardizing the conduct of failure studies, and tool support for faster defect analysis.

Publisher's Version

Program Analysis

Language-Agnostic Dynamic Analysis of Multilingual Code: Promises, Pitfalls, and Prospects
Haoran Yang, Wen Li, and Haipeng Cai
(Washington State University, USA)
Analyzing multilingual code holistically is key to systematic quality assurance of real-world software which is mostly developed in multiple computer languages. Toward such analyses, state-of-the-art approaches propose an almost-fully language-agnostic methodology and apply it to dynamic dependence analysis/slicing of multilingual code, showing great promises. We investigated this methodology through a technical analysis followed by a replication study applying it to 10 real-world multilingual projects of diverse language combinations. Our results revealed critical practicality (i.e., having the levels of efficiency/scalability, precision, and extensibility to various language combinations for practical use) challenges to the methodology. Based on the results, we reflect on the underlying pitfalls of the language-agnostic design that leads to such challenges. Finally, looking forward to the prospects of dynamic analysis for multilingual code, we identify a new research direction towards better practicality and precision while not sacrificing extensibility much, as supported by preliminary results. The key takeaway is that pursuing fully language-agnostic analysis may be both impractical and unnecessary, and striving for a better balance between language independence and practicality may be more fruitful.

Publisher's Version

Online Presentations

A Study on Identifying Code Author from Real Development
Siyi Gong and Hao Zhong
(Shanghai Jiao Tong University, China)
Identifying code authors is important in many research topics, and various approaches have been proposed. Although these approaches achieve promising results on their datasets, their true effectiveness is still in question. To the best of our knowledge, only one large-scale study was conducted to explore the impacts of related factors (e.g., the temporal effect and the distribution of files per author). This study selected Google Code Jam programs as their subjects, but such programs are quite different from the source files that programmers write in daily development. To understand their effectiveness and challenges, we replicate their study and use their approach to analyze source files that are retrieved from real projects. The prior study claims that the temporal effect and the distribution of files per author have only minor impacts on their trained models. In the contrast, we find that in 85.48% pairs of training and testing sets, the accuracy of a trained model is less effective when the temporal effect is considered, and in total, the average accuracy decreases by 0.4298. In addition, when we use the real distribution of files as inputs, their approach can accurately identify only one or two core code authors, although a project can have more than ten authors. By revealing the limitations of the prior approach, our study sheds lights on where to make future improvements.

Publisher's Version
Paving the Way for Mature Secondary Research: The Seven Types of Literature Review
Paul Ralph and Sebastian Baltes
(Dalhousie University, Canada; University of Adelaide, Australia)
Confusion over different kinds of secondary research, and their divergent purposes, is undermining the effectiveness and usefulness of secondary studies in software engineering. This short paper therefore explains the differences between ad hoc review, case survey, critical review, meta-analysis (aka systematic literature review), meta-synthesis (aka thematic analysis), rapid review and scoping review (aka systematic mapping study). These definitions and associated guidelines help researchers better select and describe their literature reviews, while helping reviewers select more appropriate evaluation criteria.

Publisher's Version

Demonstrations

Community

iTiger: An Automatic Issue Title Generation Tool
Ting Zhang, Ivana Clairine Irsan, Ferdian Thung, DongGyun Han, David Lo, and Lingxiao Jiang
(Singapore Management University, Singapore; Royal Holloway University of London, UK)
In both commercial and open-source software, bug reports or issues are used to track bugs or feature requests. However, the quality of issues can differ a lot. Prior research has found that bug reports with good quality tend to gain more attention than the ones with poor quality. As an essential component of an issue, title quality is an important aspect of issue quality. Moreover, issues are usually presented in a list view, where only the issue title and some metadata are present. In this case, a concise and accurate title is crucial for readers to grasp the general concept of the issue and facilitate the issue triaging. Previous work formulated the issue title generation task as a one-sentence summarization task. A sequence-to-sequence model was employed to solve this task. However, it requires a large amount of domain-specific training data to attain good performance in issue title generation. Recently, pre-trained models, which learned knowledge from large-scale general corpora, have shown much success in software engineering tasks.
In this work, we make the first attempt to fine-tune BART, which has been pre-trained using English corpora, to generate issue titles. We implemented the fine-tuned BART as a web tool named iTiger, which can suggest an issue title based on the issue description. iTiger is fine-tuned on 267,094 GitHub issues. We compared iTiger with the state-of-the-art method, i.e., iTAPE, on 33,438 issues. The automatic evaluation shows that iTiger outperforms iTAPE by 29.7
Demo URL: https://youtu.be/-JMWR9-lR78
Source code and replication package URL: https://github.com/soarsmu/iTiger

Publisher's Version Video
CodeMatcher: A Tool for Large-Scale Code Search Based on Query Semantics Matching
Chao Liu, Xuanlin Bao, Xin Xia, Meng Yan, David Lo, and Ting Zhang
(Chongqing University, China; Huawei, China; Singapore Management University, Singapore)
Due to the emergence of large-scale codebases, such as GitHub and Gitee, searching and reusing existing code can help developers substantially improve software development productivity. Over the years, many code search tools have been developed. Early tools leveraged the information retrieval (IR) technique to perform an efficient code search for a frequently changed large-scale codebase. However, the search accuracy was low due to the semantic mismatch between query and code. In the recent years, many tools leveraged Deep Learning (DL) technique to address this issue. But the DL-based tools are slow and the search accuracy is unstable.
In this paper, we presented an IR-based tool CodeMatcher, which inherits the advantages of the DL-based tool in query semantics matching. Generally, CodeMatcher builds indexing for a large-scale codebase at first to accelerate the search response time. For a given search query, it addresses irrelevant and noisy words in the query, then retrieves candidate code from the indexed codebase via iterative fuzzy search, and finally reranks the candidates based on two designed measures of semantic matching between query and candidates. We implemented CodeMatcher as a search engine website. To verify the effectiveness of our tool, we evaluated CodeMatcher on 41k+ open-source Java repositories. Experimental results showed that CodeMatcher can achieve an industrial-level response time (0.3s) with a common server with an Intel-i7 CPU. On the search accuracy, CodeMatcher significantly outperforms three state-of-the-art tools (DeepCS, UNIF, and CodeHow) and two online search engines (GitHub search and Google search).

Publisher's Version Video Info

Software Evolution

Context-Aware Code Recommendation in Intellij IDEA
Shamsa Abid, Hamid Abdul Basit, and Shafay Shamail
(Singapore Management University, Singapore; Prince Sultan University, Saudi Arabia; Lahore University of Management Sciences, Pakistan)
Developers spend a lot of time online, searching for code to help them implement their desired features. While code recommenders help improve developers’ productivity, there is currently no support for context-aware code recommendation for opportunistic code reuse on-the-go. Typical code recommendation systems provide recommendations against a search query, whereas a code recommender that supports opportunistic reuse can recommend related code snippets that represent features that the developer may want to implement next. In this paper, we present a novel Context-aware Feature-driven API usage-based Code Recommender (CA-FACER) tool, which is an Intellij IDEA plugin that leverages a developer’s development context to recommend related code snippets. We consider the methods having API usages in a developer’s active project as part of the development context. Our approach uses contextual data from a developer’s active project to find similar projects and recommends code from popular features of those projects. The popular features are identified as frequently occurring API usage based Method Clone Classes. From our experimental evaluation on 120 Android Java projects from GitHub, we observe a 46% improvement of precision using our proposed context-aware approach over a baseline system. Our technique recommends related code examples with an average precision (P@5) of 94% and 83% and a success rate of 90% and 95% for initial and evolved development stages respectively. A video demonstration of our tool is available at https://youtu.be/UjuM8WRc318.

Publisher's Version Video
Python-by-Contract Dataset
Jiyang Zhang, Marko Ristin, Phillip Schanely, Hans Wernher van de Venn, and Milos Gligoric
(University of Texas at Austin, USA; Zurich University of Applied Sciences, Switzerland)
Design-by-contract as a programming technique is becoming popular in Python community as various tools have been developed for automatically testing the code based on the contracts. However, there is no sufficiently large and representative Python code base with contracts to evaluate these different testing tools. We present Python-by-contract dataset containing 514 Python functions annotated with contracts using icontract library. We show that our Python-by-contract dataset can be easily used by existing testing tools that take advantage of contracts. The demo video can be found at https://youtu.be/08wZN-xh6mY.

Publisher's Version

Human/Computer Interaction

MultIPAs: Applying Program Transformations to Introductory Programming Assignments for Data Augmentation
Pedro Orvalho, Mikoláš Janota, and Vasco Manquinho
(INESC-ID, Portugal; University of Lisbon, Portugal; Czech Technical University in Prague, Czechia)
There has been a growing interest, over the last few years, in the topic of automated program repair applied to fixing introductory programming assignments (IPAs). However, the datasets of IPAs publicly available tend to be small and with no valuable annotations about the defects of each program. Small datasets are not very useful for program repair tools that rely on machine learning models. Furthermore, a large diversity of correct implementations allows computing a smaller set of repairs to fix a given incorrect program rather than always using the same set of correct implementations for a given IPA. For these reasons, there has been an increasing demand for the task of augmenting IPAs benchmarks.
This paper presents MultIPAs, a program transformation tool that can augment IPAs benchmarks by: (1) applying six syntactic mutations that conserve the program's semantics and (2) applying three semantic mutilations that introduce faults in the IPAs. Moreover, we demonstrate the usefulness of MultIPAs by augmenting with millions of programs two publicly available benchmarks of programs written in the C language, and also by generating an extensive benchmark of semantically incorrect programs.

Publisher's Version Video Info
PolyFax: A Toolkit for Characterizing Multi-language Software
Wen Li, Li Li, and Haipeng Cai
(Washington State University, USA; Monash University, Australia)
Today’s software systems are mostly developed in multiple languages (i.e., multi-language software), yet tool support for understanding and assuring these systems is rare. To facilitate future research on multi-language software engineering, this paper presents PolyFax, a toolkit that offers automated means for dataset collection from GitHub and two analysis utilities--a vulnerability-fixing commit categorization tool (VCC) and a language interfacing mechanism identification/categorization tool (LIC). The VCC tool immediately assists with assessing the vulnerability proneness of a given multi-language project based on its version histories, while the LIC tool enables dissection of the most important aspect of the construction of multi-language systems. Application of PolyFax to 7,113 multi-language projects with 12.6 million commits showed its practical usefulness in terms of promising efficiency and accuracy for studying multi-language software.

Publisher's Version

Software Testing

CLIFuzzer: Mining Grammars for Command-Line Invocations
Abhilash Gupta, Rahul Gopinath, and Andreas Zeller
(CISPA Helmholtz Center for Information Security, Germany)
The behavior of command-line utilities can be very much influenced by passing command-line options and arguments—configuration settings that enable, disable, or otherwise influence parts of the code to be executed. Hence, systematic testing of command-line utilities requires testing them with diverse configurations of supported command-line options.
We introduce CLIFuzzer, a tool that takes an executable program and, using dynamic analysis to track input processing, automatically extract a full set of its options, arguments, and argument types. This set forms a grammar that represents the valid sequences of valid options and arguments. Producing invocations from this grammar, we can fuzz the program with an endless list of random configurations, covering the related code. This leads to increased coverage and new bugs over purely mutation based fuzzers.

Publisher's Version
RecipeGen++: An Automated Trigger Action Programs Generator
Imam Nur Bani Yusuf, Diyanah Binte Abdul Jamal, Lingxiao Jiang, and David Lo
(Singapore Management University, Singapore)
Trigger Action Programs (TAPs) are event-driven rules that allow users to automate smart-devices and internet services. Users can write TAPs by specifying triggers and actions from a set of predefined channels and functions. Despite its simplicity, composing TAPs can still be challenging for users due to the enormous search space of available triggers and actions. The growing popularity of TAPs is followed by the increasing number of supported devices and services, resulting in a huge number of possible combinations between triggers and actions. Motivated by such a fact, we improve our prior work and propose RecipeGen++, a deep-learning-based approach that leverages Transformer seq2seq (sequence-to-sequence) architecture to generate TAPs given natural language descriptions. RecipeGen++ can generate TAPs in the Interactive, One-Click, or Functionality Discovery modes. In the Interactive mode, users can provide feedback to guide the prediction of a trigger or action component. In contrast, the One-Click mode allows users to generate all TAP components directly. Additionally, RecipeGen++ also enables users to discover functionalities at the channel level through the Functionality Discovery mode. We have evaluated RecipeGen++ on real-world datasets in all modes. Our results demonstrate that RecipeGen++ can outperform the baseline by 2.2%-16.2% in the gold-standard benchmark and 5%-29.2% in the noisy benchmark.

Publisher's Version Video Info

Empirical

TAPHSIR: Towards AnaPHoric Ambiguity Detection and ReSolution In Requirements
Saad Ezzini, Sallam Abualhaija, Chetan Arora, and Mehrdad Sabetzadeh
(University of Luxembourg, Luxembourg; Deakin University, Australia; University of Ottawa, Canada)
We introduce TAPHSIR – a tool for anaphoric ambiguity detection and anaphora resolution in requirements. TAPHSIR facilities reviewing the use of pronouns in a requirements specification and revising those pronouns that can lead to misunderstandings during the development process. To this end, TAPHSIR detects the requirements which have potential anaphoric ambiguity and further attempts interpreting anaphora occurrences automatically. TAPHSIR employs a hybrid solution composed of an ambiguity detection solution based on machine learning and an anaphora resolution solution based on a variant of the BERT language model. Given a requirements specification, TAPHSIR decides for each pronoun occurrence in the specification whether the pronoun is ambiguous or unambiguous, and further provides an automatic interpretation for the pronoun. The output generated by TAPHSIR can be easily reviewed and validated by requirements engineers. TAPHSIR is publicly available on Zenodo (https://doi.org/10.5281/zenodo.5902117).

Publisher's Version
COREQQA: A COmpliance REQuirements Understanding using Question Answering Tool
Sallam Abualhaija, Chetan Arora, and Lionel C. Briand
(University of Luxembourg, Luxembourg; Deakin University, Australia; University of Ottawa, Canada)
We introduce COREQQA, a tool for assisting requirements engineers in acquiring a better understanding of compliance requirements by means of automated Question Answering. Extracting compliance-related requirements by manually navigating through a legal document is both time-consuming and error-prone. COREQQA enables requirements engineers to pose questions in natural language about a compliance-related topic given some legal document, e.g., asking about data breach. The tool then automatically navigates through the legal document and returns to the requirements engineer a list of text passages containing the possible answers to the input question. For better readability, the tool also highlights the likely answers in these passages. The engineer can then use this output for specifying compliance requirements. COREQQA is developed using advanced large-scale language models from BERT’s family. COREQQA has been evaluated on four legal documents. The results of this evaluation are briefly presented in the paper. The tool is publicly available on Zenodo (https://doi.org/10.5281/zenodo.6653514).

Publisher's Version

Formal Methods

SolSEE: A Source-Level Symbolic Execution Engine for Solidity
Shang-Wei Lin, Palina Tolmach, Ye Liu, and Yi Li
(Nanyang Technological University, Singapore; IHPC at A*STAR, Singapore)
Most of the existing smart contract symbolic execution tools perform analysis on bytecode, which loses high-level semantic information presented in source code. This makes interactive analysis tasks—such as visualization and debugging—extremely challenging, and significantly limits the tool usability. In this paper, we present SolSEE, a source-level symbolic execution engine for Solidity smart contracts. We describe the design of SolSEE, highlight its key features, and demonstrate its usages through a Web-based user interface. SolSEE demonstrates advantages over other existing source-level analysis tools in the advanced Solidity language features it supports and analysis flexibility. A demonstration video is available at: https://sites.google.com/view/solsee/.

Publisher's Version Video Info
MpBP: Verifying Robustness of Neural Networks with Multi-path Bound Propagation
Ye Zheng, Jiaxiang Liu, and Xiaomu Shi
(Shenzhen University, China)
Robustness of neural networks need be guaranteed in many safety-critical scenarios, such as autonomous driving and cyber-physical controlling. In this paper, we present MpBP, a tool for verifying the robustness of neural networks. MpBP is inspired by classical bound propagation methods for neural network verification, and aims to improve the effectiveness by exploiting the notion of propagation paths. Specifically, MpBP extends classical bound propagation methods, including forward bound propagation, backward bound propagation, and forward+backward bound propagation, with multiple propagation paths. MpBP is based on the widely-used PyTorch machine learning framework, hence providing efficient parallel verification on GPUs and user-friendly usage. We evaluate MpBP on neural networks trained on standard datasets MNIST, CIFAR-10 and Tiny ImageNet. The results demonstrate the effectiveness advantage of MpBP beyond two state-of-the-art bound propagation tools LiRPA and GPUPoly, with comparable efficiency to LiRPA and significantly higher efficiency than GPUPoly. A video demonstration that showcases the main features of MpBP can be found at https://youtu.be/3KyPMuPpfR8. Source code is available at https://github.com/formes20/MpBP and https://doi.org/10.5281/zenodo.7029261.

Publisher's Version Video Info

Debugging

eGEN: An Energy-Saving Modeling Language and Code Generator for Location-Sensing of Mobile Apps
Kowndinya Boyalakuntla, Marimuthu Chinnakali, Sridhar Chimalakonda, and Chandrasekaran K
(IIT Tirupati, India; National Institute of Technology Karnataka, India)
Given the limited tool support for energy-saving strategies during the design phase of android applications, developing battery-aware, location-based android applications is a non-trivial task for developers. To this end, we propose eGEN, consisting of (1) a Domain-Specific Modeling Language (DSML) and (2) a code generator to specify and create native battery-aware, location-based mobile applications. We evaluated eGEN by instrumenting the generated battery-aware code in five location-based, open-source android applications and compared the energy consumption with non-eGEN versions. The experimental results show 188 mA (8.34% of battery per hour) of average reduction in battery consumption while showing only 97 meters of degradation in location accuracy over three kilometers of a cycling path. Hence, we see this tool as a first step in helping developers write battery-aware code in location-based android applications. The GitHub repository with source code and all artifacts is available at https://github.com/Kowndinya2000/egen, and the tool demo video at https://youtu.be/Iadfh4cCw8I.

Publisher's Version Video
SFLKit: A Workbench for Statistical Fault Localization
Marius Smytzek and Andreas Zeller
(CISPA Helmholtz Center for Information Security, Germany; Saarland University, Germany)
Statistical fault localization aims at detecting execution features that correlate with failures, such as whether individual lines are part of the execution. We introduce SFLKit, an out-of-the-box workbench for statistical fault localization. The framework provides straightforward access to the fundamental concepts of statistical fault localization. It supports five predicate types, four coverage-inspired spectra, like lines, and 44 similarity coefficients, e.g., TARANTULA or OCHIAI, for statistical program analysis.
SFLKit separates the execution of tests from the analysis of the results and is therefore independent of the used testing framework. It leverages program instrumentation to enable the logging of events and derives the predicates and spectra from these logs. This instrumentation allows for introducing multiple programming languages and the extension of new concepts in statistical fault localization. Currently, SFLKit supports the instrumentation of Python programs. It is highly configurable, requiring only the logging of the required events.

Publisher's Version Video Info

Mining Software Repositories

WikiDoMiner: Wikipedia Domain-Specific Miner
Saad Ezzini, Sallam Abualhaija, and Mehrdad Sabetzadeh
(University of Luxembourg, Luxembourg; University of Ottawa, Canada)
We introduce WikiDoMiner – a tool for automatically generating domain-specific corpora by crawling Wikipedia. WikiDoMiner helps requirements engineers create an external knowledge resource that is specific to the underlying domain of a given requirements specification (RS). Being able to build such a resource is important since domain-specific datasets are scarce. WikiDoMiner generates a corpus by first extracting a set of domain-specific keywords from a given RS, and then querying Wikipedia for these keywords. The output of WikiDoMiner is a set of Wikipedia articles relevant to the domain of the input RS. Mining Wikipedia for domain-specific knowledge can be beneficial for multiple requirements engineering tasks, e.g., ambiguity handling, requirements classification, and question answering. WikiDoMiner is publicly available on Zenodo under an open-source license (https: //doi.org/10.5281/zenodo.6672682)

Publisher's Version
RegMiner: Mining Replicable Regression Dataset from Code Repositories
Xuezhi Song, Yun Lin, Yijian Wu, Yifan Zhang, Siang Hwee Ng, Xin Peng, Jin Song Dong, and Hong Mei
(Fudan University, China; Shanghai Jiao Tong University, China; National University of Singapore, Singapore; Peking University, China)
In this work, we introduce a tool, RegMiner, to automate the process of collecting replicable regression bugs from a set of Git repositories. In the code commit history, RegMiner searches for regressions where a test can pass a regression-fixing commit, fail a regressioninducing commit, and pass a previous working commit again. Technically, RegMiner (1) identifies potential regression-fixing commits from the code evolution history, (2) migrates the test and its code dependencies in the commit over the history, and (3) minimizes the compilation overhead during the regression search. Our experients show that RegMiner can successfully collect 1035 regressions over 147 projects in 8 weeks, creating the largest replicable regression dataset within the shortest period, to the best of our knowledge. In addition, our experiments further show that (1) RegMiner can construct the regression dataset with very high precision and acceptable recall, and (2) the constructed regression dataset is of high authenticity and diversity. The source code of RegMiner is available at https://github.com/SongXueZhi/RegMiner, the mined regression dataset is available at https://regminer.github.io/, and the demonstration video is available at https://youtu.be/yzcM9Y4unok.

Publisher's Version Video Info

Program Analysis

FIM: Fault Injection and Mutation for Simulink
Ezio Bartocci, Leonardo Mariani, Dejan Ničković, and Drishti Yadav
(TU Wien, Austria; University of Milano-Bicocca, Italy; AIT, Austria)
We introduce FIM, an open-source toolkit for automated fault injection and mutant generation in Simulink models. FIM allows the injection of faults into specific parts, supporting common types of faults and mutation operators whose parameters can be customized to control the time of fault actuation and persistence. Additional flags allow the user to activate the individual fault blocks during testing to observe their effects on the overall system reliability. We provide insights into the design and architecture of FIM, and evaluate its performance on a case study from the avionics domain.

Publisher's Version
JSIMutate: Understanding Performance Results through Mutations
Thomas Laurent, Paolo Arcaini, Catia Trubiani, and Anthony Ventresque
(Lero, Ireland; University College Dublin, Ireland; National Institute of Informatics, Japan; Gran Sasso Science Institute, Italy)
Understanding the performance characteristics of software systems is particular relevant when looking at design alternatives. However, it is a very challenging problem, due to the complexity of interpreting the role and incidence of the different system elements on performance metrics of interest, such as system response time or resources utilisation. This work introduces JSIMutate, a tool that makes use of queueing network performance models and enables the analysis of mutations of a model reflecting possible design changes to support designers in identifying the model elements that contribute to improving or worsening the system's performance.

Publisher's Version

Security

VulCurator: A Vulnerability-Fixing Commit Detector
Truong Giang Nguyen, Thanh Le-Cong, Hong Jin Kang, Xuan-Bach D. Le, and David Lo
(Singapore Management University, Singapore; University of Melbourne, Australia)
Open-source software (OSS) vulnerability management process is important nowadays, as the number of discovered OSS vulnerabilities is increasing over time. Monitoring vulnerability-fixing commits is a part of the standard process to prevent vulnerability exploitation. Manually detecting vulnerability-fixing commits is, however, time-consuming due to the possibly large number of commits to review. Recently, many techniques have been proposed to automatically detect vulnerability-fixing commits using machine learning. These solutions either: (1) did not use deep learning, or (2) use deep learning on only limited sources of information. This paper proposes VulCurator, a tool that leverages deep learning on richer sources of information, including commit messages, code changes and issue reports for vulnerability-fixing commit classification. Our experimental results show that VulCurator outperforms the state-of-the-art baselines up to 16.1% in terms of F1-score.
VulCurator tool is publicly available at https://github.com/ntgiang71096/VFDetector and https://zenodo.org/record/7034132# .Yw3MN-xBzDI, with a demo video at https://youtu.be/uMlFmWSJYOE

Publisher's Version
KVS: A Tool for Knowledge-Driven Vulnerability Searching
Xingqi Cheng, Xiaobing Sun, Lili Bo, and Ying Wei
(Yangzhou University, China; Heidelberg University, Germany; Nanjing University, China)
It is difficult to quickly locate and search for specific vulnerabilities and their solutions because vulnerability information is scattered in the existing vulnerability management library. To alleviate this problem, we extract knowledge from vulnerability reports and organize the vulnerability information into the form of a knowledge graph. Then, we implement a tool for knowledge-driven vulnerability searching, KVS. This tool mainly uses the BERT model to realize the vulnerability named entity recognition and construct the vulnerability knowledge graph (VulKG). Finally, we can search vulnerabilities of interest-based on VulKG. The URL of this tool is https://cinnqi.github.io/Neo4j-D3-VKG/. Video of our demo is available at https://youtu.be/FT1BaLUGPk0.

Publisher's Version Video Info
MANDO-GURU: Vulnerability Detection for Smart Contract Source Code by Heterogeneous Graph Embeddings
Hoang H. Nguyen, Nhat-Minh Nguyen, Hong-Phuc Doan, Zahra Ahmadi, Thanh-Nam Doan, and Lingxiao Jiang
(Leibniz Universität Hannover, Germany; Singapore Management University, Singapore; Hanoi University of Science and Technology, Vietnam)
Smart contracts are increasingly used with blockchain systems for high-value applications. It is highly desired to ensure the quality of smart contract source code before they are deployed. This paper proposes a new deep learning-based tool, MANDO-GURU, that aims to accurately detect vulnerabilities in smart contracts at both coarse-grained contract-level and fine-grained line-level. Using a combination of control-flow graphs and call graphs of Solidity code, we design new heterogeneous graph attention neural networks to encode more structural and potentially semantic relations among different types of nodes and edges of such graphs and use the encoded embeddings of the graphs and nodes to detect vulnerabilities. Our validation of real-world smart contract datasets shows that MANDO-GURU can significantly improve many other vulnerability detection techniques by up to 24% in terms of the F1-score at the contract level, depending on vulnerability types. It is the first learning-based tool for Ethereum smart contracts that identify vulnerabilities at the line level and significantly improves the traditional code analysis-based techniques by up to 63.4%. Our tool is publicly available at https://github.com/MANDO-Project/ge-sc-machine. A test version is currently deployed at http://mandoguru.com, and a demo video of our tool is available at http://mandoguru.com/demo-video.

Publisher's Version Video Info
FastKLEE: Faster Symbolic Execution via Reducing Redundant Bound Checking of Type-Safe Pointers
Haoxin Tu, Lingxiao Jiang, Xuhua Ding, and He Jiang
(Singapore Management University, Singapore; Dalian University of Technology, China)
Symbolic execution (SE) has been widely adopted for automatic program analysis and software testing. Many SE engines (e.g., KLEE or Angr) need to interpret certain Intermediate Representations (IR) of code during execution, which may be slow and costly. Although a plurality of studies proposed to accelerate SE, few of them consider optimizing the internal interpretation operations. In this paper, we propose FastKLEE, a faster SE engine that aims to speed up execution via reducing redundant bound checking of type-safe pointers during IR code interpretation. Specifically, in FastKLEE, a type inference system is first leveraged to classify pointer types (i.e., safe or unsafe) for the most frequently interpreted read/write instructions. Then, a customized memory operation is designed to perform bound checking for only the unsafe pointers and omit redundant checking on safe pointers. We implement FastKLEE on top of the well-known SE engine KLEE and combined it with the notable type inference system CCured. Evaluation results demonstrate that FastKLEE is able to reduce by up to 9.1% (5.6% on average) as the state-of-the-art approach KLEE in terms of the time to explore the same number (i.e., 10k) of execution paths. FastKLEE is opensourced at https://github.com/haoxintu/FastKLEE. A video demo of FastKLEE is available at https://youtu.be/fjV_a3kt-mo.

Publisher's Version Video Info

Online Presentations

Clang __usercall: Towards Native Support for User Defined Calling Conventions
Jared Widberg, Sashank Narain, and Yimin Chen
(University of Massachusetts at Lowell, USA)
In reverse engineering interfacing with C/C++ functions is of great interest because it provides much more flexibility for product development and security purpose. However, it has been a great challenge when interfacing functions with user defined calling conventions due to the lack of sufficient and user-friendly tooling. In this work, we design and implement Clang __usercall, which aims to provide programmers with an elegant and familiar syntax to specify user defined calling conventions on functions in C/C++ source code. Our key novelties lie in mimicing the most popular syntax and adapting Clang for interfacing purpose. Our preliminary user study shows that our solution outperforms the existing ones in multiple key aspects including user experience and required lines of code. Clang __usercall is already added to the Compiler Explorer website as well.

Publisher's Version
GFI-Bot: Automated Good First Issue Recommendation on GitHub
Hao He, Haonan Su, Wenxin Xiao, Runzhi He, and Minghui Zhou
(Peking University, China)
To facilitate newcomer onboarding, GitHub recommends the use of "good first issue" (GFI) labels to signal issues suitable for newcomers to resolve. However, previous research shows that manually labeled GFIs are scarce and inappropriate, showing a need for automated recommendations. In this paper, we present GFI-Bot (accessible at https://gfibot.io), a proof-of-concept machine learning powered bot for automated GFI recommendation in practice. Project maintainers can configure GFI-Bot to discover and label possible GFIs so that newcomers can easily locate issues for making their first contributions. GFI-Bot also provides a high-quality, up-to-date dataset for advancing GFI recommendation research.

Publisher's Version
SemCluster: A Semi-supervised Clustering Tool for Crowdsourced Test Reports with Deep Image Understanding
Mingzhe Du, Shengcheng Yu, Chunrong Fang, Tongyu Li, Heyuan Zhang, and Zhenyu Chen
(Nanjing University, China)
Due to the openness of crowdsourced testing, mobile app crowdsourced testing has been subject to duplicate reports. The previous research methods extract the textual features of the crowdsourced test reports, combine with shallow image analysis, and perform unsupervised clustering on the crowdsourced test reports to clarify the duplication of crowdsourced test reports and solve the problem. However, these methods ignore the semantic connection between textual descriptions and screenshots, making the clustering results unsatisfactory and the deduplication effect less accurate.
This paper proposes a semi-supervised clustering tool for crowdsourced test reports with deep image understanding, namely SemCluster, which makes the most of the semantic connection between textual descriptions and screenshots by constructing semantic binding rules and performing semi-supervised clustering. SemCluster improves six metrics of clustering results in the experiment compared to the state-of-the-art method, which verifies that SemCluster has achieved a good deduplication effect. The demo can be found at: https://sites.google.com/view/semcluster-demo.

Publisher's Version
TSA: A Tool to Detect and Quantify Network Side-Channels
İsmet Burak Kadron and Tevfik Bultan
(University of California at Santa Barbara, USA)
Mobile applications, Internet of Things devices and web services are pervasive and they all encrypt the communications between servers and clients to not have information leakages. While the network traffic is encrypted, packet sizes and timings are still visible to an eavesdropper and these properties can leak information and sacrifice user privacy. We present TSA, a black box network side-channel analysis tool which detects and quantifies side-channel information leakages. TSA provides the users with the means to automate trace gathering by providing a framework in which the users can write mutators for the inputs to the system under analysis. TSA can also take as input traces directly for analysis if the user prefers to gather them separately. TSA is open-source and available as a Python package and a command-line tool. TSA demo, tool and benchmarks are available at https://github.com/kadron/tsa-tool.

Publisher's Version

Doctoral Symposium

Session 1

Blackbox Adversarial Attacks and Explanations for Automatic Speech Recognition
Xiaoliang Wu
(University of Edinburgh, UK)
Automatic speech recognition (ASR) models are used widely in applications for voice navigation and voice control of domestic appliances. The computational core of ASRs are Deep Neural Networks (DNNs) that have been shown to be susceptible to adversarial perturbations and exhibit unwanted biases and ethical issues. To assess the security of ASRs, we propose techniques that generate blackbox (agnostic to the DNN) adversarial attacks that are portable across ASRs. This is in contrast to existing work that focuses on whitebox attacks that are time consuming and lack portability. Apart from that, to figure out why ASRs(always blackbox) are easily attacked, we provide explanation methods on ASRs that help increase our understanding of the system and ultimately help build trust in the system.

Publisher's Version

Session 2

This Is Your Cue! Assisting Search Behaviour with Resource Style Properties
Deeksha M. Arya
(McGill University, Canada)
When learning a software technology, programmers face a large variety of resources in different styles and catering to different requirements. Although search engines are helpful to filter relevant resources, programmers are still required to manually go through a number of resources before they find one pertinent to their needs. Prior work has largely concentrated on helping programmers find the precise location of relevant information within a resource. Our work focuses on helping programmers assess the pertinence of resources to differentiate between resources. We investigated how programmers find learning resources online via a diary and interview study, and observed that programmers use certain cues to determine whether to access a resource. Based on our findings, we investigate the extent to which we can support the cue-following process via a prototype tool. Our research supports programmers’ search behaviour for software technology learning resources to inform resource creators on important factors that programmers look for during their search.

Publisher's Version
Infrastructure as Code for Dynamic Deployments
Daniel Sokolowski
(University of St. Gallen, Switzerland)
Modern DevOps organizations require a high degree of automation to achieve software stability at frequent changes. Further, there is a need for flexible, timely reconfiguration of the infrastructure, e.g., to use pay-per-use infrastructure efficiently based on application load. Infrastructure as Code (IaC) is the DevOps tool to automate infrastructure. However, modern static IaC solutions only support infrastructures that are deployed and do not change afterward. To implement infrastructures that change dynamically over time, static IaC programs have to be (updated and) re-run, e.g., in a CI/CD pipeline, or configure an external orchestrator that implements the dynamic behavior, e.g., an autoscaler or Kubernetes operator. Both do not capture the dynamic behavior in the IaC program and prevent analyzing and testing the infrastructure configuration jointly with its dynamic behavior.
To fill this gap, we envision dynamic IaC, which augments static IaC with the ability to define dynamic behavior within the IaC program. In contrast to static IaC programs, dynamic IaC programs run continuously. They re-evaluate program parts that depend on external signals when these change and automatically adjust the infrastructure accordingly. We implement DIaC as the first dynamic IaC solution and demonstrate it in two realistic use cases of broader relevance. With dynamic IaC, ensuring the program’s correctness is even harder than for static IaC because programs may define many target configurations in contrast to only a few. However, for this reason, it is also more critical. To solve this issue, we propose automated, specialized property-based testing for IaC programs and implement it in ProTI.

Publisher's Version
Automated Capacity Analysis of Limitation-Aware Microservices Architectures
Rafael Fresno-Aranda
(University of Seville, Spain)
Over the last years, the concept of API economy has fostered the creation of an ecosystem of public APIs used as business elements. These APIs include various pricing plans, which allow developers to consume an API for a specific price and under certain conditions. These conditions include capacity limits, a.k.a. limitations, that limit the usage of the API. Additionally, modern web applications are usually based on a microservices architecture (MSA), in which multiple services communicate with each other through public APIs using a standardized paradigm, commonly RESTful. When an MSA consumes external APIs with limitations, it is necessary to analyse the impact of these limitations in its capacity. These MSAs are known as Limitation-Aware Microservices Architecture (LAMA). This PhD dissertation aims to provide an automated framework to analyse the capacity of a LAMA given the formal description of its internal topology and the external pricing plans. This framework would be used to solve analysis operations, which deal with the extraction of useful information that helps developers build their LAMAs.

Publisher's Version

Session 3

Change-Aware Mutation Testing for Evolving Systems
Miloš Ojdanić
(University of Luxembourg, Luxembourg)
Although the strongest test criteria, traditional mutation testing has shown to not scale with modern incremental development practices. In this work, we describe our proposal of commit-aware mutation testing and introduce the concept of commit-relevant mutants suitable to evaluate the system's behaviour after being affected by regression changes. We show that commit-relevant mutants represent a small but effective set that assesses the delta of behaviours between two consecutive software versions. Commit-aware mutation testing provides the guidance for developers to quantify to which extent they have tested error-prone locations impacted by program changes. In this paper, we portray our efforts to make mutation criteria change-aware as we study characteristics of commit-relevant mutants striving to bring mutation testing closer to being worthwhile for evolving systems.

Publisher's Version
Effective and Scalable Fault Injection using Bug Reports and Generative Language Models
Ahmed Khanfir
(University of Luxembourg, Luxembourg)
Previous research has shown that artificial faults can be useful in many software engineering tasks such as testing, fault-tolerance assessment, debugging, dependability evaluation, risk analysis, etc. However, such artificial-fault-based applications can be questioned or inaccurate when the considered faults misrepresent real bugs. Since typically, fault injection techniques (i.e. mutation testing) produce a large number of faults by altering ”blindly” the code in arbitrary locations, they are unlikely capable to produce few but relevant real-like faults. In our work, we tackle this challenge by guiding the injection towards resembling bugs that have been previously introduced by developers. For this purpose, we propose iBiR, the first fault injection approach that leverages information from bug reports to inject ”realistic” faults. iBiR injects faults on the locations that are more likely to be related to a given bug-report by applying appropriate inverted fix-patterns, which are manually or automatically crafted by automated-program-repair researchers. We assess our approach using bugs from the Defects4J dataset and show that iBiR outperforms significantly conventional mutation testing in terms of injecting faults that semantically resemble and couple with real ones, in the vast majority of the cases. Similarly, the faults produced by iBiR give significantly better fault-tolerance estimates than conventional mutation testing in around 80% of the cases.

Publisher's Version
Explaining and Debugging Pathological Program Behavior
Martin Eberlein
(Humboldt University of Berlin, Germany)
Programs fail. But which part of the input is responsible for the failure? To resolve the issue, developers must first understand how and why the program behaves as it does, notably when it deviates from the expected outcome. A program’s behavior is essentially the set of all its executions. This set is usually diverse, unpredictable, and generally unbounded. A pathological program behavior occurs once the actual outcome does not match the expected behavior. Consequently, developers must fix these issues to ensure the built system is the desired software. In our upcoming research, we want to focus on providing developers with a detailed description of the root causes that resulted in the program’s unwanted behavior. Thus, we aim to automatically produce explanations that capture the circumstances of arbitrary program behavior by correlating individual input elements (features) and their corresponding execution outcome. To this end, we use the scientific method and combine generative and predictive models, allowing us (i) to learn the statistical relations between the features of the inputs and the program behavior and (ii) to generate new inputs to refine or refute our current explanatory prediction model.

Publisher's Version

Session 4

Sentiment in Software Engineering: Detection and Application
Nathan Cassee
(Eindhoven University of Technology, Netherlands)
In software engineering the role of human aspects is an important one, especially as developers indicate that they experience a wide range of emotions while developing software. Within software engineering researchers have sought to understand the role emotions and sentiment play in the development of software by studying issues, pull-requests and commit messages. To detect sentiment, automated tools are used, and in this doctoral thesis we plan to study the use of these sentiment analysis tools, their applications, best practices for their usage and the effect of non-natural language on their performance. In addition to studying the application of sentiment analysis tools, we also aim to study self-admitted technical debt and bots in software engineering, to understand why developers express sentiment and what they signal when they express sentiment. Through studying both the application of sentiment analysis tools and the role of sentiment in software engineering, we hope to provide practical recommendations for both researchers and developers.

Publisher's Version

Student Research Competition

Online Presentations

A Practical Call Graph Construction Method for Python
Jiacheng Zhong
(Nanjing University, China)
Python has become one of the most popular programming languages today. Call graph is one of the essential data structures for many applications in software engineering. However, the precision and recall rate of the existing Python call graph construction methods are generally not high enough, which affects their use in practice. This paper proposes PyPt, a static call graph construction method for Python based on flow-insensitive context-insensitve pointer analysis, which can deal with some dynamic features, such as the dynamic resolution of attributes, higher-order functions, etc. This paper compares PyPt with the state-of-the-art call graph tool PyCG on a benchmark containing 99 manually constructed programs and six real-world open-source projects. The results show that PyPt is better than PyCG in both soundness and completeness.

Publisher's Version
Automated Generation of Test Oracles for RESTful APIs
Juan C. Alonso
(University of Seville, Spain)
Test case generation tools for RESTful APIs have proliferated in recent years. However, despite their promising results, they all share the same limitation: they can only detect crashes (i.e., server errors) and disconformities with the API specification. In this paper, we present a technique for the automated generation of test oracles for RESTful APIs through the detection of invariants. In practice, our approach aims to learn the expected properties of the output by analysing previous API requests and their corresponding responses. For this, we extended the popular tool Daikon for dynamic detection of likely invariants. A preliminary evaluation conducted on a set of 8 operations from 6 industrial APIs reveals a total precision of 66.5% (reaching 100% in 2 operations). Moreover, our approach revealed 6 reproducible bugs in APIs with millions of users: Amadeus, GitHub and OMDb.

Publisher's Version
CheapET-3: Cost-Efficient Use of Remote DNN Models
Michael Weiss
(USI Lugano, Switzerland)
On complex problems, state of the art prediction accuracy of Deep Neural Networks (DNN) can be achieved using very large-scale models, consisting of billions of parameters. Such models can only be run on dedicated servers, typically provided by a 3th party service, which leads to a substantial monetary cost for every prediction. We propose a new software architecture for client-side applications, where a small local DNN is used alongside a remote large-scale model, aiming to make easy predictions locally at negligible monetary cost, while still leveraging the benefits of a large model for challenging inputs. In a proof of concept we reduce prediction cost by up to 50% without negatively impacting system accuracy.

Publisher's Version
Improving IDE Code Inspections with Tree Automata
Yunjeong Lee
(National University of Singapore, Singapore)
Integrated development environments (IDEs) are equipped with code inspections to warn developers of malformed or incorrect code by analyzing the code's data flow. However, the data flow analysis performed by the IDE code inspections may issue warnings in code where warnings are irrelevant. Existing methods to prevent any bogus warnings are either too conservative --- suppressing the same type of warnings in the entire codebase --- or are not sustainable as they require adding a set of heuristics to the data flow analysis. We propose a programming-by-example (PBE) framework that synthesizes tree automata (TAs) to suppress bogus warnings in all the code containing the same patterns as the user-provided examples. Experiments with a prototype of the framework demonstrate that several real-world patterns heuristically suppressed in production IDE can be mechanically translated to a TA in a systematic way. Furthermore, we briefly discuss how TA-based solution can be useful in other IDE features such as code refactoring.

Publisher's Version
RESTInfer: Automated Inferring Parameter Constraints from Natural Language RESTful API Descriptions
Yi Liu
(Nanyang Technological University, Singapore)
RESTful APIs have been applied to provide cloud services by various notable companies. The quality assurance of RESTful API is critical. Several automatic RESTful API testing techniques have been proposed to tame this issue. By analyzing crashes caused by each test case, developers can find potential bugs in cloud services. However, it is difficult for automated tools to generate feasible parameters under complicating constraints randomly. Fortunately, RESTful API descriptions can be used to infer possible parameter constraints. Given parameter constraints, automated tools can further improve the efficiency of testing.
In this paper, we propose RESTInfer, a two-phase approach to infer parameter constraints by natural language processing. The preliminary evaluation result shows that RESTInfer can achieve a high code coverage and bug finding.

Publisher's Version

Tutorials

Program Analysis using WALA (Tutorial)
Joanna C. S. Santos and Julian Dolby
(University of Notre Dame, USA; IBM Research, USA)
Static analysis is widely used in research and practice for multiple purposes such as fault localization, vulnerability detection, code clone identification, code refactoring, optimization, etc. Since implementing static analyzers is a non-trivial task, engineers often rely on existing frameworks to implement their techniques. The IBM T.J. Watson Libraries for Analysis (WALA) is one of such frameworks that allows the analysis of multiple environments, such as Java bytecode (and related languages), JavaScript, Android, Python, etc. In this tutorial, we walk through the process of using WALA for program analysis. First, the tutorial will cover all the required background knowledge that is necessary to understand the technical implementation details of the explained algorithms and techniques. Subsequently, we provide a technical overview of the WALA framework and its support for analysis of multiple programming languages and frameworks code. Then, we will do several live demonstration of using WALA to implement client analyses. We will focus on two common uses of analysis: a form of security analysis, taint analysis, and on using analysis graphs for machine learning of code.

Publisher's Version
Dynamic Data Race Prediction: Fundamentals, Theory, and Practice (Tutorial)
Umang Mathur and Andreas Pavlogiannis
(National University of Singapore, Singapore; Aarhus University, Denmark)
Data races are the most common concurrency bugs and considerable efforts are put in ensuring data-race-free (DRF) programs. The most popular approach is via dynamic analyses, which soundly report DRF violations by analyzing program executions. Recently, there has been a prevalent shift to predictive analysis techniques. Such techniques attempt to predict DRF violations even in unobserved program executions, while making sure that the analysis is sound (does not raise false positives).
This tutorial will present the foundations of race prediction in a systematic manner, and summarize latest advances in race prediction in a concise and unifying way. State-of-the-art predictive techniques will be explained out of first principles, followed by a comparison between soundness, completeness and complexity guarantees provided in each case. In addition, we will highlight the use of specific data structures that result in algorithmic efficiency in these techniques. We will also touch on various notions of optimality and their suitability in online/offline prediction. On the theoretical side, we will highlight some recent hard computational barriers inherent in race prediction, as well as ways to alleviate them in specific settings. We will also touch upon other common concurrency bugs, such as deadlocks and atomicity violations, and highlight cases when techniques are transferable between them. The tutorial will include a hands-on demonstration of two relevant tools, namely RAPID and M2. Finally, we will end with some key open questions with the aim to inspire future research.

Publisher's Version
Machine Learning and Natural Language Processing for Automating Software Testing (Tutorial)
Mauro Pezzè
(USI Lugano, Switzerland; Schaffhausen Institute of Technology, Switzerland)
In this tutorial, we see how natural language processing and machine learning can help us address the open challenges of software testing. We overview the open challenges of testing autonomous and self-adaptive software systems, discuss the leading-edge technologies that can address the core issues, and see the latest progresses and future prospective of natural language processing and machine learning to cope with core problems.
Automating test case and oracle generation are still largely open issues. Autonomous and self-adaptive systems, like self-driving cars, smart cities, and smart buildings, raise new issues that further toughen the already challenging scenarios. In the tutorial we understand the growing importance of field testing to address failures that emerge in production, the role of dynamic analysis and deep learning in revealing failure-prone scenarios, the need of symbolic fuzzing to explore unexpected scenarios, and the potentiality of reinforcement learning and natural language processing to generate test cases and oracles. We see in details state-of-the-art approaches that exploit natural language processing to automatically generate executable test oracles, as well as semantic matching, deep and reinforcement learning to automatically generate test cases and reveal failure-prone scenarios in production.
The tutorial is designed for both researchers, whose research roadmap focuses on software testing and applications of natural language processing and machine learning to software engineering, and practitioners, who see important professional opportunities from autonomous and self-adaptive systems. It is particularly well suited to PhD students and postdoctoral researchers who aim to address new challenges with novel technologies. The tutorial is self-contained, and is designed for a software engineering audience, who many not have a specific background in natural language processing and machine learning.

Publisher's Version
Performing Large-Scale Mining Studies: From Start to Finish (Tutorial)
Robert Dyer and Samuel W. Flint
(University of Nebraska-Lincoln, USA)
Modern software engineering research often relies on mining open-source software repositories, to either provide motivation for their research problems and/or evaluation of the proposed approach. Mining ultra-large-scale software repositories is still a difficult task, requiring substantial expertise and access to significant hardware. Tools such as Boa can help researchers easily mine large numbers of open-source repositories. There has also recently been more of a push toward open science, with an emphasis on making replication packages available. Building such replication packages incurs additional workload for researchers. In this tutorial, we teach how to use the Boa infrastructure for mining software repository data. We leverage Boa’s VS Code IDE extension to help write and submit Boa queries, and also leverage Boa’s study template to show how researchers can more easily analyze the output from Boa and automatically produce a suitable replication package that is published on Zenodo.

Publisher's Version

proc time: 30.25