PROMISE 2022 – Proceedings

Message from the Chairs
It is our pleasure to welcome you to the 18th ACM International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE 2022), to be held in hybrid mode (physically and virtually) on November 18th, 2022, co-located with the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2022). PROMISE is an annual forum for researchers and practitioners to present, discuss and exchange ideas, results, expertise and experiences in the construction and/or application of predictive models and data analytics in software engineering. Such models and analyses could be targeted at planning, design, implementation, testing, maintenance, quality assurance, evaluation, process improvement, management, decision making, and risk assessment in software and systems development. This year PROMISE received a total of 18 paper submissions. The review process was double blind and each paper was reviewed by at least three members of the program committee. An online discussion was also held for 8 days. Based on this procedure, we accepted a total of 10 full papers, which will be presented in 3 technical sessions. The acceptance criteria were entirely based on the quality of the papers, without imposing any constraint on the number of papers to be accepted.
We are delighted to announce an outstanding keynote: Release Engineering in the AI World: How can Analytics Help? By Prof. Bram Adams, Queen’s University, Canada
We would like to thank all authors for submitting high quality papers, and program committee members for their timely and accurate reviewing activity. Last, but not least, we would like to thank the FSE 2022 organizers for hosting PROMISE 2022 as a co-located event and for their logistic support in the organization of the conference.
We hope you will enjoy PROMISE 2022. We certainly will!
Many thanks from Shane McIntosh (General Chair), Gema Rodriguez-Perez and Weiyi Shang (Program Chairs).

Keynote

Release Engineering in the AI World: How Can Analytics Help? (Keynote)
Bram Adams

(Queen’s University, Canada)
The last decade, the practices of continuous delivery and deployment have taken the software engineering world by storm. While applications used to be released in an ad hoc manner, breakthroughs in (amongst others) continuous integration, infrastructure-as-code and log monitoring have turned the reliable release of cloud applications into a manageable achievement for most companies. However, the advent of AI models seems to have caused a "reset", pushing companies to reinvent the way in which they release high-quality products that now not only rely on source code, but also on data and models. This talk will focus on the key ingredients of successful pre-AI release engineering practices, then will connect those to newly emerging, post-AI release engineering practices. After this talk, the audience will understand the major challenges software companies face to release their AI product multiple times a day, as well as the opportunities for predictive models and data analytics.

Publisher's Version

Papers

Improving the Performance of Code Vulnerability Prediction using Abstract Syntax Tree Information
Fahad Al Debeyan

, Tracy Hall

, and David Bowes

(Lancaster University, UK)
The recent emergence of the Log4jshell vulnerability demonstrates the importance of detecting code vulnerabilities in software systems. Software Vulnerability Prediction Models (VPMs) are a promising tool for vulnerability detection. Recent studies have focused on improving the performance of models to predict whether a piece of code is vulnerable or not (binary classification). However, such approaches are limited because they do not provide developers with information on the type of vulnerability that needs to be patched. We present our multiclass classification approach to improve the performance of vulnerability prediction models. Our approach uses abstract syntax tree n-grams to identify code clusters related to specific vulnerabilities. We evaluated our approach using real-world Java software vulnerability data. We report increased predictive performance compared to a variety of other models, for example, F-measure increases from 55% to 75% and MCC increases from 48% to 74%. Our results suggest that clustering software vulnerabilities using AST n-gram information is a promising approach to improve vulnerability prediction and enable specific information about the vulnerability type to be provided.

Publisher's Version

Measuring Design Compliance using Neural Language Models: An Automotive Case Study
Dhasarathy Parthasarathy, Cecilia Ekelin, Anjali Karri, Jiapeng Sun, and Panagiotis Moraitis
(Volvo, Sweden; Chalmers University of Technology, Sweden)
As the modern vehicle becomes more software-defined, it is beginning to take significant effort to avoid serious regression in software design. This is because automotive software architects rely largely upon manual review of code to spot deviations from specified design principles. Such an approach is both inefficient and prone to error. In recent days, neural language models pre-trained on source code are beginning to be used for automating a variety of programming tasks. In this work, we extend the application of such a Programming Language Model (PLM) to automate the assessment of design compliance. Using a PLM, we construct a system that assesses whether a set of query programs comply with Controller-Handler, a design pattern specified to ensure hardware abstraction in automotive control software. The assessment is based upon measuring whether the geometrical arrangement of query program embeddings, extracted from the PLM, aligns with that of a set of known implementations of the pattern. The level of alignment is then transformed into an interpretable measure of compliance. Using a controlled experiment, we demonstrate that our technique determines compliance with a precision of 92%. Also, using expert review to calibrate the automated assessment, we introduce a protocol to determine the nature of the violation, helping eventual refactoring. Results from this work indicate that neural language models can provide valuable assistance to human architects in assessing and fixing violations in automotive software design.

Publisher's Version

Feature Sets in Just-in-Time Defect Prediction: An Empirical Evaluation
Peter Bludau

and Alexander Pretschner

(fortiss, Germany; TU Munich, Germany)
Just-in-time defect prediction assigns a defect risk to each new change to a software repository in order to prioritize review and testing efforts. Over the last decades different approaches were proposed in literature to craft more accurate prediction models. However, defect prediction is still not widely used in industry, due to predictions with varying performance. In this study, we evaluate existing features on six open-source projects and propose two new features sets, not yet discussed in literature. By combining all feature sets, we improve MCC by on average 21%, leading to the best performing models when compared to state-of-the-art approaches. We also evaluate effort-awareness and find that on average 14% more defects can be identified, inspecting 20% of changed lines.

Publisher's Version

Profiling Developers to Predict Vulnerable Code Changes
Tugce Coskun

, Rusen Halepmollasi

, Khadija Hanifi

, Ramin Fadaei Fouladi

, Pinar Comak De Cnudde

, and Ayse Tosun

(Istanbul Technical University, Turkey; Ericsson Security Research, Turkey)
Software vulnerability prediction and management have caught the interest of researchers and practitioners, recently. Various techniques that are usually based on characteristics of the code artefacts are also offered to predict software vulnerabilities. While other studies achieve promising results, the role of developers in inducing vulnerabilities has not been studied yet. We aim to profile the vulnerability inducing and vulnerability fixing behaviors of developers in software projects using Heterogeneous Information Network (HIN) analysis. We also investigate the impact of developer profiles in predicting vulnerability inducing commits, and compare the findings against the approach based on the code metrics. We adopt Random Walk with Restart (RWR) algorithm on HIN and the aggregation of code metrics for extracting all the input features. We utilize traditional machine learning algorithms namely, Naive Bayes (NB), Support Vector Machine (SVM), Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) to build the prediction models.We report our empirical analysis to predict vulnerability inducing commits of four Apache projects. The technique based on code metrics achieves 90% success for the recall measure, whereas the technique based on profiling developer behavior achieves 71% success. When we use the feature sets obtained with the two techniques together, we achieve 89% success.

Publisher's Version

Predicting Build Outcomes in Continuous Integration using Textual Analysis of Source Code Commits
Khaled Al-Sabbagh

, Miroslaw Staron

, and Regina Hebig

(Chalmers University of Technology, Sweden; University of Gothenburg, Sweden)
Machine learning has been increasingly used to solve various software engineering tasks. One example of its usage is to predict the outcome of builds in continuous integration, where a classifier is built to predict whether new code commits will successfully compile. The aim of this study is to investigate the effectiveness of fifteen software metrics in building a classifier for build outcome prediction. Particularly, we implemented an experiment wherein we compared the effectiveness of a line-level metric and fourteen other traditional software metrics on 49,040 build records that belong to 117 Java projects. We achieved an average precision of 91% and recall of 80% when using the line-level metric for training, compared to 90% precision and 76% recall for the next best traditional software metric. In contrast, using file-level metrics was found to yield a higher predictive quality (average MCC for the best software metric= 68%) than the line-level metric (average MCC= 16%) for the failed builds. We conclude that file-level metrics are better predictors of build outcomes for the failed builds, whereas the line-level metric is a slightly better predictor of passed builds.

Publisher's Version

LOGI: An Empirical Model of Heat-Induced Disk Drive Data Loss and Its Implications for Data Recovery
Hammad Ahmad

, Colton Holoday, Ian Bertram, Kevin Angstadt, Zohreh Sharafi, and Westley Weimer

(University of Michigan, USA; MathWorks, USA; St. Lawrence University, USA; Polytechnique Montréal, Canada)
Disk storage continues to be an important medium for data recording in software engineering, and recovering data from a failed storage disk can be expensive and time-consuming. Unfortunately, while physical damage instances are well documented, existing studies of data loss are limited, often only predicting times between failures. We present an empirical measurement of patterns of heat damage on indicative, low-cost commodity hard drives. Because damaged hard drives require many hours to read, we propose an efficient, accurate sampling algorithm. Using our empirical measurements, we develop LOGI, a formal mathematical model that, on average, predicts sector damage with precision, recall, F-measure, and accuracy values of over 0.95. We also present a case study on the usage of LOGI and discuss its implications for file carver software. We hope that this model is used by other researchers to simulate damage and bootstrap further study of disk failures, helping engineers make informed decisions about data storage for software systems.

Publisher's Version

Assessing the Quality of GitHub Copilot’s Code Generation
Burak Yetistiren

, Isik Ozsoy

, and Eray Tuzun

(Bilkent University, Turkey)
The introduction of GitHub’s new code generation tool, GitHub Copilot, seems to be the first well-established instance of an AI pair-programmer. GitHub Copilot has access to a large number of open-source projects, enabling it to utilize more extensive code in various programming languages than other code generation tools. Although the initial and informal assessments are promising, a systematic evaluation is needed to explore the limits and benefits of GitHub Copilot. The main objective of this study is to assess the quality of generated code provided by GitHub Copilot. We also aim to evaluate the impact of the quality and variety of input parameters fed to GitHub Copilot. To achieve this aim, we created an experimental setup for evaluating the generated code in terms of validity, correctness, and efficiency. Our results suggest that GitHub Copilot was able to generate valid code with a 91.5% success rate. In terms of code correctness, out of 164 problems, 47 (28.7%) were correctly, while 84 (51.2%) were partially correctly, and 33 (20.1%) were incorrectly generated. Our empirical analysis shows that GitHub Copilot is a promising tool based on the results we obtained, however further and more comprehensive assessment is needed in the future.

Publisher's Version

Info

On the Effectiveness of Data Balancing Techniques in the Context of ML-Based Test Case Prioritization
Jediael Mendoza, Jason Mycroft, Lyam Milbury

, Nafiseh Kahani

, and Jason Jaskolka

(Carleton University, Canada)
Regression testing is the cornerstone of quality assurance of software systems. However, executing regression test cases can impose significant computational and operational costs. In this context, Machine Learning-based Test Case Prioritization (ML-based TCP) techniques rank the execution of regression tests based on their probability of failures and execution time so that the faults can be detected as early as possible during the regression testing. Despite the recent progress of ML-based TCP, even the best reported ML-based TCP techniques can reach 90% or higher effectiveness in terms of Cost-cognizant Average Percentage of Faults Detected (APFDc) only in 20% of studied subjects. We argue that the imbalanced nature of used training datasets caused by the low failure rate of regression tests is one of the main reasons for this shortcoming. This work conducts an empirical study on applying 19 state-of the- art data balancing techniques for dealing with imbalanced data sets in the TCP context, based on the most comprehensive publicly available datasets. The results demonstrate that data balancing techniques can improve the effectiveness of the best-known ML-based TCP technique for most subjects, with an average of 0.06 in terms of APFDc.

Publisher's Version

Identifying Security-Related Requirements in Regulatory Documents Based on Cross-Project Classification
Mazen Mohamad

, Jan-Philipp Steghöfer

, Alexander Åström, and Riccardo Scandariato

(Chalmers University of Technology, Sweden; University of Gothenburg, Sweden; Xitaso, Germany; Comentor, Sweden; Hamburg University of Technology, Germany)
Security is getting substantial focus in many industries, especially safety-critical ones. When new regulations and standards which can run to hundreds of pages are introduced, it is necessary to identify the requirements in those documents which have an impact on security. Additionally, it is necessary to revisit the requirements of existing systems and identify the security related ones. We investigate the feasibility of using a classifier for security-related requirements trained on requirement specifications available online. We base our investigation on 15 requirement documents, randomly selected and partially pre-labelled, with a total of 3,880 requirements. To validate the model, we run a cross-project prediction on the data where each specification constitutes a group. We also test the model on three different United Nations (UN) regulations from the automotive domain with different magnitudes of security relevance. Our results indicate the feasibility of training a model from a heterogeneous data set including specifications from multiple domains and in different styles. Additionally, we show the ability of such a classifier to identify security requirements in real-life regulations and discuss scenarios in which such a classification becomes useful to practitioners.

Publisher's Version

API + Code = Better Code Summary? Insights from an Exploratory Study
Prantik Parashar Sarmah

and Sridhar Chimalakonda

(IIT Tirupati, India)
Automatic code summarization techniques aid in program comprehension by generating a natural language summary from source code. Recent research in this area has seen significant developments from basic Seq2Seq models to different flavors of transformer models, which try to encode the structural components of the source code using various input representations. Apart from the source code itself, components used in source code, such as API knowledge, have previously been helpful in code summarization using recurrent neural networks (RNN). So, in this article, along with source code and its structure, we explore the importance of APIs in improving the performance of code summarization models. Our model uses a transformer-based architecture containing two encoders for two input modules, source code and API sequences, and a joint decoder to generate summaries combining the outputs of two encoders. We experimented with our proposed model on a dataset of java projects collected from GitHub containing around 87K <Java Method, API Sequence, Comment> triplets. The experiments show our model outperforms most of the existing RNN-based approaches, but the overall performance does not improve compared with the state-of-the-art approach using transformers. Thus, the results show that although API information is helpful for code summarization, we see immense scope for further research focusing on improving models and leveraging additional API knowledge for code summarization.

Publisher's Version

PROMISE 2022 – Proceedings

18th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE 2022)

Frontmatter

Keynote

Papers