PROMISE 2023 – Proceedings

Harnessing Predictive Modeling and Software Analytics in the Age of LLM-Powered Software Development (Invited Talk)
Foutse Khomh
(Polytechnique Montréal, Canada)
In the rapidly evolving landscape of software development, Large Language Models (LLM) have emerged as powerful tools that can significantly impact the way software code is written, reviewed, and optimized, making them invaluable resources for programmers. They offer developers the ability to leverage pre-trained knowledge and tap into vast code repositories, enabling faster development cycles and reducing the time spent on repetitive or mundane coding tasks. However, while these models offer substantial benefits, their adoption also presents multiple challenges. For example, they might generate code snippets that are syntactically correct but functionally flawed, requiring human review and validation. Moreover, the ethical considerations surrounding these models, such as biases in the training data, should be carefully addressed to ensure fair and inclusive software development practices. This talk will provide an overview and reflection on some of these challenges, present some preliminary solutions, and discuss opportunities for predictive models and data analytics.

Publisher's Version

Papers

BuggIn: Automatic Intrinsic Bugs Classification Model using NLP and ML
Pragya Bhandari and Gema Rodríguez-Pérez
(University of British Columbia, Canada)
Recent studies have shown that bugs can be categorized into in- trinsic and extrinsic types. Intrinsic bugs can be backtracked to specific changes in the version control system (VCS), while extrin- sic bugs originate from external changes to the VCS and lack a direct bug-inducing change. Using only intrinsic bugs to train bug prediction models has been reported as beneficial to improve the performance of such models. However, there is currently no auto- mated approach to identify intrinsic bugs. To bridge this gap, our study employs Natural Language Processing (NLP) techniques to automatically identify intrinsic bugs. Specifically, we utilize two embedding techniques, seBERT and TF-IDF, applied to the title and description text of bug reports. The resulting embeddings are fed into well-established machine learning algorithms such as Support Vector Machine, Logistic Regression, Decision Tree, Random Forest, and K-Nearest Neighbors. The primary objective of this paper is to assess the performance of various NLP and machine learning techniques in identifying intrinsic bugs using the textual informa- tion extracted from bug reports. The results demonstrate that both seBERT and TF-IDF can be effectively utilized for intrinsic bug identification. The highest performance scores were achieved by combining TF-IDF with the Decision Tree algorithm and utilizing the bug titles (yielding an F1 score of 78%). This was closely fol- lowed by seBERT, Support Vector Machine, and bug titles (with an F1 score of 77%). In summary, this paper introduces an innovative approach that automates the identification of intrinsic bugs using textual information derived from bug reports.

Publisher's Version

Do Developers Fix Continuous Integration Smells?
Ayberk Yaşa, Ege Ergül, Hakan Erdogmus, and Eray Tüzün
(Bilkent University, Turkiye; Carnegie Mellon University, USA)
Continuous Integration (CI) is a common software engineering practice in which the code changes are frequently merged into a software project repository after automated builds and tests have been successfully run. CI enables developers to quickly detect bugs, enhance the quality of the code, and shorten review times. However, developers may encounter some obstacles in following the CI principles. They may be unaware of them, they may follow the principles partially or they may even act against them. These behaviors result in CI smells. CI smells may in turn lessen the benefits of CI. Addressing CI smells rapidly allows software projects to fully reap the benefits of CI and increase its effectiveness. The main objective of this study is to investigate how frequently developers address CI smells. To achieve this objective, we first selected seven smells, then implemented scripts for detecting these smells automatically, and then ran the scripts in eight open-source software projects using GitHub Actions. To assess the resolution extent of CI smells by practitioners, we calculated the occurrences and time-to-resolution (TTR) of each smell. Our results suggest that Skipped Job smell has been fixed slightly more than other CI smells. The most frequently observed smell was Long Build, which was detected in an average of 19.03% of all CI builds. Fake Success smell does not get resolved in projects where it exists. Our study reveals that practitioners do not fix CI smells in practice. Further studies are needed to explore the underlying reasons behind this, in order to recommend more effective strategies for addressing these smells.

Publisher's Version

Large Scale Study of Orphan Vulnerabilities in the Software Supply Chain
David Reid, Kristiina Rahkema, and James Walden
(University of Tennessee at Knoxville, USA; University of Tartu, Estonia; Northern Kentucky University, USA)
The security of the software supply chain has become a critical issue in an era where the majority of software projects use open source software dependencies, exposing them to vulnerabilities in those dependencies. Awareness of this issue has led to the creation of dependency tracking tools that can identify and remediate such vulnerabilities. These tools rely on package manager metadata to identify dependencies, but open source developers often copy dependencies into their repositories manually without the use of a package manager.
In order to understand the size and impact of this problem, we designed a large scale empirical study to investigate vulnerabilities propagated through copying of dependencies. Such vulnerabilities are called orphan vulnerabilities. We created a tool, VCAnalyzer, to find orphan vulnerabilities copied from an initial set of vulnerable files. Starting from an initial set of 3,615 vulnerable files from the CVEfixes dataset, we constructed a dataset of more than three million orphan vulnerabilities found in over seven hundred thousand open source projects.
We found that 83.4% of the vulnerable files from the CVEfixes dataset were copied at least once. A majority (59.3%) of copied vulnerable files contained C source code. Only 1.3% of orphan vulnerabilities were ever remediated. Remediation took 469 days on average, with half of vulnerabilities in active projects requiring more than three years to fix. Our findings demonstrate that the number of orphan vulnerabilities not trackable by dependency managers is large and point to a need for improving how software supply chain tools identify dependencies. We make our VCAnalyzer tool and our dataset publicly available.

Publisher's Version

The FormAI Dataset: Generative AI in Software Security through the Lens of Formal Verification
Norbert Tihanyi, Tamas Bisztray, Ridhi Jain, Mohamed Amine Ferrag, Lucas C. Cordeiro, and Vasileios Mavroeidis
(Technology Innovation Institute, United Arab Emirates; University of Oslo, Norway; University of Manchester, UK)
This paper presents the FormAI dataset, a large collection of 112,000 AI-generated compilable and independent C programs with vulnerability classification. We introduce a dynamic zero-shot prompting technique constructed to spawn diverse programs utilizing Large Language Models (LLMs). The dataset is generated by GPT-3.5-turbo and comprises programs with varying levels of complexity. Some programs handle complicated tasks like network management, table games, or encryption, while others deal with simpler tasks like string manipulation. Every program is labeled with the vulnerabilities found within the source code, indicating the type, line number, and vulnerable function name. This is accomplished by employing a formal verification method using the Efficient SMT-based Bounded Model Checker (ESBMC), which uses model checking, abstract interpretation, constraint programming, and satisfiability modulo theories to reason over safety/security properties in programs. This approach definitively detects vulnerabilities and offers a formal model known as a counterexample, thus eliminating the possibility of generating false positive reports. We have associated the identified vulnerabilities with Common Weakness Enumeration (CWE) numbers. We make the source code available for the 112,000 programs, accompanied by a separate file containing the vulnerabilities detected in each program, making the dataset ideal for training LLMs and machine learning algorithms. Our study unveiled that according to ESBMC, 51.24% of the programs generated by GPT-3.5 contained vulnerabilities, thereby presenting considerable risks to software safety and security.

Publisher's Version

Published Artifact

Info

Artifacts Available

Artifacts Functional

Comparing Word-Based and AST-Based Models for Design Pattern Recognition
Sivajeet Chand, Sushant Kumar Pandey, Jennifer Horkoff, Miroslaw Staron, Miroslaw Ochodek, and Darko Durisic
(Chalmers University of Technology, Sweden; University of Gothenburg, Sweden; Poznan University, Poland; Volvo Cars, Gothenburg, Sweden)
Design patterns (DPs) provide reusable and general solutions for frequently encountered problems. Patterns are important to maintain the structure and quality of software products, in particular in large and distributed systems like automotive software. Modern language models (like Code2Vec or Word2Vec) indicate a deep understanding of programs, which has been shown to help in such tasks as program repair or program comprehension, and therefore show promise for DPR in industrial contexts. The models are trained in a self-supervised manner, using a large unlabelled code base, which allows them to quantify such abstract concepts as programming styles, coding guidelines, and, to some extent, the semantics of programs. This study demonstrates how two language models—Code2Vec and Word2Vec, trained on two public automotive repositories, can show the separation of programs containing specific DPs. The results show that the Code2Vec and Word2Vec produce average F1-scores of 0.781 and 0.690 on open-source Java programs, showing promise for DPR in practice.

Publisher's Version

On Effectiveness of Further Pre-training on BERT Models for Story Point Estimation
Sousuke Amasaki
(Okayama Prefectural University, Japan)
CONTEXT: Recent studies on story point estimation used deep learning-based language models. These language models were pre-trained on general corpora. However, using language models further pre-trained with specific corpora might be effective. OBJECTIVE: To examine the effectiveness of further pre-trained language models for the predictive performance of story point estimation. METHOD: Two types of further pre-trained language models, namely, domain-specific and repository-specific models, were compared with off-the-shelf models and Deep-SE. The estimation performance was evaluated with 16 project data. RESULTS: The effectiveness of domain-specific and repository-specific models were limited though they were better than the base model they further pre-trained. CONCLUSION: The effect of further pre-training was small. Large off-the-shelf models might be better to be chosen.

Publisher's Version

Automated Fairness Testing with Representative Sampling
Umutcan Karakas and Ayse Tosun
(Istanbul Technical University, Turkiye)
The issue of fairness testing in machine learning models has become popular due to rising concerns about potential bias and discrimination, as these models continue to permeate end-user applications. However, achieving an accurate and reliable measurement of the fairness performance of machine learning models remains a substantial challenge. Representative sampling plays a pivotal role in ensuring accurate fairness assessments and providing insight into the underlying dynamics of data, unlike biased or random sampling approaches. In our study, we introduce our approach, namely RSFair, which adopts the representative sampling method to comprehensively evaluate the fairness performance of a trained machine learning model. Our research findings on two datasets indicate that RSFair yields more accurate and reliable results, thus improving the efficiency of subsequent search steps, and ultimately the fairness performance of the model. With the usage of Orthogonal Matching Pursuit (OMP) and K-Singular Value Decomposition (K-SVD) algorithms for representative sampling, RSFair significantly improves the detection of discriminatory inputs by 76% and the fairness performance by 53% compared to other search-based approaches in the literature.

Publisher's Version

Info

Model Review: A PROMISEing Opportunity
Tim Menzies
(North Carolina State University, USA)
To make models more understandable and correctable, I propose that the PROMISE community pivots to the problem of model review. Over the years, there have been many reports that very simple mod- els can perform exceptionally well. Yet, where are the researchers asking “say, does that mean that we could make software analytics simpler and more comprehensible?” This is an important question, since humans often have difficulty accurately assessing complex models (leading to unreliable and sometimes dangerous results).
Prior PROMISE results have shown that data mining can effectively summarizing large models/ data sets into simpler and smaller ones. Therefore, the PROMISE community has the skills and experience needed to redefine, simplify, and improve the relationship between humans and AI.

Publisher's Version

PROMISE 2023 – Proceedings

19th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE 2023)

Frontmatter

Invited Talk

Papers