MSR4P&S 2022 – Proceedings

Welcome from the Chairs
On behalf of the Program Committee, we are pleased to present the proceedings of the 1st International Workshop on Mining Software Repositories for Privacy and Security (MSR4P&S 2022). MSR4P&S is co-located with the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). This year, because of the Covid-19 pandemic, MSR4P&S (as part of ESEC/FSE) is held virtually with an adapted program that will bring together international researchers to exchange ideas, share experiences, investigate problems, and propose promising solutions concerning the application of Mining Software Repositories (MSR) to investigate the different stages of privacy and security. The workshop topics cover a wide range of MSR applications for cybersecurity research, including empirical and mixed-method approaches, as well as datasets and tools.

Keynote

Mining Software Repositories for Security: Data Quality Issues Lessons from Trenches (Keynote)
Muhammad Ali Babar

(University of Adelaide, Australia)
Software repositories are an attractive source of data for understanding the burning security issues challenging developers, anecdotal solutions, and building AI/ML-based models and tools. That is why there is exponential growth in the literature based on mining software repositories for software security. While the abundance of freely available data for research is a fortune, the data quality issues can make software repositories minefields capable of blowing any time and effort budget for a project. Our group has been active in this area for the last few years to develop knowledge, understanding, and tools for improving software security by mining repositories. Through a mix of successful and failed efforts, we have experienced firsthand what is called “garbage in, garbage out” due to poor data quality. Without fully appreciating the data quality issues, starting a data-driven software security project can be frustrating and disheartening for a research team. We believe engaging the relevant stakeholders in developing and sharing knowledge and technologies to improve software security data quality is crucial. To this end, we are not only systematically identifying and synthesizing the existing empirical literature on improving data quality but also devising innovative solutions for addressing the data quality challenges while mining software repositories for software security. This talk will draw lessons and recommendations from our efforts of systematically reviewing the state-of-the-art and developing solutions for improving data quality while building knowledge, understanding, and tools for supporting software security. The talk will use a selected set of our studies to demonstrate the concrete cases of the challenges faced and the used workarounds to successfully continue our journey of learning and improving in this line of research and practice.

Publisher's Version

Assessing Privacy

Mining Software Repositories for Patternizing Attack-and-Defense Co-Evolution
Samiha Shimmi

and Mona Rahimi
(Northern Illinois University, USA)
Several evidence indicates that malicious cyber actors learn, adapt, or, in other words, react to the defensive measures put into place by the cybersecurity community, as much as system defenders react to attacks. To this end, this research aims to mine the existing software repositories to document patterns of co-evolution, which appear between the cyber attacker and defender, as attack-and defend adaptations, for the purpose of determining the probability of attackers’ responsive actions.

Publisher's Version

Assessing Software Privacy using the Privacy Flow-Graph
Feiyang Tang

and Bjarte M. Østvold

(Norwegian Computing Center, Norway)
We increasingly rely on digital services and the conveniences they provide. Processing of personal data is integral to such services and thus privacy and data protection are a growing concern, and governments have responded with regulations such as the EU's GDPR. Following this, organisations that make software have legal obligations to document the privacy and data protection of their software. This work must involve both software developers that understand the code and the organisation's data protection officer or legal department that understands privacy and the requirements of a Data Protection and Impact Assessment (DPIA).
To help developers and non-technical people such as lawyers document the privacy and data protection behaviour of software, we have developed an automatic software analysis technique. This technique is based on static program analysis to characterise the flow of privacy-related data. The results of the analysis can be presented as a graph of privacy flows and operations---that is understandable also for non-technical people. We argue that our technique facilitates collaboration between technical and non-technical people in documenting the privacy behaviour of the software. We explain how to use the results produced by our technique to answer a series of privacy-relevant questions needed for a DPIA. To illustrate our work, we show both detailed and abstract analysis results from applying our analysis technique to the secure messaging service Signal and to the client of the cloud service NextCloud and show how their privacy flow-graphs inform the writing of a DPIA.

Publisher's Version

Vulnerabilities

An Exploratory Study on the Relationship of Smells and Design Issues with Software Vulnerabilities
Sahrima Jannat Oishwee

, Zadia Codabux

, and Natalia Stakhanova

(University of Saskatchewan, Canada)
Software vulnerabilities are one of the leading causes of the loss of confidential data resulting in financial damages in the industry. As a result, software companies strive to discover potential vulnerabilities before the software is deployed. While traditionally, software metrics have been widely used to uncover vulnerabilities, more recent studies have been looking at code smells to detect vulnerabilities. This preliminary study explores the relationship between smells, design issues, and software vulnerabilities. As smells and design issues are indicators of potential problems in the software, establishing a relationship with vulnerabilities can be helpful for vulnerability prediction. In this study, we analyzed 561 versions of nine open-source software by exploring the smells and design issues in the vulnerable and non-vulnerable classes. We found that some smells and design issues have a statistically significant relationship with the vulnerable classes. However, after a manual analysis of the code segments containing the vulnerabilities, we found no indication that smells or design issues induce the vulnerabilities. In fact, they were still present in those code segments even after the vulnerabilities were resolved.

Publisher's Version

Counterfeit Object-Oriented Programming Vulnerabilities: An Empirical Study in Java
Joanna C. S. Santos

, Xueling Zhang

, and Mehdi Mirakhorli

(University of Notre Dame, USA; Rochester Institute of Technology, USA)
Many modern applications rely on Object-Oriented (OO) design principles, where the basic system components are objects and classes. They share objects with other processes, store them in disk/files for future retrieval or transport them over network to other systems. Object-oriented programs leverage numerous dynamic features and design principles such as runtime dispatching and object-oriented callbacks which allow flexible software design. Although seemingly innocuous, these features can be abused by the attackers to hijack the program's control flow to an undesirable behavior. This is referred to as Counterfeit Object-Oriented Programming (COOP), in which attackers hijack objects in the program in order to create a sequence of method calls that introduce a malicious behavior. COOP is a type of code reuse attack in which a hacker hijacks objects (gadgets) in the program and use that to control the program execution flow via manipulating the sequence of methods and data being passed among these methods (gadget chains). In this paper, we describe a preliminary empirical investigation of COOP attacks in real software systems caused by untrusted object deserialization. In this preliminary study, we investigated the severity of these attacks, their consequences, and how they were mitigated by developers. Furthermore, we used the findings to create a dataset of vulnerable software projects and their fixes.

Publisher's Version

SecurityEval Dataset: Mining Vulnerability Examples to Evaluate Machine Learning-Based Code Generation Techniques
Mohammed Latif Siddiq

and Joanna C. S. Santos

(University of Notre Dame, USA)
Automated source code generation is currently a popular machine-learning-based task. It can be helpful for software developers to write functionally correct code from a given context. However, just like human developers, a code generation model can produce vulnerable code, which the developers can mistakenly use. For this reason, evaluating the security of a code generation model is a must. In this paper, we describe SecurityEval, an evaluation dataset to fulfill this purpose. It contains 130 samples for 75 vulnerability types, which are mapped to the Common Weakness Enumeration (CWE). We also demonstrate using our dataset to evaluate one open-source (i.e., InCoder) and one closed-source code generation model (i.e., GitHub Copilot).

Publisher's Version

MSR4P&S 2022 – Proceedings

1st International Workshop on Mining Software Repositories Applications for Privacy and Security (MSR4P&S 2022)

Frontmatter

Keynote

Assessing Privacy

Vulnerabilities