MSR 2014 – Proceedings

Is Mining Software Repositories Data Science? (Keynote)
Audris Mockus
(Avaya Labs Research, USA)
Trick question: what is Data Science? The collection and use of low-veracity data in software repositories and other operational support systems is exploding. It is, therefore, imperative to elucidate basic principles of how such data comes into being and what it means. Are there practices of constructing software data analysis tools that could raise the integrity of their results despite the problematic nature of the underlying data? The talk explores the basic nature of data in operational support systems and considers approaches to develop engineering practices for software mining tools.

Green Mining

Mining Energy-Greedy API Usage Patterns in Android Apps: An Empirical Study
Mario Linares-Vásquez, Gabriele Bavota, Carlos Bernal-Cárdenas, Rocco Oliveto, Massimiliano Di Penta, and Denys Poshyvanyk

(College of William and Mary, USA; University of Sannio, Italy; Universidad Nacional de Colombia, Colombia; University of Molise, Italy)
Energy consumption of mobile applications is nowadays a hot topic, given the widespread use of mobile devices. The high demand for features and improved user experience, given the available powerful hardware, tend to increase the apps’ energy consumption. However, excessive energy consumption in mobile apps could also be a consequence of energy greedy hardware, bad programming practices, or particular API usage patterns. We present the largest to date quantitative and qualitative empirical investigation into the categories of API calls and usage patterns that—in the context of the Android development framework—exhibit particularly high energy consumption profiles. By using a hardware power monitor, we measure energy consumption of method calls when executing typical usage scenarios in 55 mobile apps from different domains. Based on the collected data, we mine and analyze energy-greedy APIs and usage patterns. We zoom in and discuss the cases where either the anomalous energy consumption is unavoidable or where it is due to suboptimal usage or choice of APIs. Finally, we synthesize our findings into actionable knowledge and recipes for developers on how to reduce energy consumption while using certain categories of Android APIs and patterns

GreenMiner: A Hardware Based Mining Software Repositories Software Energy Consumption Framework
Abram Hindle, Alex Wilson, Kent Rasmussen, E. Jed Barlow, Joshua Charles Campbell, and Stephen Romansky
(University of Alberta, Canada)
Green Mining is a field of MSR that studies software energy consumption and relies on software performance data. Unfortunately there is a severe lack of publicly available software power use performance data. This means that green mining researchers must generate this data themselves by writing tests, building multiple revisions of a product, and then running these tests multiple times (10+) for each software revision while measuring power use. Then, they must aggregate these measurements to estimate the energy consumed by the tests for each software revision. This is time consuming and is made more difficult by the constraints of mobile devices and their OSes. In this paper we propose, implement, and demonstrate Green Miner: the first dedicated hardware mining software repositories testbed. The Green Miner physically measures the energy consumption of mobile devices (Android phones) and automates the testing of applications, and the reporting of measurements back to developers and researchers. The Green Miner has already produced valuable results for commercial Android application developers, and has been shown to replicate other power studies' results.

Mining Questions about Software Energy Consumption
Gustavo Pinto, Fernando Castor, and Yu David Liu
(Federal University of Pernambuco, Brazil; SUNY Binghamton, USA)
A growing number of software solutions have been proposed to address application-level energy consumption problems in the last few years. However, little is known about how much software developers are concerned about energy consumption, what aspects of energy consumption they consider important, and what solutions they have in mind for improving energy efficiency. In this paper we present the first empirical study on understanding the views of application programmers on software energy consumption problems. Using StackOverflow as our primary data source, we analyze a carefully curated sample of more than 300 questions and 550 answers from more than 800 users. With this data, we observed a number of interesting findings. Our study shows that practitioners are aware of the energy consumption problems: the questions they ask are not only diverse -- we found 5 main themes of questions -- but also often more interesting and challenging when compared to the control question set. Even though energy consumption-related questions are popular when considering a number of different popularity measures, the same cannot be said about the quality of their answers. In addition, we observed that some of these answers are often flawed or vague. We contrast the advice provided by these answers with the state-of-the-art research on energy consumption. Our summary of software energy consumption problems may help researchers focus on what matters the most to software developers and end users.

Code Clones and Origin Analysis

Prediction and Ranking of Co-change Candidates for Clones
Manishankar Mondal, Chanchal K. Roy, and Kevin A. Schneider
(University of Saskatchewan, Canada)
Code clones are identical or similar code fragments scattered in a code-base. A group of code fragments that are similar to one another form a clone group. Clones in a particular group often need to be changed together (i.e., co-changed) consistently. However, all clones in a group might not require consistent changes, because some clone fragments might evolve independently. Thus, while changing a particular clone fragment, it is important for a programmer to know which other clone fragments in the same group should be consistently co-changed with that particular clone fragment.
In this research work, we empirically investigate whether we can automatically predict and rank these other clone fragments (i.e., the co-change candidates) from a clone group while making changes to a particular clone fragment in this group. For prediction and ranking we automatically retrieve and infer evolutionary coupling among clones by mining the past clone evolution history. Our experimental result on six subject systems written in two different programming languages (C, and Java) considering both exact and near-miss clones implies that we can automatically predict and rank co-change candidates for clones by analyzing evolutionary coupling. Our ranking mechanism can help programmers pinpoint the likely co-change candidates while changing a particular clone fragment and thus, can help us to better manage software clones.

Incremental Origin Analysis of Source Code Files
Daniela Steidl, Benjamin Hummel, and Elmar Juergens
(CQSE, Germany)
The history of software systems tracked by version control systems is often incomplete because many file movements are not recorded. However, static code analyses that mine the file history, such as change frequency or code churn, produce precise results only if the complete history of a source code file is available. In this paper, we show that up to 38.9% of the files in open source systems have an incomplete history, and we propose an incremental, commit-based approach to reconstruct the history based on clone information and name similarity. With this approach, the history of a file can be reconstructed across repository boundaries and thus provides accurate information for any source code analysis. We evaluate the approach in terms of correctness, completeness, performance, and relevance with a case study among seven open source systems and a developer survey.

Oops! Where Did That Code Snippet Come From?
Lisong Guo, Julia Lawall, and Gilles Muller

(INRIA, France; LIP6, France; Sorbonne, France; UPMC, France)
A kernel oops is an error report that logs the status of the Linux kernel at the time of a crash. Such a report can provide valuable first-hand information for a Linux kernel maintainer to conduct postmortem debugging. Recently, a repository has been created that systematically collects kernel oopses from Linux users. However, debugging based on only the information in a kernel oops is difficult. We consider the initial problem of finding the offending line, i.e., the line of source code that incurs the crash. For this, we propose a novel algorithm based on approximate sequence matching, as used in bioinformatics, to automatically pinpoint the offending line based on information about nearby machine-code instructions, as found in a kernel oops. Our algorithm achieves 92% accuracy compared to 26% for the traditional approach of using only the oops instruction pointer.

Bug Characterizing

Works For Me! Characterizing Non-reproducible Bug Reports
Mona Erfani Joorabchi, Mehdi Mirzaaghaei, and Ali Mesbah
(University of British Columbia, Canada)
Bug repository systems have become an integral component of software development activities. Ideally, each bug report should help developers to find and fix a software fault. However, there is a subset of reported bugs that is not (easily) reproducible, on which developers spend considerable amounts of time and effort. We present an empirical analysis of non-reproducible bug reports to characterize their rate, nature, and root causes. We mine one industrial and five open-source bug repositories, resulting in 32K non-reproducible bug reports. We (1) compare properties of non-reproducible reports with their counterparts such as active time and number of authors, (2) investigate their life-cycle patterns, and (3) examine 120 Fixed non-reproducible reports. In addition, we qualitatively classify a set of randomly selected non-reproducible bug reports (1,643) into six common categories. Our results show that, on average, non-reproducible bug reports pertain to 17% of all bug reports, remain active three months longer than their counterparts, can be mainly (45%) classified as "Interbug Dependencies'', and 66% of Fixed non-reproducible reports were indeed reproduced and fixed.

Characterizing and Predicting Blocking Bugs in Open Source Projects
Harold Valdivia Garcia and Emad Shihab
(Rochester Institute of Technology, USA)
As software becomes increasingly important, its quality becomes an increasingly important issue. Therefore, prior work focused on software quality and proposed many prediction models to identify the location of software bugs, to estimate their fixing-time, etc. However, one special type of severe bugs is blocking bugs. Blocking bugs are software bugs that prevent other bugs from being fixed. These blocking bugs may increase maintenance costs, reduce overall quality and delay the release of the software systems.
In this paper, we study blocking-bugs in six open source projects and propose a model to predict them. Our goal is to help developers identify these blocking bugs early on. We collect the bug reports from the bug tracking systems of the projects, then we obtain 14 different factors related to, for example, the textual description of the bug, the location the bug is found in and the people involved with the bug. Based on these factors we build decision trees for each project to predict whether a bug will be a blocking bug or not. Then, we analyze these decision trees in order to determine which factors best indicate these blocking bugs. Our results show that our prediction models achieve F-measures of 15-42%, which is a two- to four-fold improvement over the baseline random predictors. We also find that the most important factors in determining blocking bugs are the comment text, comment size, the number of developers in the CC list of the bug report and the reporter's experience. Our analysis shows that our models reduce the median time to identify a blocking bug by 3-18 days.

An Empirical Study of Dormant Bugs
Tse-Hsun Chen, Meiyappan Nagappan, Emad Shihab, and Ahmed E. Hassan

(Queen's University, Canada; Rochester Institute of Technology, USA)
Over the past decade, several research efforts have studied the quality of software systems by looking at post-release bugs. However, these studies do not account for bugs that remain dormant (i.e., introduced in a version of the software system, but are not found until much later) for years and across many versions. Such dormant bugs skew our under- standing of the software quality. In this paper we study dormant bugs against non-dormant bugs using data from 20 different open-source Apache foundation software systems. We find that 33% of the bugs introduced in a version are not reported till much later (i.e., they are reported in future versions as dormant bugs). Moreover, we find that 18.9% of the reported bugs in a version are not even introduced in that version (i.e., they are dormant bugs from prior versions). In short, the use of reported bugs to judge the quality of a specific version might be misleading. Exploring the fix process for dormant bugs, we find that they are fixed faster (median fix time of 5 days) than non- dormant bugs (median fix time of 8 days), and are fixed by more experienced developers (median commit counts of developers who fix dormant bug is 169% higher). Our results highlight that dormant bugs are different from non-dormant bugs in many perspectives and that future research in software quality should carefully study and consider dormant bugs.

Mining Repos and QA Sites

The Promises and Perils of Mining GitHub
Eirini Kalliamvakou, Georgios Gousios

, Kelly Blincoe, Leif Singer, Daniel M. German, and Daniela Damian
(University of Victoria, Canada; Delft University of Technology, Netherlands)
With over 10 million git repositories, GitHub is becoming one of the most important source of software artifacts on the Internet. Researchers are starting to mine the information stored in GitHub's event logs, trying to understand how its users employ the site to collaborate on software. However, so far there have been no studies describing the quality and properties of the data available from GitHub. We document the results of an empirical study aimed at understanding the characteristics of the repositories in GitHub and how users take advantage of GitHub's main features---namely commits, pull requests, and issues. Our results indicate that, while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration. We show, for example, that the majority of the projects are personal and inactive; that GitHub is also being used for free storage and as a Web hosting service; and that almost 40% of all pull requests do not appear as merged, even though they were. We provide a set of recommendations for software engineering researchers on how to approach the data in GitHub.

Info

Mining StackOverflow to Turn the IDE into a Self-Confident Programming Prompter
Luca Ponzanelli, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, and Michele Lanza

(University of Lugano, Switzerland; University of Sannio, Italy; University of Molise, Italy)
Developers often require knowledge beyond the one they possess, which often boils down to consulting sources of information like Application Programming Interfaces (API) documentation, forums, Q&A websites, etc. Knowing what to search for and how is non- trivial, and developers spend time and energy to formulate their problems as queries and to peruse and process the results. We propose a novel approach that, given a context in the IDE, automatically retrieves pertinent discussions from Stack Overflow, evaluates their relevance, and, if a given confidence threshold is surpassed, notifies the developer about the available help. We have implemented our approach in Prompter, an Eclipse plug-in. Prompter has been evaluated through two studies. The first was aimed at evaluating the devised ranking model, while the second was conducted to evaluate the usefulness of Prompter.

Mining Questions Asked by Web Developers
Kartik Bajaj, Karthik Pattabiraman

, and Ali Mesbah
(University of British Columbia, Canada)
Modern web applications consist of a significant amount of client- side code, written in JavaScript, HTML, and CSS. In this paper, we present a study of common challenges and misconceptions among web developers, by mining related questions asked on Stack Over- flow. We use unsupervised learning to categorize the mined questions and define a ranking algorithm to rank all the Stack Overflow questions based on their importance. We analyze the top 50 questions qualitatively. The results indicate that (1) the overall share of web development related discussions is increasing among developers, (2) browser related discussions are prevalent; however, this share is decreasing with time, (3) form validation and other DOM related discussions have been discussed consistently over time, (4) web related discussions are becoming more prevalent in mobile development, and (5) developers face implementation issues with new HTML5 features such as Canvas. We examine the implications of the results on the development, research, and standardization communities.

Process Mining Multiple Repositories for Software Defect Resolution from Control and Organizational Perspective
Monika Gupta, Ashish Sureka, and Srinivas Padmanabhuni
(IIIT Delhi, India; Infosys, India)
Issue reporting and resolution is a software engineering process supported by tools such as Issue Tracking System (ITS), Peer Code Review (PCR) system and Version Control System (VCS). Several open source software projects such as Google Chromium and Android follow process in which a defect or feature enhancement request is reported to an issue tracker followed by source-code change or patch review and patch commit using a version control system. We present an application of process mining three software repositories (ITS, PCR and VCS) from control flow and organizational perspective for effective process management. ITS, PCR and VCS are not explicitly linked so we implement regular expression based heuristics to integrate data from three repositories for Google Chromium project. We define activities such as bug reporting, bug fixing, bug verification, patch submission, patch review, and source code commit and create an event log of the bug resolution process. The extracted event log contains audit trail data such as caseID, timestamp, activity name and performer. We discover runtime process model for bug resolution process spanning three repositories using process mining tool, Disco, and conduct process performance and efficiency analysis. We identify bottlenecks, define and detect basic and composite anti-patterns. In addition to control flow analysis, we mine event log to perform organizational analysis and discover metrics such as handover of work, subcontracting, joint cases and joint activities.

Mining Applications

MUX: Algorithm Selection for Software Model Checkers
Varun Tulsian, Aditya Kanade, Rahul Kumar, Akash Lal

, and Aditya V. Nori
(Indian Institute of Science, India; Microsoft Research, India)
With the growing complexity of modern day software, software model checking has become a critical technology for ensuring correctness of software. As is true with any promising technology, there are a number of tools for software model checking. However, their respective performance trade-offs are difficult to characterize accurately – making it difficult for practitioners to select a suitable tool for the task at hand. This paper proposes a technique called MUX that addresses the problem of selecting the most suitable software model checker for a given input instance. MUX performs machine learning on a repository of software verification instances. The algorithm selector, synthesized through machine learning, uses structural features from an input instance, comprising a program-property pair, at runtime and determines which tool to use.
We have implemented MUX for Windows device drivers and evaluated it on a number of drivers and model checkers. Our results are promising in that the algorithm selector not only avoids a significant number of timeouts but also improves the total runtime by a large margin, compared to any individual model checker. It also outperforms a portfolio-based algorithm selector being used in Microsoft at present. Besides, MUX identifies structural features of programs that are key factors in determining performance of model checkers.

Improving the Effectiveness of Test Suite through Mining Historical Data
Jeff Anderson, Saeed Salem, and Hyunsook Do
(Microsoft, USA; North Dakota State University, USA)
Software regression testing is an integral part of most major software projects. As projects grow larger and the number of tests increases, performing regression testing becomes more costly. If software engineers can identify and run tests that are more likely to detect failures during regression testing, they may be able to better manage their regression testing activities. In this paper, to help identify such test cases, we developed techniques that utilizes various types of information in software repositories. To assess our techniques, we conducted an empirical study using an industrial software product, Microsoft Dynamics AX, which contains real faults. Our results show that the proposed techniques can be effective in identifying test cases that are likely to detect failures.

Finding Patterns in Static Analysis Alerts: Improving Actionable Alert Ranking
Quinn Hanam, Lin Tan, Reid Holmes, and Patrick Lam
(University of Waterloo, Canada)
Static analysis (SA) tools that find bugs by inferring programmer beliefs (e.g., FindBugs) are commonplace in today's software industry. While they find a large number of actual defects, they are often plagued by high rates of alerts that a developer would not act on (unactionable alerts) because they are incorrect, do not significantly affect program execution, etc. High rates of unactionable alerts decrease the utility of static analysis tools in practice.
We present a method for differentiating actionable and unactionable alerts by finding alerts with similar code patterns. To do so, we create a feature vector based on code characteristics at the site of each SA alert. With these feature vectors, we use machine learning techniques to build an actionable alert prediction model that is able to classify new SA alerts.
We evaluate our technique on three subject programs using the FindBugs static analysis tool and the Faultbench benchmark methodology. For a developer inspecting the top 5% of all alerts for three sample projects, our approach is able to identify 57 of 211 actionable alerts, which is 38 more than the FindBugs priority measure. Combined with previous actionable alert identification techniques, our method finds 75 actionable alerts in the top 5%, which is four more actionable alerts (a 6% improvement) than previous actionable alert identification techniques.

Impact Analysis of Change Requests on Source Code Based on Interaction and Commit Histories
Motahareh Bahrami Zanjani, George Swartzendruber, and Huzefa Kagdi
(Wichita State University, USA)
The paper presents an approach to perform impact analysis (IA) of an incoming change request on source code. The approach is based on a combination of interaction (e.g., Mylyn) and commit (e.g., CVS) histories. The source code entities (i.e., files and methods) that were interacted or changed in the resolution of past change requests (e.g., bug fixes) were used. Information retrieval, machine learning, and lightweight source code analysis techniques were employed to form a corpus from these source code entities. Additionally, the corpus was augmented with the textual descriptions of the previously resolved change requests and their associated commit messages. Given a textual description of a change request, this corpus is queried to obtain a ranked list of relevant source code entities that are most likely change prone. Such an approach that combines information from interactions and commits for IA at the change request level was not previously investigated. Furthermore, the approach requires only the entities that were interacted and/or committed in the past, which differs from the previous solutions that require indexing of a complete snapshot (e.g., a release).
An empirical study on 3272 interactions and 5093 commits from Mylyn, an open source task management tool, was conducted. The results show that the combined approach outperforms an individual approach based on commits. Moreover, it also outperformed an approach based on indexing a single, complete snapshot of a software system.

Defect Prediction

An Empirical Study of Just-in-Time Defect Prediction using Cross-Project Models
Takafumi Fukushima, Yasutaka Kamei, Shane McIntosh, Kazuhiro Yamashita, and Naoyasu Ubayashi
(Kyushu University, Japan; Queen's University, Canada)
Prior research suggests that predicting defect-inducing changes, i.e., Just-In-Time (JIT) defect prediction is a more practical alternative to traditional defect prediction techniques, providing immediate feedback while design decisions are still fresh in the minds of developers. Unfortunately, similar to traditional defect prediction models, JIT models require a large amount of training data, which is not available when projects are in initial development phases. To address this flaw in traditional defect prediction, prior work has proposed cross-project models, i.e., models learned from older projects with sufficient history. However, cross-project models have not yet been explored in the context of JIT prediction. Therefore, in this study, we empirically evaluate the performance of JIT cross-project models. Through a case study on 11 open source projects, we find that in a JIT cross-project context: (1) high performance within-project models rarely perform well; (2) models trained on projects that have similar correlations between predictor and dependent variables often perform well; and (3) ensemble learning techniques that leverage historical data from several other projects (e.g., voting experts) often perform well. Our findings empirically confirm that JIT cross-project models learned using other projects are a viable solution for projects with little historical data. However, JIT cross-project models perform best when the data used to learn them is carefully selected.

Towards Building a Universal Defect Prediction Model
Feng Zhang, Audris Mockus, Iman Keivanloo, and Ying Zou

(Queen's University, Canada; Avaya Labs Research, USA)
To predict files with defects, a suitable prediction model must be built for a software project from either itself (within-project) or other projects (cross-project). A universal defect prediction model that is built from the entire set of diverse projects would relieve the need for building models for an individual project. A universal model could also be interpreted as a basic relationship between software metrics and defects. However, the variations in the distribution of predictors pose a formidable obstacle to build a universal model. Such variations exist among projects with different context factors (e.g., size and programming language). To overcome this challenge, we propose context-aware rank transformations for predictors. We cluster projects based on the similarity of the distribution of 26 predictors, and derive the rank transformations using quantiles of predictors for a cluster. We then fit the universal model on the transformed data of 1,398 open source projects hosted on SourceForge and GoogleCode. Adding context factors to the universal model improves the predictive power. The universal model obtains prediction performance comparable to the within-project models and yields similar results when applied on five external projects (one Apache and four Eclipse projects). These results suggest that a universal defect prediction model may be an achievable goal.

Code Review and Code Search

The Impact of Code Review Coverage and Code Review Participation on Software Quality: A Case Study of the Qt, VTK, and ITK Projects
Shane McIntosh, Yasutaka Kamei, Bram Adams, and Ahmed E. Hassan

(Queen's University, Canada; Kyushu University, Japan; Polytechnique Montréal, Canada)
Software code review, i.e., the practice of having third-party team members critique changes to a software system, is a well-established best practice in both open source and proprietary software domains. Prior work has shown that the formal code inspections of the past tend to improve the quality of software delivered by students and small teams. However, the formal code inspection process mandates strict review criteria (e.g., in-person meetings and reviewer checklists) to ensure a base level of review quality, while the modern, lightweight code reviewing process does not. Although recent work explores the modern code review process qualitatively, little research quantitatively explores the relationship between properties of the modern code review process and software quality. Hence, in this paper, we study the relationship between software quality and: (1) code review coverage, i.e., the proportion of changes that have been code reviewed, and (2) code review participation, i.e., the degree of reviewer involvement in the code review process. Through a case study of the Qt, VTK, and ITK projects, we find that both code review coverage and participation share a significant link with software quality. Low code review coverage and participation are estimated to produce components with up to two and five additional post-release defects respectively. Our results empirically confirm the intuition that poorly reviewed code has a negative impact on software quality in large systems using modern reviewing tools.

Modern Code Reviews in Open-Source Projects: Which Problems Do They Fix?
Moritz Beller, Alberto Bacchelli, Andy Zaidman, and Elmar Juergens
(Delft University of Technology, Netherlands; CQSE, Germany)
Code review is the manual assessment of source code by humans, mainly intended to identify defects and quality problems. Modern Code Review (MCR), a lightweight variant of the code inspections investigated since the 1970s, prevails today both in industry and open-source software (OSS) systems. The objective of this paper is to increase our understanding of the practical benefits that the MCR process produces on reviewed source code. To that end, we empirically explore the problems fixed through MCR in OSS systems. We manually classified over 1,400 changes taking place in reviewed code from two OSS projects into a validated categorization scheme. Surprisingly, results show that the types of changes due to the MCR process in OSS are strikingly similar to those in the industry and academic systems from literature, featuring the similar 75:25 ratio of maintainability-related to functional problems. We also reveal that 7–35% of review comments are discarded and that 10–22% of the changes are not triggered by an explicit review comment. Patterns emerged in the review data; we investigated them revealing the technical factors that influence the number of changes due to the MCR process. We found that bug-fixing tasks lead to fewer changes and tasks with more altered files and a higher code churn have more changes. Contrary to intuition, the person of the reviewer had no impact on the number of changes.

Thesaurus-Based Automatic Query Expansion for Interface-Driven Code Search
Otávio A. L. Lemos, Adriano C. de Paula, Felipe C. Zanichelli, and Cristina V. Lopes
(Federal University of São Paulo, Brazil; University of California at Irvine, USA)
Software engineers often resort to code search practices to support software maintenance and evolution tasks, in particular code reuse. An issue that affects code search is the vocabulary mismatch problem: while searching for a particular function, users have to guess the exact words that were chosen by original developers to name code entities. In this paper we present an automatic query expansion (AQE) approach that uses word relations to increase the chances of finding relevant code. The approach is applied on top of Test-Driven Code Search (TDCS), a promising code retrieval technique that uses test cases as inputs to formulate the search query, but can also be used with other techniques that handle interface definitions to produce queries (interface-driven code search). Since these techniques rely on keywords and types, the vocabulary mismatch problem is also relevant. AQE is carried out by leveraging WordNet, a type thesaurus for expanding types, and another thesaurus containing only software-related word relations. Our approach is general but was specifically designed for non-native English speakers, who are frequently unaware of the most common terms used to name functions in software. Our evaluation with 36 non-native subjects - including developers and senior Computer Science students - provides evidence that our approach can improve the chances of finding relevant functions by 41% (recall improvement of 30%, on average), without hurting precision.

Effort Estimation and Reuse

Estimating Development Effort in Free/Open Source Software Projects by Mining Software Repositories: A Case Study of OpenStack
Gregorio Robles, Jesús M. González-Barahona, Carlos Cervigón, Andrea Capiluppi, and Daniel Izquierdo-Cortázar
(Universidad Rey Juan Carlos, Spain; Brunel University, UK; Bitergia, Spain)
Because of the distributed and collaborative nature of free / open source software (FOSS) projects, the development effort invested in a project is usually unknown, even after the software has been released. However, this information is becoming of major interest, especially ---but not only--- because of the growth in the number of companies for which FOSS has become relevant for their business strategy. In this paper we present a novel approach to estimate effort by considering data from source code management repositories. We apply our model to the OpenStack project, a FOSS project with more than 1,000 authors, in which several tens of companies cooperate. Based on data from its repositories and together with the input from a survey answered by more than 100 developers, we show that the model offers a simple, but sound way of obtaining software development estimations with bounded margins of error.

Info

An Industrial Case Study of Automatically Identifying Performance Regression-Causes
Thanh H. D. Nguyen, Meiyappan Nagappan, Ahmed E. Hassan

, Mohamed Nasser, and Parminder Flora
(Queen's University, Canada; BlackBerry, Canada)
Even the addition of a single extra field or control statement in the source code of a large-scale software system can lead to performance regressions. Such regressions can considerably degrade the user experience. Working closely with the members of a performance engineering team, we observe that they face a major challenge in identifying the cause of a performance regression given the large number of performance counters (e.g., memory and CPU usage) that must be analyzed. We propose the mining of a regression-causes repository (where the results of performance tests and causes of past regressions are stored) to assist the performance team in identifying the regression-cause of a newly-identified regression. We evaluate our approach on an open-source system, and a commercial system for which the team is responsible. The results show that our approach can accurately (up to 80% accuracy) identify performance regression-causes using a reasonably small number of historical test runs (sometimes as few as four test runs per regression-cause).

Revisiting Android Reuse Studies in the Context of Code Obfuscation and Library Usages
Mario Linares-Vásquez, Andrew Holtzhauer, Carlos Bernal-Cárdenas, and Denys Poshyvanyk

(College of William and Mary, USA; Universidad Nacional de Colombia, Colombia)
In the recent years, studies of design and programming practices in mobile development are gaining more attention from researchers. Several such empirical studies used Android applications (paid, free, and open source) to analyze factors such as size, quality, dependencies, reuse, and cloning. Most of the studies use executable files of the apps (APK files), instead of source code because of availability issues (most of free apps available at the Android official market are not open-source, but still can be downloaded and analyzed in APK format). However, using only APK files in empirical studies comes with some threats to the validity of the results. In this paper, we analyze some of these pertinent threats. In particular, we analyzed the impact of third-party libraries and code obfuscation practices on estimating the amount of reuse by class cloning in Android apps. When including and excluding third-party libraries from the analysis, we found statistically significant differences in the amount of class cloning 24,379 free Android apps. Also, we found some evidence that obfuscation is responsible for increasing a number of false positives when detecting class clones. Finally, based on our findings, we provide a list of actionable guidelines for mining and analyzing large repositories of Android applications and minimizing these threats to validity

Mining Mix

Syntax Errors Just Aren't Natural: Improving Error Reporting with Language Models
Joshua Charles Campbell, Abram Hindle, and José Nelson Amaral
(University of Alberta, Canada)
A frustrating aspect of software development is that compiler error messages often fail to locate the actual cause of a syntax error. An errant semicolon or brace can result in many errors reported throughout the file. We seek to find the actual source of these syntax errors by relying on the consistency of software: valid source code is usually repetitive and unsurprising. We exploit this consistency by constructing a simple N-gram language model of lexed source code tokens. We implemented an automatic Java syntax-error locator using the corpus of the project itself and evaluated its performance on mutated source code from several projects. Our tool, trained on the past versions of a project, can effectively augment the syntax error locations produced by the native compiler. Thus we provide a methodology and tool that exploits the naturalness of software source code to detect syntax errors alongside the parser.

Info

Do Developers Feel Emotions? An Exploratory Analysis of Emotions in Software Artifacts
Alessandro Murgia, Parastou Tourani, Bram Adams, and Marco Ortu
(University of Antwerp, Belgium; Polytechnique Montréal, Canada; University of Cagliari, Italy)
Software development is a collaborative activity in which developers interact to create and maintain a complex software system. Human collaboration inevitably evokes emotions like joy or sadness, which can affect the collaboration either positively or negatively, yet not much is known about the individual emotions and their role for software development stakeholders. In this study, we analyze whether development artifacts like issue reports carry any emotional information about software development. This is a first step towards verifying the feasibility of an automatic tool for emotion mining in software development artifacts: if humans cannot determine any emotion from a software artifact, neither can a tool. Analysis of the Apache Software Foundation issue tracking system shows that developers do express emotions (in particular gratitude, joy and sadness). However, the more context is provided about an issue report, the more human raters start to doubt and nuance their interpretation of emotions. More investigation is needed before building a fully automatic emotion mining tool.

How Does a Typical Tutorial for Mobile Development Look Like?
Rebecca Tiarks and Walid Maalej
(University of Hamburg, Germany)
We report on an exploratory study, which aims at understanding how development tutorials are structured, what types of tutorials exist, and how official tutorials differ from tutorials written by development communities. We analyzed over 1.200 tutorials for mobile application development provided by six different sources for the three major platforms: Android, Apple iOS, and Windows Phone. We found that a typical tutorial contains around 2700 words distributed over 4 pages and including a list of instructions with 18 items. Overall, 70% of the tutorials contain source code examples and a similar fraction contain images. On average, one tutorial has 6 images. When analyzing the images, we found that the studied iOS community posted the largest number of images, 14 images per tutorial, on average, from which 74% are plain images, i.e., mainly screenshots without stencils, diagrams, or highlights. In contrast, 36% of the images included in the official tutorials by Apple were diagrams or images with stencils. Community sites seem to follow a similar structure to the official sites but include items and images which are rather underrepresented in the official tutorials. From the analysis of the tutorials content by means of natural language processing combined with manual content analysis, we derived four categories for mobile development tutorials: infrastructure and design, application and services, distribution and maintenance, and development platform. Our categorization can help tutorial writers to better organize and evaluate the content of their tutorials and identify missing tutorials.

Unsupervised Discovery of Intentional Process Models from Event Logs
Ghazaleh Khodabandelou, Charlotte Hug, Rebecca Deneckère, and Camille Salinesi
(Sorbonne, France)
Research on guidance and method engineering has highlighted that many method engineering issues, such as lack of flexibility or adaptation, are solved more effectively when intentions are explicitly specified. However, software engineering process models are most often described in terms of sequences of activities. This paper presents a novel approach, so-called Map Miner Method (MMM), designed to automate the construction of intentional process models from process logs. To do so, MMM uses Hidden Markov Models to model users' activities logs in terms of users' strategies. MMM also infers users' intentions and constructs fine-grained and coarse-grained intentional process models with respect to the Map metamodel syntax (i.e., metamodel that specifies intentions and strategies of process actors). These models are obtained by optimizing a new precision-fitness metric. The result is a software engineering method process specification aligned with state of the art of method engineering approaches. As a case study, the MMM is used to mine the intentional process associated to the Eclipse platform usage. Observations show that the obtained intentional process model offers a new understanding of software processes, and could readily be used for recommender systems.

Short Research/Practice Papers

Tracing Dynamic Features in Python Programs
Beatrice Åkerblom

, Jonathan Stendahl, Mattias Tumlin, and Tobias Wrigstad

(Stockholm University, Sweden; Uppsala University, Sweden)
Recent years have seen a number of proposals for adding (retrofitting) static typing to dynamic programming languages, a natural consequence of their growing popularity for non-toy applications across a multitude of domains. These proposals often make assumptions about how programmers write code, and in many cases restrict the way the languages can be used. In the context of Python, this paper describes early results from trace-based collection of run-time data about the use of built-in language features which are inherently hard to type, such as dynamic code generation. The end goal of this work is to facilitate static validation tooling for Python, in particular retrofitting of type systems.

It's Not a Bug, It's a Feature: Does Misclassification Affect Bug Localization?
Pavneet Singh Kochhar, Tien-Duy B. Le, and David Lo

(Singapore Management University, Singapore)
Bug localization refers to the task of automatically processing bug reports to locate source code files that are responsible for the bugs. Many bug localization techniques have been proposed in the literature. These techniques are often evaluated on issue reports that are marked as bugs by their reporters in issue tracking systems. However, recent findings by Herzig et al. find that a substantial number of issue reports marked as bugs, are not bugs but other kinds of issues like refactorings, request for enhancement, documentation changes, test case creation, and so on. Herzig et al. report that these misclassifications affect bug prediction, namely the task of predicting which files are likely to be buggy in the future. In this work, we investigate whether these misclassifications also affect bug localization. To do so, we analyze issue reports that have been manually categorized by Herzig et al. and apply a bug localization technique to recover a ranked list of candidate buggy files for each issue report. We then evaluate whether the quality of ranked lists of reports reported as bugs is the same as that of real bug reports. Our findings shed light that there is a need for additional cleaning steps to be performed on issue reports before they are used to evaluate bug localization techniques.

Classifying Unstructured Data into Natural Language Text and Technical Information
Thorsten Merten, Bastian Mager, Simone Bürsner, and Barbara Paech
(Bonn-Rhein-Sieg University of Applied Sciences, Germany; University of Heidelberg, Germany)
Software repository data, for example in issue tracking systems, include natural language text and technical information, which includes anything from log files via code snippets to stack traces.
However, data mining is often only interested in one of the two types e.g. in natural language text when looking at text mining. Regardless of which type is being investigated, any techniques used have to deal with noise caused by fragments of the other type i.e. methods interested in natural language have to deal with technical fragments and vice versa.
This paper proposes an approach to classify unstructured data, e.g. development documents, into natural language text and technical information using a mixture of text heuristics and agglomerative hierarchical clustering.
The approach was evaluated using 225 manually annotated text passages from developer emails and issue tracker data. Using white space tokenization as a basis, the overall precision of the approach is 0.84 and the recall is 0.85.

Collaboration in Open-Source Projects: Myth or Reality?
Yuriy Tymchuk, Andrea Mocci, and Michele Lanza

(University of Lugano, Switzerland)
One of the fundamental principles of open-source projects is that they foster collaboration among developers, disregarding their geographical location or personal background. When it comes to software repositories collaboration is a rather ephemeral phenomenon which lacks a clear definition, and it must therefore be mined and modeled. This throws up the question whether what is mined actually maps to reality.
In this paper we investigate collaboration by modeling it using a number of diverse approaches that we then compare to a ground truth obtained by surveying a substantial set of developers of the Pharo open-source community. Our findings indicate that the notion of collaboration must be revisited, as it is undermined by a number of factors that are often tackled in imprecise ways or not taken into account at all.

Improving the Accuracy of Duplicate Bug Report Detection using Textual Similarity Measures
Alina Lazar, Sarah Ritchey, and Bonita Sharif
(Youngstown State University, USA)
The paper describes an improved method for automatic duplicate bug report detection based on new textual similarity features and binary classification. Using a set of new textual features, inspired from recent text similarity research, we train several binary classification models. A case study was conducted on three open source systems: Eclipse, Open Office, and Mozilla to determine the effectiveness of the improved method. A comparison is also made with current state-of-the-art approaches highlighting similarities and differences. Results indicate that the accuracy of the proposed method is better than previously reported research with respect to all three systems.

Undocumented and Unchecked: Exceptions That Spell Trouble
Maria Kechagia and Diomidis Spinellis

(Athens University of Economics and Business, Greece)
Modern programs rely on large application programming interfaces (APIs). The Android framework comprises 231 core APIs, and is used by countless developers. We examine a sample of 4,900 distinct crash stack traces from 1,800 different Android applications, looking for Android API methods with undocumented exceptions that are part of application crashes. For the purposes of this study, we take as a reference the version 15 of the Android API, which matches our stack traces. Our results show that a significant number of crashes (19%) might have been avoided if these methods had the corresponding exceptions documented as part of their interface.

Innovation Diffusion in Open Source Software: Preliminary Analysis of Dependency Changes in the Gentoo Portage Package Database
Remco Bloemen, Chintan Amrit, Stefan Kuhlmann, and Gonzalo Ordóñez–Matamoros
(University of Twente, Netherlands)
In this paper we make the case that software dependencies are a form of innovation adoption. We then test this on the time-evolution of the Gentoo package dependency graph. We find that the Bass model of innovation diffusion fits the growth of the number of packages depending on a given library. Interestingly, we also find that low-level packages have a primarily imitation driven adoption and multimedia libraries have primarily innovation driven growth.

A Dictionary to Translate Change Tasks to Source Code
Katja Kevic and Thomas Fritz

(University of Zurich, Switzerland)
At the beginning of a change task, software developers spend a substantial amount of their time searching and navigating to locate relevant parts in the source code. Current approaches to support developers in this initial code search predominantly use information retrieval techniques that leverage the similarity between task descriptions and the identifiers of code elements to recommend relevant elements. However, the vocabulary or language used in source code often differs from the one used for describing change tasks, especially since the people developing the code are not the same as the ones reporting bugs or defining new features to be implemented. In our work, we investigate the creation of a dictionary that maps the different vocabularies using information from change sets and interaction histories stored with previously completed tasks. In an empirical analysis on four open source projects, our approach substantially improved upon the results of traditional information retrieval techniques for recommending relevant code elements.

New Features for Duplicate Bug Detection
Nathan Klein, Christopher S. Corley, and Nicholas A. Kraft
(Oberlin College, USA; University of Alabama, USA)
Issue tracking software of large software projects receive a large volume of issue reports each day. Each of these issues is typically triaged by hand, a time consuming and error prone task. Additionally, issue reporters lack the necessary understanding to know whether their issue has previously been reported. This leads to issue trackers containing a lot of duplicate reports, adding complexity to the triaging task.
Duplicate bug report detection is designed to aid developers by automatically grouping bug reports concerning identical issues. Previous work by Alipour et al. has shown that the textual, categorical, and contextual information of an issue report are effective measures in duplicate bug report detection. In our work, we extend previous work by introducing a range of metrics based on the topic distribution of the issue reports, relying only on data taken directly from bug reports. In particular, we introduce a novel metric that measures the first shared topic between two topic-document distributions. This paper details the evaluation of this group of pair-based metrics with a range of machine learning classifiers, using the same issues used by Alipour et al. We demonstrate that the proposed metrics show a significant improvement over previous work, and conclude that the simple metrics we propose should be considered in future studies on bug report deduplication, as well as for more general natural language processing applications.

Mining Modern Repositories with Elasticsearch
Oleksii Kononenko, Olga Baysal, Reid Holmes, and Michael W. Godfrey

(University of Waterloo, Canada)
Organizations are generating, processing, and retaining data at a rate that often exceeds their ability to analyze it effectively; at the same time, the insights derived from these large data sets are often key to the success of the organizations, allowing them to better understand how to solve hard problems and thus gain competitive advantage. Because this data is so fast-moving and voluminous, it is increasingly impractical to analyze using traditional offline, read-only relational databases.
Recently, new "big data" technologies and architectures, including Hadoop and NoSQL databases, have evolved to better support the needs of organizations analyzing such data. In particular, Elasticsearch - a distributed full-text search engine - explicitly addresses issues of scalability, big data search, and performance that relational databases were simply never designed to support. In this paper, we reflect upon our own experience with Elasticsearch and highlight its strengths and weaknesses for performing modern mining software repositories research.

Mining Challenge

A Study of External Community Contribution to Open-Source Projects on GitHub
Rohan Padhye, Senthil Mani, and Vibha Singhal Sinha
(IBM Research, India)
Open-source software projects are primarily driven by community contribution. However, commit access to such projects' software repositories is often strictly controlled. These projects prefer to solicit external participation in the form of patches or pull requests. In this paper, we analyze a set of 89 top-starred GitHub projects and their forks in order to explore the nature and distribution of such community contribution. We first classify commits (and developers) into three categories: core, external and mutant, and study the relative sizes of each of these classes through a ring-based visualization. We observe that projects written in mainstream scripting languages such as JavaScript and Python tend to include more external participation than projects written in upcoming languages such as Scala. We also visualize the geographic spread of these communities via geocoding. Finally, we classify the types of pull requests submitted based on their labels and observe that bug fixes are more likely to be merged into the main projects as compared to feature enhancements.

Understanding "Watchers" on GitHub
Jyoti Sheoran, Kelly Blincoe, Eirini Kalliamvakou, Daniela Damian

, and Jordan Ell
(University of Victoria, Canada)
Users on GitHub can watch repositories to receive notifications about project activity. This introduces a new type of passive project membership. In this paper, we investigate the behavior of watchers and their contribution to the projects they watch. We find that a subset of project watchers begin contributing to the project and those contributors account for a significant percentage of contributors on the project. As contributors, watchers are more confident and contribute over a longer period of time in a more varied way than other contributors. This is likely attributable to the knowledge gained through project notifications.

Do Developers Discuss Design?
João Brunet, Gail C. Murphy, Ricardo Terra, Jorge Figueiredo, and Dalton Serey
(Federal University of Campina Grande, Brazil; University of British Columbia, Canada; Federal University of Lavras, Brazil)
Design is often raised in the literature as important to attaining various properties and characteristics in a software system. At least for open-source projects, it can be hard to find evidence of ongoing design work in the technical artifacts produced as part of the development. Although developers usually do not produce specific design documents, they do communicate about design in different ways. In this paper, we provide quantitative evidence that developers address design through discussions in commits, issues, and pull requests. To achieve this, we built a discussions' classifier and automatically labeled 102,122 discussions from 77 projects. Based on this data, we make four observations about the projects: i) on average, 25% of the discussions in a project are about design; ii) on average, 26% of developers contribute to at least one design discussion; iii) only 1% of the developers contribute to more than 15% of the discussions in a project; and iv) these few developers who contribute to a broad range of design discussions are also the top committers in a project.

Magnet or Sticky? An OSS Project-by-Project Typology
Kazuhiro Yamashita, Shane McIntosh, Yasutaka Kamei, and Naoyasu Ubayashi
(Kyushu University, Japan; Queen's University, Canada)
For Open Source Software (OSS) projects, retaining existing contributors and attracting new ones is a major concern. In this paper, we expand and adapt a pair of population migration metrics to analyze migration trends in a collection of open source projects. Namely, we study: (1) project stickiness, i.e., its tendency to retain existing contributors and (2) project magnetism, i.e., its tendency to attract new contributors. Using quadrant plots, we classify projects as attractive (highly magnetic and sticky), stagnant (highly sticky, weakly magnetic), fluctuating (highly magnetic, weakly sticky), or terminal (weakly magnetic and sticky). Through analysis of the MSR challenge dataset, we find that: (1) quadrant plots can effectively identify at-risk projects, (2) stickiness is often motivated by professional activity and (3) transitions among quadrants as a project ages often coincides with interesting events in the evolution history of a project.

Security and Emotion: Sentiment Analysis of Security Discussions on GitHub
Daniel Pletea, Bogdan Vasilescu, and Alexander Serebrenik
(Eindhoven University of Technology, Netherlands)
Application security is becoming increasingly prevalent during software and especially web application development. Consequently, countermeasures are continuously being discussed and built into applications, with the goal of reducing the risk that unauthorized code will be able to access, steal, modify, or delete sensitive data. In this paper we gauged the presence and atmosphere surrounding security-related discussions on GitHub, as mined from discussions around commits and pull requests. First, we found that security related discussions account for approximately 10% of all discussions on GitHub. Second, we found that more negative emotions are expressed in security-related discussions than in other discussions. These findings confirm the importance of properly training developers to address security concerns in their applications as well as the need to test applications thoroughly for security vulnerabilities in order to reduce frustration and improve overall project atmosphere.

Sentiment Analysis of Commit Comments in GitHub: An Empirical Study
Emitza Guzman, David Azócar, and Yang Li
(TU München, Germany)
Emotions have a high impact in productivity, task quality, creativity, group rapport and job satisfaction. In this work we use lexical sentiment analysis to study emotions expressed in commit comments of different open source projects and analyze their relationship with different factors such as used programming language, time and day of the week in which the commit was made, team distribution and project approval. Our results show that projects developed in Java tend to have more negative commit comments, and that projects that have more distributed teams tend to have a higher positive polarity in their emotional content. Additionally, we found that commit comments written on Mondays tend to a more negative emotion. While our results need to be confirmed by a more representative sample they are an initial step into the study of emotions and related factors in open source projects.

Analysing the 'Biodiversity' of Open Source Ecosystems: The GitHub Case
Nicholas Matragkas, James R. Williams, Dimitris S. Kolovos

, and Richard F. Paige
(University of York, UK)
In nature the diversity of species and genes in ecological communities affects the functioning of these communities. Biologists have found out that more diverse communities appear to be more productive than less diverse communities. Moreover such communities appear to be more stable in the face of perturbations. In this paper, we draw the analogy between ecological communities and Open Source Software (OSS) ecosystems, and we investigate the diversity and structure of OSS communities. To address this question we use the MSR 2014 challenge dataset, which includes data from the top-10 software projects for the top programming languages on GitHub. Our findings show that OSS communities on GitHub consist of 3 types of users (core developers, active users, passive users). Moreover, we show that the percentage of core developers and active users does not change as the project grows and that the majority of members of large projects are passive users.

Co-evolution of Project Documentation and Popularity within Github
Karan Aggarwal, Abram Hindle, and Eleni Stroulia
(University of Alberta, Canada)
Github is a very popular collaborative software-development platform that provides typical source-code management and issue tracking features augmented by strong social-networking features such as following developers and watching projects. These features help ``spread the word'' about individuals and projects, building the reputation of the former and increasing the popularity of the latter. In this paper, we investigate the relation between project popularity and regular, consistent documentation updates. We found strong indicators that consistently popular projects exhibited consistent documentation effort and that this effort tended to attract more documentation collaborators. We also found that frameworks required more documentation effort than libraries to achieve similar adoption success, especially in the initial phase.

An Insight into the Pull Requests of GitHub
Mohammad Masudur Rahman and Chanchal K. Roy
(University of Saskatchewan, Canada)
Given the increasing number of unsuccessful pull requests in GitHub projects, insights into the success and failure of these requests are essential for the developers. In this paper, we provide a comparative study between successful and unsuccessful pull requests made to 78 GitHub base projects by 20,142 developers from 103,192 forked projects. In the study, we analyze pull request discussion texts, project specific information (e.g., domain, maturity), and developer specific information (e.g., experience) in order to report useful insights, and use them to contrast between successful and unsuccessful pull requests. We believe our study will help developers overcome the issues with pull requests in GitHub, and project administrators with informed decision making.

Data Showcase

A Dataset for Pull-Based Development Research
Georgios Gousios

and Andy Zaidman
(Delft University of Technology, Netherlands)
Pull requests form a new method for collaborating in distributed software development. To study the pull request distributed development model, we constructed a dataset of almost 900 projects and 350,000 pull requests, including some of the largest users of pull requests on Github. In this paper, we describe how the project selection was done, we analyze the selected features and present a machine learning tool set for the R statistics environment.

The Bug Catalog of the Maven Ecosystem
Dimitris Mitropoulos, Vassilios Karakoidas, Panos Louridas, Georgios Gousios

, and Diomidis Spinellis

(Athens University of Economics and Business, Greece; Delft University of Technology, Netherlands)
Examining software ecosystems can provide the research community with data regarding artifacts, processes, and communities. We present a dataset obtained from the Maven central repository ecosystem (approximately 265GB of data) by statically analyzing the repository to detect potential software bugs. For our analysis we used FindBugs, a tool that examines Java bytecode to detect numerous types of bugs. The dataset contains the metrics results that FindBugs reports for every project version (a JAR) included in the ecosystem. For every version we also stored specific metadata such as the JAR's size, its dependencies and others. Our dataset can be used to produce interesting research results, as we show in specific examples.

A Dataset of Feature Additions and Feature Removals from the Linux Kernel
Leonardo Passos and Krzysztof Czarnecki
(University of Waterloo, Canada)
This paper describes a dataset of feature additions and removals in the Linux kernel evolution history, spanning over seven years of kernel development. Features, in this context, denote configurable system options that users select when creating customized kernel images. The provided dataset is the largest corpus we are aware of capturing feature additions and removals, allowing researchers to assess the kernel evolution from a feature-oriented point-of-view. Furthermore, the dataset can be used to better understand how features evolve over time, and how different artifacts change as a result. One particular use of the dataset is to provide a real-world case to assess existing support for feature traceability and evolution. In this paper, we detail the dataset extraction process, the underlying database schema, and example queries. The dataset is directly available at our Bitbucket repository: https://bitbucket.org/lpassos/kconfigdb

Kataribe: A Hosting Service of Historage Repositories
Kenji Fujiwara, Hideaki Hata, Erina Makihara, Yusuke Fujihara, Naoki Nakayama, Hajimu Iida, and Kenichi Matsumoto

(NAIST, Japan)
In the research of Mining Software Repositories, code repository is one of the core source since it contains the product of software development. Code repository stores the versions of files, and makes it possible to browse the histories of files, such as modification dates, authors, messages, etc. Although such rich information of file histories is easily available, extracting the histories of methods, which are elements of source code files, is not easy from general code repositories. To tackle this difficulty, we have developed Historage, a fine-grained version control system. Historage repository is a Git repository which is built upon original Git repository. Therefore, similar mining techniques for general Git repositories are applicable to Historage repositories. Kataribe is a hosting service of Historage repositories, which enables researchers and developers to browse method histories on the web and clone Historage repositories to local. The Kataribe project aims to maintain and expand the datasets and features.

Info

Lean GHTorrent: GitHub Data on Demand
Georgios Gousios

, Bogdan Vasilescu, Alexander Serebrenik, and Andy Zaidman
(Delft University of Technology, Netherlands; Eindhoven University of Technology, Netherlands)
In recent years, GitHub has become the largest code host in the world, with more than 5M developers collaborating across 10M repositories. Numerous popular open source projects (such as Ruby on Rails, Homebrew, Bootstrap, Django or jQuery) have chosen GitHub as their host and have migrated their code base to it. GitHub offers a tremendous research potential. For instance, it is a flagship for current open source development, a place for developers to showcase their expertise to peers or potential recruiters, and the platform where social coding features or pull requests emerged. However, GitHub data is, to date, largely underexplored. To facilitate studies of GitHub, we have created GHTorrent, a scalable, queriable, offline mirror of the data offered through the GitHub REST API. In this paper we present a novel feature of GHTorrent designed to offer customisable data dumps on demand. The new GHTorrent data-on-demand service offers users the possibility to request via a web form up-to-date GHTorrent data dumps for any collection of GitHub repositories. We hope that by offering customisable GHTorrent data dumps we will not only lower the "barrier for entry" even further for researchers interested in mining GitHub data (thus encourage researchers to intensify their mining efforts), but also enhance the replicability of GitHub studies (since a snapshot of the data on which the results were obtained can now easily accompany each study).

A Code Clone Oracle
Daniel E. Krutz and Wei Le
(Rochester Institute of Technology, USA)
Code clones are functionally equivalent code segments. De- tecting code clones is important for determining bugs, fixes and software reuse. Code clone detection is also essential for developing fast and precise code search algorithms. How- ever, the challenge of such research is to evaluate that the clones detected are indeed functionally equivalent, consider- ing the majority of clones are not textual or even syntacti- cally identical. The goal of this work is to generate a set of method level code clones with a high confidence to help to evaluate future code clone detection and code search tools to evaluate their techniques. We selected three open source programs, Apache, Python and PostgreSQL, and randomly sampled a total of 1536 function pairs. To confirm whether or not these function pairs indicate a clone and what types of clones they belong to, we recruited three experts who have experience in code clone research and four students who have experience in programming for manual inspection. For confi- dence of the data, the experts consulted multiple code clone detection tools to make the consensus. To assist manual inspection, we built a tool to automatically load function pairs of interest and record the manual inspection results. We found that none of the 66 pairs are textual identical type- 1 clones, and 9 pairs are type-4 clones. Our data is available at: http://phd.gccis.rit.edu/weile/data/cloneoracle/.

Generating Duplicate Bug Datasets
Alina Lazar, Sarah Ritchey, and Bonita Sharif
(Youngstown State University, USA)
Automatic identification of duplicate bug reports is an important research problem in the mining software repositories field. This paper presents a collection of bug datasets collected, cleaned and preprocessed for the duplicate bug report identification problem. The datasets were extracted from open-source systems that use Bugzilla as their bug tracking component and contain all the bugs ever submitted. The systems used are Eclipse, Open Office, NetBeans and Mozilla. For each dataset, we store both the initial data and the cleaned data in separate collections in a mongoDB document-oriented database. For each dataset, in addition to the bug data collections downloaded from bug repositories, the database includes a set of all pairs of duplicate bugs together with randomly selected pairs of non-duplicate bugs. Such a dataset is useful as input for classification models and forms a good base to support replications and comparisons by other researchers. We used a subset of this data to predict duplicate bug reports but the same data set may also be used to predict bug priorities and severity.

FLOSS 2013: A Survey Dataset about Free Software Contributors: Challenges for Curating, Sharing, and Combining
Gregorio Robles, Laura Arjona Reina, Alexander Serebrenik, Bogdan Vasilescu, and Jesús M. González-Barahona
(Universidad Rey Juan Carlos, Spain; Universidad Politécnica de Madrid, Spain; Eindhoven University of Technology, Netherlands)
In this data paper we describe a data set obtained by means of performing an on-line survey to over 2,000 Free Libre Open Source Software (FLOSS) contributors. The survey includes questions related to personal characteristics (gender, age, civil status, nationality, etc.), education and level of English, professional status, dedication to FLOSS projects, reasons and motivations, involvement and goals. We describe as well the possibilities and challenges of using private information from the survey when linked with other, publicly available data sources. In this regard, an example of data sharing will be presented and legal, ethical and technical issues will be discussed.

Info

A Green Miner's Dataset: Mining the Impact of Software Change on Energy Consumption
Chenlei Zhang and Abram Hindle
(University of Alberta, Canada)
With the advent of mobile computing, the responsibility of software developers to update and ship energy efficient applications has never been more pronounced. Green mining attempts to address this responsibility by examining the impact of software change on energy consumption. One problem with green mining is that power performance data is not readily available, unlike many other forms of MSR research. Green miners have to create tests and run them across numerous versions of a software project because power performance data was either missing or never existed for that particular project. In this paper we describe multiple open green mining datasets used in prior green mining work. The dataset includes numerous power traces and parallel system call and CPU/IO/Memory traces of multiple versions of multiple products. These datasets enable those more interested in data-mining and modeling to work on green mining problems as well.

Gentoo Package Dependencies over Time
Remco Bloemen, Chintan Amrit, Stefan Kuhlmann, and Gonzalo Ordóñez–Matamoros
(University of Twente, Netherlands)
Open source distributions such as Gentoo need to accurately track dependency relations between software packages in order to install working systems. To do this, Gentoo has a carefully authored database containing those relations. In this paper, we extract the Gentoo package dependency graph and its changes over time. The final dependency graph spans 15 thousand open source projects and 80 thousand dependency relations. Furthermore, the development of this graph is tracked over time from the beginning of the Gentoo project in 2000 to the first quarter of 2012, with monthly resolution. The resulting dataset provides many opportunities for research. In this paper we explore cluster analysis to reveals meaningful relations between packages and in a separate paper we analyze changes in the dependencies over time to get insights in the innovation dynamics of open source software.

Models of OSS Project Meta-Information: A Dataset of Three Forges
James R. Williams, Davide Di Ruscio, Nicholas Matragkas, Juri Di Rocco, and Dimitris S. Kolovos

(University of York, UK; University of L'Aquila, Italy)
The process of selecting open-source software (OSS) for adoption is not straightforward as it involves exploring various sources of information to determine the quality, maturity, activity, and user support of each project. In the context of the OSSMETER project, we have developed a forge-agnostic metamodel that captures the meta-information common to all OSS projects. We specialise this metamodel for popular OSS forges in order to capture forge-specific meta-information. In this paper we present a dataset conforming to these metamodels for over 500,000 OSS projects hosted on three popular OSS forges: Eclipse, SourceForge, and GitHub. The dataset enables different kinds of automatic analysis and supports objective comparisons of cross-forge OSS alternatives with respect to a user's needs and quality requirements.

A Dataset of Clone References with Gaps
Hiroaki Murakami, Yoshiki Higo

, and Shinji Kusumoto
(Osaka University, Japan)
This paper introduces a new dataset of clone references, which is a set of correct clones consisting of their locational information with their gapped lines. Bellon's dataset is one of widely used clone datasets. Bellon's dataset contains many clone references, thus the dataset is useful for comparing accuracies among clone detectors. However, Bellon's dataset does not have locational information of gapped lines. Thus, Bellon's benchmark does not evaluate some Type-3 clones correctly. In order to resolve the problem, we added locational information of gapped lines to Bellon's dataset. The new dataset is available at ``http://sdl.ist.osaka-u.ac.jp/~h-murakm/2014_clone_references_with_gaps/''.
This paper also shows some examples that the new dataset and Bellon's dataset yield different evaluation results. Moreover, we report an experimental result that compares Bellon's dataset and the new dataset by using three clone detectors that can detect Type-3 clones. Finally, we conclude that the new dataset can evaluate Type-3 clones more correctly than Bellon's dataset.

A Dataset for Maven Artifacts and Bug Patterns Found in Them
Vaibhav Saini, Hitesh Sajnani, Joel Ossher, and Cristina V. Lopes
(University of California at Irvine, USA)
In this paper, we present data downloaded from Maven, one of the most popular component repositories. The data includes the binaries of 186,392 components, along with source code for 161,025. We identify and organize these components into groups where each group contains all the versions of a library. In order to asses the quality of these components, we make available report generated by the FindBugs tool on 64,574 components. The information is also made available in the form of a database which stores total number, type, and priority of bug patterns found in each component, along with its defect density. We also describe how this dataset can be useful in software engineering research.

OpenHub: A Scalable Architecture for the Analysis of Software Quality Attributes
Gabriel Farah, Juan Sebastian Tejada, and Dario Correal
(Universidad de los Andes, Colombia)
There is currently a vast array of open source projects available on the web, and although they are searchable by name or description in the search engines, there is no way to search for projects by how well they perform on a given set of quality attributes (e.g. usability or maintainability). With OpenHub, we present a scalable and extensible architecture for the static and runtime analysis of open source repositories written in Python, presenting the architecture and pinpointing future possibilities with it.

Understanding Software Evolution: The Maisqual Ant Data Set
Boris Baldassari and Philippe Preux
(SQuORING Technologies, France; LIFL, France; CNRS, France; INRIA, France; University of Lille, France)
Software engineering is a maturing discipline which has seen many drastic advances in the last years. However, some studies still point to the lack of rigorous and mathematically grounded methods to raise the field to a new emerging science, with proper and reproducible foundations to build upon. Indeed, mathematicians and statisticians do not necessarily have software engineering knowledge, while software engineers and practitioners do not necessarily have a mathematical background.
The Maisqual research project intends to fill the gap between both fields by proposing a controlled and peer-reviewed data set series ready to use and study. These data sets feature metrics from different repositories, from source code to mail activity and configuration management meta data. Metrics are described and commented, and all the steps followed for their extraction and treatment are described with contextual information about the data and its meaning.
This article introduces the Apache Ant weekly data set, featuring 636 extracts of the project over 12 years at different levels of artefacts – application, files, functions. By associating community and process related information to code extracts, this data set unveils interesting perspectives on the evolution of one of the great success stories of open source.

MSR 2014 – Proceedings

Frontmatter

Keynote

Green Mining

Code Clones and Origin Analysis

Bug Characterizing

Mining Repos and QA Sites

Mining Applications

Defect Prediction

Code Review and Code Search

Effort Estimation and Reuse

Mining Mix

Short Research/Practice Papers

Mining Challenge

Data Showcase