SWAN 2016 – Proceedings

Message from the Chairs
Software practitioners make technical and business decisions based on the understanding they have of their software systems. This understanding is grounded in their own experiences, but can be augmented by studying various kinds of development artifacts, including source code, bug reports, version control meta-data, test cases, usage logs, etc. Unfortunately, the information contained in these artifacts is typically not organized in the way that is immediately useful to stakeholders' everyday decision making needs. To handle the large volumes of data, many practitioners and researchers have turned to analytics -- the use of analysis, data, and systematic reasoning for making decisions. Thus, software analytics is an emerging field of modern data mining and analysis.
The International Workshop on Software Analytics (SWAN) aims at providing a common venue for researchers and practitioners across software engineering, data mining and mining software repositories research domains to share new approaches and emerging results in developing and validating analytics rich solutions, as well as adopting analytics to software development and maintenance processes to better inform their everyday decisions.

API Analytics and Security

Addressing Scalability in API Method Call Analytics
Ervina Cergani, Sebastian Proksch, Sarah Nadi, and Mira Mezini

(TU Darmstadt, Germany)
Intelligent code completion recommends relevant code to developers by comparing the editor content to code patterns extracted by analyzing large repositories. However, with the vast amount of data available in such repositories, scalability of the recommender system becomes an issue. We propose using Boolean Matrix Factorization (BMF) as a clustering technique for analyzing code in order to improve scalability of the underlying models. We compare model size, inference speed, and prediction quality of an intelligent method call completion engine built on top of canopy clustering versus one built on top of BMF. Our results show that BMF reduces model size up to 80% and increases inference speed up to 78%, without significant change in prediction quality.

Vulnerability Severity Scoring and Bounties: Why the Disconnect?
Nuthan Munaiah and Andrew Meneely
(Rochester Institute of Technology, USA)
The Common Vulnerability Scoring System (CVSS) is the de facto standard for vulnerability severity measurement today and is crucial in the analytics driving software fortification. Required by the U.S. National Vulnerability Database, over 75,000 vulnerabilities have been scored using CVSS. We compare how the CVSS correlates with another, closely-related measure of security impact: bounties. Recent economic studies of vulnerability disclosure processes show a clear relationship between black market value and bounty payments. We analyzed the CVSS scores and bounty awarded for 703 vulnerabilities across 24 products. We found a weak (Spearman’s ρ = 0.34) correlation between CVSS scores and bounties, with CVSS being more likely to underestimate bounty. We believe such a negative result is a cause for concern. We investigated why these measurements were so discordant by (a) analyzing the individual questions of CVSS with respect to bounties and (b) conducting a qualitative study to find the similarities and differences between CVSS and the publicly-available criteria for awarding bounties. Among our findings were that the bounty criteria were more explicit about code execution and privilege escalation whereas CVSS makes no explicit mention of those. We also found that bounty valuations are evaluated solely by project maintainers, whereas CVSS has little provenance in practice.

Defects and Effort Estimation

A Replication Study: Mining a Proprietary Temporal Defect Dataset
Tamer Abdou, Atakan Erdem, Ayse Bener, and Adam Neal
(Ryerson University, Canada; PSL, Canada)
We conduct a replication study to define temporal patterns of activity sequences in a proprietary dataset, and compare them with an open-source dataset. Temporal bug repository data may give many insights in the context of root-cause analysis of defects. Observing activities based on temporal changes enables the formation of temporal activity sequences. We use datasets from an issue tracking system repository of a proprietary and enterprise level software life-cycle management tool. We define the temporal patterns of activity sequences and compare them with Firefox data. On the basis of these analyses, we observe that some activities of the sequences are more critical than the others in the context of proprietary projects. Similarities and differences of the relevant activities are highlighted and explained. Integrating results from various empirical studies helped us in gradually generalizing evidences observed in the original study, and identifying the consistencies between open-source projects and proprietary ones. Comprehending temporal activity sequences assist software quality teams to optimize the allocation of their human resources as well on manage project schedules more efficiently.

A Hybrid Model for Task Completion Effort Estimation
Ali Dehghan, Kelly Blincoe, and Daniela Damian

(University of Victoria, Canada; University of Auckland, New Zealand)
Predicting time and effort of software task completion has been an active area of research for a long time. Previous studies have proposed predictive models based on either text data or metadata of software tasks to estimate either completion time or completion effort of software tasks, but there is a lack of focus in the literature on integrating all sets of attributes together to achieve better performing models. We first apply the previously proposed models on the datasets of two IBM commercial projects called RQM and RTC to find the best performing model in predicting task completion effort on each set of attributes. Then we propose an approach to create a hybrid model based on selected individual predictors to achieve more accurate and stable results in early prediction of task completion effort and to make sure the model is not bounded to some attributes and consequently is adoptable to a larger number of tasks. Categorizing task completion effort values into Low and High labels based on their measured median value, we show that our hybrid model provides 3-8% more accuracy in early prediction of task completion effort compared to the best individual predictors.

Crowdsourcing

Analyzing On-Boarding Time in Context of Crowdsourcing
Kumar Abhinav, Alpana Dubey, Gurdeep Virdi, and Alex Kass
(Accenture Technology Labs, India; Accenture Technology Labs, USA)
Crowdsourcing is an emerging area which leverages collective intelligence of the crowd. Although crowdsourcing provides several benefits, it also brings uncertainty in any project execution. The uncertainty may be because of the time taken in on-boarding workers and lack of confidence in workers. The On-boarding time specifically becomes important when tasks are of short duration as it is not worth spending too much of time in on-boarding a worker for short task. In this paper, we empirically analyze 59,597 tasks data from Upwork, an online marketplace, to understand major factors that impact On-boarding time. We identified that certain factors, such as Feedback, Hiring rate, Total hours spent, Length of requirement etc., affect the On-boarding time. We applied two predictive models to predict the On-boarding time. Our study provides insights for researchers, organizations, etc. who are looking to accomplish their tasks through crowdsourcing and helps them to better understand factors which influence the On-boarding time.

Software Crowdsourcing Reliability: An Empirical Study on Developers Behavior
Turki Alelyani and Ye Yang
(Stevens Institute of Technology, USA)
Crowdsourcing has become an emergent paradigm for software production in recent decades. Its open-call format attracts the participation of hundreds of thousands of developers. To ensure the success of software crowdsourcing, we must accurately measure and monitor the reliability of participating crowd workers, which, surprisingly, has rarely been done. To that end, this paper aims to examine the dependability of crowd workers in selecting tasks for software crowdsourcing. Empirical analysis of worker behaviors will investigate the following: (1) workers’ behavior when registering and carrying out the announced tasks; (2) the relationship between rewards and performance; (3) the effects of development type among different groups; and (4) the evolution of workers’ behavior according to the skills they have adopted. This study’s findings include: (1) On average, most reliable crowdsourcing group responds to a task call within 10% of their allotted time and completes the assigned work in less than 5% of that time. (2) Crowd workers tend to focus on tasks according to specific ranges of rewards and types of challenges, based on their skill levels. (3) Crowd skills spread evenly across the entire range of groups. In summary, our results can guide future research into crowdsourcing service design and can inform ideas for crowdsourcing strategy conception according to time, reward, development type, and other aspects of crowdsourcing.

Design and Clones

FourD: Do Developers Discuss Design? Revisited
Abbas Shakiba, Robert Green, and Robert Dyer
(Bowling Green State University, USA)
Software repositories contain a variety of information that can be mined and utilized to enhance software engineering processes. Patterns stored in software repository meta-data can provide useful and informative information about different aspects of a project, particularly those that may not be obvious for developers. One such aspect is the role of software design in a project. The messages connected to each commit in the repository note not only what changes have been made to project files, but potentially if those changes have somehow manipulated the design of the software.
In this paper, a sample of commit messages from a random sample of projects on GitHub and SourceForge are manually classified as "design" or "non-design" based on a survey. The resulting data is then used to train multiple machine learning algorithms in order to determine if it is possible to predict whether or not a single commit is discussing software design. Our results show the Random Forest classifier performed best on our combined data set with a G-mean of 75.01.

Info

Sampling Code Clones from Program Dependence Graphs with GRAPLE
Tim A. D. Henderson and Andy Podgurski
(Case Western Reserve University, USA)
We present GRAPLE, a method to generate a representative sample of recurring (frequent) subgraphs of any directed labeled graph(s). GRAPLE is based on frequent subgraph mining, absorbing Markov chains, and Horvitz-Thompson estimation. It can be used to sample any kind of graph representation for programs. One of many software engineering applications for finding recurring subgraphs is detecting duplicated code (code clones) from representations such as program dependence graphs (PDGs) and abstract syntax trees. To assess the usefulness of clones detected from PDGs, we conducted a case study on a 73 KLOC commercial Android application developed over 5 years. Nine of the application's developers participated. To our knowledge, it is the first study to have professional developers examine code clones detected from PDGs. We describe a new PDG generation tool jpdg for JVM languages, which was used to generate the dependence graphs used in the study.