SoftwareMining 2017 – Proceedings

Message from the Chairs
Software systems have been playing important roles in business, scientific research, and our everyday lives. It is critical to improve both software productivity and quality, which are major challenges to software engineering researchers and practitioners. In recent years, software mining has emerged as a promising means to address these challenges. It has been successfully applied to discover knowledge from software artifacts (e.g., specifications, source code, documentations, execution logs, and bug reports) to improve software quality and development process (e.g., to obtain the insights for the causes leading to poor software quality, to help software engineers locate and identify problems quickly, and to help the managers optimize the resources for better productivity). Software mining has attracted much attention in both software engineering and data mining communities.

Specification and Requirement Analysis

Mining Temporal Intervals from Real-Time System Traces
Sean Kauffman and Sebastian Fischmeister
(University of Waterloo, Canada)
We introduce a novel algorithm for mining temporal intervals from real-time system traces with linear complexity using passive, black-box learning. Our interest is in mining nfer specifications from spacecraft telemetry to improve human and machine comprehension. Nfer is a recently proposed formalism for inferring event stream abstractions with a rule notation based on Allen Logic. The problem of mining Allen’s relations from a multivariate interval series is well studied, but little attention has been paid to generating such a series from symbolic time sequences such as system traces. We propose a method to automatically generate an interval series from real-time system traces so that they may be used as inputs to existing algorithms to mine nfer rules. Our algorithm has linear runtime and constant space complexity in the length of the trace and can mine infrequent intervals of arbitrary length from incomplete traces. The paper includes results from case studies using logs from the Curiosity rover on Mars and two other realistic datasets.

Mining Specifications using Nested Words
Apurva Narayan, Nirmal Benann, and Sebastian Fischmeister
(University of Waterloo, Canada)
Parameter mining of traces identifies formal properties that describe a program's dynamic behaviour. These properties are useful for developers to understand programs, to identify defects, and, in general, to reason about them. The dynamic behavior of programs typically follows a distinct pattern of calls and returns. Prior work uses general logic to identify properties from a given set of templates. Consequently, either the properties are inadequate since the logic is not expressive enough, or the approach fails to scale due to the generality of the logic. This paper uses nested words and nested word automata that are especially well suited for describing the dynamic behaviour of a program. Specifically, these nested words can describe pre/post conditions and inter-procedural data-flow and have constant memory requirements.
We propose a framework for mining properties that are in the form of nested words from system traces. As a part of the framework, we propose a novel scalable algorithm optimized for mining nested words. The framework is evaluated on traces from real world applications.

Eliciting Big Data Requirement from Big Data Itself: A Task-Directed Approach
Peng Wang, Ke Tao, Chenxu Gao, Xi Ning, Shuang Gu, and Bo Deng
(Tsinghua University, China; Beijing Institute of System Engineering, China; Institute of Electronics at Chinese Academy of Sciences, China)
The characteristics of big data not only challenge the processing methods of large volume of data, but also the way we make use of such semantic-rich resources, among which how users plan to manipulate the intermediate or final results requires to be well considered. This is especially challenging in building analytics systems as big data is richer in semantics and the heterogeneous data modalities also impose burdens on semantic fusion and visualization. While most of current research focuses on how to mine semantic information from big data, we emphasize the importance of the role of users in terms of requirement acquisition while building practical system functions in data analytics or management. This paper proposes a task-directed approach of requirement analysis for big data analytics which enhances requirement elicitation with modern data mining approaches. Based on the entities and semantic relationships learned from big data, a user-in-loop semi-automated requirement elicitation is carried out to generate a requirement repository which is affordable for further maintenance and modification. State-of-the-art user modeling methods can be incorporated aiming at refining the requirement iteratively according to the task interaction of users with system functions. Taking the case study of lifelogging analytics, we further discuss the advantages and limitations of our perspective.

Task Routing

A Hybrid Approach to Code Reviewer Recommendation with Collaborative Filtering
Zhenglin Xia, Hailong Sun

, Jing Jiang, Xu Wang, and Xudong Liu
(Beihang University, China)
Code review is known to be of paramount importance for software quality assurance. However, finding a reviewer for certain code can be very challenging in Modern Code Review environment due to the difficulty of learning the expertise and availability of candidate reviewers. To tackle this problem, existing efforts mainly concern how to model a reviewer's expertise with the review history, and making recommendation based on how well a reviewer's expertise can meet the requirement of a review task. Nonetheless, as there are both explicit and implicit relations in data that affect whether a reviewer is suitable for a given task, merely modeling review expertise with explicit relations often fails to achieve expected recommendation accuracy. To that end, we propose a recommendation algorithm that takes implicit relations into account. Furthermore, we utilize a hybrid approach that combines latent factor models and neighborhood methods to capture implicit relations. Finally, we have conducted extensive experiments by comparing with the state-of-the-art methods using the data of 5 popular GitHub projects. The results demonstrate that our approach outperforms the comparing methods for all top-k recommendations and reaches a 15.3% precision promotion in top-1 recommendation.

Competition-Aware Task Routing for Contest Based Crowdsourced Software Development
Yang Fu, Hailong Sun

, and Luting Ye
(Beihang University, China)
In crowdsourced software development, routing a task to right developers is a critical issue that largely affects the productivity and quality of software. In particular, crowdsourced software development platforms (e.g. Topcoder and Kaggle) usually adopt the competition-based crowdsourcing model. Given an incoming task, most of existing efforts focus on using the historical data to learn the probability that a developer may take the task and recommending developers accordingly. However, existing work ignores the locality characteristics of the developer-task dataset and the competition among developers. In this work, we propose a novel recommendation approach for task routing in competitive crowdsourced software development. First, we cluster tasks on the basis of content similarity. Second, for a given task, with the most similar task cluster, we utilize machine learning based classification to recommend a list of candidate developers. Third, we consider the competitive relationship among developers and re-rank the candidates by incorporating the competition network among them. Experiments conducted on 3 datasets (totally 7,481 tasks) crawled from Topcoder show that our approach delivers promising recommendation accuracy and outperforms the two comparing methods by 5.5% and 25.4% on average respectively.

Code Inspection and Refactoring

Evaluating Micro Patterns and Software Metrics in Vulnerability Prediction
Kazi Zakia Sultana and Byron J. Williams
(Mississippi State University, USA)
Software security is an important aspect of ensuring software quality. Early detection of vulnerable code during development is essential for the developers to make cost and time effective software testing. The traditional software metrics are used for early detection of software vulnerability, but they are not directly related to code constructs and do not specify any particular granularity level. The goal of this study is to help developers evaluate software security using class-level traceable patterns called micro patterns to reduce security risks. The concept of micro patterns is similar to design patterns, but they can be automatically recognized and mined from source code. If micro patterns can better predict vulnerable classes compared to traditional software metrics, they can be used in developing a vulnerability prediction model. This study explores the performance of class-level patterns in vulnerability prediction and compares them with traditional class-level software metrics. We studied security vulnerabilities as reported for one major release of Apache Tomcat, Apache Camel and three stand-alone Java web applications. We used machine learning techniques for predicting vulnerabilities using micro patterns and class-level metrics as features. We found that micro patterns have higher recall in detecting vulnerable classes than the software metrics.

A Code Inspection Tool by Mining Recurring Changes in Evolving Software
Alex Fish, Thuy Linh Nguyen, and Myoungkyu Song
(University of Nebraska at Omaha, USA)
Mining software repositories have frequently been investigated in recent research. Software modification in repositories are often recurring changes, similar but different changes across multiple locations. It is not easy for developers to find all the relevant locations to maintain such changes, including bug-fixes, new feature addition, and refactorings. Performing recurring changes is tedious and error-prone, resulting in in- consistent and missing updates. To address this problem, we present CloneMap, a clone-aware code inspection tool that helps developers ensure correctness of recurring changes to multiple locations in an evolving software. CloneMap allows developers to specify the old and new versions of a program. It then applies a clone detection technique to (1) mine repositories for extracting differences of recurring changes, (2) visualize the clone evolution, and (3) help developers focus their attention to potential anomalies, such as inconsistent and missing updates.

Data-Driven Usability Refactoring: Tools and Challenges
Alejandra Garrido, Sergio Firmenich, Julián Grigera, and Gustavo Rossi
(Universidad Nacional de La Plata, Argentina; CONICET, Argentina; CIC, Argentina)
Usability has long been recognized as an important software quality attribute and it has become essential in web application development and maintenance. However, it is still hard to integrate usability evaluation and improvement practices in the software development process. Moreover, these practices are usually unaffordable for small to medium-sized companies. In this position paper we propose an approach and tools to allow the crowd of web users participate in the process of usability evaluation and repair. Since we use the refactoring technique for usability improvement, we introduce the notion of “data-driven refactoring”: use data from the mass of users to learn about refactoring opportunities, plus also about refactoring effectiveness. This creates an improvement cycle where some refactorings may be discarded while others introduced, depending on their evaluated success. The paper also discusses some of the challenges that we foresee ahead.

SoftwareMining 2017 – Proceedings

2017 6th IEEE/ACM International Workshop on Software Mining (SoftwareMining 2017)

Frontmatter

Specification and Requirement Analysis

Task Routing

Code Inspection and Refactoring