SoftwareMining 2016 – Proceedings

Message from the Chairs
Software systems have been playing important roles in business, scientific research, and our everyday lives. It is critical to improve both software productivity and quality, which are major challenges to software engineering researchers and practitioners. In recent years, software mining has emerged as a promising mean to address these challenges. It has been successfully applied to discover knowledge from software artifacts (e.g., specifications, source code, documentations, execution logs, and bug reports) to improve software quality and development process (e.g., to obtain the insights for the causes leading to poor software quality, to help software engineers locate and identify problems quickly, and to help the managers optimize the resources for better productivity). Software mining has attracted much attention in both software engineering and data mining communities.

Invited Papers

Mining Apps for Anomalies
Andreas Zeller
(Saarland University, Germany)
When interacting with mobile apps, do users always get what they expect? We have mined thousands of Android apps for common features such as descriptions, APIs used, data flows, and (recently) user interfaces and callbacks. Associating these with each other allows us to detect outliers: Apps whose description does not fit their behavior; apps whose sensitive data flow is usual; and user interface elements whose text or icon suggests one action, but which actually are tied to other actions. Such anomalies not only reveal bugs, but actual security issues – and there is a huge treasure trove worth of data to be mined, abstracted, and analyzed.

Code Migration with Statistical Machine Translation
Tien N. Nguyen

(University of Texas at Dallas, USA)
In modern software development, developers often need to migrate code written for one platform in a programming language to another language for a different platform. The migration process is often performed manually or semi-automatically, in which developers are required to manually define translation rules and API mappings between languages. This talk outlines our research plan and results in investigating Statistical Machine Translation (SMT) in supporting code migration. We will explain the challenges and our solutions to address them, as well as our vision along this direction.

Code Mining

Mining Timed Regular Expressions from System Traces
Greta Cutulenco, Yogi Joshi, Apurva Narayan, and Sebastian Fischmeister
(University of Waterloo, Canada)
Dynamic behavior of a program can be assessed through examination of events emitted by the program during execution. Temporal properties define the order of occurrence and timing constraints on event occurrence. Such specifications are important for safety-critical real-time systems for which a delayed response to an emitted event may lead to a fault in the system. Since temporal properties are rarely specified for programs and due to the complexity of the formalisms, it is desirable to suggest properties by extracting them from traces of program execution for testing, verification, anomaly detection, and debugging purposes.
We propose a framework for automatically mining properties that are in the form of timed regular expressions (TREs) from system traces. Using an abstract structure of the property, the framework constructs a finite state machine to serve as an acceptor. As part of the framework, we propose two novel algorithms optimized for mining general TREs and a fragment without negation. The framework is evaluated on industrial strength safety-critical real-time applications (a deployed autonomous hexacopter system and a commercial vehicle in operation) using traces with more than 1 Million entries. Our framework is open source and available online:https://bitbucket.org/sfischme/tre-mining

By the Power of SMT! Mining Function Contracts to Better Bounded Model Checking
Azat Abdullin and Marat Akhin
(St. Petersburg Polytechnic University, Russia; JetBrains Research, Russia)
Program analysis is rapidly changing the way we develop software; one of the more important problems is that of function contract creation, as these contracts can greatly increase the quality and performance of the analysis. However, the predominant way of creating function contracts is their manual development by the end-user. In this paper we present an approach which allows one to automatically collect function contracts for bounded model checking by software mining augmented with deep SMT solver integration. The prototype implementation in Borealis bounded model checker has been evaluated on a number of programs and proved its ability to find interesting contracts.

Automatic Prediction of Bug Fixing Effort Measured by Code Churn Size
Ferdian Thung
(Singapore Management University, Singapore)
During software maintenance, developers often receive many bug reports. Project managers often need to manage limited resources to resolve the many bugs that a project receives. To help project managers perform their job, past studies have proposed techniques that predict the amount of time that passes between a bug report being submitted and it being resolved. However, this time period might not be representative of the actual development effort, as developers might not work on the bug right away or all the time. In the open source development setting, developers are only volunteers and might not devote their full working hours to fix a bug in a particular open source project. In the industrial setting, developers might be asked to perform various tasks aside from fixing a particular bug.
In this work, we estimate bug fixing effort in terms of code churn size. Code churn size is the number of lines of code that is either added, deleted, or modified to fix the bug. Lines of code has traditionally been used to estimate effort. However, no past studies have proposed techniques to automatically predict code churn size. In this work, using code churn size as estimation for bug fixing effort, we propose a classification-based approach that predicts, given a bug report, whether the bug fixing effort would be high or low. We have evaluated our approach on 1,029 bug reports from hadoop-common and struts2. The result is promising; we can achieve an Area Under the Receiver Operating Curve (AUC) of 0.612 to predict bug fixing effort in terms of lines of code churned, which is a 22.4% improvement over a baseline.

Text Mining

Duplicate Issue Detection for the Android Open Source Project
Kasthuri Jayarajah, Meera Radhakrishnan, and Camellia Zakaria
(Singapore Management University, Singapore)
The Android Open Source Project(AOSP) has seen tremendous traction over the past decade, and as such, the bug repository is growing in scale. With this growth, the effort required for project members to triage incoming new reports to identify whether it is a duplicate issue that has already been addressed, or receiving attention, is also on the rise. In this work, we create dataset of issues from the Android issue tracker, and use standard IR techniques such as VSM and LDA to understand their capability in such similar issue retrieval. Further, we combine VSM and LDA to evaluate its usefulness. We find that, overall, VSM performs better with this dataset.

Mining Testing Questions on Stack Overflow
Pavneet Singh Kochhar
(Singapore Management University, Singapore)
During software maintenance, testing is a crucial activity to ensure the quality of code as it evolves over time. With the increasing size and complexity of software, adequate software testing has become increasingly important. Developers often ask problems they face during testing on Community Question Answering (CQA) websites such as Stack Overflow. These websites can serve as good repositories to understand the common topics of discussions and challenges faced by developers during testing.
In this paper, we present a study of common challenges and important topics of discussion, by mining testing related questions asked on Stack Overflow. We use unsupervised learning to categorize the questions and rank all the Stack Overflow questions based on their importance. Our results show that topics such as test framework, database and client server are more often discussed compared to other topics. Also, there has been an uptrend for mobile development questions in testing related discussions.

On the Feasibility of Detecting Cross-Platform Code Clones via Identifier Similarity
Xiao Cheng, Lingxiao Jiang

, Hao Zhong

, Haibo Yu, and Jianjun Zhao

(Shanghai Jiao Tong University, China; Singapore Management University, Singapore; Kyushu University, Japan)
More and more mobile applications run on multiple mobile operating systems to attract more users of different platforms. Although versions on different platforms are implemented in different programming languages (e.g., Java and Objective-C), there must be many code snippets that implement the similar business logic on different platforms. Such code snippets are called cross-platform clones. It is challenging but essential to detect such clones for software maintenance. Due to the practice that developers usually use some common identifiers when implementing the same business logic on different platforms, in this paper, we investigate the identifier similarity of the same mobile application on different platforms and provide insights about the feasibility of cross-platform clone detection via identifier similarity. In our experiment, we have analyzed the source code of 18 open-source cross-platform applications which are implemented on Android, iOS and Windows Phone, and find that the smaller KL-Divergence the application has, the more accurate the clones detected by identifiers will be.

SoftwareMining 2016 – Proceedings

5th International Workshop on Software Mining (SoftwareMining 2016)

Frontmatter

Invited Papers

Code Mining

Text Mining