ASE 2016
31st IEEE/ACM International Conference on Automated Software Engineering (ASE 2016)
Powered by
Conference Publishing Consulting

31st IEEE/ACM International Conference on Automated Software Engineering (ASE 2016), September 3–7, 2016, Singapore, Singapore

ASE 2016 – Proceedings

Contents - Abstracts - Authors


Title Page

Message from the Chairs
It is our great pleasure and honor to welcome everyone to the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE 2016). The ASE conference series is the premier research forum for automated software engineering. Each year it brings together researchers and practitioners from academia and industry to discuss foundations, techniques, and tools for automated analysis, design, implementation, testing, and maintenance of software systems. This publication contains the proceedings of the conference, which was held between September 3­7, 2016 in Singapore. The first two days of the conference were allocated to workshops, tutorials, and a doctoral symposium, while the last three days were for the presentations of keynotes, technical papers, and tool demo papers. Entering its fourth decade after its inception, the ASE conference is getting even more relevant today. More and more individuals, businesses, and governments are relying on software systems, ranging from small apps that run on handheld devices to large mission critical systems that run on the cloud. ASE tools and techniques help in the construction and evolution of such systems to improve system reliability and developer productivity.

ASE 2016 Organization
Committee listings



Program Generation for Performance
Markus Püschel
(ETH Zurich, Switzerland)
It has become extraordinarily difficult to write software that performs close to optimally on complex modern microarchitectures. Particularly plagued are domains that require complex mathematical computations such as multimedia processing, communication, control, graphics, and machine learning. In these domains, performance-critical components are usually written in C (with possible extensions) and often even in assembly, carefully "tuned" to the platform's architecture and microarchitecture. The result is usually long, rather unreadable code that needs to be re-written or re-tuned with every platform upgrade. On the other hand, the performance penalty for relying on straightforward, non-tuned, "more elegant" implementations can be often a factor of 10, 100, or even more. The overall problem is one of productivity, maintainability, and quality (namely performance), i.e., software engineering. However, even though a large set of sophisticated software engineering theory and tools exist, it appears that to date this community has not focused much on mathematical computations nor performance in the detailed, close-to-optimal sense above. The reason for the latter may be that performance, unlike various aspects of correctness, is not syntactic in nature (and in reality is often even unpredictable and, well, messy). The aim of this talk is to draw attention to the performance/productivity problem for mathematical applications and to make the case for a more interdisciplinary attack. As a set of thoughts in this direction we offer some of the lessons we have learned in the last decade in our own research on Spiral (, a program generation framework for numerical kernels. Key techniques used in Spiral include staged declarative domain-specific languages to express algorithm knowledge and algorithm transformations, the use of platform-cognizant rewriting systems for parallelism and locality optimizations, and the use of search and machine learning techniques to navigate possible spaces of choices. Experimental results show that the codegenerated by Spiral competes with, and sometimes outperforms, the best available human-written code. Spiral has been used to generate part of Intel's commercial libraries IPP and MKL.

Publisher's Version Article Search
Changing Microsoft's Build: Revolution or Evolution
Wolfram Schulte
(Microsoft, USA)
Tens of thousands of Microsoft engineers build and test hundreds of software products several times a day. It is essential that this continuous integration scales, guarantees short feedback cycles, and functions reliably with minimal human intervention. During the past three years TSE's charter has been to shorten this cycle time. We went after this goal in two ways: Evolution via CloudBuild and Revolution via Concord. CloudBuild is a build service infrastructure, now being used by all major product groups in Microsoft, like Azure, Bing, Office, SQL except for Windows. CloudBuild addresses all aspects of a continuous integration workflow, like builds, test and code analysis, but also drops, package and symbol creation and storage. CloudBuild supports multiple build languages as long as they fulfill a coarse grained IO based contract. CloudBuild uses content based caching to run build-related tasks only when needed. Lastly, it builds on many machines in parallel. The speed ups of build and testing range from 1.2x to 10x. CloudBuild aims to rapidly onboard teams and hence has to support non-deterministic build tools and specification languages that under-declare dependencies. CloudBuild, being a reliable build service in the presence of unreliable components, currently achieves service availability better than 99%. Windows went a different path. Their past build exhaust was so massive that building Windows in the cloud and bringing the build results back for testing on corp.-net. was considered infeasible. So they decided to move to a new build language, codename Concord. By construction, Concord guarantees reliable builds, no over-build, and allows for efficient distribution. Adopting Concord has led to immense performance improvements, we have seen up to 100X speedup for Windows builds. But the path has been long and rocky, since it not only requires a substantial rewrite of existing build logic, but also all related developer and build lab processes have to change. Whether evolution or revolution is the right path forward -- the verdict is still out.

Publisher's Version Article Search
The Power of Probabilistic Thinking
David S. Rosenblum
(National University of Singapore, Singapore)
Traditionally, software engineering has dealt in absolutes. For instance, we talk about a system being "correct" or "incorrect", with the shades of grey in between occasionally acknowledged but rarely dealt with explicitly. And we typically employ logical, algebraic, relational and other representations and techniques that help us reason about software in such absolute terms. There of course have been notable exceptions to this, such as the use of statistical techniques in testing and debugging. But by and large, both researchers and practitioners have favored the relative comfort of an absolutist viewpoint in all aspects of development. In this talk, I will argue the benefits of taking a more thoroughly probabilistic approach in software engineering. Software engineering is rife with stochastic phenomena, and the vast majority of software systems operate in an environment of uncertain, random behavior, which suits an explicit probabilistic characterization. Furthermore, this uncertainty is becoming ever more pronounced in new software systems and platforms, such as the Internet of Things and autonomous vehicles, with their frequent imprecise outputs and heavy reliance on machine learning. To illustrate more deeply some of the considerations involved in taking a probabilistic approach, I will talk about some recent research I have been doing in probabilistic verification.

Publisher's Version Article Search

Main Research Papers

Test Evaluation

An Empirical Investigation into the Nature of Test Smells
Michele Tufano, Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, Andrea De Lucia, and Denys Poshyvanyk
(College of William and Mary, USA; University of Salerno, Italy; University of Lugano, Switzerland; University of Sannio, Italy; University of Molise, Italy)
Test smells have been defined as poorly designed tests and, as reported by recent empirical studies, their presence may negatively affect comprehension and maintenance of test suites. Despite this, there are no available automated tools to support identification and repair of test smells. In this paper, we firstly investigate developers' perception of test smells in a study with 19 participants. The results show that developers generally do not recognize (potentially harmful) test smells, highlighting that automated tools for identifying such smells are much needed. However, to build effective tools, deeper insights into the test smells phenomenon are required. To this aim, we conducted a large-scale empirical investigation aimed at analyzing (i) when test smells occur in source code, (ii) what their survivability is, and (iii) whether their presence is associated with the presence of design problems in production code (code smells). The results indicate that test smells are usually introduced when the corresponding test code is committed in the repository for the first time, and they tend to remain in a system for a long time. Moreover, we found various unexpected relationships between test and code smells. Finally, we show how the results of this study can be used to build effective automated tools for test smell detection and refactoring.

Publisher's Version Article Search Info
Evaluating Non-adequate Test-Case Reduction
Mohammad Amin Alipour, August Shi, Rahul Gopinath, Darko Marinov, and Alex Groce
(Oregon State University, USA; University of Illinois at Urbana-Champaign, USA)
Given two test cases, one larger and one smaller, the smaller test case is preferred for many purposes. A smaller test case usually runs faster, is easier to understand, and is more convenient for debugging. However, smaller test cases also tend to cover less code and detect fewer faults than larger test cases. Whereas traditional research focused on reducing test suites while preserving code coverage, recent work has introduced the idea of reducing individual test cases, rather than test suites, while still preserving code coverage. Other recent work has proposed non-adequately reducing test suites by not even preserving all the code coverage. This paper empirically evaluates a new combination of these two ideas, non-adequate reduction of test cases, which allows for a wide range of trade-offs between test case size and fault detection. Our study introduces and evaluates C%-coverage reduction (where a test case is reduced to retain at least C% of its original coverage) and N -mutant reduction (where a test case is reduced to kill at least N of the mutants it originally killed). We evaluate the reduction trade-offs with varying values of C% and N for four real-world C projects: Mozilla’s SpiderMonkey JavaScript engine, the YAFFS2 flash file system, Grep, and Gzip. The results show that it is possible to greatly reduce the size of many test cases while still preserving much of their fault-detection capability.

Publisher's Version Article Search
Optimizing Customized Program Coverage
Peter Ohmann, David Bingham Brown, Naveen Neelakandan, Jeff Linderoth, and Ben Liblit
(University of Wisconsin-Madison, USA)
Program coverage is used across many stages of software development. While common during testing, program coverage has also found use outside the test lab, in production software. However, production software has stricter requirements on run-time overheads, and may limit possible program instrumentation. Thus, optimizing the placement of probes to gather program coverage is important.
We introduce and study the problem of customized program coverage optimization. We generalize previous work that optimizes for complete coverage instrumentation with a system that adapts optimization to customizable program coverage requirements. Specifically, our system allows a user to specify desired coverage locations and to limit legal instrumentation locations. We prove that the problem of determining optimal coverage probes is NP-hard, and we present a solution based on mixed integer linear programming. Due to the computational complexity of the problem, we also provide two practical approximation approaches. We evaluate the effectiveness of our approximations across a diverse set of benchmarks, and show that our techniques can substantially reduce instrumentation while allowing the user immense freedom in defining coverage requirements. When naive instrumentation is dense or expensive, our optimizations succeed in lowering execution time overheads.

Publisher's Version Article Search Info
What Makes Killing a Mutant Hard
Willem Visser
(Stellenbosch University, South Africa)
Mutation operators have been studied at length to determine which ones are the ``best" at some metric (for example creates the least equivalent mutants, creates hard-to-kill mutants, etc.). These studies though have focused on specific test suites, where the test inputs and oracles are fixed, which leads to results that are strongly influenced by the test suites and thus makes the conclusions potentially less general. In this paper we consider all test inputs and we assume we have no prior knowledge about the likelihood of any specific inputs. We will also show how varying the strength of the oracle have a big impact on the results. We only consider a few mutation operators (mostly relational), only a handful of programs to mutate (amenable to probabilistic symbolic execution), and only consider how likely it is that a mutant is killed. A core finding is that the likelihood of reaching the source line where the mutation is applied, is an important contributor to the likelihood of killing the mutant and when we control for this we can see which operators create mutations that are too easy versus very hard to kill.

Publisher's Version Article Search
Test Case Permutation to Improve Execution Time
Panagiotis Stratis and Ajitha Rajan
(University of Edinburgh, UK)
With the growing complexity of software, the number of test cases needed for effective validation is extremely large. Executing these large test suites is expensive, both in terms of time and energy. Cache misses are known to be one of the main factors contributing to execution time of a software. Cache misses are reduced by increasing the locality of memory references. For a single program run, compiler optimisations help improve data locality and code layout optimisations help improve spatial locality of instructions. Nevertheless, cache locality optimisations have not been proposed and explored across several program runs, which is the case when we run several test cases.
In this paper, we propose and evaluate a novel approach to improve instruction locality across test case runs. Our approach measures the distance between test case runs (number of different instructions). We then permute the test cases for execution so that the distance between neighbouring test cases is minimised. We hypothesise that test cases executed in this new order for improved instruction locality will reduce time consumed.
We conduct a preliminary evaluation with four subject programs and test suites from the SIR repository to answer the following questions, 1.~Is execution time of a test suite affected by the order in which test cases are executed? and 2.~How does time consumed in executing our permutation compare to random test case permutations? We found that the order in which test cases are executed has a definite impact on execution time. The extent of impact varies, based on program characteristics and test cases. Our approach outperformed more than 97% of random test case permutations on 3 of the 4 subject programs and did better than 93% of the random orderings on the remaining subject program. Using the optimised permutation, we saw a maximum reduction of 7.4% over average random permutation execution time and 34.7% over the worst permutation.

Publisher's Version Article Search


Predicting Semantically Linkable Knowledge in Developer Online Forums via Convolutional Neural Network
Bowen Xu, Deheng Ye, Zhenchang Xing, Xin Xia, Guibin Chen, and Shanping Li
(Zhejiang University, China; Nanyang Technological University, Singapore)
Consider a question and its answers in Stack Overflow as a knowledge unit. Knowledge units often contain semantically relevant knowledge, and thus linkable for different purposes, such as duplicate questions, directly linkable for problem solving, indirectly linkable for related information. Recognizing different classes of linkable knowledge would support more targeted information needs when users search or explore the knowledge base. Existing methods focus on binary relatedness (i.e., related or not), and are not robust to recognize different classes of semantic relatedness when linkable knowledge units share few words in common (i.e., have lexical gap). In this paper, we formulate the problem of predicting semantically linkable knowledge units as a multiclass classification problem, and solve the problem using deep learning techniques. To overcome the lexical gap issue, we adopt neural language model (word embeddings) and convolutional neural network (CNN) to capture word- and document-level semantics of knowledge units. Instead of using human-engineered classifier features which are hard to design for informal user-generated content, we exploit large amounts of different types of user-created knowledge-unit links to train the CNN to learn the most informative word- and sentence-level features for the multiclass classification task. Our evaluation shows that our deep-learning based approach significantly and consistently outperforms traditional methods using traditional word representations and human-engineered classifier features.

Publisher's Version Article Search
Testing Advanced Driver Assistance Systems using Multi-objective Search and Neural Networks
Raja Ben Abdessalem, Shiva Nejati, Lionel C. Briand, and Thomas Stifter
(University of Luxembourg, Luxembourg; IEE, Luxembourg)
Recent years have seen a proliferation of complex Advanced Driver Assistance Systems (ADAS), in particular, for use in autonomous cars. These systems consist of sensors and cameras as well as image processing and decision support software components. They are meant to help drivers by providing proper warnings or by preventing dangerous situations. In this paper, we focus on the problem of design time testing of ADAS in a simulated environment. We provide a testing approach for ADAS by combining multi-objective search with surrogate models developed based on neural networks. We use multi-objective search to guide testing towards the most critical behaviors of ADAS. Surrogate modeling enables our testing approach to explore a larger part of the input search space within limited computational resources. We characterize the condition under which the multi-objective search algorithm behaves the same with and without surrogate modeling, thus showing the accuracy of our approach. We evaluate our approach by applying it to an industrial ADAS system. Our experiment shows that our approach automatically identifies test cases indicating critical ADAS behaviors. Further, we show that combining our search algorithm with surrogate modeling improves the quality of the generated test cases, especially under tight and realistic computational resources.

Publisher's Version Article Search
Privacy Preserving via Interval Covering Based Subclass Division and Manifold Learning Based Bi-directional Obfuscation for Effort Estimation
Fumin Qi, Xiao-Yuan Jing, Xiaoke Zhu, Fei Wu, and Li Cheng
(Wuhan University, China; Nanjing University of Posts and Telecommunications, China; Henan University, China)
When a company lacks local data in hand, engineers can build an effort model for the effort estimation of a new project by utilizing the training data shared by other companies. However, one of the most important obstacles for data sharing is the privacy concerns of software development organizations. In software engineering, most of existing privacy-preserving works mainly focus on the defect prediction, or debugging and testing, yet the privacy-preserving data sharing problem has not been well studied in effort estimation. In this paper, we aim to provide data owners with an effective approach of privatizing their data before release. We firstly design an Interval Covering based Subclass Division (ICSD) strategy. ICSD can divide the target data into several subclasses by digging a new attribute (i.e., class label) from the effort data. And the obtained class label is beneficial to maintaining the distribution of the target data after obfuscation. Then, we propose a manifold learning based bi-directional data obfuscation (MLBDO) algorithm, which uses two nearest neighbors, which are selected respectively from the previous and next subclasses by utilizing the manifold learning based nearest neighbor selector, as the disturbances to obfuscate the target sample. We call the entire approach as ICSD&MLBDO. Experimental results on seven public effort datasets show that: 1) ICSD&MLBDO can guarantee the privacy and maintain the utility of obfuscated data. 2) ICSD&MLBDO can achieve better privacy and utility than the compared privacy-preserving methods.

Publisher's Version Article Search
Deep Learning Code Fragments for Code Clone Detection
Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk
(College of William and Mary, USA)
Code clone detection is an important problem for software maintenance and evolution. Many approaches consider either structure or identifiers, but none of the existing detection techniques model both sources of information. These techniques also depend on generic, handcrafted features to represent code fragments. We introduce learning-based detection techniques where everything for representing terms and fragments in source code is mined from the repository. Our code analysis supports a framework, which relies on deep learning, for automatically linking patterns mined at the lexical level with patterns mined at the syntactic level. We evaluated our novel learning-based approach for code clone detection with respect to feasibility from the point of view of software maintainers. We sampled and manually evaluated 398 file- and 480 method-level pairs across eight real-world Java systems; 93% of the file- and method-level samples were evaluated to be true positives. Among the true positives, we found pairs mapping to all four clone types. We compared our approach to a traditional structure-oriented technique and found that our learning-based approach detected clones that were either undetected or suboptimally reported by the prominent tool Deckard. Our results affirm that our learning-based approach is suitable for clone detection and a tenable technique for researchers.

Publisher's Version Article Search Info

Recommendation and Automation

Automatically Recommending Code Reviewers Based on Their Expertise: An Empirical Comparison
Christoph Hannebauer, Michael Patalas, Sebastian Stünkel, and Volker Gruhn
(University of Duisburg-Essen, Germany)
Code reviews are an essential part of quality assurance in Free, Libre, and Open Source (FLOSS) projects. However, finding a suitable reviewer can be difficult, and delayed or forgotten reviews are the consequence. Automating reviewer selection with suitable algorithms can mitigate this problem. We compare empirically six algorithms based on modification expertise and two algorithms based on review expertise on four major FLOSS projects. Our results indicate that the algorithms based on review expertise yield better recommendations than those based on modification expertise. The algorithm Weighted Review Count (WRC) recommends at least one out of five reviewers correctly in 69 % to 75 % of all cases, which is one of the best results achieved in the comparison.

Publisher's Version Article Search
Evaluating the Evaluations of Code Recommender Systems: A Reality Check
Sebastian Proksch, Sven Amann, Sarah Nadi, and Mira Mezini
(TU Darmstadt, Germany)
While researchers develop many new exciting code recommender systems, such as method-call completion, code-snippet completion, or code search, an accurate evaluation of such systems is always a challenge. We analyzed the current literature and found that most of the current evaluations rely on artificial queries extracted from released code, which begs the question: Do such evaluations reflect real-life usages? To answer this question, we capture 6,189 fine-grained development histories from real IDE interactions. We use them as a ground truth and extract 7,157 real queries for a specific method-call recommender system. We compare the results of such real queries with different artificial evaluation strategies and check several assumptions that are repeatedly used in research, but never empirically evaluated. We find that an evolving context that is often observed in practice has a major effect on the prediction quality of recommender systems, but is not commonly reflected in artificial evaluations.

Publisher's Version Article Search Info
Too Much Automation? The Bellwether Effect and Its Implications for Transfer Learning
Rahul Krishna, Tim Menzies, and Wei Fu
(North Carolina State University, USA)
Transfer learning: is the process of translating quality predictors learned in one data set to another. Transfer learning has been the subject of much recent research. In practice, that research means changing models all the time as transfer learners continually exchange new models to the current project. This paper offers a very simple bellwether transfer learner. Given N data sets, we find which one produce the best predictions on all the others. This bellwether data set is then used for all subsequent predictions (or, until such time as its predictions start failing-- at which point it is wise to seek another bellwether). Bellwethers are interesting since they are very simple to find (just wrap a for-loop around standard data miners). Also, they simplify the task of making general policies in SE since as long as one bellwether remains useful, stable conclusions for N data sets can be achieved just by reasoning over that bellwether. From this, we conclude (1) this bellwether method is a useful (and very simple) transfer learning method; (2) bellwethers are a baseline method against which future transfer learners should be compared; (3) sometimes, when building increasingly complex automatic methods, researchers should pause and compare their supposedly more sophisticated method against simpler alternatives.

Publisher's Version Article Search
Automatic Microbenchmark Generation to Prevent Dead Code Elimination and Constant Folding
Marcelino Rodriguez-Cancio, Benoit Combemale, and Benoit Baudry
(University of Rennes 1, France; INRIA, France)
Microbenchmarking evaluates, in isolation, the execution time of small code segments that play a critical role in large applications. The accuracy of a microbenchmark depends on two critical tasks: wrap the code segment into a payload that faithfully recreates the execution conditions of the large application; build a scaffold that runs the payload a large number of times to get a statistical estimate of the execution time. While recent frameworks such as the Java Microbenchmark Harness (JMH) address the scaffold challenge, developers have very limited support to build a correct payload. This work focuses on the automatic generation of payloads, starting from a code segment selected in a large application. Our generative technique prevents two of the most common mistakes made in microbenchmarks: dead code elimination and constant folding. A microbenchmark is such a small program that can be “over-optimized” by the JIT and result in distorted time measures, if not designed carefully. Our technique automatically extracts the segment into a compilable payload and generates additional code to prevent the risks of “over-optimization”. The whole approach is embedded in a tool called AutoJMH, which generates payloads for JMH scaffolds. We validate the capabilities AutoJMH, showing that the tool is able to process a large percentage of segments in real programs. We also show that AutoJMH can match the quality of payloads handwritten by performance experts and outperform those written by professional Java developers without experience in microbenchmarking.

Publisher's Version Article Search Info

Model-Based Testing and Oracles

Visualization of Combinatorial Models and Test Plans
Rachel Tzoref-Brill, Paul Wojciak, and Shahar Maoz
(Tel Aviv University, Israel; IBM Research, Israel; IBM, USA)
Combinatorial test design (CTD) is an effective and widely used test design technique. CTD provides automatic test plan generation, but it requires a manual definition of the test space in the form of a combinatorial model. One challenge for successful application of CTD in practice relates to this manual model definition and maintenance process. Another challenge relates to the comprehension and use of the test plan generated by CTD for prioritization purposes.
In this work we introduce the use of visualizations as a means to address these challenges. We apply three different forms of visualization, matrices, graphs, and treemaps, to visualize the relationships between the different elements of the model, and to visualize the strength of each test in the test plan and the relationships between the different tests in terms of combinatorial coverage. We evaluate our visualizations via a user survey with 19 CTD practitioners, as well as via two industrial projects in which our visualization was used and allowed test designers to get vital insight into their models and into the coverage provided through CTD generated test plans.

Publisher's Version Article Search
Finding Access Control Bugs in Web Applications with CanCheck
Ivan Bocić and Tevfik Bultan
(University of California at Santa Barbara, USA)
Access control bugs in web applications can have dire consequences since many web applications store private and sensitive data. In this paper we present an automated verification technique for access control in Ruby on Rails (Rails) applications. Our technique starts by automatically extracting a model that captures 1) the ways the data is accessed and modified by the application, 2) the access control policy of the application, and 3) the authorization checks used for access control policy enforcement. Then, it automatically translates this model to first order logic and uses automated theorem provers to check whether the declared access control policy is correctly enforced by the implementation. We implemented our technique in a tool called CanCheck. Using CanCheck on open source Rails applications, we found numerous previously unknown exploitable access control bugs as well as several deficiencies in access control policies.

Publisher's Version Article Search
SOFIA: An Automated Security Oracle for Black-Box Testing of SQL-Injection Vulnerabilities
Mariano Ceccato, Cu D. Nguyen, Dennis Appelt, and Lionel C. Briand
(Fondazione Bruno Kessler, Italy; University of Luxembourg, Luxembourg)
Security testing is a pivotal activity in engineering secure software. It consists of two phases: generating attack inputs to test the system, and assessing whether test executions expose any vulnerabilities. The latter phase is known as the security oracle problem.
In this work, we present SOFIA, a Security Oracle for SQL-Injection Vulnerabilities. SOFIA is programming-language and source-code independent, and can be used with various attack generation tools. Moreover, because it does not rely on known attacks for learning, SOFIA is meant to also detect types of SQLi attacks that might be unknown at learning time. The oracle challenge is recast as a one-class classification problem where we learn to characterise legitimate SQL statements to accurately distinguish them from SQLi attack statements.
We have carried out an experimental validation on six applications, among which two are large and widely-used. SOFIA was used to detect real SQLi vulnerabilities with inputs generated by three attack generation tools. The obtained results show that SOFIA is computationally fast and achieves a recall rate of 100% (i.e., missing no attacks) with a low false positive rate (0.6%).

Publisher's Version Article Search
Supporting Oracle Construction via Static Analysis
Junjie Chen, Yanwei Bai, Dan Hao, Lingming Zhang, Lu Zhang, Bing Xie, and Hong Mei
(Peking University, China; University of Texas at Dallas, USA)
In software testing, the program under test is usually executed with test inputs and checked against a test oracle, which is a mechanism to verify whether the program behaves as expected. Selecting the right oracle data to observe is crucial in test oracle construction. In the literature, researchers have proposed two dynamic approaches to oracle data selection by analyzing test execution information (e.g., variables' values or interaction information). However, collecting such information during program execution may incur extra cost. In this paper, we present the first static approach to oracle data selection, SODS (Static Oracle Data Selection). In particular, SODS first identifies the substitution relationships between candidate oracle data by constructing a probabilistic substitution graph based on the definition-use chains of the program under test, then estimates the fault-observing capability of each candidate oracle data, and finally selects a subset of oracle data with strong fault-observing capability. For programs with analyzable test code, we further extend SODS via pruning the probabilistic substitution graph based on 0-1-CFA call graph analysis. The experimental study on 11 subject systems written in C or Java demonstrates that our static approach is more effective and much more efficient than state-of-the-art dynamic approaches in most cases.

Publisher's Version Article Search


Local-Based Active Classification of Test Report to Assist Crowdsourced Testing
Junjie Wang, Song Wang, Qiang Cui, and Qing Wang
(Institute of Software at Chinese Academy of Sciences, China; University at Chinese Academy of Sciences, China; University of Waterloo, Canada)
In crowdsourced testing, an important task is to identify the test reports that actually reveal fault - true fault, from the large number of test reports submitted by crowd workers. Most existing approaches towards this problem utilized supervised machine learning techniques, which often require users to manually label a large amount of training data. Such process is time-consuming and labor-intensive. Thus, reducing the onerous burden of manual labeling while still being able to achieve good performance is crucial. Active learning is one potential technique to address this challenge, which aims at training a good classifier with as few labeled data as possible. Nevertheless, our observation on real industrial data reveals that existing active learning approaches generate poor and unstable performances on crowdsourced testing data. We analyze the deep reason and find that the dataset has significant local biases. To address the above problems, we propose LOcal-based Active ClassiFication (LOAF) to classify true fault from crowdsourced test reports. LOAF recommends a small portion of instances which are most informative within local neighborhood, and asks user their labels, then learns classifiers based on local neighborhood. Our evaluation on 14,609 test reports of 34 commercial projects from one of the Chinese largest crowdsourced testing platforms shows that our proposed LOAF can generate promising results. In addition, its performance is even better than existing supervised learning approaches which built on large amounts of labelled historical data. Moreover, we also implement our approach and evaluate its usefulness using real-world case studies. The feedbacks from testers demonstrate its practical value.

Publisher's Version Article Search
Multi-objective Test Report Prioritization using Image Understanding
Yang Feng, James A. Jones, Zhenyu Chen, and Chunrong Fang
(University of California at Irvine, USA; Nanjing University, China)
In crowdsourced software testing, inspecting the large number of test reports is an overwhelming but inevitable software maintenance task. In recent years, to alleviate this task, many text-based test-report classification and prioritization techniques have been proposed. However in the mobile testing domain, test reports often consist of more screenshots and shorter descriptive text, and thus text-based techniques may be ineffective or inapplicable. The shortage and ambiguity of natural-language text information and the well defined screenshots of activity views within mobile applications motivate our novel technique based on using image understanding for multi-objective test-report prioritization. In this paper, by taking the similarity of screenshots into consideration, we present a multi-objective optimization-based prioritization technique to assist inspections of crowdsourced test reports. In our technique, we employ the Spatial Pyramid Matching (SPM) technique to measure the similarity of the screenshots, and apply the natural-language processing technique to measure the distance between the text of test reports. Furthermore, to validate our technique, an experiment with more than 600 test reports and 2500 images is conducted. The experimental results show that image-understanding techniques can provide benefit to test-report prioritization for most applications.

Publisher's Version Article Search
CrowdService: Serving the Individuals through Mobile Crowdsourcing and Service Composition
Xin Peng, Jingxiao Gu, Tian Huat Tan, Jun Sun, Yijun Yu, Bashar Nuseibeh, and Wenyun Zhao
(Fudan University, China; Singapore University of Technology and Design, Singapore; Open University, UK; University of Limerick, Ireland)
Some user needs in real life can only be accomplished by leveraging the intelligence and labor of other people via crowdsourcing tasks. For example, one may want to confirm the validity of the description of a secondhand laptop by asking someone else to inspect the laptop on site. To integrate these crowdsourcing tasks into user applications, it is required that crowd intelligence and labor be provided as easily accessible services (e.g., Web services), which can be called crowd services. In this paper, we develop a framework named CROWDSERVICE which supplies crowd intelligence and labor as publicly accessible crowd services via mobile crowdsourcing. We implement the proposed framework on the Android platform and evaluate the usability of the framework with a user study.

Publisher's Version Article Search
QUICKAR: Automatic Query Reformulation for Concept Location using Crowdsourced Knowledge
Mohammad Masudur Rahman and Chanchal K. Roy
(University of Saskatchewan, Canada)
During maintenance, software developers deal with numerous change requests made by the users of a software system. Studies show that the developers find it challenging to select appropriate search terms from a change request during concept location. In this paper, we propose a novel technique--QUICKAR--that automatically suggests helpful reformulations for a given query by leveraging the crowdsourced knowledge from Stack Overflow. It determines semantic similarity or relevance between any two terms by analyzing their adjacent word lists from the programming questions of Stack Overflow, and then suggests semantically relevant queries for concept location. Experiments using 510 queries from two software systems suggest that our technique can improve or preserve the quality of 76% of the initial queries on average which is promising. Comparison with one baseline technique validates our preliminary findings, and also demonstrates the potential of our technique.

Publisher's Version Article Search Info


Taming Android Fragmentation: Characterizing and Detecting Compatibility Issues for Android Apps
Lili Wei, Yepang Liu, and Shing-Chi Cheung
(Hong Kong University of Science and Technology, China)
Android ecosystem is heavily fragmented. The numerous combinations of different device models and operating system versions make it impossible for Android app developers to exhaustively test their apps. As a result, various compatibility issues arise, causing poor user experience. However, little is known on the characteristics of such fragmentation-induced compatibility issues and no mature tools exist to help developers quickly diagnose and fix these issues. To bridge the gap, we conducted an empirical study on 191 real-world compatibility issues collected from popular open-source Android apps. Our study characterized the symptoms and root causes of compatibility issues, and disclosed that the patches of these issues exhibit common patterns. With these findings, we propose a technique named FicFinder to automatically detect compatibility issues in Android apps. FicFinder performs static code analysis based on a model that captures Android APIs as well as their associated context by which compatibility issues are triggered. FicFinder reports actionable debugging information to developers when it detects potential issues. We evaluated FicFinder with 27 large-scale open-source Android apps. The results show that FicFinder can precisely detect compatibility issues in these apps and uncover previously-unknown issues.

Publisher's Version Article Search
Automated Model-Based Android GUI Testing using Multi-level GUI Comparison Criteria
Young-Min Baek and Doo-Hwan Bae
(KAIST, South Korea)
Automated Graphical User Interface (GUI) testing is one of the most widely used techniques to detect faults in mobile applications (apps) and to test functionality and usability. GUI testing exercises behaviors of an application under test (AUT) by executing events on GUIs and checking whether the app behaves correctly. In particular, because Android leads in market share of mobile OS platforms, a lot of research on automated Android GUI testing techniques has been performed. Among various techniques, we focus on model-based Android GUI testing that utilizes a GUI model for systematic test generation and effective debugging support. Since test inputs are generated based on the underlying model, accurate GUI modeling of an AUT is the most crucial factor in order to generate effective test inputs. However, most modern Android apps contain a number of dynamically constructed GUIs that make accurate behavior modeling more challenging. To address this problem, we propose a set of multi-level GUI Comparison Criteria (GUICC) that provides the selection of multiple abstraction levels for GUI model generation. By using multilevel GUICC, we conducted empirical experiments to identify the influence of GUICC on testing effectiveness. Results show that our approach, which performs model-based testing with multi-level GUICC, achieved higher effectiveness than activity-based GUI model generation. We also found that multi-level GUICC can alleviate the inherent state explosion problems of existing a single-level GUICC for behavior modeling of real-world Android apps by flexibly manipulating GUICC.

Publisher's Version Article Search Info
HybriDroid: Static Analysis Framework for Android Hybrid Applications
Sungho Lee, Julian Dolby, and Sukyoung Ryu
(KAIST, South Korea; IBM Research, USA)
Mobile applications (apps) have long invaded the realm of desktop apps, and hybrid apps become a promising solution for supporting multiple mobile platforms. Providing both platform-specific functionalities via native code like native apps and user interactions via JavaScript code like web apps, hybrid apps help developers build multiple apps for different platforms without much duplicated efforts. However, most hybrid apps are developed in multiple programming languages with different semantics, which may be vulnerable to programmer errors. Moreover, because untrusted JavaScript code may access device-specific features via native code, hybrid apps may be vulnerable to various security attacks. Unfortunately, no existing tools can help hybrid app developers by detecting errors or security holes. In this paper, we present HybriDroid, the first static analysis framework for Android hybrid apps. We investigate the semantics of Android hybrid apps especially for the interoperation mechanism of Android Java and JavaScript. Then, we design and implement a static analysis framework that analyzes inter-communication between Android Java and JavaScript. As example analyses supported by HybriDroid, we implement a bug detector that identifies programmer errors due to the hybrid semantics, and a taint analyzer that finds information leaks cross language boundaries. Our empirical evaluation shows that the tools are practically usable in that they found previously uncovered bugs in real-world Android hybrid apps and possible information leaks via a widely-used advertising platform.

Publisher's Version Article Search


Locus: Locating Bugs from Software Changes
Ming Wen, Rongxin Wu, and Shing-Chi Cheung
(Hong Kong University of Science and Technology, China)
Various information retrieval (IR) based techniques have been proposed recently to locate bugs automatically at the file level. However, their usefulness is often compromised by the coarse granularity of files and the lack of contextual information. To address this, we propose to locate bugs using software changes, which offer finer granularity than files and provide important contextual clues for bug-fixing. We observe that bug inducing changes can facilitate the bug fixing process. For example, it helps triage the bug fixing task to the developers who committed the bug inducing changes or enables developers to fix bugs by reverting these changes. Our study further identifies that change logs and the naturally small granularity of changes can help boost the performance of IR-based bug localization. Motivated by these observations, we propose an IR-based approach Locus to locate bugs from software changes, and evaluate it on six large open source projects. The results show that Locus outperforms existing techniques at the source file level localization significantly. MAP and MRR in particular have been improved, on average, by 20.1% and 20.5%, respectively. Locus is also capable of locating the inducing changes within top 5 for 41.0% of the bugs. The results show that Locus can significantly reduce the number of lines needing to be scanned to locate the bug compared with existing techniques.

Publisher's Version Article Search
Fine-Tuning Spectrum Based Fault Localisation with Frequent Method Item Sets
Gulsher Laghari, Alessandro Murgia, and Serge Demeyer
(University of Antwerp, Belgium)
Continuous integration is a best practice adopted in modern software development teams to identify potential faults immediately upon project build. Once a fault is detected it must be repaired immediately, hence continuous integration provides an ideal testbed for experimenting with the state of the art in fault localisation. In this paper we propose a variant of what is known as spectrum based fault localisation, which leverages patterns of method calls by means of frequent itemset mining. We compare our variant (we refer to it as patterned spectrum analysis) against the state of the art and demonstrate on 351 real bugs drawn from five representative open source java projects that patterned spectrum analysis is more effective in localising the fault. Based on anecdotal evidence from this comparison, we suggest avenues for further improvements. Keywords: Automated developer tests; Continuous Integration; Spectrum based fault localisation; Statistical debugging

Publisher's Version Article Search
Recommending Relevant Classes for Bug Reports using Multi-objective Search
Rafi Almhana, Wiem Mkaouer, Marouane Kessentini, and Ali Ouni
(University of Michigan, USA; Osaka University, Japan)
Developers may follow a tedious process to find the cause of a bug based on code reviews and reproducing the abnormal behavior. In this paper, we propose an automated approach to finding and ranking potential classes with the respect to the probability of containing a bug based on a bug report description. Our approach finds a good balance between minimizing the number of recommended classes and maximizing the relevance of the proposed solution using a multi-objective optimization algorithm. The relevance of the recommended classes (solution) is estimated based on the use of the history of changes and bug-fixing, and the lexical similarity between the bug report description and the API documentation. We evaluated our system on 6 open source Java projects, using the version of the project before fixing the bug of many bug reports. The experimental results show that the search-based approach significantly outperforms three state-of-the-art methods in recommending relevant files for bug reports. In particular, our multi-objective approach is able to successfully locate the true buggy methods within the top 10 recommendations for over 87% of the bug reports.

Publisher's Version Article Search
An Empirical Study on Dependence Clusters for Effort-Aware Fault-Proneness Prediction
Yibiao Yang, Mark Harman, Jens Krinke, Syed Islam, David Binkley, Yuming Zhou, and Baowen Xu
(Nanjing University, China; University College London, UK; University of East London, UK; Loyola University Maryland, USA)
A dependence cluster is a set of mutually inter-dependent program elements. Prior studies have found that large dependence clusters are prevalent in software systems. It has been suggested that dependence clusters have potentially harmful effects on software quality. However, little empirical evidence has been provided to support this claim. The study presented in this paper investigates the relationship between dependence clusters and software quality at the function-level with a focus on effort-aware fault-proneness prediction. The investigation first analyzes whether or not larger dependence clusters tend to be more fault-prone. Second, it investigates whether the proportion of faulty functions inside dependence clusters is significantly different from the proportion of faulty functions outside dependence clusters. Third, it examines whether or not functions inside dependence clusters playing a more important role than others are more fault-prone. Finally, based on two groups of functions (i.e., functions inside and outside dependence clusters), the investigation considers a segmented fault-proneness prediction model. Our experimental results, based on five well-known open-source systems, show that (1) larger dependence clusters tend to be more fault-prone; (2) the proportion of faulty functions inside dependence clusters is significantly larger than the proportion of faulty functions outside dependence clusters; (3) functions inside dependence clusters that play more important roles are more fault-prone; (4) our segmented prediction model can significantly improve the effectiveness of effort-aware fault-proneness prediction in both ranking and classification scenarios. These findings help us better understand how dependence clusters influence software quality.

Publisher's Version Article Search

Program Analysis

StraightTaint: Decoupled Offline Symbolic Taint Analysis
Jiang Ming, Dinghao Wu, Jun Wang, Gaoyao Xiao, and Peng Liu
(Pennsylvania State University, USA)
Taint analysis has been widely applied in ex post facto security applications, such as attack provenance investigation, computer forensic analysis, and reverse engineering. Unfortunately, the high runtime overhead imposed by dynamic taint analysis makes it impractical in many scenarios. The key obstacle is the strict coupling of program execution and taint tracking logic code. To alleviate this performance bottleneck, recent work seeks to offload taint analysis from program execution and run it on a spare core or a different CPU. However, since the taint analysis has heavy data and control dependencies on the program execution, the massive data in recording and transformation overshadow the benefit of decoupling. In this paper, we propose a novel technique to allow very lightweight logging, resulting in much lower execution slowdown, while still permitting us to perform full-featured offline taint analysis. We develop StraightTaint, a hybrid taint analysis tool that completely decouples the program execution and taint analysis. StraightTaint relies on very lightweight logging of the execution information to reconstruct a straight-line code, enabling an offline symbolic taint analysis without frequent data communication with the application. While StraightTaint does not log complete runtime or input values, it is able to precisely identify the causal relationships between sources and sinks, for example. Compared with traditional dynamic taint analysis tools, StraightTaint has much lower application runtime overhead.

Publisher's Version Article Search
IncA: A DSL for the Definition of Incremental Program Analyses
Tamás Szabó, Sebastian Erdweg, and Markus Voelter
(itemis, Germany; Delft University of Technology, Netherlands)
Program analyses support software developers, for example, through error detection, code-quality assurance, and by enabling compiler optimizations and refactorings. To provide real-time feedback to developers within IDEs, an analysis must run efficiently even if the analyzed code base is large.
To achieve this goal, we present a domain-specific language called IncA for the definition of efficient incremental program analyses that update their result as the program changes. IncA compiles analyses into graph patterns and relies on existing incremental matching algorithms. To scale IncA analyses to large programs, we describe optimizations that reduce caching and prune change propagation. Using IncA, we have developed incremental control flow and points-to analysis for C, well-formedness checks for DSLs, and 10 FindBugs checks for Java. Our evaluation demonstrates significant speedups for all analyses compared to their non-incremental counterparts.

Publisher's Version Article Search Info
What Developers Want and Need from Program Analysis: An Empirical Study
Maria Christakis and Christian Bird
(Microsoft Research, USA)
Program Analysis has been a rich and fruitful field of research for many decades, and countless high quality program analysis tools have been produced by academia. Though there are some well-known examples of tools that have found their way into routine use by practitioners, a common challenge faced by researchers is knowing how to achieve broad and lasting adoption of their tools. In an effort to understand what makes a program analyzer most attractive to developers, we mounted a multi-method investigation at Microsoft. Through interviews and surveys of developers as well as analysis of defect data, we provide insight and answers to four high level research questions that can help researchers design program analyzers meeting the needs of software developers.
First, we explore what barriers hinder the adoption of program analyzers, like poorly expressed warning messages. Second, we shed light on what functionality developers want from analyzers, including the types of code issues that developers care about. Next, we answer what non-functional characteristics an analyzer should have to be widely used, how the analyzer should fit into the development process, and how its results should be reported. Finally, we investigate defects in one of Microsoft's flagship software services, to understand what types of code issues are most important to minimize, potentially through program analysis.

Publisher's Version Article Search
DistIA: A Cost-Effective Dynamic Impact Analysis for Distributed Programs
Haipeng Cai and Douglas Thain
(Washington State University, USA; University of Notre Dame, USA)
Dynamic impact analysis is a fundamental technique for understanding the impact of specific program entities, or changes to them, on the rest of the program for concrete executions. However, existing techniques are either inapplicable or of very limited utility for distributed programs running in multiple concurrent processes. This paper presents DistIA, a dynamic analysis of distributed systems that predicts impacts propagated both within and across process boundaries by partially ordering distributed method-execution events, inferring causality from the ordered events, and exploiting message-passing semantics. We applied DistIA to large distributed systems of various architectures and sizes, for which it on average finishes the entire analysis within one minute and safely reduces impact-set sizes by over 43% relative to existing options with runtime overhead less than 8%. Moreover, two case studies initially demonstrate the precision of DistIA and its utility in distributed system understanding. While conservative thus subject to false positives, DistIA balances precision and efficiency to offer cost-effective options for evolving distributed programs.

Publisher's Version Article Search Info

Locks and Races

Radius Aware Probabilistic Testing of Deadlocks with Guarantees
Yan Cai and Zijiang Yang
(Institute of Software at Chinese Academy of Sciences, China; Western Michigan University, USA)
Concurrency bugs only occur under certain interleaving. Existing randomized techniques are usually ineffective. PCT innovatively generates scheduling, before executing a program, based on priori-ties and priority change points. Hence, it provides a probabilistic guarantee to trigger concurrency bugs. PCT randomly selects prior-ity change points among all events, which might be effective for non-deadlock concurrency bugs. However, deadlocks usually in-volve two or more threads and locks, and require more ordering constraints to be triggered. We interestingly observe that, every two events of a deadlock usually occur within a short range. We gener-ally formulate this range as the bug Radius, to denote the max dis-tance of every two events of a concurrency bug. Based on the bug radius, we propose RPro (Radius aware Probabilistic testing) for triggering deadlocks. Unlike PCT, RPro selects priority change points within the radius of the targeted deadlocks but not among all events. Hence, it guarantees larger probabilities to trigger dead-locks. We have implemented RPro and PCT and evaluated them on a set of real-world benchmarks containing 10 unique deadlocks. The experimental results show that RPro triggered all deadlocks with higher probabilities (i.e., >7.7x times larger on average) than that by PCT. We also evaluated RPro with radius varying from 1 to 150 (or 300). The result shows that the radius of a deadlock is much smaller (i.e., from 2 to 114 in our experiment) than the num-ber of all events. This further confirms our observation and makes RPro meaningful in practice.

Publisher's Version Article Search
LockPeeker: Detecting Latent Locks in Java APIs
Ziyi Lin, Hao Zhong, Yuting Chen, and Jianjun Zhao
(Shanghai Jiao Tong University, China; Kyushu University, Japan)
Detecting lock-related defects has long been a hot research topic in software engineering. Many efforts have been spent on detecting such deadlocks in concurrent software systems. However, latent locks may be hidden in application programming interface (API) methods whose source code may not be accessible to developers. Many APIs have latent locks. For example, our study has shown that J2SE alone can have 2,000+ latent locks. As latent locks are less known by developers, they can cause deadlocks that are hard to perceive or diagnose. Meanwhile, the state-of-the-art tools mostly handle API methods as black boxes, and cannot detect deadlocks that involve such latent locks. In this paper, we propose a novel black-box testing approach, called LockPeeker, that reveals latent locks in Java APIs. The essential idea of LockPeeker is that latent locks of a given API method can be revealed by testing the method and summarizing the locking effects during testing execution. We have evaluated LockPeeker on ten real-world Java projects. Our evaluation results show that (1) LockPeeker detects 74.9% of latent locks in API methods, and (2) it enables state-of-the-art tools to detect deadlocks that otherwise cannot be detected.

Publisher's Version Article Search
Sound Static Deadlock Analysis for C/Pthreads
Daniel Kroening, Daniel Poetzl, Peter Schrammel, and Björn Wachter
(University of Oxford, UK; University of Sussex, UK; SSW-Trading, Germany)
We present a static deadlock analysis approach for C/pthreads. The design of our method has been guided by the requirement to analyse real-world code. Our approach is sound (i.e., misses no deadlocks) for programs that have defined behaviour according to the C standard and the pthreads specification, and is precise enough to prove deadlock-freedom for a large number of such programs. The method consists of a pipeline of several analyses that build on a new context- and thread-sensitive abstract interpretation framework. We further present a lightweight dependency analysis to identify statements relevant to deadlock analysis and thus speed up the overall analysis. In our experimental evaluation, we succeeded to prove deadlock-freedom for 292 programs from the Debian GNU/Linux distribution with in total 2.3 MLOC in 4 hours.

Publisher's Version Article Search
Static Race Detection for Device Drivers: The Goblint Approach
Vesal Vojdani, Kalmer Apinis, Vootele Rõtov, Helmut Seidl, Varmo Vene, and Ralf Vogler
(University of Tartu, Estonia; TU Munich, Germany)
Device drivers rely on fine-grained locking to ensure safe access to shared data structures. For human testers, concurrency makes such code notoriously hard to debug; for automated reasoning, dynamically allocated memory and low-level pointer manipulation poses significant challenges. We present a flexible approach to data race analysis, implemented in the open source Goblint static analysis framework, that combines different pointer and value analyses in order to handle a wide range of locking idioms, including locks allocated dynamically as well as locks stored in arrays. To the best of our knowledge, this is the most ambitious effort, having lasted well over ten years, to create a fully automated static race detection tool that can deal with most of the intricate locking schemes found in Linux device drivers. Our evaluation shows that these analyses are sufficiently precise, but practical use of these techniques requires inferring environmental and domain-specific assumptions.

Publisher's Version Article Search

Empirical Studies and New Ideas

An Empirical Evaluation of Two User Interfaces of an Interactive Program Verifier
Martin Hentschel, Reiner Hähnle, and Richard Bubel
(TU Darmstadt, Germany)
Theorem provers have highly complex interfaces, but there are not many systematic studies of their usability and effectiveness. Specifically, for interactive theorem provers the ability to quickly comprehend intermediate proof situations is of pivotal importance. In this paper we present the (as far as we know) first empirical study that systematically compares the effectiveness of different user interfaces of an interactive theorem prover. We juxtapose two different user interfaces of the interactive verifier KeY: the traditional one which focuses on proof objects and a more recent one that provides a view akin to an interactive debugger. We carefully designed a controlled experiment where users were given various proof understanding tasks that had to be solved with alternating interfaces. We provide statistical evidence that the conjectured higher effectivity of the debugger-like interface is not just a hunch.

Publisher's Version Article Search
Traceability Maintenance: Factors and Guidelines
Salome Maro, Anthony Anjorin, Rebekka Wohlrab, and Jan-Philipp Steghöfer
(Chalmers University of Technology, Sweden; University of Gothenburg, Sweden; University of Paderborn, Germany)
Traceability is an important concern for numerous software engineering activities. Establishing traceability links is a challenging and cost-intensive task, which is uneconomical without suitable strategies for maintaining high link quality. Current approaches to Traceability Management (TM), however, often make important assumptions and choices without ensuring that the consequences and implications for traceability maintenance are feasible and desirable in practice.
In this paper, therefore, we identify a set of core factors that influence how the quality of traceability links can be maintained. For each factor, we discuss relevant challenges and provide guidelines on how best to ensure viable traceability maintenance in a practical TM approach. Our guidelines are meant to be used by tool developers and users to select the most appropriate TM approach for their needs.
Our results are based on and supported by data collected from interviews conducted with: (i) 9 of our industrial and academic project partners to elicit requirements for a TM tool, and (ii) 24 software development stakeholders from 15 industrial cases to provide a broader overview of the current state of the practice on TM.
To evaluate the feasibility of our guidelines, we investigate a set of existing TM approaches used in industry with respect to our guidelines.

Publisher's Version Article Search
Usage, Costs, and Benefits of Continuous Integration in Open-Source Projects
Michael Hilton, Timothy Tunnell, Kai Huang, Darko Marinov, and Danny Dig
(Oregon State University, USA; University of Illinois, USA)
Continuous integration (CI) systems automate the compilation, building, and testing of software. Despite CI rising as a big success story in automated software engineering, it has received almost no attention from the research community. For example, how widely is CI used in practice, and what are some costs and benefits associated with CI? Without answering such questions, developers, tool builders, and researchers make decisions based on folklore instead of data. In this paper, we use three complementary methods to study the usage of CI in open-source projects. To understand which CI systems developers use, we analyzed 34,544 open-source projects from GitHub. To understand how developers use CI, we analyzed 1,529,291 builds from the most commonly used CI system. To understand why projects use or do not use CI, we surveyed 442 developers. With this data, we answered several key questions related to the usage, costs, and benefits of CI. Among our results, we show evidence that supports the claim that CI helps projects release more often, that CI is widely adopted by the most popular projects, as well as finding that the overall percentage of projects using CI continues to grow, making it important and timely to focus more research on CI.

Publisher's Version Article Search Info
DSL-Maps: From Requirements to Design of Domain-Specific Languages
Ana Pescador and Juan de Lara
(Autonomous University of Madrid, Spain)
Domain-Specific Languages (DSLs) are central to Model-Driven Engineering, where they are used for creating models for particular domains. However, current research and tools for building DSLs focus on the design and implementation aspects of the DSL, while the requirements analysis phase, and its automated transition to design is largely neglected.
In order to alleviate this situation, we propose DSL-maps, a notation inspired by mind-maps, to represent requirements for DSLs. The notation is supported by a tool, which helps in the automated transition into an initial meta-model design, using a customizable transformation and recommendations from a catalogue of meta-model design patterns.

Publisher's Version Article Search Info
The IDE as a Scriptable Information System
Dimitar Asenov, Peter Müller, and Lukas Vogel
(ETH Zurich, Switzerland; Ergon Informatik, Switzerland)
Software engineering is extremely information-intensive. Every day developers work with source code, version repositories, issue trackers, documentation, web-based and other information resources. However, three key aspects of information work lack good support: (i) combining information from different sources; (ii) flexibly presenting collected information to enable easier comprehension; and (iii) automatically acting on collected information, for example to perform a refactoring. Poor support for these activities makes many common development tasks time-consuming and error-prone. We propose an approach that directly addresses these three issues by integrating a flexible query mechanism into the development environment. Our approach enables diverse ways to process and visualize information and can be extended via scripts. We demonstrate how an implementation of the approach can be used to rapidly write queries that meet a wide range of information needs.

Publisher's Version Article Search Video


Inferring Annotations for Device Drivers from Verification Histories
Zvonimir Pavlinovic, Akash Lal, and Rahul Sharma
(New York University, USA; Microsoft Research, India; Stanford University, USA)
This paper studies and optimizes automated program verification. Detailed reasoning about software behavior is often facilitated by program invariants that hold across all program executions. Finding program invariants is in fact an essential step in automated program verification. Automatic discovery of precise invariants, however, can be very difficult in practice. The problem can be simplified if one has access to a candidate set of assertions (or annotations) and the search for invariants is limited over the space defined by these annotations. Then, the main challenge is to automatically generate quality program annotations. We present an approach that infers program annotations automatically by leveraging the history of verifying related programs. Our algorithm extracts high-quality annotations from previous verification attempts, and then applies them for verifying new programs. We present a case study where we applied our algorithm to Microsoft’s Static Driver Verifier (SDV). SDV is an industrial-strength tool for verification of Windows device drivers that uses manually-tuned heuristics for obtaining a set of annotations. Our technique inferred program annotations comparable in performance to the existing annotations used in SDV that were devised manually by human experts over years. Additionally, the inferred annotations together with the existing ones improved the performance of SDV overall, proving correct 47% of drivers more while running 22% faster in our experiments.

Publisher's Version Article Search Info
Array Length Inference for C Library Bindings
Alisa J. Maas, Henrique Nazaré, and Ben Liblit
(University of Wisconsin-Madison, USA; Federal University of Minas Gerais, Brazil)
Simultaneous use of multiple programming languages (polyglot programming) assists in creating efficient, coherent, modern programs in the face of legacy code. However, manually creating bindings to low-level languages like C is tedious and error-prone. We offer relief in the form of an automated suite of analyses, designed to enhance the quality of automatically produced bindings. These analyses recover high-level array length information that is missing from C’s type system. We emit annotations in the style of GObject-Introspection, which produces bindings from annotations on function signatures. We annotate each array argument as terminated by a special sentinel value, fixed-length, or of length determined by another argument. These properties help produce more idiomatic, efficient bindings. We correctly annotate at least 70% of all arrays with these length types, and our results are comparable to those produced by human annotators, but take far less time to produce.

Publisher's Version Article Search
APEx: Automated Inference of Error Specifications for C APIs
Yuan Kang, Baishakhi Ray, and Suman Jana
(Columbia University, USA; University of Virginia, USA)
Although correct error handling is crucial to software robustness and security, developers often inadvertently introduce bugs in error handling code. Moreover, such bugs are hard to detect using existing bug-finding tools without correct error specifications. Creating error specifications manually is tedious and error-prone. In this paper, we present a new technique that automatically infers error specifications of API functions based on their usage patterns in C programs. Our key insight is that error-handling code tend to have fewer branching points and program statements than the code implementing regular functionality. Our scheme leverages this property to automatically identify error handling code at API call sites and infer the corresponding error constraints. We then use the error constraints from multiple call sites for robust inference of API error specifications. We evaluated our technique on 217 API functions from 6 different libraries across 28 projects written in C and found that it can identify error-handling paths with an average precision of 94% and recall of 66%. We also found that our technique can infer correct API error specifications with an average precision of 77% and recall of 47%. To further demonstrate the usefulness of the inferred error specifications, we used them to find 118 previously unknown potential bugs (including several security flaws that are currently being fixed by the corresponding developers) in the 28 tested projects.

Publisher's Version Article Search

Interactions, Deltas, Goals

On Essential Configuration Complexity: Measuring Interactions in Highly-Configurable Systems
Jens Meinicke, Chu-Pan Wong, Christian Kästner, Thomas Thüm, and Gunter Saake
(University of Magdeburg, Germany; Carnegie Mellon University, USA; TU Braunschweig, Germany)
Quality assurance for highly-configurable systems is challenging due to the exponentially growing configuration space. Interactions among multiple options can lead to surprising behaviors, bugs, and security vulnerabilities. Analyzing all configurations systematically might be possible though if most options do not interact or interactions follow specific patterns that can be exploited by analysis tools. To better understand interactions in practice, we analyze program traces to characterize and identify where interactions occur on control flow and data. To this end, we developed a dynamic analysis for Java based on variability-aware execution and monitor executions of multiple small to medium-sized programs. We find that the essential configuration complexity of these programs is indeed much lower than the combinatorial explosion of the configuration space indicates. However, we also discover that the interaction characteristics that allow scalable and complete analyses are more nuanced than what is exploited by existing state-of-the-art quality assurance strategies.

Publisher's Version Article Search Info
Precise Semantic History Slicing through Dynamic Delta Refinement
Yi Li, Chenguang Zhu, Julia Rubin, and Marsha Chechik
(University of Toronto, Canada; Massachusetts Institute of Technology, USA)
Semantic history slicing solves the problem of extracting changes related to a particular high-level functionality from the software version histories. State-of-the-art techniques combine static program analysis and dynamic execution tracing to infer an over-approximated set of changes that can preserve the functional behaviors captured by a test suite. However, due to the conservative nature of such techniques, the sliced histories may contain irrelevant changes. In this paper, we propose a divide-and-conquer-style partitioning approach enhanced by dynamic delta refinement to produce minimal semantic history slices. We utilize deltas in dynamic invariants generated from successive test executions to learn significance of changes with respect to the target functionality. Empirical results indicate that these measurements accurately rank changes according to their relevance to the desired test behaviors and thus partition history slices in an efficient and effective manner.

Publisher's Version Article Search
Goal-Conflict Detection Based on Temporal Satisfiability Checking
Renzo Degiovanni, Nicolas Ricci, Dalal Alrajeh, Pablo Castro, and Nazareno Aguirre
(Universidad Nacional de Río Cuarto, Argentina; CONICET, Argentina; Imperial College London, UK)
Goal-oriented requirements engineering approaches propose capturing how a system should behave through the specification of high-level goals, from which requirements can then be systematically derived. Goals may however admit subtle situations that make them diverge, i.e., not be satisfiable as a whole under specific circumstances feasible within the domain, called boundary conditions. While previous work allows one to identify boundary conditions for conflicting goals written in LTL, it does so through a pattern-based approach, that supports a limited set of patterns, and only produces pre-determined formulations of boundary conditions.
We present a novel automated approach to compute boundary conditions for general classes of conflicting goals expressed in LTL, using a tableaux-based LTL satisfiability procedure. A tableau for an LTL formula is a finite representation of all its satisfying models, which we process to produce boundary conditions that violate the formula, indicating divergence situations. We show that our technique can automatically produce boundary conditions that are more general than those obtainable through existing previous pattern-based approaches, and can also generate boundary conditions for goals that are not captured by these patterns.

Publisher's Version Article Search Info

Symbolic Execution

Symbolic Execution of Stored Procedures in Database Management Systems
Muhammad Suleman Mahmood, Maryam Abdul Ghafoor, and Junaid Haroon Siddiqui
(Lahore University of Management Sciences, Pakistan)
Stored procedures in database management systems are often used to implement complex business logic. Correctness of these procedures is critical for correct working of the system. However, testing them remains difficult due to many possible states of data and database constraints. This leads to mostly manual testing. Newer tools offer automated execution for unit testing of stored procedures but the test cases are still written manually.
In this paper, we propose a novel approach of using dynamic symbolic execution to automatically generate test cases and corresponding database states for stored procedures. We treat values in database tables as symbolic, model the constraints on data imposed by the schema and by the SQL statements executed by the stored procedure. We use an SMT solver to find values that will drive the stored procedure on a particular execution path.
We instrument the internal execution plans generated by PostgreSQL database management system to extract constraints and use the Z3 SMT solver to generate test cases consisting of table data and procedure inputs. Our evaluation using stored procedures from a large business application shows that this technique can uncover bugs that lead to schema constraint violations and user defined exceptions.

Publisher's Version Article Search Info
Conc-iSE: Incremental Symbolic Execution of Concurrent Software
Shengjian Guo, Markus Kusano, and Chao Wang
(Virginia Tech, USA; University of Southern California, USA)
Software updates often introduce new bugs to existing code bases. Prior regression testing tools focus mainly on test case selection and prioritization whereas symbolic execution tools only handle code changes in sequential software. In this paper, we propose the first incremental symbolic execution method for concurrent software to generate new tests by exploring only the executions affected by code changes between two program versions. Specifically, we develop an inter-thread and inter-procedural change-impact analysis to check if a statement is affected by the changes and then leverage the information to choose executions that need to be reexplored. We also check if execution summaries computed in the previous program can be used to avoid redundant explorations in the new program. We have implemented our method in an incremental symbolic execution tool called Conc-iSE and evaluated it on a large set of multithreaded C programs. Our experiments show that the new method can significantly reduce the overall symbolic execution time when compared with state-of-the-art symbolic execution tools such as KLEE.

Publisher's Version Article Search
Model-Based Whitebox Fuzzing for Program Binaries
Van-Thuan Pham, Marcel Böhme, and Abhik Roychoudhury
(National University of Singapore, Singapore)
Many real-world programs take highly structured and very complex inputs. The automated testing of such programs is non-trivial. If the test input does not adhere to a specific file format, the program returns a parser error. For symbolic execution-based whitebox fuzzing the corresponding error handling code becomes a significant time sink. Too much time is spent in the parser exploring too many paths leading to trivial parser errors. Naturally, the time is better spent exploring the functional part of the program where failure with valid input exposes deep and real bugs in the program. In this paper, we suggest to leverage information about the file format and the data chunks of existing, valid files to swiftly carry the exploration beyond the parser code. We call our approach Model-based Whitebox Fuzzing (MoWF) because the file format input model of blackbox fuzzers can be exploited as a constraint on the vast input space to rule out most invalid inputs during path exploration in symbolic execution. We evaluate on 13 vulnerabilities in 8 large program binaries with 6 separate file formats and found that MoWF exposes all vulnerabilities while both, traditional whitebox fuzzing and model-based blackbox fuzzing, expose only less than half, respectively. Our experiments also demonstrate that MoWF exposes 70% vulnerabilities without any seed inputs.

Publisher's Version Article Search Video
Symbolic Execution of Complex Program Driven by Machine Learning Based Constraint Solving
Xin Li, Yongjuan Liang, Hong Qian, Yi-Qi Hu, Lei Bu, Yang Yu, Xin Chen, and Xuandong Li
(Nanjing University, China)
Symbolic execution is a widely-used program analysis technique. It collects and solves path conditions to guide the program traversing. However, due to the limitation of the current constraint solvers, it is difficult to apply symbolic execution on programs with complex path conditions, like nonlinear constraints and function calls. In this paper, we propose a new symbolic execution tool MLB to handle such problem. Instead of relying on the classical constraint solving, in MLB, the feasibility problems of the path conditions are transformed into optimization problems, by minimizing some dissatisfaction degree. The optimization problems are then handled by the underlying optimization solver through machine learning guided sampling and validation. MLB is implemented on the basis of Symbolic PathFinder and encodes not only the simple linear path conditions, but also nonlinear arithmetic operations, and even black-box function calls of library methods, into symbolic path conditions. Experiment results show that MLB can achieve much better coverage on complex real-world programs.

Publisher's Version Article Search
Towards Bounded Model Checking using Nonlinear Programming Solver
Masataka Nishi
(Hitachi, Japan)
Due to their complexity, currently available bounded model checking techniques based on Boolean Satisfiability and Satisfiability Modulo Theories inadequately handle non-linear floating-point and integer arithmetic. Using a numerical approach, we reduce a bounded model checking problem to a constraint satisfaction problem. Currently available techniques attempt to solve the constraint problem but can guarantee neither global convergence nor correctness. Using the IPOPT and ANTIGONE non-linear programming (NLP) solvers, we transform the original constraint satisfaction problem from one having disjunctions of constraints into one having conjunctions of constraints with a few introduced auxiliary variables. The transformation lowers the computing cost and preserves the Boolean structure of the original problem while complying with limits of NLP solvers.

Publisher's Version Article Search

Design and Specs

Identifying Domain Elements from Textual Specifications
Jitendra Singh Thakur and Atul Gupta
(IIITDM Jabalpur, India; Jabalpur Engineering College, India)
Analysis modeling refers to the task of identifying domain objects, their attributes and operations, and the relationships between these objects from software requirements specifications which are usually written in some natural language. There have been a few efforts to automate this task, but they seem to be largely constrained by the language related issues as well as the lack of a systematic transformation process. In this paper, we propose a systematic, automated transformation approach which first interprets the specification sentences based on the Hornby's verb patterns, and then uses semantic relationships between the words in the sentences, obtained from Type Dependencies using Stanford NL Parser, to identify the domain elements from them. With the help of a controlled experiment, we show that the analysis class diagrams generated by the proposed approach are far more correct, far more complete and less redundant than those generated by the exiting automated approaches.

Publisher's Version Article Search
Continuous Detection of Design Flaws in Evolving Object-Oriented Programs using Incremental Multi-pattern Matching
Sven Peldszus, Géza Kulcsár, Malte Lochau, and Sandro Schulze
(University of Koblenz-Landau, Germany; TU Darmstadt, Germany; TU Hamburg, Germany)
Design flaws in object-oriented programs may seriously corrupt code quality thus increasing the risk for introducing subtle errors during software maintenance and evolution. Most recent approaches identify design flaws in an ad-hoc manner, either focusing on software metrics, locally restricted code smells, or on coarse-grained architectural anti-patterns. In this paper, we utilize an abstract program model capturing high-level object-oriented code entities, further augmented with qualitative and quantitative design-related information such as coupling/cohesion. Based on this model, we propose a comprehensive methodology for specifying object-oriented design flaws by means of compound rules integrating code metrics, code smells and anti-patterns in a modular way. This approach allows for efficient, automated design-flaw detection through incremental multi-pattern matching, by facilitating systematic information reuse among multiple detection rules as well as between subsequent detection runs on continuously evolving programs. Our tool implementation comprises well-known anti-patterns for Java programs. The results of our experimental evaluation show high detection precision, scalability to real-size programs, as well as a remarkable gain in efficiency due to information reuse.

Publisher's Version Article Search Info
Efficient Detection of Inconsistencies in a Multi-developer Engineering Environment
Andreas Demuth, Markus Riedl-Ehrenleitner, and Alexander Egyed
(JKU Linz, Austria)
Software developers work concurrently on different kinds of development artifacts such as requirements, architecture, design, or source code. To keep these development artifacts consistent, developers have a wide range of consistency checking approaches available. However, most existing consistency checkers work best in context of single tools and they are not well suited when development artifacts are distributed among different tools and are being modified concurrently by many developers.
This paper presents a novel, cloud-based approach to consistency checking in a multi-developer/-tool engineering environment. It allows instant consistency checking even if developers and their tools are distributed and even if they do not have access to all artifacts. It does this by systematically reusing consistency checking knowledge to keep the memory/CPU cost of consistency checking to a small constant overhead per developer. The feasibility and scalability of our approach is demonstrated through an empirical validation with 22 partly industrial system models. A prototype implementation implementation is available through the DesignSpace Engineering Cloud.

Publisher's Version Article Search
How Good Are the Specs? A Study of the Bug-Finding Effectiveness of Existing Java API Specifications
Owolabi Legunsen, Wajih Ul Hassan, Xinyue Xu, Grigore Roşu, and Darko Marinov
(University of Illinois at Urbana-Champaign, USA)
Runtime verification can be used to find bugs early, during software development, by monitoring test executions against formal specifications (specs). The quality of runtime verification depends on the quality of the specs. While previous research has produced many specs for the Java API, manually or through automatic mining, there has been no large-scale study of their bug-finding effectiveness.
We present the first in-depth study of the bug-finding effectiveness of previously proposed specs. We used JavaMOP to monitor 182 manually written and 17 automatically mined specs against more than 18K manually written and 2.1M automatically generated tests in 200 open-source projects. The average runtime overhead was under 4.3x. We inspected 652 violations of manually written specs and (randomly sampled) 200 violations of automatically mined specs. We reported 95 bugs, out of which developers already fixed 74. However, most violations, 82.81% of 652 and 97.89% of 200, were false alarms.
Our empirical results show that (1) runtime verification technology has matured enough to incur tolerable runtime overhead during testing, and (2) the existing API specifications can find many bugs that developers are willing to fix; however, (3) the false alarm rates are worrisome and suggest that substantial effort needs to be spent on engineering better specs and properly evaluating their effectiveness

Publisher's Version Article Search

Test Generation

Greedy Combinatorial Test Case Generation using Unsatisfiable Cores
Akihisa Yamada, Armin Biere, Cyrille Artho, Takashi Kitamura, and Eun-Hye Choi
(University of Innsbruck, Austria; JKU Linz, Austria; AIST, Japan)
Combinatorial testing aims at covering the interactions of parameters in a system under test, while some combinations may be forbidden by given constraints (forbidden tuples). In this paper, we illustrate that such forbidden tuples correspond to unsatisfiable cores, a widely understood notion in the SAT solving community. Based on this observation, we propose a technique to detect forbidden tuples lazily during a greedy test case generation, which significantly reduces the number of required SAT solving calls. We further reduce the amount of time spent in SAT solving by essentially ignoring constraints while constructing each test case, but then “amending” it to obtain a test case that satisfies the constraints, again using unsatisfiable cores. Finally, to complement a disturbance due to ignoring constraints, we implement an efficient approximative SAT checking function in the SAT solver Lingeling. Through experiments we verify that our approach significantly improves the efficiency of constraint handling in our greedy combinatorial testing algorithm.

Publisher's Version Article Search
Towards Automatically Generating Descriptive Names for Unit Tests
Benwen Zhang, Emily Hill, and James Clause
(University of Delaware, USA; Drew University, USA)
During maintenance, developers often need to understand the purpose of a test. One of the most potentially useful sources of information for understanding a test is its name. Ideally, test names are descriptive in that they accurately summarize both the scenario and the expected outcome of the test. Despite the benefits of being descriptive, test names often fall short of this goal. In this paper we present a new approach for automatically generating descriptive names for existing test bodies. Using a combination of natural-language program analysis and text generation, the technique creates names that summarize the test's scenario and the expected outcome. The results of our evaluation show that, (1) compared to alternative approaches, the names generated by our technique are significantly more similar to human-generated names and are nearly always preferred by developers, (2) the names generated by our technique are preferred over or are equivalent to the original test names in 83% of cases, and (3) our technique is several orders of magnitude faster than manually writing test names.

Publisher's Version Article Search
Applying Combinatorial Test Data Generation to Big Data Applications
Nan Li, Yu Lei, Haider Riaz Khan, Jingshu Liu, and Yun Guo
(Medidata Solutions, USA; University of Texas at Arlington, USA; George Mason University, USA)
Big data applications (e.g., Extract, Transform, and Load (ETL) applications) are designed to handle great volumes of data. However, processing such great volumes of data is time-consuming. There is a need to construct small yet effective test data sets during agile development of big data applications.
In this paper, we apply a combinatorial test data generation approach to two real-world ETL applications at Medidata. In our approach, we first create Input Domain Models (IDMs) automatically by analyzing the original data source and incorporating constraints manually derived from requirements. Next, the IDMs are used to create test data sets that achieve t-way coverage, which has shown to be very effective in detecting software faults. The generated test data sets also satisfy all the constraints identified in the first step. To avoid creating IDMs from scratch when there is a change to the original data source or constraints, our approach extends the original IDMs with additional information. The new IDMs, which we refer to as Adaptive IDMs (AIDMs), are updated by comparing the changes against the additional information, and are then used to generate new test data sets. We implement our approach in a tool, called comBinatorial bIg daTa Test dAta Generator (BIT-TAG).
Our experience shows that combinatorial testing can be effectively applied to big data applications. In particular, the test data sets created using our approach for the two ETL applications are only a small fraction of the original data source, but we were able to detect all the faults found with the original data source.

Publisher's Version Article Search
Generating Test Cases to Expose Concurrency Bugs in Android Applications
Hongyin Tang, Guoquan Wu, Jun Wei, and Hua Zhong
(Institute of Software at Chinese Academy of Sciences, China)
Mobile systems usually support an event-based model of concurrent programming. This model, although advantageous to maintain responsive user interfaces, may lead to subtle concurrency errors due to unforeseen threads interleaving coupled with non-deterministic reordering of asynchronous events. These bugs are very difficult to reproduce even by the same user action sequences that trigger them, due to the undetermined schedules of underlying events and threads. In this paper, we proposed RacerDroid, a novel technique that aims to expose concurrency bugs in android applications by actively controlling event schedule and thread interleaving, given the test cases that have potential data races. By exploring the state model of the application constructed dynamically, our technique starts first to generate a test case that has potential data races based on the results obtained from existing static or dynamic race detection technique. Then it reschedules test cases execution by actively controlling event dispatching and thread interleaving to determine whether such potential races really lead to thrown exceptions or assertion violations. Our preliminary experiments show that RacerDroid is effective, and it confirms real data races, while at the same time eliminates false warnings for Android apps found in the wild.

Publisher's Version Article Search
Automatic Test Image Generation using Procedural Noise
Matthew Patrick, Matthew D. Castle, Richard O. J. H. Stutt, and Christopher A. Gilligan
(University of Cambridge, UK)
It is difficult to test programs that input images, due to the large number of (pixel) values that must be chosen and the complex ways these values interact. Typically, such programs are tested manually, using images that have known results. However, this is a laborious process and limited in the range of tests that can be applied. We introduce a new approach for testing programs that input images automatically, using procedural noise and spatial statistics to create inputs that are both realistic and can easily be tuned to have specific properties. The effectiveness of our approach is illustrated on an epidemiological simulation of a recently introduced tree pest in Great Britain: Oriental Chestnut Gall Wasp. Our approach produces images that match the real landscapes more closely than other techniques and can be used (alongside metamorphic relations) to detect smaller (artificially introduced) errors with greater accuracy.

Publisher's Version Article Search

Code Comparison and Transformation

Move-Optimized Source Code Tree Differencing
Georg Dotzler and Michael Philippsen
(University of Erlangen-Nuremberg, Germany)
When it is necessary to express changes between two source code files as a list of edit actions (an edit script), modern tree differencing algorithms are superior to most text-based approaches because they take code movements into account and express source code changes more accurately. We present 5 general optimizations that can be added to state-of-the-art tree differencing algorithms to shorten the resulting edit scripts. Applied to Gumtree, RTED, JSync, and ChangeDistiller, they lead to shorter scripts for 18-98% of the changes in the histories of 9 open-source software repositories. These optimizations also are parts of our novel Move-optimized Tree DIFFerencing algorithm (MTDIFF) that has a higher accuracy in detecting moved code parts. MTDIFF (which is based on the ideas of ChangeDistiller) further shortens the edit script for another 20% of the changes in the repositories. MTDIFF and all the benchmarks are available under an open-source license.

Publisher's Version Article Search Info
Migrating Cascading Style Sheets to Preprocessors by Introducing Mixins
Davood Mazinanian and Nikolaos Tsantalis
(Concordia University, Canada)
Cascading Style Sheets (CSS) is the standard language for styling web documents and is extensively used in the industry. However, CSS lacks constructs that would allow code reuse (e.g., functions). Consequently, maintaining CSS code is often a cumbersome and error-prone task. Preprocessors (e.g., Less and Sass) have been introduced to fill this gap, by extending CSS with the missing constructs. Despite the clear maintainability benefits coming from the use of preprocessors, there is currently no support for migrating legacy CSS code to preprocessors. In this paper, we propose a technique for automatically detecting duplicated style declarations in CSS code that can be migrated to preprocessor functions (i.e., mixins). Our technique can parameterize differences in the style values of duplicated declarations, and ensure that the migration will not change the presentation semantics of the web documents. The evaluation has shown that our technique is able to detect 98% of the mixins that professional developers introduced in websites and Style Sheet libraries, and can safely migrate real CSS code.

Publisher's Version Article Search Info
Automatic Runtime Recovery via Error Handler Synthesis
Tianxiao Gu, Chengnian Sun, Xiaoxing Ma, Jian Lü, and Zhendong Su
(Nanjing University, China; University of California at Davis, USA)
Software systems are often subject to unexpected runtime errors. Automatic runtime recovery (ARR) techniques aim at recovering them from erroneous states and maintaining them functional in the field. This paper proposes Ares , a novel, practical approach to performing ARR. Our key insight is to leverage a system's already built-in error handling support to recover from unexpected errors. To this end, we synthesize error handlers via two methods: error transformation and early return. We also equip Ares with a lightweight in-vivo testing infrastructure to select the right synthesis methods and avoid potentially dangerous error handlers. Unlike existing ARR techniques based on heavyweight mechanisms (e.g., checkpoint-restart and runtime monitoring), our approach expands the intrinsic capability of runtime error resilience already existing in software systems to handle unexpected errors. Ares's lightweight mechanism makes it practical and easy to be integrated into production environments. We have implemented Ares on top of both the Java HotSpot VM and Android ART, and applied it to 52 real-world bugs. The results are promising — Ares successfully recovers from 39 of them and incurs low overhead.

Publisher's Version Article Search
Mining Revision Histories to Detect Cross-Language Clones without Intermediates
Xiao Cheng, Zhiming Peng, Lingxiao Jiang, Hao Zhong, Haibo Yu, and Jianjun Zhao
(Shanghai Jiao Tong University, China; Singapore Management University, Singapore; Kyushu University, Japan)
To attract more users on different platforms, many projects release their versions in multiple programming languages (e.g., Java and C#). They typically have many code snippets that implement similar functionalities, i.e., cross-language clones. Programmers often need to track and modify cross-language clones consistently to maintain similar functionalities across different language implementations. In literature, researchers have proposed approaches to detect cross- language clones, mostly for languages that share a common intermediate language (such as the .NET language family) so that techniques for detecting single-language clones can be applied. As a result, those approaches cannot detect cross-language clones for many projects that are not implemented in a .NET language. To overcome the limitation, in this paper, we propose a novel approach, CLCMiner, that detects cross-language clones automatically without the need of an intermediate language. Our approach mines such clones from revision histories, which reflect how programmers maintain cross-language clones in practice. We have implemented a prototype tool for our approach and conducted an evaluation on five open source projects that have versions in Java and C#. The results show that CLCMiner achieves high accuracy and point to promising future work.

Publisher's Version Article Search
Battery-Aware Transformations in Mobile Applications
Jürgen Cito, Julia Rubin, Phillip Stanley-Marbell, and Martin Rinard
(University of Zurich, Switzerland; Massachusetts Institute of Technology, USA)
We present an adaptive binary transformation system for reducing the energy impact of advertisements and analytics in mobile applications. Our approach accommodates both the needs of mobile app developers to obtain income from advertisements and the desire of mobile device users for longer battery life. Our technique automatically identifies recurrent advertisement and analytics requests and throttles these requests based on a mobile device's battery status. Of the Android applications we analyzed, 75% have at least one connection that exhibits such recurrent requests. Our automated detection scheme classifies these requests with 100% precision and 80.5% recall. Applying the proposed battery-aware transformations to a representative mobile application reduces the power consumption of the mobile device by 5.8%, without the negative effect of completely removing advertisements.

Publisher's Version Article Search


Bugram: Bug Detection with N-gram Language Models
Song Wang, Devin Chollak, Dana Movshovitz-Attias, and Lin Tan
(University of Waterloo, Canada; Carnegie Mellon University, USA)
To improve software reliability, many rule-based techniques have been proposed to infer programming rules and detect violations of these rules as bugs. These rule-based approaches often rely on the highly frequent appearances of certain patterns in a project to infer rules. It is known that if a pattern does not appear frequently enough, rules are not learned, thus missing many bugs.
In this paper, we propose a new approach—Bugram—that leverages n-gram language models instead of rules to detect bugs. Bugram models program tokens sequentially, using the n-gram language model. Token sequences from the program are then assessed according to their probability in the learned model, and low probability sequences are marked as potential bugs. The assumption is that low probability token sequences in a program are unusual, which may indicate bugs, bad practices, or unusual/special uses of code of which developers may want to be aware.
We evaluate Bugram in two ways. First, we apply Bugram on the latest versions of 16 open source Java projects. Results show that Bugram detects 59 bugs, 42 of which are manually verified as correct, 25 of which are true bugs and 17 are code snippets that should be refactored. Among the 25 true bugs, 23 cannot be detected by PR-Miner. We have reported these bugs to developers, 7 of which have already been confirmed by developers (4 of them have already been fixed), while the rest await confirmation. Second, we further compare Bugram with three additional graph- and rule-based bug detection tools, i.e., JADET, Tikanga, and GrouMiner. We apply Bugram on 14 Java projects evaluated in these three studies. Bugram detects 21 true bugs, at least 10 of which cannot be detected by these three tools. Our results suggest that Bugram is complementary to existing rule-based bug detection approaches.

Publisher's Version Article Search
Mining Input Grammars from Dynamic Taints
Matthias Höschele and Andreas Zeller
(Saarland University, Germany)
Knowing which part of a program processes which parts of an input can reveal the structure of the input as well as the structure of the program. In a URL, for instance, the protocol http, the host, and the path path would be handled by different functions and stored in different variables. Given a set of sample inputs, we use dynamic tainting to trace the data flow of each input character, and aggregate those input fragments that would be handled by the same function into lexical and syntactical entities. The result is a context-free grammar that reflects valid input structure. In its evaluation, our AUTOGRAM prototype automatically produced readable and structurally accurate grammars for inputs like URLs, spreadsheets or configuration files. The resulting grammars not only allow simple reverse engineering of input formats, but can also directly serve as input for test generators.

Publisher's Version Article Search
Phrase-Based Extraction of User Opinions in Mobile App Reviews
Phong Minh Vu, Hung Viet Pham, Tam The Nguyen, and Tung Thanh Nguyen
(Utah State University, USA)
Mobile app reviews often contain useful user opinions like bug reports or suggestions. However, looking for those opinions manually in thousands of reviews is inefective and time- consuming. In this paper, we propose PUMA, an automated, phrase-based approach to extract user opinions in app reviews. Our approach includes a technique to extract phrases in reviews using part-of-speech (PoS) templates; a technique to cluster phrases having similar meanings (each cluster is considered as a major user opinion); and a technique to monitor phrase clusters with negative sentiments for their outbreaks over time. We used PUMA to study two popular apps and found that it can reveal severe problems of those apps reported in their user reviews.

Publisher's Version Article Search

Mining and Retrieval

Practical Guidelines for Change Recommendation using Association Rule Mining
Leon Moonen, Stefano Di Alesio, David Binkley, and Thomas Rolfsnes
(Simula Research Laboratory, Norway; Loyola University Maryland, USA)
Association rule mining is an unsupervised learning technique that infers relationships among items in a data set. This technique has been successfully used to analyze a system's change history and uncover evolutionary coupling between system artifacts. Evolutionary coupling can, in turn, be used to recommend artifacts that are potentially affected by a given set of changes to the system. In general, the quality of such recommendations is affected by (1) the values selected for various parameters of the mining algorithm, (2) characteristics of the set of changes used to derive a recommendation, and (3) characteristics of the system's change history for which recommendations are generated.
In this paper, we empirically investigate the extent to which certain choices for these factors affect change recommendation. Specifically, we conduct a series of systematic experiments on the change histories of two large industrial systems and eight large open source systems, in which we control the size of the change set for which to derive a recommendation, the measure used to assess the strength of the evolutionary coupling, and the maximum size of historical changes taken into account when inferring these couplings. We use the results from our study to derive a number of practical guidelines for applying association rule mining for change recommendation.

Publisher's Version Article Search Info
Learning a Dual-Language Vector Space for Domain-Specific Cross-Lingual Question Retrieval
Guibin Chen, Chunyang Chen, Zhenchang Xing, and Bowen Xu
(Nanyang Technological University, Singapore; Zhejiang University, China)
The lingual barrier limits the ability of millions of non-English speaking developers to make effective use of the tremendous knowledge in Stack Overflow, which is archived in English. For cross-lingual question retrieval, one may use translation-based methods that first translate the non-English queries into English and then perform monolingual question retrieval in English. However, translation-based methods suffer from semantic deviation due to inappropriate translation, especially for domain-specific terms, and lexical gap between queries and questions that share few words in common. To overcome the above issues, we propose a novel cross-lingual question retrieval based on word embeddings and convolutional neural network (CNN) which are the state-of-the-art deep learning techniques to capture word- and sentence-level semantics. The CNN model is trained with large amounts of examples from Stack Overflow duplicate questions and their corresponding translation by machine, which guides the CNN to learn to capture informative word and sentence features to recognize and quantify semantic similarity in the presence of semantic deviations and lexical gaps. A uniqueness of our approach is that the trained CNN can map documents in two languages (e.g., Chinese queries and English questions) in a dual-language vector space, and thus reduce the cross-lingual question retrieval problem to a simple k-nearest neighbors search problem in the dual-language vector space, where no query or question translation is required. Our evaluation shows that our approach significantly outperforms the translation-based method, and can be extended to dual-language documents retrieval from different sources.

Publisher's Version Article Search


Mobile and Security

Reflection-Aware Static Analysis of Android Apps
Li Li, Tegawendé F. Bissyandé, Damien Octeau, and Jacques Klein
(University of Luxembourg, Luxembourg; Pennsylvania State University, USA)
We demonstrate the benefits of DroidRA, a tool for taming reflection in Android apps. DroidRA first statically extracts reflection-related object values from a given Android app. Then, it leverages the extracted values to boost the app in a way that reflective calls are no longer a challenge for existing static analyzers. This is achieved through a bytecode instrumentation approach, where reflective calls are supplemented with explicit traditional Java method calls which can be followed by state-of-the-art analyzers which do not handle reflection. Instrumented apps can thus be completely analyzed by existing static analyzers, which are no longer required to be modified to support reflection-aware analysis. The video demo of DroidRA can be found at

Publisher's Version Article Search
Relda2: An Effective Static Analysis Tool for Resource Leak Detection in Android Apps
Tianyong Wu, Jierui Liu, Xi Deng, Jun Yan, and Jian Zhang
(Institute of Software at Chinese Academy of Sciences, China; University at Chinese Academy of Sciences, China)
Resource leak is a common bug in Android applications (apps for short). In general, it is caused by missing release operations of the resources provided by Android (like Camera, Media Player and Sensors) that require programmers to explicitly release them. It might lead to several serious problems for the app and system, such as performance degradation and system crash.
This paper presents Relda2, a light-weight, scalable and practical static analysis tool, for detecting resource leaks in the byte-code of Android apps automatically. It supports two analysis techniques (flow-insensitive for quick scanning and flow-sensitive for accurate scanning), and performs inter-procedural analysis to get more precise bug reports. In addition, our tool is practical to analyze real-world apps, and has been applied to 103 Android apps, including industry applications and open source programs. Currently, we found 67 real resource leaks in these apps, which we have confirmed manually. A demo video of our tool can be found at the website:

Publisher's Version Article Search
An End-User Oriented Tool Suite for Development of Mobile Applications
Zhongyi Zhai, Bo Cheng, Meng Niu, Zhaoning Wang, Yimeng Feng, and Junliang Chen
(Beijing University of Posts and Telecommunications, China)
In this paper, we show an end-user oriented tool suite for mobile application development. The advantages of this tool suite are that the graphical user interface (GUI), as well as the application logic can both be developed in a rapid and simple way, and web-based services on the Internet can be integrated into our platform by end-users. This tool suite involves three sub-systems, namely ServiceAccess, EasyApp and LSCE. ServiceAccess takes charge of the registration and management of heterogeneous services, and can export different form of services according to the requirements of the other sub-systems. EasyApp is responsible for developing GUI in the form of mobile app. LSCE takes charge of creating the application logic that can be invoked by mobile app directly. Finally, a development case is presented to illustrate the development process using this tool suite. The URL of demo video:

Publisher's Version Article Search Video
Model Driven Design of Heterogeneous Synchronous Embedded Systems
Huafeng Zhang, Yu Jiang, Han Liu, Hehua Zhang, Ming Gu, and Jiaguang Sun
(Tsinghua University, China; University of Illinois at Urbana-Champaign, USA)
Synchronous embedded systems are becoming more and more complicated and are usually implemented with integrated hardware/software solutions. This implementation manner brings new challenges to the traditional model-driven design environments such as SCADE and STATEMATE, that supports pure hardware or software design. In this paper, we propose a co-design tool Tsmart-Edola to facilitate the system developers, and automatically generate the executable VHDL code and C code from the for- mal verified SyncBlock computation model. SyncBlock is a lightweight high-level system specification model with well defined syntax, simulation and formal semantics. Based on which, the graphical model editor, graphical simulator, verification translator, and code generator are implemented and seamlessly integrated into the Tsmart-Edola. For evaluation, we apply Tsmart-Edola to the design of a real-world train controller based on the international standard IEC 61375. Several critical ambiguousness or bugs in the standard are detected during formal verification of the constructed system model. Furthermore, the generated VHDL code and C code of Tsmart-Edola outperform that of the state-of-the-art tools in terms of synthesized gate array resource consumption and binary code size.

Publisher's Version Article Search Video
MACKE: Compositional Analysis of Low-Level Vulnerabilities with Symbolic Execution
Saahil Ognawala, Martín Ochoa, Alexander Pretschner, and Tobias Limmer
(TU Munich, Germany; Singapore University of Technology and Design, Singapore; Siemens, Germany)
Concolic (concrete+symbolic) execution has recently gained popularity as an effective means to uncover non-trivial vulnerabilities in software, such as subtle buffer overflows. However, symbolic execution tools that are designed to optimize statement coverage often fail to cover potentially vulnerable code because of complex system interactions and scalability issues of constraint solvers. In this paper, we present a tool (MACKE) that is based on the modular interactions inferred by static code analysis, which is combined with symbolic execution and directed inter-procedural path exploration. This provides an advantage in terms of statement coverage and ability to uncover more vulnerabilities. Our tool includes a novel feature in the form of interactive vulnerability report generation that helps developers prioritize bug fixing based on severity scores. A demo of our tool is available at

Publisher's Version Article Search Video Info
BovInspector: Automatic Inspection and Repair of Buffer Overflow Vulnerabilities
Fengjuan Gao, Linzhang Wang, and Xuandong Li
(Nanjing University, China)
Buffer overflow is one of the most common types of software vulnerabilities. Various static analysis and dynamic testing techniques have been proposed to detect buffer overflow vulnerabilities. With automatic tool support, static buffer overflow detection technique has been widely used in academia and industry. However, it tends to report too many false positives fundamentally due to the lack of software execution information. Currently, static warnings can only be validated by manual inspection, which significantly limits the practicality of the static analysis. In this paper, we present BovInspector, a tool framework for automatic static buffer overflow warnings inspection and validated bugs repair. Given the program source code and static buffer overflow vulnerability warnings, BovInspector first performs warning reachability analysis. Then, BovInspector executes the source code symbolically under the guidance of reachable warnings. Each reachable warning is validated and classified by checking whether all the path conditions and the buffer overflow constraints can be satisfied simultaneously. For each validated true warning, BovInspector fix it with three predefined strategies. BovInspector is complementary to prior static buffer overflow discovery schemes. Experimental results on real open source programs show that BovInspector can automatically inspect on average of 74.9% of total warnings, and false warnings account for about 25% to 100% (on average of 59.9%) of the total inspected warnings. In addition, the automatically generated patches fix all target vulnerabilities. Further information regarding the implementation and experimental results of BovInspector is available at And a short video for demonstrating the capabilities of BovInspector is now available at

Publisher's Version Article Search Video

Performance, Recommendation, and Analysis

CORRECT: Code Reviewer Recommendation at GitHub for Vendasta Technologies
Mohammad Masudur Rahman, Chanchal K. Roy, Jesse Redl, and Jason A. Collins
(University of Saskatchewan, Canada; Vendasta Technologies, Canada; Google, USA)
Peer code review locates common coding standard violations and simple logical errors in the early phases of software development, and thus, reduces overall cost. Unfortunately, at GitHub, identifying an appropriate code reviewer for a pull request is challenging given that reliable information for reviewer identification is often not readily available. In this paper, we propose a code reviewer recommendation tool-CORRECT-that considers not only the relevant cross-project work experience (e.g., external library experience) of a developer but also her experience in certain specialized technologies (e.g., Google App Engine) associated with a pull request for determining her expertise as a potential code reviewer. We design our tool using client-server architecture, and then package the solution as a Google Chrome plug-in. Once the developer initiates a new pull request at GitHub, our tool automatically analyzes the request, mines two relevant histories, and then returns a ranked list of appropriate code reviewers for the request within the browser's context. Demo:

Publisher's Version Article Search Video Info
ProcessPAIR: A Tool for Automated Performance Analysis and Improvement Recommendation in Software Development
Mushtaq Raza and João Pascoal Faria
(University of Porto, Portugal; INESC TEC, Portugal)
High-maturity software development processes can generate significant amounts of data that can be periodically analyzed to identify performance problems, determine their root causes and devise improvement actions. However, conducting that analysis manually is challenging because of the potentially large amount of data to analyze and the effort and expertise required. In this paper, we present ProcessPAIR, a novel tool designed to help developers analyze their performance data with less effort, by automatically identifying and ranking performance problems and potential root causes, so that subsequent manual analysis for the identification of deeper causes and improvement actions can be properly focused. The analysis is based on performance models defined manually by process experts and calibrated automatically from the performance data of many developers. We also show how ProcessPAIR was successfully applied for the Personal Software Process (PSP). A video about ProcessPAIR is available in

Publisher's Version Article Search
CVExplorer: Identifying Candidate Developers by Mining and Exploring Their Open Source Contributions
Gillian J. Greene and Bernd Fischer
(Stellenbosch University, South Africa)
Open source code contributions contain a large amount of technical skill information about developers, which can help to identify suitable candidates for a particular development job and therefore impact the success of a development team. We develop CVExplorer as a tool to extract, visualize, and explore relevant technical skills data from GitHub, such as languages and libraries used. It allows non-technical users to filter and identify developers according to technical skills demonstrated across all of their open source contributions, in order to support more accurate candidate identification. We demonstrate the usefulness of CVExplorer by using it to recommend candidates for open positions in two companies.

Publisher's Version Article Search
Lightweight Collection and Storage of Software Repository Data with DataRover
Thomas Kowark, Christoph Matthies, Matthias Uflacker, and Hasso Plattner
(HPI, Germany)
The ease of setting up collaboration infrastructures for software engineering projects creates a challenge for researchers that aim to analyze the resulting data. As teams can choose from various available software-as-a-service solutions and can configure them with a few clicks, researchers have to create and maintain multiple implementations for collecting and aggregating the collaboration data in order to perform their analyses across different setups. The DataRover system presented in this paper simplifies this task by only requiring custom source code for API authentication and querying. Data transformation and linkage is performed based on mappings, which users can define based on sample responses through a graphical front end. This allows storing the same input data in formats and databases most suitable for the intended analysis without requiring additional coding. Furthermore, API responses are continuously monitored to detect changes and allow users to update their mappings and data collectors accordingly. A screencast of the described use cases is available at

Publisher's Version Article Search Video Info
Visual Contract Extractor: A Tool for Reverse Engineering Visual Contracts using Dynamic Analysis
Abdullah Alshanqiti, Reiko Heckel, and Timo Kehrer
(University of Leicester, UK; Politecnico di Milano, Italy)
Visual contracts model the operations of classes, components or services by pre- and post-conditions formalised as graph transformation rules. They provide a precise but intuitive notation to test, document and analyse software systems. However, due to their detailed level of specification of data states and transformations, modelling a real application is a complex and error-prone process.
Rather than adopting a top-down modelling approach, we follow a dynamic bottom-up approach to reverse engineer visual contracts from object-oriented programs based on tracing the execution of operations. We developed the Visual Contract Extractor (VCE), a dynamic analysis tool which supports the reverse engineering of visual operation contracts from Java programs.
We explore the main features of the tool using two case studies and discuss usage scenarios ranging from traditional program understanding to novel applications in the field of model-based engineering. A screencast demonstrating the tool is provided at

Publisher's Version Article Search Video
SuperMod: Tool Support for Collaborative Filtered Model-Driven Software Product Line Engineering
Felix Schwägerl and Bernhard Westfechtel
(University of Bayreuth, Germany)
The increase in productivity implied by model-driven software product line engineering is weakened by the complexity exposed to the user having to manage a multi-variant model. Recently, a new paradigm has emerged: filtered software product line engineering transfers the established check-out/modify/commit workflow from version control to variability management, allowing to iteratively develop the multi-variant model in a single-variant view. This paper demonstrates SuperMod, a tool that supports collaborative filtered model-driven product line engineering, implemented for and with the Eclipse Modeling Framework. Concerning variability management, the tool offers capabilities for editing feature models and specifying feature configurations, both being well-known formalisms in product line engineering. Furthermore, collaborative editing of product lines is provided through distributed version control. The accompanying video shows that SuperMod seamlessly integrates into existing tool landscapes, reduces the complexity of multi-variant editing, automates a large part of variability management, and ensures consistency. A tool demonstration video is available here:

Publisher's Version Article Search Video
AnModeler: A Tool for Generating Domain Models from Textual Specifications
Jitendra Singh Thakur and Atul Gupta
(IIITDM Jabalpur, India)
This paper presents AnModeler, a tool for generating analysis models from software requirements specified using use cases. The tool uses the Stanford natural language parser to extract type dependencies (TDs) and parts of speech tags (POS-tags) of sentences from input Use Case Specification (UCS). Then, it identifies sentence structures using a set of rules framed based on Hornby's verb patterns. With the information of the TDs, POS tags, and the identified sentence structures, the tool discovers domain elements, viz.: domain objects (including their attributes and operations) and interactions between them; it consolidates the domain information as a class diagram (as well as a sequence diagram). An experiment conducted on 10 UCSs with two industry experts as subjects showed that the analysis class diagrams generated by AnModeler were remarkably close to those generated by the two industry experts. Being lightweight and easy to use, the tool can also be used to assist students and young developers in acquiring object-oriented domain modeling skills quickly.

Publisher's Version Article Search
SimilarTech: Automatically Recommend Analogical Libraries across Different Programming Languages
Chunyang Chen and Zhenchang Xing
(Nanyang Technological University, Singapore)
Third-party libraries are an integral part of many software projects. It often happens that developers need to find analogical libraries that can provide comparable features to the libraries they are already familiar with. Existing methods to find analogical libraries are limited by the community-curated list of libraries, blogs, or Q&A posts, which often contain overwhelming or out-of-date information. This paper presents our tool SimilarTech ( that makes it possible to automatically recommend analogical libraries by incorporating tag embeddings and domain-specific relational and categorical knowledge mined from Stack Overflow. SimilarTech currently supports recommendation of 6,715 libraries across 6 different programming languages. We release our SimilarTech website for public use. The SimilarTech website attracts more than 2,400 users in the past 6 months. We observe two typical usage patterns of our website in the website visit logs which can satisfy different information needs of developers. The demo video can be seen at

Publisher's Version Article Search Video Info

Testing, Validation, and Verification

TeeVML: Tool Support for Semi-automatic Integration Testing Environment Emulation
Jian Liu, John Grundy, Iman Avazpour, and Mohamed Abdelrazek
(Swinburne University of Technology, Australia; Deakin University, Australia)
Software environment emulation provides a means for simulating an operational environment of a system. This process involves approximation of systems’ external behaviors and their communications with a system to be tested in the environment. Development of such an environment is a tedious task and involves complex low level coding. Model driven engineering is an avenue to raise the level of abstraction beyond programming by specifying solution directly using problem domain concepts. In this paper we propose a novel domain-specific modeling tool to generate complex testing environments. Our tool employs a suite of domain-specific visual modeling languages for modeling emulation environment at a high level of abstraction. These high level specifications are then automatically transformed to runtime environment for application integration testing, boosting development productivity and ease of use. The tool demonstration video can be accessed here:

Publisher's Version Article Search Video Info
The Interactive Verification Debugger: Effective Understanding of Interactive Proof Attempts
Martin Hentschel, Reiner Hähnle, and Richard Bubel
(TU Darmstadt, Germany)
The Symbolic Execution Debugger (SED) is an extension of the Eclipse debug platform for interactive symbolic execution. Like a traditional debugger, the SED can be used to locate the origin of a defect and to increase program understanding. However, as it is based on symbolic execution, all execution paths are explored simultaneously. We demonstrate an extension of the SED called Interactive Verification Debugger (IVD) for inspection and understanding of formal verification attempts. By a number of novel views, the IVD allows to quickly comprehend interactive proof situations and to debug the reasons for a proof attempt that got stuck. It is possible to perform interactive proofs completely from within the IVD. It can be experimentally demonstrated that the IVD is more effective in understanding proof attempts than a conventional prover user interface. A screencast explaining proof attempt inspection with the IVD is available at

Publisher's Version Article Search
Verifying Simulink Stateflow Model: Timed Automata Approach
Yixiao Yang, Yu Jiang, Ming Gu, and Jiaguang Sun
(Tsinghua University, China; University of Illinois at Urbana-Champaign, USA)
Simulink Stateflow is widely used for the model-driven development of software. However, the increasing demand of rigorous verification for safety critical applications brings new challenge to the Simulink Stateflow because of the lack of formal semantics. In this paper, we present STU, a self-contained toolkit to bridge the Simulink Stateflow and a well-defined rigorous verification. The tool translates the Simulink Stateflow into the Uppaal timed automata for verification. Compared to existing work, more advanced and complex modeling features in Stateflow such as the event stack, conditional action and timer are supported. Then, with the strong verification power of Uppaal, we can not only find design defects that are missed by the Simulink Design Verifier, but also check more important temporal properties. The evaluation on artificial examples and real industrial applications demonstrates the effectiveness.

Publisher's Version Article Search
GUICat: GUI Testing as a Service
Lin Cheng, Jialiang Chang, Zijiang Yang, and Chao Wang
(Western Michigan University, USA; University of Southern California, USA)
GUIs are event-driven applications where the flow of the program is determined by user actions such as mouse clicks and key presses. GUI testing is a challenging task not only because of the combinatorial explosion in the number of event sequences, but also because of the difficulty to cover the large number of data values. We propose GUICat, the first cloud based GUI testing framework that simultaneously generates event sequences and data values. It is a white-box GUI testing tool that augments traditional sequence generation techniques with concolic execution. We also propose a cloudbased parallel algorithm for mitigating both event sequence explosion and data value explosion, by distributing the concolic execution tasks over public clouds such as Amazon EC2. We have evaluated the tool on standard GUI testing benchmarks and showed that GUICat significantly outperforms state-of-the-art GUI testing tools. The video demo URL is

Publisher's Version Article Search
An Automated Collaborative Requirements Engineering Tool for Better Validation of Requirements
Nor Aiza Moketar, Massila Kamalrudin, Safiah Sidek, Mark Robinson, and John Grundy
(Technical University of Malaysia Malacca, Malaysia; Fulgent, USA; Deakin University, Australia)
This demo introduces an automated collaborative requirements engineering tool, called TestMEReq, which is used to promote effective communication and collaboration between client-stakeholders and requirements engineers for better requirements validation. Our tool is augmented with real time communication and collaboration support to allow multiple stakeholders to collaboratively validate the same set of requirements. We have conducted a user study focusing on validating requirements using TestMEReq with a few groups of requirements engineers and client stakeholders. The study shows that our automated tool support is able to assist requirements engineers to effectively communicate with client-stakeholders to better validate the requirements virtually in real time. (Demo video:

Publisher's Version Article Search Video
An Extensible Framework for Variable-Precision Data-Flow Analyses in MPS
Tamás Szabó, Simon Alperovich, Markus Voelter, and Sebastian Erdweg
(itemis, Germany; Delft University of Technology, Netherlands; JetBrains, Czechia)
Data-flow analyses are used as part of many software engineering tasks: they are the foundations of program under- standing, refactorings and optimized code generation. Similar to general-purpose languages (GPLs), state-of-the-art domain-specific languages (DSLs) also require sophisticated data-flow analyses. However, as a consequence of the different economies of DSL development and their typically relatively fast evolution, the effort for developing and evolving such analyses must be lowered compared to GPLs. This tension can be resolved with dedicated support for data-flow analyses in language workbenches.
In this tool paper we present MPS-DF, which is the component in the MPS language workbench that supports the definition of data-flow analyses for DSLs. Language developers can define data-flow graph builders declaratively as part of a language definition and compute analysis results efficiently based on these data-flow graphs. MPS-DF is extensible such that it does not compromise the support for language composition in MPS. Additionally, clients of MPS-DF analyses can run the analyses with variable precision thus trading off precision for performance. This allows clients to tailor an analysis to a particular use case.

Publisher's Version Article Search Video Info

Doctoral Symposium

Towards Efficient and Effective Automatic Program Repair
Xuan-Bach D. Le
(Singapore Management University, Singapore)
Automatic Program Repair (APR) has recently been an emerging research area, addressing an important challenge in software engi- neering. APR techniques, if effective and efficient , can greatly help software debugging and maintenance. Recently proposed APR tech- niques can be generally classified into two families, namely search- based and semantics-based APR methods. To produce repairs, search- based APR techniques generate huge populations of possible re- pairs, i.e., search space, and lazily search for the best one among the search space. Semantics-based APR techniques utilize con- straint solving and program synthesis to make search space more tractable, and find those repairs that conform to semantics con- straints extracted via symbolic execution. Despite recent advances in APR, search-based APR still suffers from search space explo- sion problem, while the semantics-based APR could be hindered by limited capability of constraint solving and program synthesis. Furthermore, both APR families may be subject to overfitting, in which generated repairs do not generalize to other test sets. This thesis works towards enhancing both effectiveness and ef- ficiency in order for APR to be practically adopted in foreseeable future. To achieve this goal, other than using test cases as the pri- mary criteria for traversing the search space, we designed a new feature used for a new search-based APR technique to effectively traverse the search space, wherein bug fix history is used to evaluate the quality of repair candidates. We also developed a deductive- reasoning-based repair technique that combines search-based and semantics-based approaches to enhance the repair capability, while ensuring the soundness of generated repairs. We also leveraged machine-learning techniques to build a predictive model that pre- dicts whether an APR technique is effective in fixing particular bugs. In the future, we plan to synergize many existing APR tech- niques, improve our predictive model, and adopt the advances of other fields such as test case generation and program synthesis for APR.

Publisher's Version Article Search Info
Automated Testing and Notification of Mobile App Privacy Leak-Cause Behaviours
Joseph Chan Joo Keng
(Singapore Management University, Singapore)
I describe the design, implementation and evaluation of a novel hybrid static/dynamic analysis system for automatically uncovering and testing for the user-triggered causes and paths of privacy leaks in Android applications (privacy 'leak-causes'). I describe how I plan to further evaluate and demonstrate improvements in accuracy, coverage and testing speed of my hybrid testing approach against other currently available systems. I also show how user privacy is improved by the presentation of information on leak-causes in a field study as privacy notices. I present plans to investigate which of the commonly utilized mobile notification mechanisms is best suited to the presentation of leak-causes, as well as how users may set better privacy control policies with the information provided.

Publisher's Version Article Search
Factoring Requirement Dependencies in Software Requirement Selection using Graphs and Integer Programming
Davoud Mougouei
(Flinders University, Australia)
Software requirement selection is to find a subset of requirements (so-called optimal set) that gives the highest customer value for a release of software while keeping the cost within the budget. Several industrial studies however, have demonstrated that requirements of software projects are intricately interdependent and these interdependencies impact the values of requirements. Furthermore, the strengths of dependency relations among requirements vary in the context of real-world projects. For instance, requirements can be strongly or weakly interdependent. Therefore, it is important to consider both the existence and the strengths of dependency relations during requirement selection. The existing selection models however, have ignored either requirement dependencies altogether or the strengths of those dependencies. This research proposes an Integer programming model for requirement selection which considers both the existence and strengths of requirement dependencies. We further contribute a graph-based dependency modeling technique for capturing requirement dependencies and the their corresponding strengths. Automated/semi-automated techniques will also be devised to identify requirement dependencies and the strengths of those dependencies.

Publisher's Version Article Search
Statistical Analysis of Large Sets of Models
Önder Babur
(Eindhoven University of Technology, Netherlands)
Many applications in Model-Driven Engineering involve processing multiple models, e.g. for comparing and merging of model variants into a common domain model. Despite many sophisticated techniques for model comparison, little attention has been given to the initial data analysis and filtering activities. These are hard to ignore especially in the case of a large dataset, possibly with outliers and sub-groupings. We would like to develop a generic approach for model comparison and analysis for large datasets; using techniques from information retrieval, natural language processing and machine learning. We are implementing our approach as an open framework and have so far evaluated it on public datasets involving domain analysis, repository management and model searching scenarios.

Publisher's Version Article Search
Developer Targeted Analytics: Supporting Software Development Decisions with Runtime Information
Jürgen Cito
(University of Zurich, Switzerland)
Runtime information of deployed software has been used by business and operations units to make informed decisions under the term ``analytics". However, decisions made by software engineers in the course of evolving software have, for the most part, been based on personal belief and gut-feeling. This could be attributed to software development being, for the longest time, viewed as an activity that is detached from the notion of operating software in a production environment. In recent years, this view has been challenged with the emergence of the DevOps movement, which aim is to promote cross-functional capabilities of development and operations activities within teams. This shift in process and mindset requires analytics tools that specifically target software developers. In this research, I investigate approaches to support developers in their decision-making by incorporating runtime information in source code and provide live feedback in IDEs by predicting the impact of code changes.

Publisher's Version Article Search
API Recommendation System for Software Development
Ferdian Thung
(Singapore Management University, Singapore)
Nowadays, software developers often utilize existing third party libraries and make use of Application Programming Interface (API) to develop a software. However, it is not always obvious which library to use or whether the chosen library will play well with other libraries in the system. Furthermore, developers need to spend some time to understand the API to the point that they can freely use the API methods and putting the right parameters inside them. In this work, I plan to automatically recommend relevant APIs to developers. This API recommendation can be divided into multiple stages. First, we can recommend relevant libraries provided a given task to complete. Second, we can recommend relevant API methods that developer can use to program the required task. Third, we can recommend correct parameters for a given method according to its context. Last but not least, we can recommend how different API methods can be combined to achieve a given task.
In effort to realize this API recommendation system, I have published two related papers. The first one deals with recommending additional relevant API libraries given known useful API libraries for the target program. This system can achieve recall rate@5 of 0.852 and recall rate@10 of 0.894 in recommending additional relevant libraries. The second one deals with recommending relevant API methods a given target API and a textual description of the task. This system can achieve recall-rate@5 of 0.690 and recallrate@10 of 0.779. The results for both system indicate that the systems are useful and capable in recommending the right API/library reasonably well. Currently, I am working on another system which can recommend web APIs (i.e., libraries) given a description of the task. I am also working on a system that recommends correct parameters given an API method. In the future, I also plan to realize API composition recommendation for the given task

Publisher's Version Article Search

proc time: 3.22