FSE 2016
24th ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE 2016)
Powered by
Conference Publishing Consulting

24th ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE 2016), November 13–18, 2016, Seattle, WA, USA

FSE 2016 – Proceedings

Contents - Abstracts - Authors
Twitter: https://twitter.com/FSEconf

Frontmatter

Title Page


Message from the Chairs
On behalf of the entire conference committee, it is our great pleasure to welcome you to Seattle for the 24th ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE 2016). ACM SIGSOFT FSE is one of the premier forums that bring together researchers, practitioners, and educators to present and discuss the most recent innovations, trends, visions, experiences, and challenges in software engineering.

FSE 2016 Summary of Co-located Workshops
The 24th ACM SIGSOFT International Symposium on the Foundations of Software Engineering, FSE 2016, hosted eight international workshops. The intent of the workshops was to provide opportunities for exchanging views, advancing ideas, and discussing preliminary results in various areas of software engineering research and applications. The large number of workshop proposals, and the resulting large number of co-located workshops, are testaments to the energy and excitement within the field of software engineering to growth of new ideas, the collaboration between researchers and industrial practitioners, and the education of new contributors.

FSE 2016 Organization
Committee Listings

FSE 2016 Sponsors and Supporters
Sponsors and Supporters


Keynotes

"Womenomics" and Gender-Inclusive Software: What Software Engineers Need to Know (Invited Talk)
Margaret Burnett
(Oregon State University, USA)
This short paper is a summary of my keynote at FSE’16, with accompanying references for follow-up.

Building a Socio-Technical Theory of Coordination: Why and How (Outstanding Research Award)
James Herbsleb
(Carnegie Mellon University, USA)
Research aimed at understanding and addressing coordination breakdowns experienced in global software development (GSD) projects at Lucent Technologies took a path from open-ended qualitative exploratory studies to quantitative studies with a tight focus on a key problem – delay – and its causes. Rather than being directly associated with delay, multi-site work items involved more people than comparable same-site work items, and the number of people was a powerful predictor of delay. To counteract this, we developed and deployed tools and practices to support more effective communication and expertise location. After conducting two case studies of open source development, an extreme form of GSD, we realized that many tools and practices could be effective for multi-site work, but none seemed to work under all conditions. To achieve deeper insight, we developed and tested our Socio-Technical Theory of Coordination (STTC) in which the dependencies among engineering decisions are seen as defining a constraint satisfaction problem that the organization can solve in a variety of ways. I conclude by explaining how we applied these ideas to transparent development environments, then sketch important open research questions.

Correct or Usable? The Limits of Traditional Verification (Impact Paper Award)
Daniel JacksonORCID logo and Mandana Vaziri
(Massachusetts Institute of Technology, USA; IBM, USA)
Since our work on verification sixteen years ago, our views of the role of verification, and the centrality of correctness, have evolved. In our presentation, we’ll talk about some of our concerns about the limitations of this kind of technology, including: usability as a key factor; the unknowable properties of the environment; and the inadequacy of specifications as a means of capturing users’ desires. We’ll describe two approaches we’re currently working on to mitigate these concerns — (1) moving to higher level abstractions with correctness by construction and (2) focusing on the conceptual structure of applications — and will argue that, combined with traditional verification tools, these offer the possibility of applications that are both usable and correct.


Showcases
Wed, Nov 16, 08:30 - 10:30, Emerald Ballroom (Chair: Jo Atlee, Gail Murphy)

Continuous Deployment of Mobile Software at Facebook (Showcase)
Chuck Rossi, Elisa Shibley, Shi Su, Kent Beck, Tony Savor, and Michael Stumm
(Facebook, USA; University of Michigan, USA; Carnegie Mellon University, USA; University of Toronto, Canada)
Continuous deployment is the practice of releasing software updates to production as soon as it is ready, which is receiving increased adoption in industry. The frequency of updates of mobile software has traditionally lagged the state of practice for cloud-based services for a number of reasons. Mobile versions can only be released periodically. Users can choose when and if to upgrade, which means that several different releases coexist in production. There are hundreds of Android hardware variants, which increases the risk of having errors in the software being deployed.
Facebook has made significant progress in increasing the frequency of its mobile deployments. Over a period of 4 years, the Android release has gone from a deployment every 8 weeks to a deployment every week. In this paper, we describe in detail the mobile deployment process at FB. We present our findings from an extensive analysis of software engineering metrics based on data collected over a period of 7 years. A key finding is that the frequency of deployment does not directly affect developer productivity or software quality. We argue that this finding is due to the fact that increasing the frequency of continuous deployment forces improved release and deployment automation, which in turn reduces developer workload. Additionally, the data we present shows that dog-fooding and obtaining feedback from alpha and beta customers is critical to maintaining release quality.

Model, Execute, and Deploy: Answering the Hard Questions in End-User Programming (Showcase)
Shan Shan Huang ORCID logo
(LogicBlox, USA)
End-user programming, a frequently recurring dream, has thus far eluded large-scale, complex applications. Very real, hard questions stand in the way of its realization. How can its languages and tools support: (1) The development of applications with large data sets and sophisticated computation? (2) The co-development by end-users and professional developers when the complexity of an application demands it? (3) Beyond development, the maintenance, distribution, monitoring, and integration with other applications and services? We discuss our approach to these questions, as implemented in the LogicBlox Modeler. We discuss its use in developing applications for governments, major financial institutions, and large global retailers. We highlight the essential synergies between Programming Languages, Software Engineering, and Database research to achieve self-service at scale, and present open questions to which we look to the FSE community for inspirations and solutions.

Making Invisible Things Visible: Tracking Down Known Vulnerabilities at 3000 Companies (Showcase)
Gazi Mahmud ORCID logo
(Sonatype, USA)
This year, software development teams around the world are consuming BILLIONS of open source and third-party components. The good news: they are accelerating time to market. The bad news: 1 in 17 components they are using include known security vulnerabilities. In this talk, I will describe what Sonatype, the company behind The Central Repository that supports Apache Maven, has learned from analyzing how thousands of applications use open source components. I will also discuss how organizations like Mayo Clinic, Exxon, Capital One, the U.S. FDA and Intuit are utilizing the principles of software supply chain automation to improve application security and how organizations can balance the need for speed with quality and security early in the development cycle.

Developer Workflow at Google (Showcase)
Caitlin Sadowski ORCID logo
(Google, USA)
This talk describes the developer workflow at Google, and our use of program analysis, testing, metrics, and tooling to reduce errors when creating and committing changes to source code. Software development at Google has several unique characteristics such as our monolithic codebase and distributed hermetic build system. Changes are vetted both manually, via our internal code review tool, and automatically, via sources such as the Tricorder program analysis platform and our automated testing infrastructure.


Research Papers

Session 1: Specification
Tue, Nov 15, 11:00 - 12:30, Emerald 1 (Chair: Mike Whalen)

Titanium: Efficient Analysis of Evolving Alloy Specifications
Hamid Bagheri and Sam Malek
(University of Nebraska-Lincoln, USA; University of California at Irvine, USA)
The Alloy specification language, and the corresponding Alloy Analyzer, have received much attention in the last two decades with applications in many areas of software engineering. Increasingly, formal analyses enabled by Alloy are desired for use in an on-line mode, where the specifications are automatically kept in sync with the running, possibly changing, software system. However, given Alloy Analyzer's reliance on computationally expensive SAT solvers, an important challenge is the time it takes for such analyses to execute at runtime. The fact that in an on-line mode, the analyses are often repeated on slightly revised versions of a given specification, presents us with an opportunity to tackle this challenge. We present Titanium, an extension of Alloy for formal analysis of evolving specifications. By leveraging the results from previous analyses, Titanium narrows the state space of the revised specification, thereby greatly reducing the required computational effort. We describe the semantic basis of Titanium in terms of models specified in relational logic. We show how the approach can be realized atop an existing relational logic model finder. Our experimental results show Titanium achieves a significant speed-up over Alloy Analyzer when applied to the analysis of evolving specifications.

Info
Mining Performance Specifications
Marc Brünink and David S. Rosenblum
(National University of Singapore, Singapore)
Functional testing is widespread and supported by a multitude of tools, including tools to mine functional specifications. In contrast, non-functional attributes like performance are often less well understood and tested. While many profiling tools are available to gather raw performance data, interpreting this raw data requires expert knowledge and a thorough understanding of the underlying software and hardware infrastructure. In this work we present an approach that mines performance specifications from running systems autonomously. The tool creates performance models during runtime. The mined models are analyzed further to create compact and comprehensive performance assertions. The resulting assertions can be used as an evidence-based performance specification for performance regression testing, performance monitoring, or as a foundation for more formal performance specifications.

Designing Minimal Effective Normative Systems with the Help of Lightweight Formal Methods
Jianye Hao, Eunsuk Kang, Jun Sun, and Daniel JacksonORCID logo
(Tianjin University, China; University of California at Berkeley, USA; Singapore University of Technology and Design, Singapore; Massachusetts Institute of Technology, USA)
Normative systems (i.e., a set of rules) are an important approach to achieving effective coordination among (often an arbitrary number of) agents in multiagent systems. A normative system should be effective in ensuring the satisfaction of a desirable system property, and minimal (i.e., not containing norms that unnecessarily over-constrain the behaviors of agents). Designing or even automatically synthesizing minimal effective normative systems is highly non-trivial. Previous attempts on synthesizing such systems through simulations often fail to generate normative systems which are both minimal and effective. In this work, we propose a framework that facilitates designing of minimal effective normative systems using lightweight formal methods. Given a minimal effective normative system which coordinates many agents must be minimal and effective for a small number of agents, we start with automatically synthesizing one such system with a few agents. We then increase the number of agents so as to check whether the same design remains minimal and effective. If it is, we manually establish an induction proof so as to lift the design to an arbitrary number of agents.

Proteus: Computing Disjunctive Loop Summary via Path Dependency Analysis
Xiaofei Xie, Bihuan Chen, Yang Liu ORCID logo, Wei Le ORCID logo, and Xiaohong Li ORCID logo
(Tianjin University, China; Nanyang Technological University, Singapore; Iowa State University, USA)
Loops are challenging structures for program analysis, especially when loops contain multiple paths with complex interleaving executions among these paths. In this paper, we first propose a classification of multi-path loops to understand the complexity of the loop execution, which is based on the variable updates on the loop conditions and the execution order of the loop paths. Secondly, we propose a loop analysis framework, named Proteus, which takes a loop program and a set of variables of interest as inputs and summarizes path-sensitive loop effects on the variables. The key contribution is to use a path dependency automaton (PDA) to capture the execution dependency between the paths. A DFS-based algorithm is proposed to traverse the PDA to summarize the effect for all feasible executions in the loop. The experimental results show that Proteus is effective in three applications: Proteus can 1) compute a more precise bound than the existing loop bound analysis techniques; 2) significantly outperform state-of-the-art tools for loop verification; and 3) generate test cases for deep loops within one second, while KLEE and Pex either need much more time or fail.

Session 2: HCI and Process
Tue, Nov 15, 11:00 - 12:30, Emerald 2 (Chair: Peri Tarr)

A Cross-Tool Communication Study on Program Analysis Tool Notifications
Brittany Johnson, Rahul Pandita, Justin Smith, Denae Ford, Sarah Elder, Emerson Murphy-Hill, Sarah Heckman, and Caitlin Sadowski
(North Carolina State University, USA; Google, USA)
Program analysis tools use notifications to communicate with developers, but previous research suggests that developers encounter challenges that impede this communication. This paper describes a qualitative study that identifies 10 kinds of challenges that cause notifications to miscommunicate with developers. Our resulting notification communication theory reveals that many challenges span multiple tools and multiple levels of developer experience. Our results suggest that, for example, future tools that model developer experience could improve communication and help developers build more accurate mental models.

Factors Influencing Code Review Processes in Industry
Tobias Baum, Olga Liskin, Kai Niklas, and Kurt Schneider
(Leibniz Universität Hannover, Germany)
Code review is known to be an efficient quality assurance technique. Many software companies today use it, usually with a process similar to the patch review process in open source software development. However, there is still a large fraction of companies performing almost no code reviews at all. And the companies that do code reviews have a lot of variation in the details of their processes. For researchers trying to improve the use of code reviews in industry, it is important to know the reasons for these process variations. We have performed a grounded theory study to clarify process variations and their rationales. The study is based on interviews with software development professionals from 19 companies. These interviews provided insights into the reasons and influencing factors behind the adoption or non-adoption of code reviews as a whole as well as for different process variations. We have condensed these findings into seven hypotheses and a classification of the influencing factors. Our results show the importance of cultural and social issues for review adoption. They trace many process variations to differences in development context and in desired review effects.

Foraging and Navigations, Fundamentally: Developers' Predictions of Value and Cost
David Piorkowski, Austin Z. Henley, Tahmid Nabi, Scott D. Fleming, Christopher Scaffidi, and Margaret Burnett
(Oregon State University, USA; University of Memphis, USA)
Empirical studies have revealed that software developers spend 35%–50% of their time navigating through source code during development activities, yet fundamental questions remain: Are these percentages too high, or simply inherent in the nature of software development? Are there factors that somehow determine a lower bound on how effectively developers can navigate a given information space? Answering questions like these requires a theory that captures the core of developers' navigation decisions. Therefore, we use the central proposition of Information Foraging Theory to investigate developers' ability to predict the value and cost of their navigation decisions. Our results showed that over 50% of developers' navigation choices produced less value than they had predicted and nearly 40% cost more than they had predicted. We used those results to guide a literature analysis, to investigate the extent to which these challenges are met by current research efforts, revealing a new area of inquiry with a rich and crosscutting set of research challenges and open problems.

How to Break an API: Cost Negotiation and Community Values in Three Software Ecosystems
Christopher Bogart, Christian KästnerORCID logo, James Herbsleb, and Ferdian Thung
(Carnegie Mellon University, USA; Singapore Management University, Singapore)
Change introduces conflict into software ecosystems: breaking changes may ripple through the ecosystem and trigger rework for users of a package, but often developers can invest additional effort or accept opportunity costs to alleviate or delay downstream costs. We performed a multiple case study of three software ecosystems with different tooling and philosophies toward change, Eclipse, R/CRAN, and Node.js/npm, to understand how developers make decisions about change and change-related costs and what practices, tooling, and policies are used. We found that all three ecosystems differ substantially in their practices and expectations toward change and that those differences can be explained largely by different community values in each ecosystem. Our results illustrate that there is a large design space in how to build an ecosystem, its policies and its supporting infrastructure; and there is value in making community values and accepted tradeoffs explicit and transparent in order to resolve conflicts and negotiate change-related costs.

Info

Session 3: Bug Detection and Debugging
Tue, Nov 15, 11:00 - 12:30, Emerald 3 (Chair: Tingting Yu)

Python Predictive Analysis for Bug Detection
Zhaogui Xu, Peng Liu, Xiangyu ZhangORCID logo, and Baowen Xu ORCID logo
(Nanjing University, China; Purdue University, USA)
Python is a popular dynamic language that allows quick software development. However, Python program analysis engines are largely lacking. In this paper, we present a Python predictive analysis. It first collects the trace of an execution, and then encodes the trace and unexecuted branches to symbolic constraints. Symbolic variables are introduced to denote input values, their dynamic types, and attribute sets, to reason about their variations. Solving the constraints identifies bugs and their triggering inputs. Our evaluation shows that the technique is highly effective in analyzing real-world complex programs with a lot of dynamic features and external library calls, due to its sophisticated encoding design based on traces. It identifies 46 bugs from 11 real-world projects, with 16 new bugs. All reported bugs are true positives.

aec-badge-fse16-ae
Crash Consistency Validation Made Easy
Yanyan Jiang, Haicheng Chen, Feng Qin, Chang XuORCID logo, Xiaoxing Ma, and Jian Lu
(Nanjing University, China; Ohio State University, USA)
Software should behave correctly even in adverse conditions. Particularly, we study the problem of automated validation of crash consistency, i.e., file system data safety when systems crash. Existing work requires non-trivial manual efforts of specifying checking scripts and workloads, which is an obstacle for software developers. Therefore, we propose C3, a novel approach that makes crash consistency validation as easy as pressing a single button. With a program and an input, C3 automatically reports inconsistent crash sites. C3 not only exempts developers from the need of writing crash site checking scripts (by an algorithm that computes editing distance between file system snapshots) but also reduces the reliance on dedicated workloads (by test amplification). We implemented C3 as an open-source tool. With C3, we found 14 bugs in open-source software that have severe consequences at crash and 11 of them were previously unknown to the developers, including in highly mature software (e.g., GNU zip and GNU coreutils sort) and popular ones being actively developed (e.g., Adobe Brackets and TeXstudio).

Discovering Bug Patterns in JavaScript
Quinn Hanam, Fernando S. de M. Brito, and Ali Mesbah
(University of British Columbia, Canada; Federal University of Paraíba, Brazil)
JavaScript has become the most popular language used by developers for client and server side programming. The language, however, still lacks proper support in the form of warnings about potential bugs in the code. Most bug finding tools in use today cover bug patterns that are discovered by reading best practices or through developer intuition and anecdotal observation. As such, it is still unclear which bugs happen frequently in practice and which are important for developers to be fixed. We propose a novel semi-automatic technique, called BugAID, for discovering the most prevalent and detectable bug patterns. BugAID is based on unsupervised machine learning using language-construct-based changes distilled from AST differencing of bug fixes in the code. We present a large-scale study of common bug patterns by mining 105K commits from 134 server-side JavaScript projects. We discover 219 bug fixing change types and discuss 13 pervasive bug patterns that occur across multiple projects and can likely be prevented with better tool support. Our findings are useful for improving tools and techniques to prevent common bugs in JavaScript, guiding tool integration for IDEs, and making developers aware of common mistakes involved with programming in JavaScript.

Info aec-badge-fse16-ae
Effort-Aware Just-in-Time Defect Prediction: Simple Unsupervised Models Could Be Better Than Supervised Models
Yibiao Yang, Yuming ZhouORCID logo, Jinping Liu, Yangyang Zhao, Hongmin Lu, Lei Xu ORCID logo, Baowen Xu ORCID logo, and Hareton Leung
(Nanjing University, China; Hong Kong Polytechnic University, China)
Unsupervised models do not require the defect data to build the prediction models and hence incur a low building cost and gain a wide application range. Consequently, it would be more desirable for practitioners to apply unsupervised models in effort-aware just-in-time (JIT) defect prediction if they can predict defect-inducing changes well. However, little is currently known on their prediction effectiveness in this context. We aim to investigate the predictive power of simple unsupervised models in effort-aware JIT defect prediction, especially compared with the state-of-the-art supervised models in the recent literature. We first use the most commonly used change metrics to build simple unsupervised models. Then, we compare these unsupervised models with the state-of-the-art supervised models under cross-validation, time-wise-cross-validation, and across-project prediction settings to determine whether they are of practical value. The experimental results, from open-source software systems, show that many simple unsupervised models perform better than the state-of-the-art supervised models in effort-aware JIT defect prediction.

Session 4: Security and Privacy
Tue, Nov 15, 14:00 - 15:30, Emerald 1 (Chair: Diomidis Spinellis)

Detecting Sensitive Data Disclosure via Bi-directional Text Correlation Analysis
Jianjun Huang, Xiangyu ZhangORCID logo, and Lin Tan
(Purdue University, USA; University of Waterloo, Canada)
Traditional sensitive data disclosure analysis faces two challenges: to identify sensitive data that is not generated by specific API calls, and to report the potential disclosures when the disclosed data is recognized as sensitive only after the sink operations. We address these issues by developing BidText, a novel static technique to detect sensitive data disclosures. BidText formulates the problem as a type system, in which variables are typed with the text labels that they encounter (e.g., during key-value pair operations). The type system features a novel bi-directional propagation technique that propagates the variable label sets through forward and backward data-flow. A data disclosure is reported if a parameter at a sink point is typed with a sensitive text label. BidText is evaluated on 10,000 Android apps. It reports 4,406 apps that have sensitive data disclosures, with 4,263 apps having log based disclosures and 1,688 having disclosures due to other sinks such as HTTP requests. Existing techniques can only report 64.0% of what BidText reports. And manual inspection shows that the false positive rate for BidText is 10%.

aec-badge-fse16-ae
Multi-representational Security Analysis
Eunsuk Kang, Aleksandar Milicevic, and Daniel JacksonORCID logo
(University of California at Berkeley, USA; Microsoft, USA; Massachusetts Institute of Technology, USA)
Security attacks often exploit flaws that are not anticipated in an abstract design, but are introduced inadvertently when high-level interactions in the design are mapped to low-level behaviors in the supporting platform. This paper proposes a multi-representational approach to security analysis, where models capturing distinct (but possibly overlapping) views of a system are automatically composed in order to enable an end-to-end analysis. This approach allows the designer to incrementally explore the impact of design decisions on security, and discover attacks that span multiple layers of the system. This paper describes Poirot, a prototype implementation of the approach, and reports on our experience on applying Poirot to detect previously unknown security flaws in publicly deployed systems.

String Analysis for Side Channels with Segmented Oracles
Lucas Bang, Abdulbaki Aydin, Quoc-Sang Phan, Corina S. Păsăreanu, and Tevfik BultanORCID logo
(University of California at Santa Barbara, USA; Carnegie Mellon Silicon Valley, USA; NASA Ames Research Center, USA)
We present an automated approach for detecting and quantifying side channels in Java programs, which uses symbolic execution, string analysis and model counting to compute information leakage for a single run of a program. We further extend this approach to compute information leakage for multiple runs for a type of side channels called segmented oracles, where the attacker is able to explore each segment of a secret (for example each character of a password) independently. We present an efficient technique for segmented oracles that computes information leakage for multiple runs using only the path constraints generated from a single run symbolic execution. Our implementation uses the symbolic execution tool Symbolic PathFinder (SPF), SMT solver Z3, and two model counting constraint solvers LattE and ABC. Although LattE has been used before for analyzing numeric constraints, in this paper, we present an approach for using LattE for analyzing string constraints. We also extend the string constraint solver ABC for analysis of both numeric and string constraints, and we integrate ABC in SPF, enabling quantitative symbolic string analysis.

WebRanz: Web Page Randomization for Better Advertisement Delivery and Web-Bot Prevention
Weihang Wang, Yunhui Zheng, Xinyu Xing, Yonghwi Kwon, Xiangyu ZhangORCID logo, and Patrick Eugster
(Purdue University, USA; IBM Research, USA; Pennsylvania State University, USA; TU Darmstadt, Germany)
Nowadays, a rapidly increasing number of web users are using Ad-blockers to block online advertisements. Ad-blockers are browser-based software that can block most Ads on the websites, speeding up web browsers and saving bandwidth. Despite these benefits to end users, Ad-blockers could be catastrophic for the economic structure underlying the web, especially considering the rise of Ad-blocking as well as the number of technologies and services that rely exclusively on Ads to compensate their cost. In this paper, we introduce WebRanz that utilizes a randomization mechanism to circumvent Ad-blocking. Using WebRanz, content publishers can constantly mutate the internal HTML elements and element attributes of their web pages, without affecting their visual appearances and functionalities. Randomization invalidates the pre-defined patterns that Ad-blockers use to filter out Ads. Though the design of WebRanz is motivated by evading Ad-blockers, WebRanz also benefits the defense against bot scripts. We evaluate the effectiveness of WebRanz and its overhead using 221 randomly sampled top Alexa web pages and 8 representative bot scripts.

Info aec-badge-fse16-ae

Session 5: Adaptation and Change
Tue, Nov 15, 14:00 - 15:30, Emerald 2 (Chair: Harald Gall)

A Discrete-Time Feedback Controller for Containerized Cloud Applications
Luciano Baresi, Sam Guinea, Alberto Leva, and Giovanni Quattrocchi
(Politecnico di Milano, Italy)
Modern Web applications exploit Cloud infrastructures to scale their resources and cope with sudden changes in the workload. While the state of practice is to focus on dynamically adding and removing virtual machines, we advocate that there are strong benefits in containerizing the applications and in scaling the containers.
In this paper we present an autoscaling technique that allows containerized applications to scale their resources both at the VM level and at the container level. Furthermore, applications can combine this infrastructural adaptation with platform-level adaptation. The autoscaling is made possible by our planner, which consists of a grey-box discrete-time feedback controller.
The work has been validated using two application benchmarks deployed to Amazon EC2. Our experiments show that our planner outperforms Amazon's AutoScaling by 78% on average without containers; and that the introduction of containers allows us to improve by yet another 46% on average.

Keep It SIMPLEX: Satisfying Multiple Goals with Guarantees in Control-Based Self-Adaptive Systems
Stepan Shevtsov and Danny Weyns
(Linnaeus University, Sweden; KU Leuven, Belgium)
An increasingly important concern of software engineers is handling uncertainties at design time, such as environment dynamics that may be difficult to predict or requirements that may change during operation. The idea of self-adaptation is to handle such uncertainties at runtime, when the knowledge becomes available. As more systems with strict requirements require self-adaptation, providing guarantees for adaptation has become a high-priority. Providing such guarantees with traditional architecture-based approaches has shown to be challenging. In response, researchers have studied the application of control theory to realize self-adaptation. However, existing control-theoretic approaches applied to adapt software systems have primarily focused on satisfying only a single adaptation goal at a time, which is often too restrictive for real applications. In this paper, we present Simplex Control Adaptation, SimCA, a new approach to self-adaptation that satisfies multiple goals, while being optimal with respect to an additional goal. SimCA offers robustness to measurement inaccuracy and environmental disturbances, and provides guarantees. We evaluate SimCA for two systems with strict requirements that have to deal with uncertainties: an underwater vehicle system used for oceanic surveillance, and a tele-assistance system for health care support.

Info
Automated Change Impact Analysis between SysML Models of Requirements and Design
Shiva Nejati, Mehrdad Sabetzadeh, Chetan Arora, Lionel C. BriandORCID logo, and Felix Mandoux
(University of Luxembourg, Luxembourg; Delphi Automotive Systems, Luxembourg)
An important activity in systems engineering is analyzing how a change in requirements will impact the design of a system. Performing this analysis manually is expensive, particularly for complex systems. In this paper, we propose an approach to automatically identify the impact of requirements changes on system design, when the requirements and design elements are expressed using models. We ground our approach on the Systems Modeling Language (SysML) due to SysML’s increasing use in industrial applications.
Our approach has two steps: For a given change, we first apply a static slicing algorithm to extract an estimated set of impacted model elements. Next, we rank the elements of the resulting set according to a quantitative measure designed to predict how likely it is for each element to be impacted. The measure is computed using Natural Language Processing (NLP) applied to the textual content of the elements. Engineers can then inspect the ranked list of elements and identify those that are actually impacted. We evaluate our approach on an industrial case study with 16 real-world requirements changes. Our results suggest that, using our approach, engineers need to inspect on average only 4.8% of the entire design in order to identify the actually-impacted elements. We further show that our results consistently improve when our analysis takes into account both structural and behavioral diagrams rather than only structural ones, and the natural-language content of the diagrams in addition to only their structural and behavioral content.

Session 6: API Mining and Usage
Tue, Nov 15, 14:00 - 15:30, Emerald 3 (Chair: Tao Xie)

Parameter-Free Probabilistic API Mining across GitHub
Jaroslav Fowkes and Charles Sutton
(University of Edinburgh, UK)
Existing API mining algorithms can be difficult to use as they require expensive parameter tuning and the returned set of API calls can be large, highly redundant and difficult to understand. To address this, we present PAM (Probabilistic API Miner), a near parameter-free probabilistic algorithm for mining the most interesting API call patterns. We show that PAM significantly outperforms both MAPO and UPMiner, achieving 69% test-set precision, at retrieving relevant API call sequences from GitHub. Moreover, we focus on libraries for which the developers have explicitly provided code examples, yielding over 300,000 LOC of hand-written API example code from the 967 client projects in the data set. This evaluation suggests that the hand-written examples actually have limited coverage of real API usages.

API Deprecation: A Retrospective Analysis and Detection Method for Code Examples on the Web
Jing Zhou and Robert J. Walker
(University of Calgary, Canada)
Deprecation allows the developers of application programming interfaces (APIs) to signal to other developers that a given API item ought to be avoided. But little is known about deprecation practices beyond anecdotes. We examine how API deprecation has been used in 26 open source Java frameworks and libraries, finding that the classic deprecate–replace–remove cycle is often not followed, as many APIs were removed without prior deprecation, many deprecated APIs were subsequently un-deprecated, and removed APIs are even resurrected with surprising frequency. Furthermore, we identify several problems in the information commonly (not) provided to help API consumers transition their dependent code.
As a consequence of deprecation, coding examples on the web --- an increasingly important source of information for developers --- can easily become outdated. Code examples that demonstrate how to use deprecated APIs can be difficult to disregard and a waste of time for developers. We propose a lightweight, version-sensitive framework to detect deprecated API usages in source code examples on the web so developers can be informed of such usages before they invest time and energy into studying them. We reify the framework as a prototype tool (Deprecation Watcher). Our evaluation on detecting deprecated Android API usages in code examples on Stack Overflow shows our tool obtains a precision of 100% and a recall of 86% in a random sample of 200 questions.

When Should Internal Interfaces Be Promoted to Public?
André Hora, Marco Tulio Valente, Romain Robbes, and Nicolas Anquetil
(Federal University of Minas Gerais, Brazil; Federal University of Mato Grosso do Sul, Brazil; University of Chile, Chile; University of Lille, France)
Commonly, software systems have public (and stable) interfaces, and internal (and possibly unstable) interfaces. Despite being discouraged, client developers often use internal interfaces, which may cause their systems to fail when they evolve. To overcome this problem, API producers may promote internal interfaces to public. In practice, however, API producers have no assistance to identify public interface candidates. In this paper, we study the transition from internal to public interfaces. We aim to help API producers to deliver a better product and API clients to benefit sooner from public interfaces. Our empirical investigation on five widely adopted Java systems present the following observations. First, we identified 195 promotions from 2,722 internal interfaces. Second, we found that promoted internal interfaces have more clients. Third, we predicted internal interface promotion with precision between 50%-80%, recall 26%-82%, and AUC 74%-85%. Finally, by applying our predictor on the last version of the analyzed systems, we automatically detected 382 public interface candidates.

POLLUX: Safely Upgrading Dependent Application Libraries
Sukrit Kalra, Ayush Goel, Dhriti Khanna, Mohan Dhawan, Subodh Sharma, and Rahul Purandare ORCID logo
(IIIT Delhi, India; IBM Research, India; IIT Delhi, India)
Software evolution in third-party libraries across version upgrades can result in addition of new functionalities or change in existing APIs. As a result, there is a real danger of impairment of backward compatibility. Application developers, therefore, must keep constant vigil over library enhancements to ensure application consistency, i.e., application retains its semantic behavior across library upgrades. In this paper, we present the design and implementation of POLLUX, a framework to detect application-affecting changes across two versions of the same dependent non-adversarial library binary, and provide feedback on whether the application developer should link to the newer version or not. POLLUX leverages relevant application test cases to drive execution through both versions of the concerned library binary, records all concrete effects on the environment, and compares them to determine semantic similarity across the same API invocation for the two library versions. Our evaluation with 16 popular, open-source library binaries shows that POLLUX is accurate with no false positives and works across compiler optimizations.

Session 7: Verification
Tue, Nov 15, 16:30 - 18:00, Emerald 1 (Chair: Abhik Roychoudhury)

Extracting Instruction Semantics via Symbolic Execution of Code Generators
Niranjan Hasabnis and R. Sekar
(Intel, USA; Stony Brook University, USA)
Binary analysis and instrumentation form the basis of many tools and frameworks for software debugging, security hardening, and monitoring. Accurate modeling of instruction semantics is para­mount in this regard, as errors can lead to program crashes, or worse, bypassing of security checks. Semantic modeling is a daunting task for modern processors such as x86 and ARM that support over a thousand instructions, many of them with complex semantics. This paper describes a new approach to automate this semantic modeling task. Our approach leverages instruction semantics knowledge that is already encoded into today’s production compilers such as GCC and LLVM. Such an approach can greatly reduce manual effort, and more importantly, avoid errors introduced by manual modeling. Furthermore, it is applicable to any of the numerous architectures already supported by the compiler. In this paper, we develop a new symbolic execution technique to extract instruction semantics from a compiler’s source code. Unlike previous applications of symbolic execution that were focused on identifying a single program path that violates a property, our approach addresses the all paths problem, extracting the entire input/output behavior of the code generator. We have applied it successfully to the 120K lines of C-code used in GCC’s code generator to extract x86 instruction semantics. To demonstrate architecture-neutrality, we have also applied it to AVR, a processor used in the popular Arduino platform.

aec-badge-fse16-ae
Efficient Generation of Inductive Validity Cores for Safety Properties
Elaheh Ghassabani, Andrew Gacek, and Michael W. Whalen
(University of Minnesota, USA; Rockwell Collins, USA)
Symbolic model checkers can construct proofs of properties over very complex models. However, the results reported by the tool when a proof succeeds do not generally provide much insight to the user. It is often useful for users to have traceability information related to the proof: which portions of the model were necessary to construct it. This traceability information can be used to diagnose a variety of modeling problems such as overconstrained axioms and underconstrained properties, and can also be used to measure completeness of a set of requirements over a model. In this paper, we present a new algorithm to efficiently compute the em inductive validity core (IVC) within a model necessary for inductive proofs of safety properties for sequential systems. The algorithm is based on the UNSAT core support built into current SMT solvers and a novel encoding of the inductive problem to try to generate a minimal inductive validity core. We prove our algorithm is correct, and describe its implementation in the JKind model checker for Lustre models. We then present an experiment in which we benchmark the algorithm in terms of speed, diversity of produced cores, and minimality, with promising results.

Correctness Witnesses: Exchanging Verification Results between Verifiers
Dirk BeyerORCID logo, Matthias Dangl, Daniel Dietsch, and Matthias Heizmann
(LMU Munich, Germany; University of Passau, Germany; University of Freiburg, Germany)
Standard verification tools provide a counterexample to witness a specification violation, and, since a few years, such a witness can be validated by an independent validator using an exchangeable witness format. This way, information about the violation can be shared across verification tools and the user can use standard tools to visualize and explore witnesses. This technique is not yet established for the correctness case, where a program fulfills a specification. Even for simple programs, it is often difficult for users to comprehend why a given program is correct, and there is no way to independently check the verification result. We close this gap by complementing our earlier work on violation witnesses with correctness witnesses. While we use an extension of the established common exchange format for violation witnesses to represent correctness witnesses, the techniques for producing and validating correctness witnesses are different. The overall goal to make proofs available to engineers is probably as old as programming itself, and proof-carrying code was proposed two decades ago --- our goal is to make it practical: We consider witnesses as first-class exchangeable objects, stored independently from the source code and checked independently from the verifier that produced them, respecting the important principle of separation of concerns. At any time, the invariants from the correctness witness can be used to reconstruct a correctness proof to establish trust. We extended two state-of-the-art verifiers, CPAchecker and Ultimate Automizer, to produce and validate witnesses, and report that the approach is promising on a large set of verification tasks.

Info aec-badge-fse16-ae
SMT-Based Verification of Parameterized Systems
Arie Gurfinkel ORCID logo, Sharon Shoham ORCID logo, and Yuri Meshman
(Software Engineering Institute, USA; University of Waterloo, Canada; Tel Aviv University, Israel; Technion, Israel)
It is well known that verification of safety properties of sequential programs is reducible to satisfiability modulo theory of a first-order logic formula, called a verification condition (VC). The reduction is used both in deductive and automated verification, the difference is only in whether the user or the solver provides candidates for inductive invariants. In this paper, we extend the reduction to parameterized systems consisting of arbitrary many copies of a user-specified process, and whose transition relation is definable in first-order logic modulo theory of linear arithmetic and arrays. We show that deciding whether a parameterized system has a universally quantified inductive invariant is reducible to satisfiability of (non-linear) Constraint Horn Clauses (CHC). As a consequence of our reduction, we obtain a new automated procedure for verifying parameterized systems using existing PDR and CHC engines. While the new procedure is applicable to a wide variety of systems, we show that it is a decision procedure for several decidable fragments.

Session 8: Requirements and Models
Tue, Nov 15, 16:30 - 18:00, Emerald 2 (Chair: Jo Atlee)

On-the-Fly Decomposition of Specifications in Software Model Checking
Sven Apel, Dirk BeyerORCID logo, Vitaly Mordan, Vadim Mutilin, and Andreas Stahlbauer
(University of Passau, Germany; LMU Munich, Germany; Russian Academy of Sciences, Russia)
Major breakthroughs have increased the efficiency and effectiveness of software model checking considerably, such that this technology is now applicable to industrial-scale software. However, verifying the full formal specification of a software system is still considered too complex, and in practice, sets of properties are verified one by one in isolation. We propose an approach that takes the full formal specification as input and first tries to verify all properties simultaneously in one verification run. Our verification algorithm monitors itself and detects situations for which the full set of properties is too complex. In such cases, we perform an automatic decomposition of the full set of properties into smaller sets, and continue the verification seamlessly. To avoid state-space explosion for large sets of properties, we introduce on-the-fly property weaving: properties get weaved into the program's transition system on the fly, during the analysis; which properties to weave and verify is determined dynamically during the verification process. We perform an extensive evaluation based on verification tasks that were derived from 4336 Linux kernel modules, and a set of properties that define the correct usage of the Linux API. Checking several properties simultaneously can lead to a significant performance gain, due to the fact that abstract models share many parts among different properties.

Info aec-badge-fse16-ae
On Well-Separation of GR(1) Specifications
Shahar MaozORCID logo and Jan Oliver Ringert
(Tel Aviv University, Israel)
Specifications for reactive synthesis, an automated procedure to obtain a correct-by-construction reactive system, consist of assumptions and guarantees. One way a controller may satisfy the specification is by preventing the environment from satisfying the assumptions, without satisfying the guarantees. Although valid this solution is usually undesired and specifications that allow it are called non-well-separated.
In this work we investigate non-well-separation in the context of GR(1), an expressive fragment of LTL that enables efficient synthesis. We distinguish different cases of non-well-separation, and compute strategies showing how the environment can be forced to violate its assumptions. Moreover, we show how to find a core, a minimal set of assumptions that lead to non-well-separation, and further extend our work to support past-time LTL and patterns.
We implemented our work and evaluated it on 79 specifications. The evaluation shows that non-well-separation is a common problem in specifications and that our tools can be efficiently applied to identify it and its causes.

Lightweight Specification and Analysis of Dynamic Systems with Rich Configurations
Nuno Macedo, Julien Brunel, David Chemouil, Alcino CunhaORCID logo, and Denis Kuperberg
(INESC TEC, Portugal; University of Minho, Portugal; University of Toulouse, France; ONERA, France; TU Munich, Germany)
Model-checking is increasingly popular in the early phases of the software development process. To establish the correctness of a software design one must usually verify both structural and behavioral (or temporal) properties. Unfortunately, most specification languages, and accompanying model-checkers, excel only in analyzing either one or the other kind. This limits their ability to verify dynamic systems with rich configurations: systems whose state space is characterized by rich structural properties, but whose evolution is also expected to satisfy certain temporal properties.
To address this problem, we first propose Electrum, an extension of the Alloy specification language with temporal logic operators, where both rich configurations and expressive temporal properties can easily be defined. Two alternative model-checking techniques are then proposed, one bound­ed and the other unbounded, to verify systems expressed in this language, namely to verify that every desirable temporal property holds for every possible configuration.

Gray Links in the Use of Requirements Traceability
Nan NiuORCID logo, Wentao Wang, and Arushi Gupta
(University of Cincinnati, USA)
The value of traceability is in its use. How do different software engineering tasks affect the tracing of the same requirement? In this paper, we answer the question via an empirical study where we explicitly assign the participants into 3 trace-usage groups of one requirement: finding its implementation for verification and validation purpose, changing it within the original software system, and reusing it toward another application. The results uncover what we call "gray links"--around 20% of the total traces are voted to be true links with respect to only one task but not the others. We provide a mechanism to identify such gray links and discuss how they can be leveraged to advance the research and practice of value-based requirements traceability.

Session 9: Android
Tue, Nov 15, 16:30 - 18:00, Emerald 3 (Chair: Lingxiao Jiang)

Understanding and Detecting Wake Lock Misuses for Android Applications
Yepang Liu, Chang XuORCID logo, Shing-Chi CheungORCID logo, and Valerio Terragni
(Hong Kong University of Science and Technology, China; Nanjing University, China)
Wake locks are widely used in Android apps to protect critical computations from being disrupted by device sleeping. Inappropriate use of wake locks often seriously impacts user experience. However, little is known on how wake locks are used in real-world Android apps and the impact of their misuses. To bridge the gap, we conducted a large-scale empirical study on 44,736 commercial and 31 open-source Android apps. By automated program analysis and manual investigation, we observed (1) common program points where wake locks are acquired and released, (2) 13 types of critical computational tasks that are often protected by wake locks, and (3) eight patterns of wake lock misuses that commonly cause functional and non-functional issues, only three of which had been studied by existing work. Based on our findings, we designed a static analysis technique, Elite, to detect two most common patterns of wake lock misuses. Our experiments on real-world subjects showed that Elite is effective and can outperform two state-of-the-art techniques.

aec-badge-fse16-ae
DiagDroid: Android Performance Diagnosis via Anatomizing Asynchronous Executions
Yu Kang, Yangfan Zhou ORCID logo, Hui Xu, and Michael R. Lyu ORCID logo
(Chinese University of Hong Kong, China; Fudan University, China)
Rapid UI responsiveness is a key consideration to Android app developers. However, the complicated concurrency model of Android makes it hard for developers to understand and further diagnose the UI performance. This paper presents DiagDroid, a tool specifically designed for Android UI performance diagnosis. The key notion of DiagDroid is that UI-triggered asynchronous executions contribute to the UI performance, and hence their performance and their runtime dependency should be properly captured to facilitate performance diagnosis. However, there are tremendous ways to start asynchronous executions, posing a great challenge to profiling such executions and their runtime dependency. To this end, we properly abstract five categories of asynchronous executions as the building basis. As a result, they can be tracked and profiled based on the specifics of each category with a dynamic instrumentation approach carefully tailored for Android. DiagDroid can then accordingly profile the asynchronous executions in a task granularity, equipping it with low-overhead and high compatibility merits. The tool is successfully applied in diagnosing 33 real-world open-source apps, and we find 14 of them contain 27 performance issues. It shows the effectiveness of our tool in Android UI performance diagnosis. The tool is open-source released online.

Info
Minimizing GUI Event Traces
Lazaro Clapp, Osbert Bastani, Saswat Anand, and Alex AikenORCID logo
(Stanford University, USA)
GUI input generation tools for Android apps, such as Android's Monkey, are useful for automatically producing test inputs, but these tests are generally orders of magnitude larger than necessary, making them difficult for humans to understand. We present a technique for minimizing the output of such tools. Our technique accounts for the non-deterministic behavior of mobile apps, producing small event traces that reach a desired activity with high probability.
We propose a variant of delta debugging, augmented to handle non-determinism, to solve the problem of trace minimization. We evaluate our algorithm on two sets of commercial and open-source Android applications, showing that we can minimize large event traces reaching a particular application activity, producing traces that are, on average, less than 2% the size of the original traces.

Causal Impact Analysis for App Releases in Google Play
William Martin, Federica SarroORCID logo, and Mark Harman
(University College London, UK)
App developers would like to understand the impact of their own and their competitors’ software releases. To address this we introduce Causal Impact Release Analysis for app stores, and our tool, CIRA, that implements this analysis. We mined 38,858 popular Google Play apps, over a period of 12 months. For these apps, we identified 26,339 releases for which there was adequate prior and posterior time series data to facilitate causal impact analysis. We found that 33% of these releases caused a statistically significant change in user ratings. We use our approach to reveal important characteristics that distinguish causal significance in Google Play. To explore the actionability of causal impact analysis, we elicited the opinions of app developers: 56 companies responded, 78% concurred with the causal assessment, of which 33% claimed that their company would consider changing its app release strategy as a result of our findings.

Info

Session 10: Static Analysis
Wed, Nov 16, 11:00 - 12:30, Emerald 1 (Chair: Mark Marron)

Static DOM Event Dependency Analysis for Testing Web Applications
Chungha Sung, Markus Kusano, Nishant Sinha, and Chao Wang
(Virginia Tech, USA; IBM Research, India; University of Southern California, USA)
The number and complexity of JavaScript-based web applications are rapidly increasing, but methods and tools for automatically testing them are lagging behind, primarily due to the difficulty in analyzing the subtle interactions between the applications and the event-driven execution environment. Although static analysis techniques have been routinely used on software written in traditional programming languages, such as Java and C++, adapting them to handle JavaScript code and the HTML DOM is difficult. In this work, we propose the first constraint-based declarative program analysis procedure for computing dependencies over program variables as well as event-handler functions of the various DOM elements, which is crucial for analyzing the behavior of a client-side web application. We implemented the method in a software tool named JSDEP and evaluated it in ARTEMIS, a platform for automated web application testing. Our experiments on a large set of web applications show the new method can significantly reduce the number of redundant test sequences and significantly increase test coverage with minimal overhead.

aec-badge-fse16-ae
On-Demand Strong Update Analysis via Value-Flow Refinement
Yulei Sui and Jingling Xue ORCID logo
(UNSW, Australia)
We present a new Strong UPdate Analysis for C programs, called Supa, that enables computing points-to information on-demand via value-flow refinement, in environments with small time and memory budgets such as IDEs. We formulate Supa by solving a graph-reachability problem on a value- flow graph representation of the program, so that strong updates are performed where needed, as long as the total analysis budget is not exhausted. Supa facilitates efficiency and precision tradeoffs by allowing different pointer analyses to be applied in a hybrid multi-stage analysis framework.
We have implemented Supa in LLVM with its artifact available at [1]. We evaluate Supa by choosing uninitialized pointer detection as a major client on 12 open-source C programs. As the analysis budget increases, Supa achieves improved precision, with its single-stage flow-sensitive analysis reaching 97% of that achieved by whole-program flow- sensitive analysis by consuming about 0.19 seconds and 36KB of memory per query, on average (with a budget of at most 10000 value-flow edges per query).

aec-badge-fse16-ae
Call Graph Construction for Java Libraries
Michael Reif, Michael Eichberg, Ben Hermann, Johannes Lerch, and Mira MeziniORCID logo
(TU Darmstadt, Germany)
Today, every application uses software libraries. Yet, while a lot of research exists w.r.t. analyzing applications, research that targets the analysis of libraries independent of any application is scarce. This is unfortunate, because, for developers of libraries, such as the Java Development Kit (JDK), it is crucial to ensure that the library behaves as intended regardless of how it is used. To fill this gap, we discuss the construction of call graphs for libraries that abstract over all potential library usages. Call graphs are particularly relevant as they are a precursor of many advanced analyses, such as inter-procedural data-flow analyses.
We show that the current practice of using call graph algorithms designed for applications to analyze libraries leads to call graphs that, at the same time, lack relevant call edges and contain unnecessary edges. This motivates the need for call graph construction algorithms dedicated to libraries. Unlike algorithms for applications, call graph construction algorithms for libraries must take into consideration the goals of subsequent analyses. Specifically, we show that it is essential to distinguish between the scenario of an analysis for potential exploitable vulnerabilities from the scenario of an analysis for general software quality attributes, e.g., dead methods or unused fields. This distinction affects the decision about what constitutes the library-private implementation, which therefore, needs special treatment. Thus, building one call graph that satisfies all needs is not sensical. Overall, we observed that the proposed call graph algorithms reduce the number of call edges up to 30% when compared to existing approaches.

aec-badge-fse16-ae
Revamping JavaScript Static Analysis via Localization and Remediation of Root Causes of Imprecision
Shiyi Wei, Omer Tripp, Barbara G. Ryder, and Julian DolbyORCID logo
(University of Maryland, USA; Google, USA; Virginia Tech, USA; IBM Research, USA)
Static analysis is challenged by the dynamic language constructs of JavaScript which often lead to unacceptable performance and/or precision results. We describe an approach that focuses on improving the practicality and accuracy of points-to analysis and call graph construction for JavaScript programs. The approach first identifies program constructs which are sources of imprecision (i.e., root causes) through monitoring the static analysis process. We then examine and suggest specific context-sensitive analyses to apply. Our technique is able to to find that the root causes comprise less than 2% of the functions in JavaScript library applications. Moreover, the specialized analysis derived by our approach finishes within a few seconds, even on programs which can not complete within 10 minutes with the original analysis.

Session 11: Recommendation
Wed, Nov 16, 11:00 - 12:30, Emerald 2 (Chair: Chris Bird)

What Would Users Change in My App? Summarizing App Reviews for Recommending Software Changes
Andrea Di Sorbo, Sebastiano Panichella, Carol V. Alexandru, Junji Shimagaki, Corrado A. Visaggio, Gerardo Canfora, and Harald C. GallORCID logo
(University of Sannio, Italy; University of Zurich, Switzerland; Sony Mobile Communications, Japan)
Mobile app developers constantly monitor feedback in user reviews with the goal of improving their mobile apps and better meeting user expectations. Thus, automated approaches have been proposed in literature with the aim of reducing the effort required for analyzing feedback contained in user reviews via automatic classification/prioritization according to specific topics. In this paper, we introduce SURF (Summarizer of User Reviews Feedback), a novel approach to condense the enormous amount of information that developers of popular apps have to manage due to user feedback received on a daily basis. SURF relies on a conceptual model for capturing user needs useful for developers performing maintenance and evolution tasks. Then it uses sophisticated summarisation techniques for summarizing thousands of reviews and generating an interactive, structured and condensed agenda of recommended software changes. We performed an end-to-end evaluation of SURF on user reviews of 17 mobile apps (5 of them developed by Sony Mobile), involving 23 developers and researchers in total. Results demonstrate high accuracy of SURF in summarizing reviews and the usefulness of the recommended changes. In evaluating our approach we found that SURF helps developers in better understanding user needs, substantially reducing the time required by developers compared to manually analyzing user (change) requests and planning future software changes.

API Code Recommendation using Statistical Learning from Fine-Grained Changes
Anh Tuan Nguyen, Michael Hilton, Mihai Codoban, Hoan Anh Nguyen, Lily Mast, Eli Rademacher, Tien N. Nguyen ORCID logo, and Danny Dig
(Iowa State University, USA; Oregon State University, USA; Microsoft, USA; University of Evansville, USA)
Learning and remembering how to use APIs is difficult. While code-completion tools can recommend API methods, browsing a long list of API method names and their documentation is tedious. Moreover, users can easily be overwhelmed with too much information. We present a novel API recommendation approach that taps into the predictive power of repetitive code changes to provide relevant API recommendations for developers. Our approach and tool, APIREC, is based on statistical learning from fine-grained code changes and from the context in which those changes were made. Our empirical evaluation shows that APIREC correctly recommends an API call in the first position 59% of the time, and it recommends the correct API call in the top five positions 77% of the time. This is a significant improvement over the state-of-the-art approaches by 30-160% for top-1 accuracy, and 10-30% for top-5 accuracy, respectively. Our result shows that APIREC performs well even with a one-time, minimal training dataset of 50 publicly available projects.

TIPMerge: Recommending Experts for Integrating Changes across Branches
Catarina Costa, Jair Figueiredo, Leonardo Murta, and Anita SarmaORCID logo
(Federal University of Acre, Brazil; Federal Fluminense University, Brazil; Oregon State University, USA)
Parallel development in branches is a common software practice. However, past work has found that integration of changes across branches is not easy, and often leads to failures. Thus far, there has been little work to recommend developers who have the right expertise to perform a branch integration. We propose TIPMerge, a novel tool that recommends developers who are best suited to perform merges, by taking into consideration developers’ past experience in the project, their changes in the branches, and de-pendencies among modified files in the branches. We evaluated TIPMerge on 28 projects, which included up to 15,584 merges with at least two developers, and potentially conflicting changes. On average, 85% of the top-3 recommendations by TIPMerge correctly included the developer who performed the merge. Best (accuracy) results of recommendations were at 98%. Our inter-views with developers of two projects reveal that in cases where the TIPMerge recommendation did not match the actual merge developer, the recommended developer had the expertise to per-form the merge, or was involved in a collaborative merge session.

Video Info
Interactive and Guided Architectural Refactoring with Search-Based Recommendation
Yun Lin, Xin Peng ORCID logo, Yuanfang Cai, Danny Dig, Diwen Zheng, and Wenyun Zhao
(Fudan University, China; Drexel University, USA; Oregon State University, USA)
Architectural refactorings can contain hundreds of steps and experienced developers could carry them out over several weeks. Moreover, developers need to explore a correct sequence of refactorings steps among many more incorrect alternatives. Thus, carrying out architectural refactorings is costly, risky, and challenging. In this paper, we present Refactoring Navigator: a tool-supported and interactive recommendation approach for aiding architectural refactoring. Our approach takes a given implementation as the starting point, a desired high-level design as the target, and iteratively recommends a series of refactoring steps. Moreover, our approach allows the user to accept, reject, or ignore a recommended refactoring step, and uses the user's feedback in further refactoring recommendations. We evaluated the effectiveness of our approach and tool using a controlled experiment and an industrial case study. The controlled experiment shows that the participants who used Refactoring Navigator accomplished their tasks in 77.4% less time and manually edited 98.3% fewer lines than the control group. The industrial case study suggests that Refactoring Navigator has the potential to help with architectural refactorings in practice.

Video

Session 12: Test Coverage
Wed, Nov 16, 11:00 - 12:30, Emerald 3 (Chair: Willem Visser)

Can Testedness be Effectively Measured?
Iftekhar Ahmed, Rahul Gopinath, Caius Brindescu, Alex GroceORCID logo, and Carlos Jensen
(Oregon State University, USA)
Among the major questions that a practicing tester faces are deciding where to focus additional testing effort, and deciding when to stop testing. Test the least-tested code, and stop when all code is well-tested, is a reasonable answer. Many measures of "testedness" have been proposed; unfortunately, we do not know whether these are truly effective. In this paper we propose a novel evaluation of two of the most important and widely-used measures of test suite quality. The first measure is statement coverage, the simplest and best-known code coverage measure. The second measure is mutation score, a supposedly more powerful, though expensive, measure.
We evaluate these measures using the actual criteria of interest: if a program element is (by these measures) well tested at a given point in time, it should require fewer future bug-fixes than a "poorly tested" element. If not, then it seems likely that we are not effectively measuring testedness. Using a large number of open source Java programs from Github and Apache, we show that both statement coverage and mutation score have only a weak negative correlation with bug-fixes. Despite the lack of strong correlation, there are statistically and practically significant differences between program elements for various binary criteria. Program elements (other than classes) covered by any test case see about half as many bug-fixes as those not covered, and a similar line can be drawn for mutation score thresholds. Our results have important implications for both software engineering practice and research evaluation.

A Large-Scale Empirical Comparison of Static and Dynamic Test Case Prioritization Techniques
Qi Luo, Kevin Moran, and Denys PoshyvanykORCID logo
(College of William and Mary, USA)
The large body of existing research in Test Case Prioritization (TCP) techniques, can be broadly classified into two categories: dynamic techniques (that rely on run-time execution information) and static techniques (that operate directly on source and test code). Absent from this current body of work is a comprehensive study aimed at understanding and evaluating the static approaches and comparing them to dynamic approaches on a large set of projects.
In this work, we perform the first extensive study aimed at empirically evaluating four static TCP techniques comparing them with state-of-research dynamic TCP techniques at different test-case granularities (e.g., method and class-level) in terms of effectiveness, efficiency and similarity of faults detected. This study was performed on 30 real-word Java programs encompassing 431 KLoC. In terms of effectiveness, we find that the static call-graph-based technique outperforms the other static techniques at test-class level, but the topic-model-based technique performs better at test-method level. In terms of efficiency, the static call-graph-based technique is also the most efficient when compared to other static techniques. When examining the similarity of faults detected for the four static techniques compared to the four dynamic ones, we find that on average, the faults uncovered by these two groups of techniques are quite dissimilar, with the top 10% of test cases agreeing on only ≈ 25% - 30% of detected faults. This prompts further research into the severity/importance of faults uncovered by these techniques, and into the potential for combining static and dynamic information for more effective approaches.

Analyzing the Validity of Selective Mutation with Dominator Mutants
Bob Kurtz, Paul Ammann, Jeff Offutt, Márcio E. Delamaro, Mariet Kurtz, and Nida Gökçe
(George Mason University, USA; University of São Paulo, Brazil; MITRE, USA; Muğla University, Turkey)
Various forms of selective mutation testing have long been accepted as valid approximations to full mutation testing. This paper presents counterevidence to traditional selective mutation. The recent development of dominator mutants and minimal mutation analysis lets us analyze selective mutation without the noise introduced by the redundancy inherent in traditional mutation. We then exhaustively evaluate all small sets of mutation operators for the Proteum mutation system and determine dominator mutation scores and required work for each of these sets on an empirical test bed. The results show that all possible selective mutation approaches have poor dominator mutation scores on at least some of these programs. This suggests that to achieve high performance with respect to full mutation analysis, selective approaches will have to become more sophisticated, possibly by choosing mutants based on the specifics of the artifact under test, that is, specialized selective mutation.

An Extensive Study of Static Regression Test Selection in Modern Software Evolution
Owolabi Legunsen, Farah Hariri, August Shi, Yafeng Lu, Lingming Zhang, and Darko MarinovORCID logo
(University of Illinois at Urbana-Champaign, USA; University of Texas at Dallas, USA)
Regression test selection (RTS) aims to reduce regression testing time by only re-running the tests affected by code changes. Prior research on RTS can be broadly split into dy namic and static techniques. A recently developed dynamic RTS technique called Ekstazi is gaining some adoption in practice, and its evaluation shows that selecting tests at a coarser, class-level granularity provides better results than selecting tests at a finer, method-level granularity. As dynamic RTS is gaining adoption, it is timely to also evaluate static RTS techniques, some of which were proposed over three decades ago but not extensively evaluated on modern software projects.
This paper presents the first extensive study that evaluates the performance benefits of static RTS techniques and their safety; a technique is safe if it selects to run all tests that may be affected by code changes. We implemented two static RTS techniques, one class-level and one method-level, and compare several variants of these techniques. We also compare these static RTS techniques against Ekstazi, a state-of-the-art, class-level, dynamic RTS technique. The experimental results on 985 revisions of 22 open-source projects show that the class-level static RTS technique is comparable to Ekstazi, with similar performance benefits, but at the risk of being unsafe sometimes. In contrast, the method-level static RTS technique performs rather poorly.

Session 13: Program Analysis
Wed, Nov 16, 14:00 - 15:30, Emerald 1 (Chair: Santosh Nagarakatte)

PerfGuard: Binary-Centric Application Performance Monitoring in Production Environments
Chung Hwan Kim, Junghwan Rhee, Kyu Hyung Lee, Xiangyu ZhangORCID logo, and Dongyan Xu
(Purdue University, USA; NEC Labs, USA; University of Georgia, USA)
Diagnosis of performance problems is an essential part of software development and maintenance. This is in particular a challenging problem to be solved in the production environment where only program binaries are available with limited or zero knowledge of the source code. This problem is compounded by the integration with a significant number of third-party software in most large-scale applications. Existing approaches either require source code to embed manually constructed logic to identify performance problems or support a limited scope of applications with prior manual analysis. This paper proposes an automated approach to analyze application binaries and instrument the binary code transparently to inject and apply performance assertions on application transactions. Our evaluation with a set of large-scale application binaries without access to source code discovered 10 publicly known real world performance bugs automatically and shows that PerfGuard introduces very low overhead (less than 3% on Apache and MySQL server) to production systems.

Python Probabilistic Type Inference with Natural Language Support
Zhaogui Xu, Xiangyu ZhangORCID logo, Lin Chen ORCID logo, Kexin Pei, and Baowen Xu ORCID logo
(Nanjing University, China; Purdue University, USA)
We propose a novel type inference technique for Python programs. Type inference is difficult for Python programs due to their heavy dependence on external APIs and the dynamic language features. We observe that Python source code often contains a lot of type hints such as attribute accesses and variable names. However, such type hints are not reliable. We hence propose to use probabilistic inference to allow the beliefs of individual type hints to be propagated, aggregated, and eventually converge on probabilities of variable types. Our results show that our technique substantially outperforms a state-of-the-art Python type inference engine based on abstract interpretation.

aec-badge-fse16-ae
Detecting and Fixing Precision-Specific Operations for Measuring Floating-Point Errors
Ran Wang, Daming Zou, Xinrui He, Yingfei Xiong ORCID logo, Lu Zhang ORCID logo, and Gang Huang ORCID logo
(Peking University, China)
The accuracy of the floating-point calculation is critical to many applications and different methods have been proposed around floating-point accuracies, such as detecting the errors in the program, verifying the accuracy of the program, and optimizing the program to produce more accurate results. These approaches need a specification of the program to understand the ideal calculation performed by the program, which is usually approached by interpreting the program in a precision-unspecific way.
However, many operations programmed in existing code are inherently precision-specific, which cannot be easily interpreted in a precision-unspecific way. In fact, the semantics used in existing approaches usually fail to interpret precision-specific operations correctly.
In this paper, we present a systematic study on precision-specific operations. First, we propose a detection approach to detect precision-specific operations. Second, we propose a fixing approach to enable the tuning of precisions under the presence of precision-specific operations. Third, we studied the precision-specific operations in the GNU C standard math library based on our detection and fixing approaches. Our results show that (1) a significant number of code fragments in the standard C math library are precision-specific operations, and some large inaccuracies reported in existing studies are false positives or potential false positives due to precision-specific operations; (2) our detection approach has high precision and recall; (3) our fixing approach can lead to overall more accurate result.

aec-badge-fse16-ae
Deep API Learning
Xiaodong Gu, Hongyu Zhang, Dongmei Zhang ORCID logo, and Sunghun Kim
(Hong Kong University of Science and Technology, China; Microsoft Research, China)
Developers often wonder how to implement a certain functionality (e.g., how to parse XML files) using APIs. Obtaining an API usage sequence based on an API-related natural language query is very helpful in this regard. Given a query, existing approaches utilize information retrieval models to search for matching API sequences. These approaches treat queries and APIs as bags-of-words and lack a deep understanding of the semantics of the query. We propose DeepAPI, a deep learning based approach to generate API usage sequences for a given natural language query. Instead of a bag-of-words assumption, it learns the sequence of words in a query and the sequence of associated APIs. DeepAPI adapts a neural language model named RNN Encoder-Decoder. It encodes a word sequence (user query) into a fixed-length context vector, and generates an API sequence based on the context vector. We also augment the RNN Encoder-Decoder by considering the importance of individual APIs. We empirically evaluate our approach with more than 7 million annotated code snippets collected from GitHub. The results show that our approach generates largely accurate API sequences and outperforms the related approaches.

Session 14: Build and Configuration
Wed, Nov 16, 14:00 - 15:30, Emerald 2 (Chair: John Penix)

Build System with Lazy Retrieval for Java Projects
Ahmet Celik, Alex Knaust, Aleksandar Milicevic, and Milos GligoricORCID logo
(University of Texas at Austin, USA; Microsoft, USA)
In the modern-day development, projects use Continuous Integration Services (CISs) to execute the build for every change in the source code. To ensure that the project remains correct and deployable, a CIS performs a clean build each time. In a clean environment, a build system needs to retrieve the project's dependencies (e.g., guava.jar). The retrieval, however, can be costly due to dependency bloat: despite a project using only a few files from each library, the existing build systems still eagerly retrieve all the libraries at the beginning of the build.
This paper presents a novel build system, Molly, which lazily retrieves parts of libraries (i.e., files) that are needed during the execution of a build target. For example, the compilation target needs only public interfaces of classes within the libraries and the test target needs only implementation of the classes that are being invoked by the tests. Additionally, Molly generates a transfer script that retrieves parts of libraries based on prior builds. Molly's design requires that we ignore the boundaries set by the library developers and look at the files within the libraries. We implemented Molly for Java and evaluated it on 17 popular open-source projects. We show that test targets (on average) depend on only 9.97% of files in libraries. A variant of Molly speeds up retrieval by 44.28%. Furthermore, the scripts generated by Molly retrieve dependencies, on average, 93.81% faster than the Maven build system.

iGen: Dynamic Interaction Inference for Configurable Software
ThanhVu Nguyen, Ugur Koc, Javran Cheng, Jeffrey S. Foster, and Adam A. Porter
(University of Maryland at College Park, USA)
To develop, analyze, and evolve today's highly configurable software systems, developers need deep knowledge of a system's configuration options, e.g., how options need to be set to reach certain locations, what configurations to use for testing, etc. Today, acquiring this detailed information requires manual effort that is difficult, expensive, and error prone. In this paper, we propose iGen, a novel, lightweight dynamic analysis technique that automatically discovers a program's interactions---expressive logical formulae that give developers rich and detailed information about how a system's configuration option settings map to particular code coverage. iGen employs an iterative algorithm that runs a system under a small set of configurations, capturing coverage data; processes the coverage data to infer potential interactions; and then generates new configurations to further refine interactions in the next iteration. We evaluated iGen on 29 programs spanning five languages; the breadth of this study would be unachievable using prior interaction inference tools. Our results show that iGen finds precise interactions based on a very small fraction of the number of possible configurations. Moreover, iGen's results confirm several earlier hypotheses about typical interaction distributions and structures.

CacheOptimizer: Helping Developers Configure Caching Frameworks for Hibernate-Based Database-Centric Web Applications
Tse-Hsun Chen, Weiyi ShangORCID logo, Ahmed E. Hassan ORCID logo, Mohamed Nasser, and Parminder Flora
(Queen's University, Canada; Concordia University, Canada; BlackBerry, Canada)
To help improve the performance of database-centric cloud-based web applications, developers usually use caching frameworks to speed up database accesses. Such caching frameworks require extensive knowledge of the application to operate effectively. However, all too often developers have limited knowledge about the intricate details of their own application. Hence, most developers find configuring caching frameworks a challenging and time-consuming task that requires extensive and scattered code changes. Furthermore, developers may also need to frequently change such configurations to accommodate the ever changing workload.
In this paper, we propose CacheOptimizer, a lightweight approach that helps developers optimize the configuration of caching frameworks for web applications that are implemented using Hibernate. CacheOptimizer leverages readily-available web logs to create mappings between a workload and database accesses. Given the mappings, CacheOptimizer discovers the optimal cache configuration using coloured Petri nets, and automatically adds the appropriate cache configurations to the application. We evaluate CacheOptimizer on three open-source web applications. We find that i) CacheOptimizer improves the throughput by 27--138%; and ii) after considering both the memory cost and throughput improvement, CacheOptimizer still brings statistically significant gains (with mostly large effect sizes) in comparison to the application's default cache configuration and to blindly enabling all possible caches.

Session 15: Code Search and Similarity
Wed, Nov 16, 14:00 - 15:30, Emerald 3 (Chair: Mehdi Mirakhorli)

BinGo: Cross-Architecture Cross-OS Binary Search
Mahinthan Chandramohan, Yinxing Xue, Zhengzi Xu, Yang Liu ORCID logo, Chia Yuan Cho, and Hee Beng Kuan Tan
(Nanyang Technological University, Singapore; DSO National Laboratories, Singapore)
Binary code search has received much attention recently due to its impactful applications, e.g., plagiarism detection, malware detection and software vulnerability auditing. However, developing an effective binary code search tool is challenging due to the gigantic syntax and structural differences in binaries resulted from different compilers, architectures and OSs. In this paper, we propose BINGO — a scalable and robust binary search engine supporting various architectures and OSs. The key contribution is a selective inlining technique to capture the complete function semantics by inlining relevant library and user-defined functions. In addition, architecture and OS neutral function filtering is proposed to dramatically reduce the irrelevant target functions. Besides, we introduce length variant partial traces to model binary functions in a program structure agnostic fashion. The experimental results show that BINGO can find semantic similar functions across architecture and OS boundaries, even with the presence of program structure distortion, in a scalable manner. Using BINGO, we also discovered a zero-day vulnerability in Adobe PDF Reader, a COTS binary.

Relationship-Aware Code Search for JavaScript Frameworks
Xuan Li, Zerui Wang, Qianxiang Wang, Shoumeng Yan, Tao Xie, and Hong Mei
(Peking University, China; Intel Research, China; University of Illinois at Urbana-Champaign, USA)
JavaScript frameworks, such as jQuery, are widely used for developing web applications. To facilitate using these JavaScript frameworks to implement a feature (e.g., functionality), a large number of programmers often search for code snippets that implement the same or similar feature. However, existing code search approaches tend to be ineffective, without taking into account the fact that JavaScript code snippets often implement a feature based on various relationships (e.g., sequencing, condition, and callback relationships) among the invoked framework API methods. To address this issue, we present a novel Relationship-Aware Code Search (RACS) approach for finding code snippets that use JavaScript frameworks to implement a specific feature. In advance, RACS collects a large number of code snippets that use some JavaScript frameworks, mines API usage patterns from the collected code snippets, and represents the mined patterns with method call relationship (MCR) graphs, which capture framework API methods’ signatures and their relationships. Given a natural language (NL) search query issued by a programmer, RACS conducts NL processing to automatically extract an action relationship (AR) graph, which consists of actions and their relationships inferred from the query. In this way, RACS reduces code search to the problem of graph search: finding similar MCR graphs for a given AR graph. We conduct evaluations against representative real-world jQuery questions posted on Stack Overflow, based on 308,294 code snippets collected from over 81,540 files on the Internet. The evaluation results show the effectiveness of RACS: the top 1 snippet produced by RACS matches the target code snippet for 46% questions, compared to only 4% achieved by a relationship-oblivious approach.

Code Relatives: Detecting Similarly Behaving Software
Fang-Hsiang Su, Jonathan Bell, Kenneth Harvey, Simha Sethumadhavan, Gail Kaiser ORCID logo, and Tony Jebara
(Columbia University, USA)
Detecting “similar code” is useful for many software engineering tasks. Current tools can help detect code with statically similar syntactic and–or semantic features (code clones) and with dynamically similar functional input/output (simions). Unfortunately, some code fragments that behave similarly at the finer granularity of their execution traces may be ignored. In this paper, we propose the term “code relatives” to refer to code with similar execution behavior. We define code relatives and then present DyCLINK, our approach to detecting code relatives within and across codebases. DyCLINK records instruction-level traces from sample executions, organizes the traces into instruction-level dynamic dependence graphs, and employs our specialized subgraph matching algorithm to efficiently compare the executions of candidate code relatives. In our experiments, DyCLINK analyzed 422+ million prospective subgraph matches in only 43 minutes. We compared DyCLINK to one static code clone detector from the community and to our implementation of a dynamic simion detector. The results show that DyCLINK effectively detects code relatives with a reasonable analysis time.

aec-badge-fse16-ae

Session 16: Program Repair
Thu, Nov 17, 11:00 - 12:30, Emerald 1 (Chair: Tien Nguyen)

Understanding and Generating High Quality Patches for Concurrency Bugs
Haopeng Liu, Yuxi Chen, and Shan Lu
(University of Chicago, USA)
Concurrency bugs are time-consuming to fix correctly by developers and a severe threat to software reliability. Although many auto-fixing techniques have been proposed recently for concurrency bugs, there is still a big gap between the quality of automatically generated patches and manually designed ones. This paper first conducts an in-depth study of manual patches for 77 real-world concurrency bugs, which provides both assessments for existing techniques and actionable suggestions for future research. Guided by this study, a new tool HFix is designed. It can automatically generate patches, which have matching quality as manual patches, for many concurrency bugs.

Anti-patterns in Search-Based Program Repair
Shin Hwei Tan, Hiroaki Yoshida, Mukul R. Prasad, and Abhik RoychoudhuryORCID logo
(National University of Singapore, Singapore; Fujitsu Labs, USA)
Search-based program repair automatically searches for a program fix within a given repair space. This may be accomplished by retrofitting a generic search algorithm for program repair as evidenced by the GenProg tool, or by building a customized search algorithm for program repair as in SPR. Unfortunately, automated program repair approaches may produce patches that may be rejected by programmers, because of which past works have suggested using human-written patches to produce templates to guide program repair. In this work, we take the position that we will not provide templates to guide the repair search because that may unduly restrict the repair space and attempt to overfit the repairs into one of the provided templates. Instead, we suggest the use of a set of anti-patterns --- a set of generic forbidden transformations that can be enforced on top of any search-based repair tool. We show that by enforcing our anti-patterns, we obtain repairs that localize the correct lines or functions, involve less deletion of program functionality, and are mostly obtained more efficiently. Since our set of anti-patterns are generic, we have integrated them into existing search based repair tools, including GenProg and SPR, thereby allowing us to obtain higher quality program patches with minimal effort.

Info
Semi-supervised Verified Feedback Generation
Shalini Kaleeswaran, Anirudh Santhiar, Aditya Kanade, and Sumit GulwaniORCID logo
(Indian Institute of Science, India; Microsoft Research, USA)
Students have enthusiastically taken to online programming lessons and contests. Unfortunately, they tend to struggle due to lack of personalized feedback. There is an urgent need of program analysis and repair techniques capable of handling both the scale and variations in student submissions, while ensuring quality of feedback.
Towards this goal, we present a novel methodology called semi-supervised verified feedback generation. We cluster submissions by solution strategy and ask the instructor to identify or add a correct submission in each cluster. We then verify every submission in a cluster against the instructor-validated submission in the same cluster. If faults are detected in the submission then feedback suggesting fixes to them is generated. Clustering reduces the burden on the instructor and also the variations that have to be handled during feedback generation. The verified feedback generation ensures that only correct feedback is generated.
We implemented a tool, named CoderAssist, based on this approach and evaluated it on dynamic programming assignments. We have designed a novel counter-example guided feedback generation algorithm capable of suggesting fixes to all faults in a submission. In an evaluation on 2226 submissions to 4 problems, CoderAssist could generate verified feedback for 1911 (85%) submissions in 1.6s each on an average. It does a good job of reducing the burden on the instructor. Only one submission had to be manually validated or added for every 16 submissions.

WATERFALL: An Incremental Approach for Repairing Record-Replay Tests of Web Applications
Mouna Hammoudi, Gregg Rothermel, and Andrea Stocco
(University of Nebraska-Lincoln, USA; University of Genoa, Italy)
Software engineers use record/replay tools to capture use case scenarios that can serve as regression tests for web applications. Such tests, however, can be brittle in the face of code changes. Thus, researchers have sought automated approaches for repairing broken record/replay tests. To date, such approaches have operated by directly analyzing differences between the releases of web applications. Often, however, intermediate versions or commits exist between releases, and these represent finer-grained sequences of changes by which new releases evolve. In this paper, we present WATERFALL, an incremental test repair approach that applies test repair techniques iteratively across a sequence of fine-grained versions of a web application. The results of an empirical study on seven web applications show that our approach is substantially more effective than a coarse-grained approach (209% overall), while maintaining an acceptable level of overhead.

Session 17: Development Environments
Thu, Nov 17, 11:00 - 12:30, Emerald 2 (Chair: Dongmei Zhang)

Efficiency of Projectional Editing: A Controlled Experiment
Thorsten Berger, Markus Völter, Hans Peter Jensen, Taweesap Dangprasert, and Janet Siegmund
(Chalmers University of Technology, Sweden; University of Gothenburg, Sweden; itemis, Germany; IT University of Copenhagen, Denmark; University of Passau, Germany)
Projectional editors are editors where a user's editing actions directly change the abstract syntax tree without using a parser. They promise essentially unrestricted language com position as well as flexible notations, which supports aligning languages with their respective domain and constitutes an essential ingredient of model-driven development. Such editors have existed since the 1980s and gained widespread attention with the Intentional Programming paradigm, which used projectional editing at its core. However, despite the benefits, programming still mainly relies on editing textual code, where projectional editors imply a very different -- typically perceived as worse -- editing experience, often seen as the main challenge prohibiting their widespread adoption. We present an experiment of code-editing activities in a projectional editor, conducted with 19 graduate computer-science students and industrial developers. We investigate the effects of projectional editing on editing efficiency, editing strategies, and error rates -- each of which we also compare to conventional, parser-based editing. We observe that editing is efficient for basic-editing tasks, but that editing strategies and typical errors differ. More complex tasks require substantial experience and a better understanding of the abstract-syntax-tree structure -- then, projectional editing is also efficient. We also witness a tradeoff between fewer typing mistakes and an increased complexity of code editing.

ECHO: Instantaneous In Situ Race Detection in the IDE
Sheng Zhan and Jeff Huang ORCID logo
(Texas A&M University, USA)
We present ECHO, a new technique that detects data races instantaneously in the IDE while developers code. ECHO is the first technique of its kind for incremental race detection supporting both code addition and deletion in the IDE. Unlike conventional static race detectors, ECHO warns developers of potential data races immediately as they are introduced into the program. The core underpinning ECHO is a set of new change-aware static analyses based on a novel static happens-before graph that, given a program change, efficiently compute the change-relevant information without re-analyzing the whole program. Our evaluation within a Java environment on both popular benchmarks and real- world applications shows promising results: for each code addition, or deletion, ECHO can instantly pinpoint all the races in a few milliseconds on average, three to four orders of magnitude faster than a conventional whole-program race detector with the same precision.

Info
Detecting Table Clones and Smells in Spreadsheets
Wensheng Dou, Shing-Chi CheungORCID logo, Chushu Gao, Chang XuORCID logo, Liang Xu, and Jun Wei ORCID logo
(Institute of Software at Chinese Academy of Sciences, China; Hong Kong University of Science and Technology, China; Nanjing University, China)
Spreadsheets are widely used by end users for various business tasks, such as data analysis and financial reporting. End users may perform similar tasks by cloning a block of cells (table) in their spreadsheets. The corresponding cells in these cloned tables are supposed to keep the same or similar computational semantics. However, when spreadsheets evolve, thus cloned tables can become inconsistent due to ad-hoc modifications, and as a result suffer from smells. In this paper, we propose TableCheck to detect table clones and related smells due to inconsistency among them. We observe that two tables with the same header information at their corresponding cells are likely to be table clones. Inspired by existing fingerprint-based code clone detection techniques, we developed a detection algorithm to detect this kind of table clones. We further detected outliers among corresponding cells as smells in the detected table clones. We implemented our idea into TableCheck, and applied it to real-world spreadsheets from the EUSES corpus. Experimental results show that table clones commonly exist (21.8%), and 25.6% of the spreadsheets with table clones suffer from smells due to inconsistency among these clones. TableCheck detected table clones and their smells with a precision of 92.2% and 85.5%, respectively, while existing techniques detected no more than 35.6% true smells that TableCheck could detect.

Session 18: Concurrency
Thu, Nov 17, 14:30 - 16:00, Emerald 1 (Chair: Jeff Huang)

Flow-Sensitive Composition of Thread-Modular Abstract Interpretation
Markus Kusano and Chao Wang
(Virginia Tech, USA; University of Southern California, USA)
We propose a constraint-based flow-sensitive static analysis for concurrent programs by iteratively composing thread-modular abstract interpreters via the use of a system of lightweight constraints. Our method is compositional in that it first applies sequential abstract interpreters to individual threads and then composes their results. It is flow-sensitive in that the causality ordering of interferences (flow of data from global writes to reads) is modeled by a system of constraints. These interference constraints are lightweight since they only refer to the execution order of program statements as opposed to their numerical properties: they can be decided efficiently using an off-the-shelf Datalog engine. Our new method has the advantage of being more accurate than existing, flow-insensitive, static analyzers while remaining scalable and providing the expected soundness and termination guarantees even for programs with unbounded data. We implemented our method and evaluated it on a large number of benchmarks, demonstrating its effectiveness at increasing the accuracy of thread-modular abstract interpretation.

A Deployable Sampling Strategy for Data Race Detection
Yan CaiORCID logo, Jian ZhangORCID logo, Lingwei Cao, and Jian Liu
(Institute of Software at Chinese Academy of Sciences, China; Institute of Information Engineering at Chinese Academy of Sciences, China)
Dynamic data race detection incurs heavy runtime overheads. Recently, many sampling techniques have been proposed to detect data races. However, some sampling techniques (e.g., Pacer) are based on traditional happens-before relation and incur a large basic overhead. Others utilize hardware to reduce their sampling overhead (e.g., DataCollider) and they, however, detect a race only when the race really occurs by delaying program executions. In this paper, we study the limitations of existing techniques and propose a new data race definition, named as Clock Races, for low overhead sampling purpose. The innovation of clock races is that the detection of them does not rely on concrete locks and also avoids heavy basic overhead from tracking happens-before relation. We further propose CRSampler (Clock Race Sampler) to detect clock races via hardware based sampling without directly delaying program executions, to further reduce runtime overhead. We evaluated CRSampler on Dacapo benchmarks. The results show that CRSampler incurred less than 5% overhead on average at 1% sampling rate. Whereas, Pacer and DataCollider incurred larger than 25% and 96% overhead, respectively. Besides, at the same sampling rate, CRSampler detected significantly more data races than that by Pacer and DataCollider.

Online Shared Memory Dependence Reduction via Bisectional Coordination
Yanyan Jiang, Chang XuORCID logo, Du Li, Xiaoxing Ma, and Jian Lu
(Nanjing University, China; Carnegie Mellon University, USA)
Order of shared memory accesses, known as the shared memory dependence, is the cornerstone of dynamic analyses of concurrent programs. In this paper, we study the problem of reducing shared memory dependences. We present the first online software-only algorithm to reduce shared memory dependences without vector clock maintenance, opening a new direction to a broad range of applications (e.g., deterministic replay and data race detection). Our algorithm exploits a simple yet effective observation, that adaptive variable grouping can recognize and match spatial locality in shared memory accesses, to reduce shared memory dependences. We designed and implemented the bisectional coordination protocol, which dynamically maintains a partition of the program's address space without its prior knowledge, such that shared variables in each partitioned interval have consistent thread and spatial locality properties. Evaluation on a set of real-world programs showed that by paying a 0--54.7% (median 21%) slowdown, bisectional coordination reduced 0.95--97% (median 55%) and 16--99.99% (median 99%) shared memory dependences compared with RWTrace and LEAP, respectively.

Parallel Data Race Detection for Task Parallel Programs with Locks
Adarsh Yoga, Santosh Nagarakatte ORCID logo, and Aarti Gupta ORCID logo
(Rutgers University, USA; Princeton University, USA)
Programming with tasks is a promising approach to write performance portable parallel code. In this model, the programmer explicitly specifies tasks and the task parallel runtime employs work stealing to distribute tasks among threads. Similar to multithreaded programs, task parallel programs can also exhibit data races. Unfortunately, prior data race detectors for task parallel programs either run the program serially or do not handle locks, and/or detect races only in the schedule observed by the analysis. This paper proposes PTRacer, a parallel on-the-fly data race detector for task parallel programs that use locks. PTRacer detects data races not only in the observed schedule but also those that can happen in other schedules (which are permutations of the memory operations in the observed schedule) for a given input. It accomplishes the above goal by leveraging the dynamic execution graph of a task parallel execution to determine whether two accesses can happen in parallel and by maintaining constant amount of access history metadata with each distinct set of locks held for each shared memory location. To detect data races (beyond the observed schedule) in programs with branches sensitive to scheduling decisions, we propose static compiler instrumentation that records memory accesses that will be executed in the other path with simple branches. PTRacer has performance overheads similar to the state-of-the-art race detector for task parallel programs, SPD3, while detecting more races in programs with locks.

aec-badge-fse16-ae

Session 19: Open Source
Thu, Nov 17, 14:30 - 16:00, Emerald 2 (Chair: Mei Naggapan)

Paradise Unplugged: Identifying Barriers for Female Participation on Stack Overflow
Denae Ford, Justin Smith, Philip J. Guo ORCID logo, and Chris ParninORCID logo
(North Carolina State University, USA; University of California at San Diego, USA)
It is no secret that females engage less in programming fields than males. However, in online communities, such as Stack Overflow, this gender gap is even more extreme: only 5.8% of contributors are female. In this paper, we use a mixed-methods approach to identify contribution barriers females face in online communities. Through 22 semi-structured interviews with a spectrum of female users ranging from non-contributors to a top 100 ranked user of all time, we identified 14 barriers preventing them from contributing to Stack Overflow. We then conducted a survey with 1470 female and male developers to confirm which barriers are gender related or general problems for everyone. Females ranked five barriers significantly higher than males. A few of these include doubts in the level of expertise needed to contribute, feeling overwhelmed when competing with a large number of users, and limited awareness of site features. Still, there were other barriers that equally impacted all Stack Overflow users or affected particular groups, such as industry programmers. Finally, we describe several implications that may encourage increased participation in the Stack Overflow community across genders and other demographics.

Info
Why We Refactor? Confessions of GitHub Contributors
Danilo Silva, Nikolaos TsantalisORCID logo, and Marco Tulio Valente
(Federal University of Minas Gerais, Brazil; Concordia University, Canada)
Refactoring is a widespread practice that helps developers to improve the maintainability and readability of their code. However, there is a limited number of studies empirically investigating the actual motivations behind specific refactoring operations applied by developers. To fill this gap, we monitored Java projects hosted on GitHub to detect recently applied refactorings, and asked the developers to explain the reasons behind their decision to refactor the code. By applying thematic analysis on the collected responses, we compiled a catalogue of 44 distinct motivations for 12 well-known refactoring types. We found that refactoring activity is mainly driven by changes in the requirements and much less by code smells. Extract Method is the most versatile refactoring operation serving 11 different purposes. Finally, we found evidence that the IDE used by the developers affects the adoption of automated refactoring tools.

Info aec-badge-fse16-ae
Effectiveness of Code Contribution: From Patch-Based to Pull-Request-Based Tools
Jiaxin Zhu, Minghui Zhou ORCID logo, and Audris Mockus
(Peking University, China; University of Tennessee, USA)
Code contributions in Free/Libre and Open Source Software projects are controlled to maintain high-quality of software. Alternatives to patch-based code contribution tools such as mailing lists and issue trackers have been developed with the pull request systems being the most visible and widely available on GitHub. Is the code contribution process more effective with pull request systems? To answer that, we quantify the effectiveness via the rates contributions are accepted and ignored, via the time until the first response and final resolution and via the numbers of contributions. To control for the latent variables, our study includes a project that migrated from an issue tracker to the GitHub pull request system and a comparison between projects using mailing lists and pull request systems. Our results show pull request systems to be associated with reduced review times and larger numbers of contributions. However, not all the comparisons indicate substantially better accept or ignore rates in pull request systems. These variations may be most simply explained by the differences in contribution practices the projects employ and may be less affected by the type of tool. Our results clarify the importance of understanding the role of tools in effective management of the broad network of potential contributors and may lead to strategies and practices making the code contribution more satisfying and efficient from both contributors' and maintainers' perspectives.

Session 20: Test Generation
Thu, Nov 17, 14:30 - 16:00, Emerald 3 (Chair: Myra Cohen)

Isomorphic Regression Testing: Executing Uncovered Branches without Test Augmentation
Jie Zhang, Yiling Lou, Lingming Zhang, Dan Hao, Lu Zhang ORCID logo, and Hong Mei ORCID logo
(Peking University, China; University of Texas at Dallas, USA)
In software testing, it is very hard to achieve high coverage with the program under test, leaving many behaviors unexplored. To alleviate this problem, various automated test generation and augmentation approaches have been proposed, among which symbolic execution and search-based techniques are the most competitive, while each has key challenges to be solved. Different from prior work, we present a new methodology for regression testing --Isomorphic Regression Testing,which explores the behaviors of the program under test by creating its variants (i.e., modified programs) instead of generating tests. In this paper, we make the first implementation of isomorphic regression testing through an approach named ISON, which creates program variants by negating branch conditions. The results show that ISON is able to additionally execute 5.3% to 80.0% branches that are originally uncovered. Furthermore, ISON also detects a number of faults not detected by a popular automated test generation tool (i.e., EvoSuite) under the scenario of regression testing.

Directed Test Generation to Detect Loop Inefficiencies
Monika Dhok and Murali Krishna Ramanathan
(Indian Institute of Science, India)
Redundant traversal of loops in the context of other loops has been recently identified as a source of performance bugs in many Java libraries. This has resulted in the design of static and dynamic analysis techniques to detect these performance bugs automatically. However, while the effectiveness of dynamic analyses is dependent on the analyzed input tests, static analyses are less effective in automatically validating the presence of these problems, validating the fixes and avoiding regressions in future versions. This necessitates the design of an approach to automatically generate tests for exposing redundant traversal of loops.
In this paper, we design a novel, scalable and automatic approach that addresses this goal. Our approach takes a library and an initial set of coverage-driven randomly generated tests as input and generates tests which enable detection of redundant traversal of loops. Our approach is broadly composed of three phases – analysis of the execution of random tests to generate method summaries, identification of methods with potential nested loops along with the appropriate context to expose the problem, and test generation to invoke the identified methods with the appropriate parameters. The generated tests can be analyzed by existing dynamic tools to detect possible performance issues.
We have implemented our approach on top of the SOOT bytecode analysis framework and validated it on many open-source Java libraries. Our experiments reveal the effectiveness of our approach in generating 224 tests that reveal 46 bugs across seven libraries, including 34 previously unknown bugs. The tests generated using our approach significantly outperform the randomly generated tests in their ability to expose the inefficiencies, demonstrating the usefulness of our design. The implementation of our tool, named Glider, is available at http://drona.csa.iisc.ac.in/~sss/tools/glider.

aec-badge-fse16-ae
Field-Exhaustive Testing
Pablo Ponzio ORCID logo, Nazareno AguirreORCID logo, Marcelo F. Frias ORCID logo, and Willem Visser ORCID logo
(National University of Río Cuarto, Argentina; CONICET, Argentina; Buenos Aires Institute of Technology, Argentina; Stellenbosch University, South Africa)
We present a testing approach for object oriented programs, which encompasses a testing criterion and an automated test generation technique. The criterion, that we call field-exhaustive testing, requires a user-provided limit n on the size of data domains, and is based on the idea of considering enough inputs so as to exhaustively cover the extension of class fields, within the limit n. Intuitively, the extension of a field f is the binary relation established between objects and their corresponding values for field f, in valid instances. Thus, a suite S is field-exhaustive if whenever a field f relates an object o with a value v (i.e., o.f = v) within a valid instance I of size bounded by n, then S contains at least one input I′ covering such relationship, i.e., o must also be part of I′, and o.f = v must hold in I′. Our test generation technique uses incremental SAT solving to produce small field-exhaustive suites: field-exhaustiveness can be achieved with a suite containing at most # F × n2 inputs, where # F is the number of fields in the class under test.
We perform an experimental evaluation on two different testing domains drawn from the literature: implementations of data structures, and of a refactoring engine. The experiments show that field-exhaustive suites can be computed efficiently, and retain similar levels of code coverage and mutation killing as significantly larger bounded exhaustive and random suites, thus consuming a fraction of the cost of test execution compared to these automated testing approaches.


Visions and Reflections

Sustainable Software Design
Martin P. RobillardORCID logo
(McGill University, Canada)
Although design plays a central role in software development, the information produced in this activity is often left to progressively evaporate as the result of software evolution, loss of artifacts, or the fading of related knowledge held by the development team. This paper introduces the concept of sustainability for software design, and calls for its integration into the existing catalog of design quality attributes. Applied to software design, sustainability conveys the idea that a particular set of design decisions and their rationale can be succinctly reflected in the host technology and/or described in documentation in a way that is checkable for conformance with the code and generally resistant to evaporation. The paper discusses the relation between sustainability and existing research areas in software engineering, and highlights future research challenges related to sustainable software design.

Designing for Dystopia: Software Engineering Research for the Post-apocalypse
Titus Barik, Rahul Pandita, Justin Middleton, and Emerson Murphy-Hill
(North Carolina State University, USA)
Software engineering researchers have a tendency to be optimistic about the future. Though useful, optimism bias bolsters unrealistic expectations towards desirable outcomes. We argue that explicitly framing software engineering research through pessimistic futures, or dystopias, will mitigate optimism bias and engender more diverse and thought-provoking research directions. We demonstrate through three pop culture dystopias, Battlestar Galactica, Fallout 3, and Children of Men, how reflecting on dystopian scenarios provides research opportunities as well as implications, such as making research accessible to non-experts, that are relevant to our present.

Disrupting Developer Productivity One Bot at a Time
Margaret-Anne Storey and Alexey Zagalsky
(University of Victoria, Canada)
Bots are used to support different software development activities, from automating repetitive tasks to bridging knowledge and communication gaps in software teams. We anticipate the use of Bots will increase and lead to improvements in software quality and developer and team productivity, but what if the disruptive effect is not what we expect?
Our goal in this paper is to provoke and inspire researchers to study the impact (positive and negative) of Bots on software development. We outline the modern Bot landscape and use examples to describe the common roles Bots occupy in software teams. We propose a preliminary cognitive support framework that can be used to understand these roles and to reflect on the impact of Bots in software development on productivity. Finally, we consider challenges that Bots may bring and propose some directions for future research.

Training the Future Workforce through Task Curation in an OSS Ecosystem
Anita SarmaORCID logo, Marco Aurélio Gerosa, Igor Steinmacher, and Rafael Leano
(Oregon State University, USA; University of São Paulo, Brazil; Federal University of Technology Paraná, Brazil)
Volunteers to Open Source Software (OSS) projects contribute not only to help creating software that they use, but also to gain skills and enrich their expertise and resumes. However, newcomers to OSS face several challenges when joining a project. Particularly, they do not know where to start, or choose tasks that they can be successful at. Here, we describe our vision towards BugExchange, a system that curates tasks from OSS projects and helps train newcomers. While evaluating and executing these tasks, newcomers can gain an understanding about the project, its technology, and concepts. There are many challenges in designing such a system. For example, identifying the information needs of newcomers, creating task recommendations that match newcomers’ skills and career goals, and providing mentoring and networking support. We plan to leverage our previous work to conceive and prototype our system, which will include multiple research lines. BugExchange has the potential to improve newcomer learning experiences, reduce dropouts, and foster community building.

Reaching the Masses: A New Subdiscipline of App Programmer Education
Charles Weir, Awais Rashid, and James NobleORCID logo
(Security Lancaster, UK; Victoria University of Wellington, New Zealand)
Programmers’ lack of knowledge and interest in secure development threatens everyone who uses mobile apps. The rise of apps has engaged millions of independent app developers, who rarely encounter any but low level security techniques. But what if software security were presented as a game, or a story, or a discussion? What if learning app security techniques could be fun as well as empowering? Only by introducing the powerful motivating techniques developed for other disciplines can we hope to upskill independent app developers, and achieve the security that we’ll need in 2025 to safeguard our identities and our data.

Studying Developer Gaze to Empower Software Engineering Research and Practice
Bonita Sharif, Benjamin Clark, and Jonathan I. Maletic
(Youngstown State University, USA; Kent State University, USA)
A new research paradigm is proposed that leverages developer eye gaze to improve the state of the art in software engineering research and practice. The vision of this new paradigm for use on software engineering tasks such as code summarization, code recommendations, prediction, and continuous traceability is described. Based on this new paradigm, it is foreseen that new benchmarks will emerge based on developer gaze. The research borrows from cognitive psychology, artificial intelligence, information retrieval, and data mining. It is hypothesized that new algorithms will be discovered that work with eye gaze data to help improve current IDEs, thus improving developer productivity. Conducting empirical studies using an eye tracker will lead to inventing, evaluating, and applying innovative methods and tools that use eye gaze to support the developer. The implications and challenges of this paradigm for future software engineering research is discussed.

DeepSoft: A Vision for a Deep Model of Software
Hoa Khanh Dam, Truyen Tran, John Grundy, and Aditya Ghose
(University of Wollongong, Australia; Deakin University, Australia)
Although software analytics has experienced rapid growth as a research area, it has not yet reached its full potential for wide industrial adoption. Most of the existing work in software analytics still relies heavily on costly manual feature engineering processes, and they mainly address the traditional classification problems, as opposed to predicting future events. We present a vision for DeepSoft, an end-to-end generic framework for modeling software and its development process to predict future risks and recommend interventions. DeepSoft, partly inspired by human memory, is built upon the powerful deep learning-based Long Short Term Memory architecture that is capable of learning long-term temporal dependencies that occur in software evolution. Such deep learned patterns of software can be used to address a range of challenging problems such as code and task recommendation and prediction. DeepSoft provides a new approach for research into modeling of source code, risk prediction and mitigation, developer modeling, and automatically generating code patches from bug reports.

Budgeted Testing through an Algorithmic Lens
Myra B. Cohen, A. Pavan, and N. V. Vinodchandran
(University of Nebraska-Lincoln, USA; Iowa State University, USA)
Automated testing has been a focus of research for a long time. As such, we tend to think about this in a coverage centric manner. Testing budgets have also driven research such as prioritization and test selection, but as a secondary concern. As our systems get larger, are more dynamic, and impact more people with each change, we argue that we should switch from a coverage centric view to a budgeted testing centric view. Researchers in other fields have designed approximation algorithms for such budgeted scenarios and these are often simple to implement and run. In this paper we present an exemplar study on combinatorial interaction testing (CIT) to show that a budgeted greedy algorithm, when adapted to our problem for various budgets, does almost as well coverage-wise as a state of the art greedy CIT algorithm, better in some cases than a state of the art simulated annealing, and always improves over random. This suggests that we might benefit from switching our focus in large systems, from coverage to budgets.

Reasoning with Imprecise Privacy Preferences
Inah Omoronyia
(University of Glasgow, UK)
User reluctance and context-dependent factors during information disclosure imply that people cannot always be counted on to indicate their appropriate privacy preference. This phenomenon is the well-known 'privacy paradox', which shows that users of modern technologies are constantly concerned about their privacy, but do not apply these concerns to their usage behaviour accordingly. The problem is that this mismatch between privacy concerns and the indicated privacy preference in software, is not considered when reasoning about the satisfaction of privacy requirements.
This paper is a research vision that draws connections between the imprecisions in user privacy preferences, and reasoning about the satisfaction of privacy requirements. We outline the close relationship between privacy and user beliefs and uncertainties. We then propose a multi-agent framework that leverage on this relationship when reasoning about the satisfaction of privacy requirements. We anticipate that this vision will help reduce the gap between an increasingly complex information age and the software techniques needed to protect user privacy.


Industrial Papers

Bing Developer Assistant: Improving Developer Productivity by Recommending Sample Code
Hongyu ZhangORCID logo, Anuj Jain, Gaurav Khandelwal, Chandrashekhar Kaushik, Scott Ge, and Wenxiang Hu
(Microsoft Research, China; Microsoft, India; Microsoft, USA; Microsoft, China)
In programming practice, developers often need sample code in order to learn how to solve a programming-related problem. For example, how to reuse an Application Programming Interface (API) of a large-scale software library and how to implement a certain functionality. We believe that previously written code can help developers understand how others addressed the similar problems and can help them write new programs. We develop a tool called Bing Developer Assistant (BDA), which improves developer productivity by recommending sample code mined from public software repositories (such as GitHub) and web pages (such as Stack Overflow). BDA can automatically mine code snippets that implement an API or answer a code search query. It has been implemented as a free-downloadable extension of Microsoft Visual Studio and has received more than 670K downloads since its initial release in December 2014. BDA is publicly available at: http://aka.ms/devassistant.

Cluster-Based Test Suite Functional Analysis
Marcel Zalmanovici, Orna Raz, and Rachel Tzoref-BrillORCID logo
(IBM Research, Israel)
A common industrial challenge is that of analyzing large legacy free text test suites in order to comprehend their functional content. The analysis results are used for different purposes, such as dividing the test suite into disjoint functional parts for automation and management purposes, identifying redundant test cases, and extracting models for combinatorial test generation while reusing the legacy test suite. Currently the analysis is performed manually, which hinders the ability to analyze many such large test suites due to time and resource constraints.
We report on our practical experience in automated analysis of real-world free text test suites from six different industrial companies. Our novel, cluster-based approach provides significant time savings for the analysis of the test suites, varying from a reduction of 35% to 97% compared to the human time required, thus enabling functional analysis in many cases where manual analysis is infeasible in practice.

A Portable Interface for Runtime Energy Monitoring
Connor ImesORCID logo, Lars Bergstrom, and Henry HoffmannORCID logo
(University of Chicago, USA; Mozilla Research, USA)
As energy consumption becomes a first class concern for computing systems, there is an increasing need for application-level access to runtime power/energy measurements. To support this need, a growing number of power and energy monitors are being developed, each with their own interfaces. In fact, the approaches are extremely diverse, and porting energy-aware code to new platforms with new hardware can involve significant rewriting effort. To reduce this effort and support portable, application-level energy monitoring, a common interface is needed. In this paper, we propose EnergyMon, a portable application interface that is independent of underlying power/energy data sources. We demonstrate EnergyMon's flexibility with two case studies -- energy-aware profiling and self-adaptive systems, each of which requires monitoring energy across a range of hardware from different manufacturers. We release the EnergyMon interface, implementations, utilities, and Java and Rust bindings and abstractions as open source.

Learning for Test Prioritization: An Industrial Case Study
Benjamin Busjaeger and Tao Xie
(Salesforce.com, USA; University of Illinois at Urbana-Champaign, USA)
Modern cloud-software providers, such as Salesforce.com, increasingly adopt large-scale continuous integration environments. In such environments, assuring high developer productivity is strongly dependent on conducting testing efficiently and effectively. Specifically, to shorten feedback cycles, test prioritization is popularly used as an optimization mechanism for ranking tests to run by their likelihood of revealing failures. To apply test prioritization in industrial environments, we present a novel approach (tailored for practical applicability) that integrates multiple existing techniques via a systematic framework of machine learning to rank. Our initial empirical evaluation on a large real-world dataset from Salesforce.com shows that our approach significantly outperforms existing individual techniques.

Combinatorial Generation of Structurally Complex Test Inputs for Commercial Software Applications
Hua Zhong, Lingming Zhang, and Sarfraz Khurshid
(Google, USA; University of Texas at Austin, USA; University of Texas at Dallas, USA)
Despite recent progress in automated test generation research, significant challenges remain for applying these techniques on large-scale software systems. These systems under test often require structurally complex test inputs within a large input domain. It is challenging to automatically generate a reasonable number of tests that are both legal and behaviorally-diverse to exercise these systems. Constraint-based test generation is an effective approach for generating structurally complex inputs for systematic testing. While this approach can typically generate large numbers of tests, it has limited scalability – tests generated are usually only up to a small bound on input size. Combinatorial test generation, e.g., pair-wise testing, is a more scalable approach but is challenging to apply on commercial software systems that require complex input structures that cannot be formed by using arbitrary combinations. This paper introduces comKorat, which unifies constraint-based generation of structurally complex tests with combinatorial testing. Specifically, comKorat integrates Korat and ACTS test generators to generate test suites for large scale software systems with structurally complex test inputs. We have successfully applied comKorat on four software applications developed at eBay and Yahoo!. The experimental results show that comKorat outperforms existing solutions in execution time and test coverage. Furthermore, comKorat found a total of 59 previously unknown bugs in the four applications.

Automated Test Input Generation for Android: Are We Really There Yet in an Industrial Case?
Xia Zeng, Dengfeng Li, Wujie Zheng, Fan Xia, Yuetang Deng ORCID logo, Wing Lam, Wei Yang, and Tao Xie
(Tencent, China; University of Illinois at Urbana-Champaign, USA)
Given the ever increasing number of research tools to automatically generate inputs to test Android applications (or simply apps), researchers recently asked the question "Are we there yet?" (in terms of the practicality of the tools). By conducting an empirical study of the various tools, the researchers found that Monkey (the most widely used tool of this category in industrial practices) outperformed all of the research tools that they studied. In this paper, we present two significant extensions of that study. First, we conduct the first industrial case study of applying Monkey against WeChat, a popular messenger app with over 762 million monthly active users, and report the empirical findings on Monkey's limitations in an industrial setting. Second, we develop a new approach to address major limitations of Monkey and accomplish substantial code-coverage improvements over Monkey, along with empirical insights for future enhancements to both Monkey and our approach.


Tool Demonstrations
Wed, Nov 16, 15:30 - 16:30, Foyer 3rd/4th Floor

NonDex: A Tool for Detecting and Debugging Wrong Assumptions on Java API Specifications
Alex Gyori, Ben Lambeth, August Shi, Owolabi Legunsen, and Darko MarinovORCID logo
(University of Illinois at Urbana-Champaign, USA)
We present NonDex, a tool for detecting and debugging wrong assumptions on Java APIs. Some APIs have underdetermined specifications to allow implementations to achieve different goals, e.g., to optimize performance. When clients of such APIs assume stronger-than-specified guarantees, the resulting client code can fail. For example, HashSet’s iteration order is underdetermined, and code assuming some implementation-specific iteration order can fail. NonDex helps to proactively detect and debug such wrong assumptions. NonDex performs detection by randomly exploring different behaviors of underdetermined APIs during test execution. When a test fails during exploration, NonDex searches for the invocation instance of the API that caused the failure. NonDex is open source, well-integrated with Maven, and also runs from the command line. During our experiments with the NonDex Maven plugin, we detected 21 new bugs in eight Java projects from GitHub, and, using the debugging feature of NonDex, we identified the underlying wrong assumptions for these 21 new bugs and 54 previously detected bugs. We opened 13 pull requests; developers already accepted 12, and one project changed the continuous-integration configuration to run NonDex on every push. The demo video is at: https://youtu.be/h3a9ONkC59c

Video Info
TIPMerge: Recommending Developers for Merging Branches
Catarina Costa, Jair Figueiredo, Anita SarmaORCID logo, and Leonardo Murta
(Federal University of Acre, Brazil; Federal Fluminense University, Brazil; Oregon State University, USA)
Development in large projects often involves branches, where changes are performed in parallel and merged periodically. This merge process often combines two independent and long sequences of commits that may have been performed by multiple, different developers. It is nontrivial to identify the right developer to perform the merge, as the developer must have enough understanding of changes in both branches to ensure that the merged changes comply with the objective of both lines of work (branches), which may have been active for several months. We designed and developed TIPMerge, a novel tool that recommends developers who are best suited to perform the merge between two given branches. TIPMerge does so by taking into consideration developers’ past experience in the project, their changes in the branches, and the dependencies among modified files in the branches. In this paper we demonstrate TIPMerge over a real merge case from the Voldemort project.

Time-Travel Debugging for JavaScript/Node.js
Earl T. Barr, Mark Marron, Ed Maurer, Dan Moseley, and Gaurav Seth
(University College London, UK; Microsoft Research, USA; Microsoft, USA)
Time-traveling in the execution history of a program during debugging enables a developer to precisely track and understand the sequence of statements and program values leading to an error. To provide this functionality to real world developers, we embarked on a two year journey to create a production quality time-traveling debugger in Microsoft's open-source ChakraCore JavaScript engine and the popular Node.js application framework.

Video Info
PUMConf: A Tool to Configure Product Specific Use Case and Domain Models in a Product Line
Ines Hajri, Arda Goknil, Lionel C. BriandORCID logo, and Thierry Stephany
(University of Luxembourg, Luxembourg; IEE, Luxembourg)
We present PUMConf, a tool for supporting configuration that currently focuses on requirements and enables effective product line management in the context of use case-driven development. By design, it relies exclusively on variability modeling for artifacts that are commonly used in such contexts (i.e., use case diagram, specifications and domain model). For given Product Line (PL) use case and domain models, PUMConf checks the consistency of the models, interactively receives configuration decisions from analysts, automatically checks decision consistency, and generates Product Specific (PS) use case and domain models from the PL models and decisions. It has been evaluated on an industrial case study in the automotive domain.

Video Info
T2API: Synthesizing API Code Usage Templates from English Texts with Statistical Translation
Thanh Nguyen, Peter C. Rigby ORCID logo, Anh Tuan Nguyen, Mark Karanfil, and Tien N. Nguyen
(Iowa State University, USA; Concordia University, Canada; University of Texas at Dallas, USA)
In this work, we develop T2API, a statistical machine translation-based tool that takes a given English description of a programming task as a query, and synthesizes the API usage template for the task by learning from training data. T2API works in two steps. First, it derives the API elements relevant to the task described in the input by statistically learning from a StackOverflow corpus of text descriptions and corresponding code. To infer those API elements, it also considers the context of the words in the textual input and the context of API elements that often go together in the corpus. The inferred API elements with their relevance scores are ensembled into an API usage by our novel API usage synthesis algorithm that learns the API usages from a large code corpus via a graph-based language model. Importantly, T2API is capable of generating new API usages from smaller, previously-seen usages.

JBSE: A Symbolic Executor for Java Programs with Complex Heap Inputs
Pietro Braione, Giovanni Denaro, and Mauro PezzèORCID logo
(University of Milano-Bicocca, Italy; University of Lugano, Switzerland)
We present the Java Bytecode Symbolic Executor (JBSE), a symbolic executor for Java programs that operates on complex heap inputs. JBSE implements both the novel Heap EXploration Logic (HEX), a symbolic execution approach to deal with heap inputs, and the main state-of-the-art approaches that handle data structure constraints expressed as either executable programs (repOk methods) or declarative specifications. JBSE is the first symbolic executor specifically designed to deal with programs that operate on complex heap inputs, to experiment with the main state-of-the-art approaches, and to combine different decision procedures to explore possible synergies among approaches for handling symbolic data structures.

ARdoc: App Reviews Development Oriented Classifier
Sebastiano Panichella, Andrea Di Sorbo, Emitza Guzman, Corrado A. Visaggio, Gerardo Canfora, and Harald C. GallORCID logo
(University of Zurich, Switzerland; University of Sannio, Italy)
Google Play, Apple App Store and Windows Phone Store are well known distribution platforms where users can download mobile apps, rate them and write review comments about the apps they are using. Previous research studies demonstrated that these reviews contain important information to help developers improve their apps. However, analyzing reviews is challenging due to the large amount of reviews posted every day, the unstructured nature of reviews and its varying quality.
In this demo we present ARdoc, a tool which combines three techniques: (1) Natural Language Parsing, (2) Text Analysis and (3) Sentiment Analysis to automatically classify useful feedback contained in app reviews important for performing software maintenance and evolution tasks. Our quantitative and qualitative analysis (involving mobile professional developers) demonstrates that ARdoc correctly classifies feedback useful for maintenance perspectives in user reviews with high precision (ranging between 84% and 89%), recall (ranging between 84% and 89%), and F-Measure (ranging between 84% and 89%). While evaluating our tool developers of our study confirmed the usefulness of ARdoc in extracting important maintenance tasks for their mobile applications.
Demo URL: https://youtu.be/Baf18V6sN8E
Demo Web Page: http://www.ifi.uzh.ch/seal/people/panichella/tools/ARdoc.html

Video
Hunter: Next-Generation Code Reuse for Java
Yuepeng Wang, Yu Feng, Ruben Martins, Arati Kaushik, Isil Dillig ORCID logo, and Steven P. Reiss
(University of Texas at Austin, USA; Brown University, USA)
In many common scenarios, programmers need to implement functionality that is already provided by some third party library. This paper presents a tool called Hunter that facilitates code reuse by finding relevant methods in large code bases and automatically synthesizing any necessary wrapper code. Since Hunter internally uses advanced program synthesis technology, it can automatically reuse existing methods even when code adaptation is necessary. We have implemented Hunter as an Eclipse plug-in and evaluate it by (a) comparing it against S6, a state-of-the-art code reuse tool, and (b) performing a user study. Our evaluation shows that Hunter compares favorably with S6 and increases programmer productivity.

Video Info
BigDebug: Interactive Debugger for Big Data Analytics in Apache Spark
Muhammad Ali Gulzar, Matteo Interlandi, Tyson Condie, and Miryung Kim ORCID logo
(University of California at Los Angeles, USA)
To process massive quantities of data, developers leverage data-intensive scalable computing (DISC) systems in the cloud, such as Google's MapReduce, Apache Hadoop, and Apache Spark. In terms of debugging, DISC systems support post-mortem log analysis but do not provide interactive debugging features in realtime. This tool demonstration paper showcases a set of concrete usecases on how BigDebug can help debug Big Data Applications by providing interactive, realtime debug primitives. To emulate interactive step-wise debugging without reducing throughput, BigDebug provides simulated breakpoints to enable a user to inspect a program without actually pausing the entire computation. To minimize unnecessary communication and data transfer, BigDebug provides on-demand watchpoints that enable a user to retrieve intermediate data using a guard and transfer the selected data on demand. To support systematic and efficient trial-and-error debugging, BigDebug also enables users to change program logic in response to an error at runtime and replay the execution from that step. BigDebug is available for download at http://web.cs.ucla.edu/~miryung/software.html

Visualizing Code and Coverage Changes for Code Review
Sebastiaan Oosterwaal, Arie van Deursen ORCID logo, Roberta Coelho, Anand Ashok Sawant, and Alberto Bacchelli
(Delft University of Technology, Netherlands; Federal University of Rio Grande do Norte, Brazil)
One of the tasks of reviewers is to verify that code modifications are well tested. However, current tools offer little support in understanding precisely how changes to the code relate to changes to the tests. In particular, it is hard to see whether (modified) test code covers the changed code. To mitigate this problem, we developed Operias, a tool that provides a combined visualization of fine-grained source code differences and coverage impact. Operias works both as a stand-alone tool on specific project versions and as a service hooked to GitHub. In the latter case, it provides automated reports for each new pull request, which reviewers can use to assess the code contribution. Operias works for any Java project that works with maven and its standard Cobertura coverage plugin. We present how Operias could be used to identify test-related problems in real-world pull requests. Operias is open source and available on GitHub with a demo video: https://github.com/SERG-Delft/operias

Video Info
End-to-End Memory Behavior Profiling with DINAMITE
Svetozar Miucin, Conor Brady, and Alexandra Fedorova
(University of British Columbia, Canada; Simon Fraser University, Canada)
Performance bottlenecks related to a program's memory behavior are common, yet very hard to debug. Tools that attempt to aid software engineers in diagnosing these bugs are typically designed to handle specific use cases; they do not provide information to comprehensively explore memory problems and to find solutions. Detailed traces of memory accesses would enable developers to ask various questions about the program's memory behaviour, but these traces quickly become very large even for short executions. We present DINAMITE: a toolkit for Dynamic INstrumentation and Analysis for MassIve Trace Exploration. DINAMITE instruments every memory access with highly debug information and provides a suite of extensible analysis tools to aid programmers in pinpointing memory bottlenecks.

Validate Your SPDX Files for Open Source License Violations
Demetris Paschalides and Georgia M. Kapitsaki
(University of Cyprus, Cyprus)
Licensing decisions for new Open Source Software are not always straightforward. However, the license that accompanies the software is important as it largely affects its subsequent distribution and reuse. License information for software products is captured - among other data - in the Software Package Data Exchange (SPDX) files. The SPDX specification is gaining popularity in the software industry and has been adopted by many organizations internally. In this demonstration paper, we present our tool for the validation of SPDX files regarding proper license use. Software packages described in SPDX format are examined in order to detect license violations that may occur when a product combines different software sources that carry different and potentially contradicting licenses. The SPDX License Validation Tool (SLVT) gives the opportunity to check the compatibility of one or more SPDX files. The evaluation performed on a number of software packages demonstrates its usefulness for drawing conclusions on license use, revealing violations in some of the test projects.

Video Info
FSX: A Tool for Fine-Grained Incremental Unit Test Generation for C/C++ Programs
Hiroaki Yoshida, Susumu Tokumoto, Mukul R. Prasad, Indradeep Ghosh, and Tadahiro Uehara
(Fujitsu Labs, USA; Fujitsu Labs, Japan)
Automated unit test generation bears the promise of significantly reducing test cost and hence improving software quality. However, the maintenance cost of the automatically generated tests presents a significant barrier to adoption of this technology. To address this challenge, in previous work, we proposed a novel technique for automated and fine-grained incremental generation of unit tests through minimal augmentation of an existing test suite. In this paper we describe a tool FSX, implementing this technique. We describe the architecture, user-interface, and salient features of FSX, and specific practical use-cases of its technology. We also report on a real, large-scale deployment of FSX, as a practical validation of the underlying research contribution and of automated test generation research in general.


Doctoral Symposium
Mon, Nov 14, 09:00 - 18:00, Seattle 3 (Chair: Felienne Hermans, Emerson Murphy-Hill)

Refactoring and Migration of Cascading Style Sheets: Towards Optimization and Improved Maintainability
Davood Mazinanian
(Concordia University, Canada)
Cascading Style Sheets is the standard styling language, and is extensively used for defining the presentation of web, mobile and desktop applications. Despite its popularity, the language's design shortcomings have made CSS development and maintenance challenging. This thesis aims at developing techniques for safely transforming CSS code (through refactoring, or migration to a preprocessor language), with the goal of optimization and improved maintainability.

Developing a Reusable Control-Based Approach to Build Self-Adaptive Software Systems with Formal Guarantees
Stepan Shevtsov
(Linnaeus University, Sweden)
An increasingly important concern of software engineers is handling uncertainty at runtime. Over the last decade researchers have applied architecture-based self-adaptation approaches to address this concern. However, providing guarantees required by current software systems has shown to be challenging with these approaches. To tackle this challenge, we study the application of control theory to realize self-adaptation and develop novel control-based adaptation mechanisms that guarantee desired system properties. Results are validated on systems with strict requirements.

Automating Repetitive Code Changes using Examples
Reudismam Rolim
(Federal University of Campina Grande, Brazil)
While adding features, fixing bugs, or refactoring the code, developers may perform repetitive code edits. Although Integrated Development Environments (IDEs) automate some transformations such as renaming, many repetitive edits are performed manually, which is error-prone and time-consuming. To help developers to apply these edits, we propose a technique to perform repetitive edits using examples. The technique receives as input the source code before and after the developer edits some target locations of the change and produces as output the top-ranked program transformation that can be applied to edit the remaining target locations in the codebase. The technique uses a state-of-the-art program synthesis methodology and has three main components: a) a DSL for describing program transformations; b) synthesis algorithms to learn program transformations in this DSL; c) ranking algorithms to select the program transformation with the higher probability of performing the desired repetitive edit. In our preliminary evaluation, in a dataset of 59 repetitive edit cases taken from real C# source code repositories, the technique performed, in 83% of the cases, the intended transformation using only 2.8 examples.

Understanding and Improving Continuous Integration
Michael Hilton
(Oregon State University, USA)
Continuous Integration (CI) has been widely adopted in the software development industry. However, the usage of CI in practice has been ignored for far too long by the research community. We propose to fill this blind spot by doing in- depth research into CI usage in practice. We will answer how questions by using using quantitative methods, such as investigating open source data that is publicly available. We will answer why questions using qualitative methods, such as semi-structured interviews and large scale surveys. In the course of our research, we plan on identifying barriers that developers face when using CI. We will develop techniques to overcome those barriers via automation. This work is advised by Professor Danny Dig.

Guided Code Synthesis using Deep Neural Networks
Carol V. Alexandru
(University of Zurich, Switzerland)
Can we teach computers how to program? Recent advances in neural network research reveal that certain neural networks are able not only to learn the syntax, grammar and semantics of arbitrary character sequences, but also synthesize new samples `in the style of' the original training data. We explore the adaptation of these techniques to code classification, comprehension and completion.

Generating Interactive Web Pages from Storyboards
Pavel Panchekha
(University of Washington, USA)
Web design is a visual art, but web designs are code. Designers work visually and must then manually translate their designs to code. We propose using synthesis to automatically translate the storyboards designers already produce into CSS stylesheets and JavaScript code. To build a synthesis tool for this complex domain, we will use a novel composition mechanism that allows splitting the synthesis task among domain-specific solvers.
We have built a domain-specific solver for CSS stylesheets; solvers for DOM actions and JavaScript code can be built with similar techniques. To compose the three domain-specific solvers, we propose using partial counterexamples to exchange information between different domains. Early results suggest that this composition mechanism is fast and allows specializing each solver to its domain.

Data Structure Synthesis
Calvin Loncaric
(University of Washington, USA)
All mainstream languages ship with libraries implementing lists, maps, sets, trees, and other common data structures. These libraries are sufficient for some use cases, but other applications need specialized data structures with different operations. For such applications, the standard libraries are not enough.
I propose to develop techniques to automatically synthesize data structure implementations from high-level specifications. My initial results on a large class of collection data structures demonstrate that this is possible and lend hope to the prospect of general data structure synthesis. Synthesized implementations can save programmer time and improve correctness while matching the performance of handwritten code.

Understanding Behavioural Patterns in JavaScript
Saba Alimadadi
(University of British Columbia, Canada)
JavaScript is one of the most popular programming languages. How- ever, understanding the dynamic behaviour of JavaScript apps is challenging in practice. There are many factors that hinder JavaScript comprehension, such as its dynamic, asynchronous, and event- driven nature, the dynamic interplay between JavaScript and the Document Object Model, and the asynchronous communication between client and server. In this research work, we have already proposed methods for understanding event-based and asynchronous JavaScript behaviour. To enhance the scalability of our methods, we propose a new technique that adopts bio-informatics algorithms to extract sequences of actions from execution traces that form higher-level patterns.

Regression Testing of Web Applications using Record/Replay Tools
Mouna Hammoudi
(University of Nebraska-Lincoln, USA)
Software engineers often use record/replay tools to enable the automated testing of web applications. Tests created in this man- ner can then be used to regression test new versions of the web applications as they evolve. Web application tests recorded by record/replay tools, however, can be quite brittle; they can easily break as applications change. For this reason, researchers have be- gun to seek approaches for automatically repairing record/replay tests. This research investigates different aspects in relation to test- ing web applications using record/replay tools. The areas that we are interested in include taxonomizing the causes behind breakages and developing automated techniques to repair breakages, creating prevention techniques to stop the occurrence of breakages and de- veloping automated frameworks for root cause analysis. Finally, we intend to evaluate all of these activities via controlled studies involving software engineers and real web application tests.

Supporting Change in Product Lines within the Context of Use Case-Driven Development and Testing
Ines Hajri
(University of Luxembourg, Luxembourg)
Product Line Engineering (PLE) is becoming a common practice in industry to enhance product quality, to reduce development costs, and to improve time-to-market. At the same time, many development contexts are use case-driven and this strongly influences their requirements engineering and system testing practices. In this PhD project, we aim to achieve automated and effective change management in a product family within the context of use case-driven development and system testing. To this end, we first provide a modeling method for capturing variability information explicitly in Product Line (PL) use case and domain models. Then, we propose an automated configuration approach to automatically generate Product Specific (PS) use case and domain models from PL models. In addition, we plan to provide a change impact analysis approach for PL use case and domain models and automated regression test selection for system test cases derived from PL use case models.

Input-Sensitive Performance Testing
Qi Luo
(College of William and Mary, USA)
One goal of performance testing is to find specific test input data for exposing performance bottlenecks and identify the methods responsible for these performance bottlenecks. A big and important challenges of performance testing is how to deeply understand the performance behaviors of a nontrivial software system in terms of test input data to properly select the specific test input values for finding the problematic methods. Thus, we propose this research program to automatically analyze performance behaviors in software and link these behaviors with test input data for selecting the specific ones that can expose performance bottlenecks. In addition, this research further examines the corresponding execution traces of selected inputs for targeting the problematic methods.

On the Utility of Dominator Mutants for Mutation Testing
Bob Kurtz
(George Mason University, USA)
Mutation testing has been shown to support the generation of test sets that are highly effective at detecting faults. However, practitioner adoption of mutation testing has been minimal in part because of problems that arise from the huge numbers of redundant and equivalent mutants that are generated. The research described here examines the relationship between mutants and attempts to reduce the number of redundant and equivalent mutants in order to make mutation testing more practical for the software tester.


Student Research Competition
Tue, Nov 15, 15:30 - 16:30, Foyer 3rd/4th Floor

Graduate Submissions

Effective Assignment and Assistance to Software Developers and Reviewers
Motahareh Bahrami Zanjani
(Wichita State University, USA)
Human reliance and dominance are ubiquitous in sustaining a high-quality large software system. Automatically assigning the right solution providers to the maintenance task at hand is arguably as important as providing the right tool support for it, especially in the far too commonly found state of inadequate or obsolete documentation of large-scale software systems. Two maintenance tasks related to assignment and assistance to software developers and reviewers are addressed, and solutions are proposed. The key insight behind these proposed solutions is the analysis and use of micro-levels of human-to-code and human-to-human interactions (eg., code review). We analyzed code reviews that are managed by Gerrit and found different markers of developer expertise associated with the source code changes and their acceptance, time line, and human roles and feedback involved in the reviews. We formed a developer-expertise model from these markers and showed its application in bug triaging. Specifically, we derived a developer recommendation approach for an incoming change request, named rDevX , from this expertise model. Additionally, we present an approach, namely cHRev, to automatically recommend reviewers who are best suited to participate in a given review, based on their historical contributions as demonstrated in their prior reviews. Furthermore, a comparative study on other previous approaches for developer recommendation and reviewer recommendation was performed. The metrics recall and MRR were used to measure their quantitative effectiveness. Results show that the proposed approaches outperform the subjected competitors with statistical significance.

RABIEF: Range Analysis Based Integer Error Fixing
Xi Cheng
(Tsinghua University, China)
C language has complicated semantics for integers. Integer errors lead to serious software failures or exploitable vulnerabilities while they are harbored in various real-world programs. It is labor-intensive and error-prone to manually address integer errors. The usability of existing automated techniques is generally poor, as they heavily rely on external specifications or simply transform bugs into crash. We propose RABIEF, a novel and fully automatic approach to fix C integer errors based on range analysis. RABIEF is inspired by the following insights: (1) fixes for various integer errors have typical patterns including sanitization, explicit cast and declared type alteration; (2) range analysis provides sound basis for error detection and guides fix generation. We implemented RABIEF into a tool Argyi. Its effectiveness and efficiency have been substantiated by the facts that: (1) Argyi succeeds in fixing 93.9% of 5414 integer bugs from Juliet test suite, scaling to 600 KLOC within 5500 seconds; (2) Argyi is confirmed to correctly fix 20 errors from 4 real-world programs within only 240 seconds.

Fine-Grained Binary Code Authorship Identification
Xiaozhu Meng
(University of Wisconsin-Madison, USA)
Binary code authorship identification is the task of determining the authors of a piece of binary code from a set of known authors. Modern software often contains code from multiple authors. However, existing techniques assume that each program binary is written by a single author. We present a new finer-grained technique to the tougher problem of determining the author of each basic block. Our evaluation shows that our new technique can discriminate the author of a basic block with 52% accuracy among 282 authors, as opposed to 0.4% accuracy by random guess, and it provides a practical solution for identifying multiple authors in software.

Identifying Participants for Collaborative Merge
Catarina Costa
(Federal Fluminense University, Brazil)
Software development is typically a collaborative activity. Development in large projects often involves branches, where changes are performed in parallel and merged periodically. While, there is no consensus on who should perform the merge, team members typically try to find someone with enough knowledge about the changes in the branches. This task can be difficult in cases where many different developers have made significant changes. My research proposes an approach, TIPMerge, to help select the most appropriate developers to participate in a collaborative merge session, such that we maximize the knowledge spread across changes. The goal of this research is to select a specified number of developers with the highest joint coverage. We use an optimization algorithm to find which developers form the best team together to deal with a specific merge case. We have implemented all the steps of our approach and evaluate parts of them.

Cozy: Synthesizing Collection Data Structures
Calvin Loncaric
(University of Washington, USA)
Many applications require specialized data structures not found in standard libraries. Implementing new data structures by hand is tedious and error-prone. To alleviate this difficulty, we built a tool called Cozy that synthesizes data structures using counter-example guided inductive synthesis. We evaluate Cozy by showing how its synthesized implementations compare to handwritten implementations in terms of correctness and performance across four real-world programs. Cozy's data structures match the performance of the handwritten implementations while avoiding human error.

Constraint-Based Event Trace Reduction
Jie Wang
(University of Chinese Academy of Sciences, China)
Various record-replay techniques are developed to facilitate web application debugging. However, it is time-consuming to inspect all recorded events that reveal a failure. To reduce the cost of debugging, delta-debugging and program slicing are used to remove failure-irrelevant events. However, delta-debugging does not scale well for long traces, and program slicing fails to remove irrelevant events that the failure has program dependence on. In this paper, we propose an effective and efficient approach to remove failure-irrelevant events from the event trace. Our approach builds constraints among events and the failure (e.g., a variable can read any of its earlier type-compatible values), to search for a minimal event trace that satisfies these constraints. Our evaluation on 10 real-world web applications shows that our approach can further remove 70% of events in the reduced trace of dynamic slicing, and needs 80% less iterations and 86% less time than delta-debugging.

Automatic Trigger Generation for End User Written Rules for Home Automation
Chandrakana NandiORCID logo
(University of Washington, USA)
To customize the behavior of a smart home, an end user writes rules. When an external event satisfies a rule's trigger, the rule's action executes; for example, when the temperature is above a certain threshold, then window awnings might be extended. End users often write incorrect rules. This paper's technique prevents a certain category of errors in the rules: errors due to too few triggers. It statically analyzes a rule's actions to automatically determine a set of necessary and sufficient triggers.
We implemented the technique in a tool called TrigGen and tested it on 96 end-user written rules for openHAB, an open-source home automation platform. It identified that 80% of the rules had fewer triggers than required for correct behavior. The missing triggers could lead to unexpected behavior and security vulnerabilities in a smart home.

Hotspot Symbolic Execution of Floating-Point Programs
Minghui Quan
(National University of Defense Technology, China)
This paper presents hotspot symbolic execution (HSE) to scale the symbolic execution of floating-point programs. The essential idea of HSE is to (1) explore the paths of some functions (called hotspot functions) in priority, and (2) divide the paths of a hotspot function into different equivalence classes, and explore as fewer path as possible inside the function while ensuring the coverage of all the classes. We have implemented HSE on KLEE and carried out extensive experiments on all 5528 functions in GNU Scientific Library (GSL). The experimental results demonstrate the effectiveness and efficiency of HSE. Compared with the baseline, HSE detects >12 times of exceptions in 30 minutes.

Evaluation of Fault Localization Techniques
Spencer Pearson
(University of Washington, USA)
Fault localization (FL) takes as input a faulty program and produces as output a list of code locations ranked by probability of being defective. A programmer doing debugging, or a program repair tool, could save time by focusing on the most suspicious locations.
Researchers evaluate new FL techniques on programs with known faults, and score a technique based on where in its list the actual defect appears. This enables comparison of multiple FL techniques to determine which one is best.
Previous research has primarily evaluated FL techniques using artificial faults, generated either by hand or automatically. Other prior work has shown that artificial faults have both similarities to and differences from real faults; given this, it is not obvious that the techniques that perform best on artificial faults will also perform best on real faults.
This work compares 7 previously-studied FL techniques, both on artificial faults (as a replication study) and on real faults (to validate the assumption that artificial faults are useful proxies for real faults for comparisons of FL techniques). Our replication largely agreed with prior work, but artificial faults were not useful for predicting which FL techniques perform best on real faults.
We also studied which characteristics make FL techniques perform well on real faults. We identified a design space that includes those 7 previously-studied FL techniques as well as 149 new ones, and determined which decisions were most important in designing a new technique.

How Should Static Analysis Tools Explain Anomalies to Developers?
Titus Barik
(North Carolina State University, USA)
Despite the advanced static analysis tools available within modern integrated development environments (IDEs), the error messages these tools produce remain perplexing for developers to comprehend. This research postulates that tools can computationally expose their internal reasoning processes to generate assistive error explanations that more closely align with how developers explain errors to themselves.

Repairing Test Dependence
Wing Lam
(University of Illinois at Urbana-Champaign, USA)
In a test suite, all the tests should be independent: no test should affect another test’s result, and running the tests in any order should yield the same test results. The assumption of such test independence is important so that tests behave consistently as designed. However, this critical assumption often does not hold in practice due to test dependence.
Test dependence causes two serious problems: a dependent test may spuriously fail even when the software is correct (a false positive alarm), or it may spuriously pass even when a bug exists in the software (a false negative). Existing approaches to cope with test dependence require tests to be executed in a given order or for each test to be executed in a separate virtual machine. This paper presents an approach that can automatically repair test dependence so that each test in a suite yields the same result regardless of their execution order. At compile time, the approach refactors code under test and test code to eliminate test dependence and prevent spurious test successes or failures.
We develop a prototype of our approach to handle one of the most common causes of test dependence and evaluate the prototype on five subject programs. In our experimental evaluation, our prototype is capable of eliminating up to 12.5% of the test dependence in the subject programs.

Combining Bug Detection and Test Case Generation
Martin Kellogg
(University of Washington, USA)
Detecting bugs in software is an important software engineering activity. Static bug finding tools can assist in detecting bugs automatically, but they suffer from high false positive rates. Automatic test generation tools can generate test cases which can find bugs, but they suffer from an oracle problem. We present N-Prog, a hybrid of the two approaches. N-Prog iteratively presents the developer an interesting, real input/output pair. The developer either classifies it as a bug (when the output is incorrect) or adds it to the regression test suite (when the output is correct). N-Prog selects input/output pairs whose input produces different output on a mutated version of the program which passes the test suite of the original. In initial experiments, N-Prog detected bugs and rediscovered test cases that had been removed from a test suite.

SmartDebug: An Interactive Debug Assistant for Java
Xinrui Guo
(Tsinghua University, China)
Debugging has long been recognized as one of the most labour- and time- consuming activities in software development. Recent research on automated debugging tries to facilitate this process by automatically generating patches for buggy programs so that they pass a predefined test suite. Despite the promising experimental results, several major obstacles emerge when we apply these techniques in active coding process. Inadequate test cases, multiple errors in one program and possible bug categories overlooked by existing fix generation search spaces impede these techniques to perform at their best.
To overcome these obstacles, we designed an interactive usage paradigm that allows a developer to characterize his or her judgments of program running state and utilize such information to guide the fix generation process. We implemented a prototype of this design, an Eclipse plugin called SmartDebug as a debug assistant for Java programs. Experiment results show that SmartDebug helped to debug 15 out of 25 buggy programs successfully. All programs contain less than 3 test cases. In 14 programs it accelerated the debugging process compared to pure human debugging, while one of which contains 2 buggy statements. This indicates that the proposed usage paradigm is capable of facilitating the debugging process in active coding process.

Static Loop Analysis and Its Applications
Xie Xiaofei
(Tianjin University, China)
Loops are challenging structures in program analysis, and an effective loop analysis is crucial in the applications, such as symbolic execution and program verification. In the research, we will first perform a deep analysis and propose a classification according to the complexity of the loops. Then try to propose techniques for analyzing and summarizing different loops. At last, we apply the techniques in multiple applications.

Social Health Cues Developers Use when Choosing Open Source Packages
Andrew Head
(University of California at Berkeley, USA)
Developers choose open source packages from many alternatives. One increasingly important factor when choosing a package is its "social health", or a developer’s ability to get help on communication channels. We conduct a study to understand how developers learn about the social health of open source packages before using them. We offer preliminary results of the cues developers find.

Finding and Breaking Test Dependencies to Speed Up Test Execution
Sebastian Kappler
(Saarland University, Germany)
Software testing takes up the major part of the build time, which hinders developers' ability to promptly identify and fix problems.
Test parallelization is an effective means to speed up test executions, hence improving software development. Effective and sound test parallelization requires that tests are independent or that test dependencies are known in advance. However, current techniques to detect test dependencies are either precise but slow, or fast but inaccurate. Further, available algorithms for test parallelization either over-constraint test executions, which reduces their level of parallelism, or re-execute the same tests multiple times, which increases the execution effort.
This research addresses both sides of the problem of speeding up test execution: it aims to devise a practical test detection technique that can suitably balance efficiency and accuracy, and develop a novel technique to break test dependencies which allows both sound and efficient test executions.

Automatic Performance Testing using Input-Sensitive Profiling
Qi Luo
(College of William and Mary, USA)
During performance testing, software engineers commonly perform application profiling to analyze an application's execution traces with different inputs to understand the performance behaviors, such as the time and space consumption. However, a non-trivial application commonly has a large number of input data, and it is mostly manual to identify the specific inputs leading to performance bottlenecks. Thus, it is challenge is to automate application profiling and find these specific inputs. To solve these problems, we propose novel approaches, namely FOREPOST, GA-Prof and PerfImpact, which automatically profile applications for finding the specific combinations of inputs triggering performance bottlenecks, and further analyze the corresponding execution traces to identify problematic methods. Specially, our approaches work in two different types of real-world scenarios of performance testing: i) a single-version scenario, in which performance bottlenecks are detected in a single software release, and ii) a two-version scenario, in which code changes responsible for performance regressions are detected by considering two consecutive software releases.

Undergraduate Submissions

Enforcing Correct Array Indexes with a Type System
Joseph Santino
(University of Washington, USA)
We have built the Index Checker, a type checker that issues warnings about array, list, and string accesses that are potentially unsafe. An example is shown in Figure 1. As with any sound tool, some of its warnings may be false positives. If the Index Checker issues no warning, then the programmer is guaranteed that no array access will cause an IndexOutOfBoundsException at run time (modulo suppressed warnings and unchecked code). The Index Checker ships with knowledge of Java APIs. The developer can optionally write a few type annotations in the program to make the Index Checker more precise. Our system includes five new type qualifiers, defined in Figure 2, that can be applied to integral types such as Java int. These are dependent types that indicate the relationship between the int and given arrays. Figures 3 and 4 show the relationship among these type qualifiers. The type system also contains a type qualifier for arrays, @MinLen, which is a lower bound on its length and permits use of literal integers to access the array or to construct a new array. The Index Checker is built upon the Checker Framework (http://CheckerFramework.org/).

Discovering Additional Violations of Java API Invariants
Waylon Huang
(University of Washington, USA)
In the absence of formal specifications or test oracles, automating testing is made possible by the fact that a program must satisfy certain requirements set down by the programming language. This work describes Randoop, an automatic unit test generator which checks for invariants specified by the Java API. Randoop is able to detect violations to invariants as specified by the Java API and create error tests that reveal related bugs. Randoop is also able to produce regression tests, meant to be added to regression test suites, that capture expected behavior. We discuss additional extensions that we have made to Randoop which expands its capability for the detection of violation of specified invariants. We also examine an optimization and a heuristic for making the invariant checking process more efficient.

Preventing Signedness Errors in Numerical Computations in Java
Christopher A. Mackie
(University of Washington, USA)
We have developed and implemented a type system, the Signedness Type System, that captures usage of signed and unsigned integers in Java programs. This type system enables developers to detect errors regarding unsigned integers at compile time, and guarantees that such errors cannot occur at run time. In a case study. our type system proved easy to use and detected a previously unknown bug. Our type system is implemented as the Signedness Checker and will be available with the Checker Framework (http://CheckerFramework.org/).

Bounded Model Checking of State-Space Digital Systems: The Impact of Finite Word-Length Effects on the Implementation of Fixed-Point Digital Controllers Based on State-Space Modeling
Felipe R. Monteiro
(Federal University of Amazonas, Brazil)
The extensive use of digital controllers demands a growing effort to prevent design errors that appear due to finite-word length (FWL) effects. However, there is still a gap, regarding verification tools and methodologies to check implementation aspects of control systems. Thus, the present paper describes an approach, which employs bounded model checking (BMC) techniques, to verify fixed-point digital controllers represented by state-space equations. The experimental results demonstrate the sensitivity of such systems to FWL effects and the effectiveness of the proposed approach to detect them. To the best of my knowledge, this is the first contribution tackling formal verification through BMC of fixed-point state-space digital controllers.

Info
Atlas: An Intelligent, Performant Framework for Web-Based Grid Computing
Sachith Gullapalli
(Yale University, USA)
This paper presents Atlas, a distributed computing system that allows for collaboration over Internet browsers. It allows users to donate the wasted processing power from their Internet-connected machines to help researchers and companies compute difficult tasks. The platform aims at maintaining similar speeds to available cloud computing services while running these tasks, while doing so at a lower cost. In order to do so, Atlas minimizes the amount of time needed per computation and intelligently distributes jobs by utilizing user browsing patterns. Benchmarks demonstrate that Atlas may be a viable alternative to traditional parallel platforms.

proc time: 5.34