ICSE 2012 – Proceedings

Where Should the Bugs Be Fixed? - More Accurate Information Retrieval-Based Bug Localization Based on Bug Reports
Jian Zhou, Hongyu Zhang, and David Lo

(Tsinghua University, China; Singapore Management University, Singapore)
For a large and evolving software system, the project team could receive a large number of bug reports. Locating the source code files that need to be changed in order to fix the bugs is a challenging task. Once a bug report is received, it is desirable to automatically point out to the files that developers should change in order to fix the bug. In this paper, we propose BugLocator, an information retrieval based method for locating the relevant files for fixing a bug. BugLocator ranks all files based on the textual similarity between the initial bug report and the source code using a revised Vector Space Model (rVSM), taking into consideration information about similar bugs that have been fixed before. We perform large-scale experiments on four open source projects to localize more than 3,000 bugs. The results show that BugLocator can effectively locate the files where the bugs should be fixed. For example, relevant buggy files for 62.60% Eclipse 3.1 bugs are ranked in the top ten among 12,863 files. Our experiments also show that BugLocator outperforms existing state-of-the-art bug localization methods.

Developer Prioritization in Bug Repositories
Jifeng Xuan, He Jiang

, Zhilei Ren, and Weiqin Zou
(Dalian University of Technology, China)
Developers build all the software artifacts in development. Existing work has studied the social behavior in software repositories. In one of the most important software repositories, a bug repository, developers create and update bug reports to support software development and maintenance. However, no prior work has considered the priorities of developers in bug repositories. In this paper, we address the problem of the developer prioritization, which aims to rank the contributions of developers. We mainly explore two aspects, namely modeling the developer prioritization in a bug repository and assisting predictive tasks with our model. First, we model how to assign the priorities of developers based on a social network technique. Three problems are investigated, including the developer rankings in products, the evolution over time, and the tolerance of noisy comments. Second, we consider leveraging the developer prioritization to improve three predicted tasks in bug repositories, i.e., bug triage, severity identification, and reopened bug prediction. We empirically investigate the performance of our model and its applications in bug repositories of Eclipse and Mozilla. The results indicate that the developer prioritization can provide the knowledge of developer priorities to assist software tasks, especially the task of bug triage.

WhoseFault: Automatic Developer-to-Fault Assignment through Fault Localization
Francisco Servant and James A. Jones
(UC Irvine, USA)
This paper describes a new technique, which automatically selects the most appropriate developers for fixing the fault represented by a failing test case, and provides a diagnosis of where to look for the fault. This technique works by incorporating three key components: (1) fault localization to inform locations whose execution correlate with failure, (2) history mining to inform which developers edited each line of code and when, and (3) expertise assignment to map locations to developers. To our knowledge, the technique is the first to assign developers to execution failures, without the need for textual bug reports. We implement this technique in our tool, WhoseFault, and describe an experiment where we utilize a large, open-source project to determine the frequency in which our tool suggests an assignment to the actual developer who fixed the fault. Our results show that 81% of the time, WhoseFault produced the same developer that actually fixed the fault within the top three suggestions. We also show that our technique improved by a difference between 4% and 40% the results of a baseline technique. Finally, we explore the influence of each of the three components of our technique over its results, and compare our expertise algorithm against an existing expertise assessment technique and find that our algorithm provides greater accuracy, by up to 37%.

Code Generation and Recovery
Wed, Jun 6, 10:45 - 12:45

Recovering Traceability Links between an API and Its Learning Resources
Barthélémy Dagenais and Martin P. Robillard

(McGill University, Canada)
Large frameworks and libraries require extensive developer learning resources, such as documentation and mailing lists, to be useful. Maintaining these learning resources is challenging partly because they are not explicitly linked to the frameworks' API, and changes in the API are not reflected in the learning resources. Automatically recovering traceability links between an API and learning resources is notoriously difficult due to the inherent ambiguity of unstructured natural language. Code elements mentioned in documents are rarely fully qualified, so readers need to understand the context in which a code element is mentioned. We propose a technique that identifies code-like terms in documents and links these terms to specific code elements in an API, such as methods. In an evaluation study with four open source systems, we found that our technique had an average recall and precision of 96%.

Generating Range Fixes for Software Configuration
Yingfei Xiong, Arnaud Hubaux, Steven She, and Krzysztof Czarnecki
(University of Waterloo, Canada; University of Namur, Belgium)
To prevent ill-formed configurations, highly configurable software often allows defining constraints over the available options. As these constraints can be complex, fixing a configuration that violates one or more constraints can be challenging. Although several fix-generation approaches exist, their applicability is limited because (1) they typically generate only one fix, failing to cover the solution that the user wants; and (2) they do not fully support non-Boolean constraints, which contain arithmetic, inequality, and string operators.
This paper proposes a novel concept, range fix, for software configuration. A range fix specifies the options to change and the ranges of values for these options. We also design an algorithm that automatically generates range fixes for a violated constraint. We have evaluated our approach with three different strategies for handling constraint interactions, on data from five open source projects. Our evaluation shows that, even with the most complex strategy, our approach generates complete fix lists that are mostly short and concise, in a fraction of a second.

Graph-Based Pattern-Oriented, Context-Sensitive Source Code Completion
Anh Tuan Nguyen, Tung Thanh Nguyen, Hoan Anh Nguyen, Ahmed Tamrawi, Hung Viet Nguyen, Jafar Al-Kofahi, and Tien N. Nguyen
(Iowa State University, USA)
Code completion helps improve programming productivity. However, current support for code completion is limited to context-free code templates or a single method call of the variable on focus. Using libraries for development, developers often repeat API usages for certain tasks. Therefore, in this paper, we introduce GraPacc, a graph-based pattern-oriented, context-sensitive code completion approach that is based on a database of API usage patterns. GraPacc manages and represents the API usage patterns of multiple variables, methods, and control structures via graph-based models. It extracts the context-sensitive features from the code, e.g. the API elements on focus or under modification, and their relations to other elements. The features are used to search and rank the patterns that are most fitted with the current code. When a pattern is selected, the current code will be completed via our novel graph-based code completion algorithm. Empirical evaluation on several real-world systems and human subjects shows that GraPacc has a high level of accuracy and a better level of usefulness than existing tools.

Automatic Input Rectification
Fan Long, Vijay Ganesh, Michael Carbin

, Stelios Sidiroglou, and Martin Rinard

(MIT, USA)
Abstract—We present a novel technique, automatic input rectification, and a prototype implementation, SOAP. SOAP learns a set of constraints characterizing typical inputs that an application is highly likely to process correctly. When given an atypical input that does not satisfy these constraints, SOAP automatically rectifies the input (i.e., changes the input so that is satisfies the learned constraints). The goal is to automatically convert potentially dangerous inputs into typical inputs that the program is highly likely to process correctly. Our experimental results show that, for a set of benchmark applications (namely, Google Picasa, ImageMagick, VLC, Swfdec, and Dillo), this approach effectively converts malicious inputs (which successfully exploit vulnerabilities in the application) into benign inputs that the application processes correctly. Moreover, a manual code analysis shows that, if an input does satisfy the learned constraints, it is incapable of exploiting these vulnerabilities. We also present the results of a user study designed to evaluate the subjective perceptual quality of outputs from benign but atypical inputs that have been automatically rectified by SOAP to conform to the learned constraints. Specifically, we obtained benign inputs that violate learned constraints, used our input rectifier to obtain rectified inputs, then paid Amazon Mechanical Turk users to provide their subjective qualitative perception of the difference between the outputs from the original and rectified inputs. The results indicate that rectification can often preserve much, and in many cases all, of the desirable data in the original input.

Empirical Studies of Development
Wed, Jun 6, 10:45 - 12:45

Overcoming the Challenges in Cost Estimation for Distributed Software Projects
Narayan Ramasubbu and Rajesh Krishna Balan
(Singapore Management University, Singapore)
In this paper, we describe how we studied, in-situ, the operational processes of three large high process maturity distributed software development companies and discovered three common problems they faced with respect to early stage project cost estimation. We found that project managers faced significant challenges to accurately estimate project costs because the standard metrics-based estimation tools they used (a) did not effectively incorporate diverse distributed project configurations and characteristics, (b) required comprehensive data that was not fully available for all starting projects, and (c) required significant experience to derive accurate estimates. To address these problems, we collaborated with practitioners at all three firms and developed a new learning-oriented semi-automated early-stage cost estimation solution that was specifically designed for globally distributed software projects. The key idea of our solution was to augment the existing metrics-driven estimation methods with a case repository that stratified past incidents related to project effort estimation issues from the historical project databases at the firms into several generalizable categories. This repository allowed project managers to quickly and effectively “benchmark” their new projects to all past projects across the firms, and thereby learn from them. We deployed our solution at each of our three research sites for real-world field-testing over a period of six months. Project managers of 219 new large globally distributed projects used both our method to estimate the cost of their projects as well as the established metrics-based estimation approaches they were used to. Our approach achieved significantly reduced estimation errors (of up to 60%). This resulted in more than 20% net cost savings, on average, per project – a massive total cost savings across all projects at the three firms!

Characterizing Logging Practices in Open-Source Software
Ding Yuan, Soyeon Park, and Yuanyuan Zhou
(University of Illinois at Urbana-Champaign, USA; UC San Diego, USA)
Software logging is a conventional programming practice. As its efficacy is often important for users and developers to understand what have happened in production run, yet software logging is often done in an arbitrarily manner. So far, there have been little study for understanding logging practices in real world software. This paper makes the first attempt (to the best of our knowledge) to provide quantitative characteristic study of the current log messages within four pieces of large open-source software. First, we quantitatively show that software logging is pervasive. By examining developers’ own modifications to logging code in revision history, we find that they often do not make the log messages right in their first attempts, and thus need to spend significant amount of efforts to modify the log messages as after-thoughts. Our study further provides several interesting findings on where developers spend most of their efforts in modifying the log messages, which can give insights for programmers, tool developers, and language and compiler designers to improve the current logging practice. To demonstrate the benefit of our study, we built a simple checker based on one of our findings and effectively detected 138 new problematic logging code from studied software (24 of them are already confirmed and fixed by developers).

The Impacts of Software Process Improvement on Developers: A Systematic Review
Mathieu Lavallée and Pierre N. Robillard
(École Polytechnique de Montréal, Canada)
This paper presents the results of a systematic review on the impacts of Software Process Improvement (SPI) on developers. This review selected 26 studies from the highest quality journals, conferences, and workshop in the field. The results were compiled and organized following the grounded theory approach. Results from the grounded theory were further categorized using the Ishikawa (or fishbone) diagram. The Ishikawa Diagram models all the factors potentially impacting software developers, and shows both the positive and negative impacts. Positive impacts include a reduction in the number of crises, and an increase in team communications and morale, as well as better requirements and documentation. Negative impacts include increased overhead on developers through the need to collect data and compile documentation, an undue focus on technical approaches, and the fact that SPI is oriented toward management and process quality, and not towards developers and product quality. This systematic review should support future practice through the identification of important obstacles and opportunities for achieving SPI success. Future research should also benefit from the problems and advantages of SPI identified by developers.

Combining Functional and Imperative Programming for Multicore Software: An Empirical Study Evaluating Scala and Java
Victor Pankratius, Felix Schmidt, and Gilda Garretón
(KIT, Germany; Oracle Labs, USA)
Recent multi-paradigm programming languages combine functional and imperative programming styles to make software development easier. Given today's proliferation of multicore processors, parallel programmers are supposed to benefit from this combination, as many difficult problems can be expressed more easily in a functional style while others match an imperative style. Due to a lack of empirical evidence from controlled studies, however, important software engineering questions are largely unanswered. Our paper is the first to provide thorough empirical results by using Scala and Java as a vehicle in a controlled comparative study on multicore software development. Scala combines functional and imperative programming while Java focuses on imperative shared-memory programming. We study thirteen programmers who worked on three projects, including an industrial application, in both Scala and Java. In addition to the resulting 39 Scala programs and 39 Java programs, we obtain data from an industry software engineer who worked on the same project in Scala. We analyze key issues such as effort, code, language usage, performance, and programmer satisfaction. Contrary to popular belief, the functional style does not lead to bad performance. Average Scala run-times are comparable to Java, lowest run-times are sometimes better, but Java scales better on parallel hardware. We confirm with statistical significance Scala's claim that Scala code is more compact than Java code, but clearly refute other claims of Scala on lower programming effort and lower debugging effort. Our study also provides explanations for these observations and shows directions on how to improve multi-paradigm languages in the future.

Performance Analysis
Wed, Jun 6, 10:45 - 12:45

Uncovering Performance Problems in Java Applications with Reference Propagation Profiling
Dacong Yan, Guoqing Xu, and Atanas Rountev

(Ohio State University, USA; UC Irvine, USA)
Many applications suffer from run-time bloat: excessive memory usage and work to accomplish simple tasks. Bloat significantly affects scalability and performance, and exposing it requires good diagnostic tools. We present a novel analysis that profiles the run-time execution to help programmers uncover potential performance problems. The key idea of the proposed approach is to track object references, starting from object creation statements, through assignment statements, and eventually statements that perform useful operations. This propagation is abstracted by a representation we refer to as a reference propagation graph. This graph provides path information specific to reference producers and their run-time contexts. Several client analyses demonstrate the use of reference propagation profiling to uncover run-time inefficiencies. We also present a study of the properties of reference propagation graphs produced by profiling 36 Java programs. Several cases studies discuss the inefficiencies identified in some of the analyzed programs, as well as the significant improvements obtained after code optimizations.

Performance Debugging in the Large via Mining Millions of Stack Traces
Shi Han

, Yingnong Dang, Song Ge, Dongmei Zhang

, and Tao Xie
(Microsoft Research, China; North Carolina State University, USA)
Given limited resource and time before software release, development-site testing and debugging become more and more insufficient to ensure satisfactory software performance. As a counterpart for debugging in the large pioneered by the Microsoft Windows Error Reporting (WER) system focusing on crashing/hanging bugs, performance debugging in the large has emerged thanks to available infrastructure support to collect execution traces with performance issues from a huge number of users at the deployment sites. However, performance debugging against these numerous and complex traces remains a significant challenge for performance analysts. In this paper, to enable performance debugging in the large in practice, we propose a novel approach, called StackMine, that mines callstack traces to help performance analysts effectively discover highly impactful performance bugs (e.g., bugs impacting many users with long response delay). As a successful technology-transfer effort, since December 2010, StackMine has been applied in performance-debugging activities at a Microsoft team for performance analysis, especially for a large number of execution traces. Based on real-adoption experiences of StackMine in practice, we conducted an evaluation of StackMine on performance debugging in the large for Microsoft Windows 7. We also conducted another evaluation on a third-party application. The results highlight substantial benefits offered by StackMine in performance debugging in the large for large-scale software systems.

Automatically Finding Performance Problems with Feedback-Directed Learning Software Testing
Mark Grechanik, Chen Fu, and Qing Xie
(Accenture Technology Labs, USA; University of Illinois at Chicago, USA)
A goal of performance testing is to find situations when applications unexpectedly exhibit worsened characteristics for certain combinations of input values. A fundamental question of performance testing is how to select a manageable subset of the input data faster to find performance problems in applications automatically.
We offer a novel solution for finding performance problems in applications automatically using black-box software testing. Our solution is an adaptive, feedback-directed learning testing system that learns rules from execution traces of applications and then uses these rules to select test input data automatically for these applications to find more performance problems when compared with exploratory random testing. We have implemented our solution and applied it to a medium-size application at a major insurance company and to an open-source application. Performance problems were found automatically and confirmed by experienced testers and developers.

Predicting Performance via Automated Feature-Interaction Detection
Norbert Siegmund, Sergiy S. Kolesnikov, Christian Kästner, Sven Apel, Don Batory, Marko Rosenmüller, and Gunter Saake
(University of Magdeburg, Germany; University of Passau, Germany; Philipps University of Marburg, Germany; University of Texas at Austin, USA)
Customizable programs and program families provide user-selectable features to allow users to tailor a program to an application scenario. Knowing in advance which feature selection yields the best performance is difficult because a direct measurement of all possible feature combinations is infeasible. Our work aims at predicting program performance based on selected features. However, when features interact, accurate predictions are challenging. An interaction occurs when a particular feature combination has an unexpected influence on performance. We present a method that automatically detects performance-relevant feature interactions to improve prediction accuracy. To this end, we propose three heuristics to reduce the number of measurements required to detect interactions. Our evaluation consists of six real-world case studies from varying domains (e.g., databases, encoding libraries, and web servers) using different configuration techniques (e.g., configuration files and preprocessor flags). Results show an average prediction accuracy of 95%.

Defect Prediction
Wed, Jun 6, 14:00 - 15:30

Sound Empirical Evidence in Software Testing
Gordon Fraser and Andrea Arcuri
(Saarland University, Germany; Simula Research Laboratory, Norway)
Several promising techniques have been proposed to automate different tasks in software testing, such as test data generation for object-oriented software. However, reported studies in the literature only show the feasibility of the proposed techniques, because the choice of the employed artifacts in the case studies (e.g., software applications) is usually done in a non-systematic way. The chosen case study might be biased, and so it might not be a valid representative of the addressed type of software (e.g., internet applications and embedded systems). The common trend seems to be to accept this fact and get over it by simply discussing it in a threats to validity section. In this paper, we evaluate search-based software testing (in particular the EvoSuite tool) when applied to test data generation for open source projects. To achieve sound empirical results, we randomly selected 100 Java projects from SourceForge, which is the most popular open source repository (more than 300,000 projects with more than two million registered users). The resulting case study not only is very large (8,784 public classes for a total of 291,639 bytecode level branches), but more importantly it is statistically sound and representative for open source projects. Results show that while high coverage on commonly used types of classes is achievable, in practice environmental dependencies prohibit such high coverage, which clearly points out essential future research directions. To support this future research, our SF100 case study can serve as a much needed corpus of classes for test generation.

Privacy and Utility for Defect Prediction: Experiments with MORPH
Fayola Peters and Tim Menzies
(West Virginia University, USA)
Ideally, we can learn lessons from software projects across multiple organizations. However, a major impediment to such knowledge sharing are the privacy concerns of software development organizations. This paper aims to provide defect data-set owners with an effective means of privatizing their data prior to release. We explore MORPH which understands how to maintain class boundaries in a data-set. MORPH is a data mutator that moves the data a random distance, taking care not to cross class boundaries. The value of training on this MORPHed data is tested via a 10-way within learning study and a cross learning study using Random Forests, Naive Bayes, and Logistic Regression for ten object-oriented defect data-sets from the PROMISE data repository. Measured in terms of exposure of sensitive attributes, the MORPHed data was four times more private than the unMORPHed data. Also, in terms of the f-measures, there was little difference between the MORPHed and unMORPHed data (original data and data privatized by data-swapping) for both the cross and within study. We conclude that at least for the kinds of OO defect data studied in this project, data can be privatized without concerns for inference efficacy.

Bug Prediction Based on Fine-Grained Module Histories
Hideaki Hata, Osamu Mizuno, and Tohru Kikuno
(Osaka University, Japan; Kyoto Institute of Technology, Japan)
There have been many bug prediction models built with historical metrics, which are mined from version histories of software modules. Many studies have reported the effectiveness of these historical metrics. For prediction levels, most studies have targeted package and file levels. Prediction on a fine-grained level, which represents the method level, is required because there may be interesting results compared to coarse-grained (package and file levels) prediction. These results include good performance when considering quality assurance efforts, and new findings about the correlations between bugs and histories. However, fine-grained prediction has been a challenge because obtaining method histories from existing version control systems is a difficult problem. To tackle this problem, we have developed a fine-grained version control system for Java, Historage. With this system, we target Java software and conduct fine-grained prediction with well-known historical metrics. The results indicate that fine-grained (method-level) prediction outperforms coarse-grained (package and file levels) prediction when taking the efforts necessary to find bugs into account. Using a correlation analysis, we show that past bug information does not contribute to method-level bug prediction.

Refactoring
Wed, Jun 6, 14:00 - 15:30

Reconciling Manual and Automatic Refactoring
Xi Ge, Quinton L. DuBose, and Emerson Murphy-Hill
(North Carolina State University, USA)
Although useful and widely available, refactoring tools are underused. One cause of this underuse is that a developer sometimes fails to recognize that she is going to refactor before she begins manually refactoring. To address this issue, we conducted a formative study of developers’ manual refactoring process, suggesting that developers’ reliance on “chasing error messages” when manually refactoring is an error-prone manual refactoring strategy. Additionally, our study distilled a set of manual refactoring workflow patterns. Using these patterns, we designed a novel refactoring tool called BeneFactor. BeneFactor detects a developer’s manual refactoring, reminds her that automatic refactoring is available, and can complete her refactoring automatically. By alleviating the burden of recognizing manual refactoring, BeneFactor is designed to help solve the refactoring tool underuse problem.

WitchDoctor: IDE Support for Real-Time Auto-Completion of Refactorings
Stephen R. Foster, William G. Griswold, and Sorin Lerner

(UC San Diego, USA)
Integrated Development Environments (IDEs) have come to perform a wide variety of tasks on behalf of the programmer, refactoring being a classic example. These operations have undeniable benefits, yet their large (and growing) number poses a cognitive scalability problem. Our main contribution is WitchDoctor -- a system that can detect, on the fly, when a programmer is hand-coding a refactoring. The system can then complete the refactoring in the background and propose it to the user long before the user can complete it. This implies a number of technical challenges. The algorithm must be 1) highly efficient, 2) handle unparseable programs, 3) tolerate the variety of ways programmers may perform a given refactoring, 4) use the IDE's proven and familiar refactoring engine to perform the refactoring, even though the the refactoring has already begun, and 5) support the wide range of refactorings present in modern IDEs. Our techniques for overcoming these challenges are the technical contributions of this paper. We evaluate WitchDoctor's design and implementation by simulating over 5,000 refactoring operations across three open-source projects. The simulated user is faster and more efficient than an average human user, yet WitchDoctor can detect more than 90% of refactoring operations as they are being performed -- and can complete over a third of refactorings before the simulated user does. All the while, WitchDoctor remains robust in the face of non-parseable programs and unpredictable refactoring scenarios. We also show that WitchDoctor is efficient enough to perform computation on a keystroke-by-keystroke basis, adding an average overhead of only 15 milliseconds per keystroke.

Use, Disuse, and Misuse of Automated Refactorings
Mohsen Vakilian, Nicholas Chen, Stas Negara, Balaji Ambresh Rajkumar, Brian P. Bailey, and Ralph E. Johnson
(University of Illinois at Urbana-Champaign, USA)
Though refactoring tools have been available for more than a decade, research has shown that programmers underutilize such tools. However, little is known about why programmers do not take advantage of these tools. We have conducted a field study on programmers in their natural settings working on their code. As a result, we collected a set of interaction data from about 1268 hours of programming using our minimally intrusive data collectors. Our quantitative data show that programmers prefer lightweight methods of invoking refactorings, usually perform small changes using the refactoring tool, proceed with an automated refactoring even when it may change the behavior of the program, and rarely preview the automated refactorings. We also interviewed nine of our participants to provide deeper insight about the patterns that we observed in the behavioral data. We found that programmers use predictable automated refactorings even if they have rare bugs or change the behavior of the program. This paper reports some of the factors that affect the use of automated refactorings such as invocation method, awareness, naming, trust, and predictability and the major mismatches between programmers' expectations and automated refactorings. The results of this work contribute to producing more effective tools for refactoring complex software.

Human Aspects of Development
Wed, Jun 6, 14:00 - 15:30

Test Confessions: A Study of Testing Practices for Plug-In Systems
Michaela Greiler, Arie van Deursen

, and Margaret-Anne Storey
(TU Delft, Netherlands; University of Victoria, Canada)
Testing plug-in based systems is challenging due to complex interactions among many different plug-ins, and variations in version and configuration. The objective of this paper is to find out how developers address this test challenge. To that end, we conduct a qualitative (grounded theory) study, in which we interview 25 senior practitioners about how they test plug-ins and applications built on top of the Eclipse plug-in framework. The outcome is an overview of the testing practices currently used, a set of identified barriers limiting the adoption of test practices, and an explanation of how limited testing is compensated by self-hosting of projects and by involving the community. These results are supported by a structured survey of more than 150 professionals. The study reveals that unit testing plays a key role, whereas plug-in specific integration problems are identified and resolved by the community. Based on our findings, we propose a series of recommendations and areas for future research.

How Do Professional Developers Comprehend Software?
Tobias Roehm, Rebecca Tiarks, Rainer Koschke, and Walid Maalej
(TU Munich, Germany; University of Bremen, Germany)
Research in program comprehension has considerably evolved over the past two decades. However, only little is known about how developers practice program comprehension under time and project pressure, and which methods and tools proposed by researchers are used in industry. This paper reports on an observational study of 28 professional developers from seven companies, investigating how developers comprehend software. In particular we focus on the strategies followed, information needed, and tools used. We found that developers put themselves in the role of end users by inspecting user interfaces. They try to avoid program comprehension, and employ recurring, structured comprehension strategies depending on work context. Further, we found that standards and experience facilitate comprehension. Program comprehension was considered a subtask of other maintenance tasks rather than a task by itself. We also found that face-to-face communication is preferred to documentation. Overall, our results show a gap between program comprehension research and practice as we did not observe any use of state of the art comprehension tools and developers seem to be unaware of them. Our findings call for further careful analysis and for reconsidering research agendas.

Asking and Answering Questions about Unfamiliar APIs: An Exploratory Study
Ekwa Duala-Ekoko and Martin P. Robillard

(McGill University, Canada)
The increasing size of APIs and the increase in the number of APIs available imply developers must frequently learn how to use unfamiliar APIs. To identify the types of questions developers want answered when working with unfamiliar APIs and to understand the difficulty they may encounter answering those questions, we conducted a study involving twenty programmers working on different programming tasks, using unfamiliar APIs. Based on the screen captured videos and the verbalization of the participants, we identified twenty different types of questions programmers ask when working with unfamiliar APIs, and provide new insights to the cause of the difficulties programmers encounter when answering questions about the use of APIs. The questions we have identified and the difficulties we observed can be used for evaluating tools aimed at improving API learning, and in identifying areas of the API learning process where tool support is missing, or could be improved.

Bug Detection
Wed, Jun 6, 16:00 - 18:00

Automated Repair of HTML Generation Errors in PHP Applications Using String Constraint Solving
Hesam Samimi, Max Schäfer, Shay Artzi, Todd Millstein

, Frank Tip, and Laurie Hendren

(UC Los Angeles, USA; IBM Research, USA; McGill University, Canada)
PHP web applications routinely generate invalid HTML. Modern browsers silently correct HTML errors, but sometimes malformed pages render inconsistently, cause browser crashes, or expose security vulnerabilities. Fixing errors in generated pages is usually straightforward, but repairing the generating PHP program can be much harder. We observe that malformed HTML is often produced by incorrect "constant prints", i.e., statements that print string literals, and present two tools for automatically repairing such HTML generation errors. PHPQuickFix repairs simple bugs by statically analyzing individual prints. PHPRepair handles more general repairs using a dynamic approach. Based on a test suite, the property that all tests should produce their expected output is encoded as a string constraint over variables representing constant prints. Solving this constraint describes how constant prints must be modified to make all tests pass. Both tools were implemented as an Eclipse plugin and evaluated on PHP programs containing hundreds of HTML generation errors, most of which our tools were able to repair automatically.

Leveraging Test Generation and Specification Mining for Automated Bug Detection without False Positives
Michael Pradel

and Thomas R. Gross
(ETH Zurich, Switzerland)
Mining specifications and using them for bug detection is a promising way to reveal bugs in programs. Existing approaches suffer from two problems. First, dynamic specification miners require input that drives a program to generate common usage patterns. Second, existing approaches report false positives, that is, spurious warnings that mislead developers and reduce the practicability of the approach. We present a novel technique for dynamically mining and checking specifications without relying on existing input to drive a program and without reporting false positives. Our technique leverages automatically generated tests in two ways: Passing tests drive the program during specification mining, and failing test executions are checked against the mined specifications. The output are warnings that show with concrete test cases how the program violates commonly accepted specifications. Our implementation reports no false positives and 54 true positives in ten well-tested Java programs.

Axis: Automatically Fixing Atomicity Violations through Solving Control Constraints
Peng Liu and Charles Zhang

(Hong Kong University of Science and Technology, China)
Atomicity, a general correctness criteria in concurrency programs, is often violated in real-world applications. What is worse is that the violations are difficult to be fixed by the developers, which makes the automatic bug fixing techniques attractive solutions. The state of the art aims at automating the manual fixing process and can not provide any theoretical reasoning and guarantees. We provide an automatic approach that applies the well-studied control theory to guarantee the maximal preservation of the concurrency degree of the original code and no introduction of deadlocks. Under the hood, we reduce the problem of violation fixing to a constraint solving problem using the petri net model. Our evaluation on 13 subjects shows that the slowdown incurred by our patches is only 40% of that of the state of the art. With the deadlock-free guarantee, our patches incur moderate overhead (around 10%), which is a worthwhile cost for safety.

CBCD: Cloned Buggy Code Detector
Jingyue Li and Michael D. Ernst
(DNV Research and Innovation, Norway; University of Washington, USA)
Developers often copy, or clone, code in order to reuse or modify functionality. When they do so, they also clone any bugs in the original code. Or, different developers may independently make the same mistake. As one example of a bug, multiple products in a product line may use a component in a similar wrong way. This paper makes two contributions. First, it presents an empirical study of cloned buggy code. In a large industrial product line, about 4% of the bugs are duplicated across more than one product or file. In three open source projects (the Linux kernel, the Git version control system, and the PostgreSQL database) we found 282, 33, and 33 duplicated bugs, respectively. Second, this paper presents a tool, CBCD, that searches for code that is semantically identical to given buggy code. CBCD tests graph isomorphism over the Program Dependency Graph (PDG) representation and uses four optimizations. We evaluated CBCD by searching for known clones of buggy code segments in the three projects and compared the results with text-based, token-based, and AST-based code clone detectors, namely Simian, CCFinder, Deckard, and CloneDR. The evaluation shows that CBCD is fast when searching for possible clones of the buggy code in a large system, and it is more precise for this purpose than the other code clone detectors.

Multiversion Software
Wed, Jun 6, 16:00 - 18:00

Crosscutting Revision Control System
Sagi Ifrah and David H. Lorenz

(Open University, Israel)
Large and medium scale software projects often require a source code revision control (RC) system. Unfortunately, RC systems do not perform well with obliviousness and quantification found in aspect-oriented code. When classes are oblivious to aspects, so is the RC system, and the crosscutting effect of aspects is not tracked. In this work, we study this problem in the context of using AspectJ (a standard AOP language) with Subversion (a standard RC system). We describe scenarios where the crosscutting effect of aspects combined with the concurrent changes that RC supports can lead to inconsistent states of the code. The work contributes a mechanism that checks-in with the source code versions of crosscutting metadata for tracking the effect of aspects. Another contribution of this work is the implementation of a supporting Eclipse plug-in (named XRC) that extends the JDT, AJDT, and SVN plug-ins for Eclipse to provide crosscutting revision control (XRC) for aspect-oriented programming.

Where Does This Code Come from and Where Does It Go? - Integrated Code History Tracker for Open Source Systems -
Katsuro Inoue, Yusuke Sasaki, Pei Xia, and Yuki Manabe
(Osaka University, Japan)
When we reuse a code fragment in an open source system, it is very important to know the history of the code, such as the code origin and evolution. In this paper, we propose an integrated approach to code history tracking for open source repositories. This approach takes a query code fragment as its input, and returns the code fragments containing the code clones with the query code. It utilizes publicly available code search engines as external resources. Based on this model, we have designed and implemented a prototype system named Ichi Tracker. Using Ichi Tracker, we have conducted three case studies. These case studies show the ancestors and descendents of the code, and we can recognize their evolution history.

Improving Early Detection of Software Merge Conflicts
Mário Luís Guimarães and António Rito Silva
(Technical University of Lisbon, Portugal)
Merge conflicts cause software defects which if detected late may require expensive resolution. This is especially true when developers work too long without integrating concurrent changes, which in practice is common as integration generally occurs at check-in. Awareness of others' activities was proposed to help developers detect conflicts earlier. However, it requires developers to detect conflicts by themselves and may overload them with notifications, thus making detection harder. This paper presents a novel solution that continuously merges uncommitted and committed changes to create a background system that is analyzed, compiled, and tested to precisely and accurately detect conflicts on behalf of developers, before check-in. An empirical study confirms that our solution avoids overloading developers and improves early detection of conflicts over existing approaches. Similarly to what happened with continuous compilation, this introduces the case for continuous merging inside the IDE.

A History-Based Matching Approach to Identification of Framework Evolution
Sichen Meng, Xiaoyin Wang, Lu Zhang, and Hong Mei

(Key Laboratory of High Confidence Software Technologies, China; Peking University, China)
In practice, it is common that a framework and its client programs evolve simultaneously. Thus, developers of client programs may need to migrate their programs to the new release of the framework when the framework evolves. As framework developers can hardly always guarantee backward compatibility during the evolution of a framework, migration of its client program is often time-consuming and error-prone. To facilitate this migration, researchers have proposed two categories of approaches to identification of framework evolution: operation-based approaches and matching-based approaches. To overcome the main limitations of the two categories of approaches, we propose a novel approach named HiMa, which is based on matching each pair of consecutive revisions recorded in the evolution history of the framework and aggregating revision-level rules to obtain framework-evolution rules. We implemented our HiMa approach as an Eclipse plug-in targeting at frameworks written in Java using SVN as the versioncontrol system. We further performed an experimental study on HiMa together with a state-of-art approach named AURA using six tasks based on three subject Java frameworks. Our experimental results demonstrate that HiMa achieves higher precision and higher recall than AURA in most circumstances and is never inferior to AURA in terms of precision and recall in any circumstances, although HiMa is computationally more costly than AURA.

Similarity and Classification
Wed, Jun 6, 16:00 - 18:00

Detecting Similar Software Applications
Collin McMillan, Mark Grechanik, and Denys Poshyvanyk

(College of William and Mary, USA; Accenture Technology Labs, USA; University of Illinois at Chicago, USA)
Although popular text search engines allow users to retrieve similar web pages, source code search engines do not have this feature. Detecting similar applications is a notoriously difficult problem, since it implies that similar high-level requirements and their low-level implementations can be detected and matched automatically for different applications. We created a novel approach for automatically detecting Closely reLated ApplicatioNs (CLAN) that helps users detect similar applications for a given Java application. Our main contributions are an extension to a framework of relevance and a novel algorithm that computes a similarity index between Java applications using the notion of semantic layers that correspond to packages and class hierarchies. We have built CLAN and we conducted an experiment with 33 participants to evaluate CLAN and compare it with the closest competitive approach, MUDABlue. The results show with strong statistical significance that CLAN automatically detects similar applications from a large repository of 8,310 Java applications with a higher precision than MUDABlue.

Content Classification of Development Emails
Alberto Bacchelli, Tommaso Dal Sasso, Marco D'Ambros, and Michele Lanza

(University of Lugano, Switzerland)
Emails related to the development of a software system contain information about design choices and issues encountered during the development process. Exploiting the knowledge embedded in emails with automatic tools is challenging, due to the unstructured, noisy, and mixed language nature of this communication medium. Natural language text is often not well-formed and is interleaved with languages with other syntaxes, such as code or stack traces.
We present an approach to classify email content at line level. Our technique classifies email lines in five categories (i.e., text, junk, code, patch, and stack trace) to allow one to subsequently apply ad hoc analysis techniques for each category. We evaluated our approach on a statistically significant set of emails gathered from mailing lists of four unrelated open source systems.

Identifying Linux Bug Fixing Patches
Yuan Tian, Julia Lawall

, and David Lo

(Singapore Management University, Singapore; INRIA/LIP6, France)
In the evolution of an operating system there is a continuing tension between the need to develop and test new features, and the need to provide a stable and secure execution environment to users. A compromise, adopted by the developers of the Linux kernel, is to release new versions, including bug fixes and new features, frequently, while maintaining some older "longterm" versions. This strategy raises the problem of how to identify bug fixing patches that are submitted to the current version but should be applied to the longterm versions as well. The current approach is to rely on the individual subsystem maintainers to forward patches that seem relevant to the maintainers of the longterm kernels. The reactivity and diligence of the maintainers, however, varies, and thus many important patches could be missed by this approach.
In this paper, we propose an approach that automatically identifies bug fixing patches based on the changes and commit messages recorded in code repositories. We compare our approach with the keyword-based approach for identifying bug-fixing patches used in the literature, in the context of the Linux kernel. The results show that our approach can achieve a 53.19% improvement in recall as compared to keyword-based approaches, with similar precision.

Active Refinement of Clone Anomaly Reports
Lucia, David Lo

, Lingxiao Jiang

, and Aditya Budi
(Singapore Management University, Singapore)
Software clones have been widely studied in the recent literature and shown useful for finding bugs because inconsistent changes among clones in a clone group may indicate potential bugs. However, many inconsistent clone groups are not real bugs. The excessive number of false positives could easily impede broad adoption of clone-based bug detection approaches.
In this work, we aim to improve the usability of clonebased bug detection tools by increasing the rate of true positives found when a developer analyzes anomaly reports. Our idea is to control the number of anomaly reports a user can see at a time and actively incorporate incremental user feedback to continually refine the anomaly reports. Our system first presents top few anomaly reports from the list of reports generated by a tool in its default ordering. Users then either accept or reject each of the reports. Based on the feedback, our system automatically and iteratively refines a classification model for anomalies and re-sorts the rest of the reports. Our goal is to present the true positives to the users earlier than the default ordering. The rationale of the idea is based on our observation that false positives among the inconsistent clone groups could share common features (in terms of code structure, programming patterns, etc.), and these features can be learned from the incremental user feedback.
We evaluate our refinement process on three sets of clonebased anomaly reports from three large real programs: the Linux Kernel (C), Eclipse, and ArgoUML (Java), extracted by a clone-based anomaly detection tool. The results show that compared to the original ordering of bug reports, we can improve the rate of true positives found (i.e., true positives are found faster) by 11%, 87%, and 86% for Linux kernel, Eclipse, and ArgoUML, respectively.

Analysis for Evolution
Thu, Jun 7, 10:45 - 12:45

Automated Analysis of CSS Rules to Support Style Maintenance
Ali Mesbah and Shabnam Mirshokraie
(University of British Columbia, Canada)
CSS is a widely used language for describing the presentation semantics of HTML elements on the web. The language has a number of characteristics, such as inheritance and cascading order, which makes maintaining CSS code a challenging task for web developers. As a result, it is common for unused rules to be accumulated over time. Despite these challenges, CSS analysis has not received much attention from the research community. We propose an automated technique to support styling code maintenance, which (1) analyzes the runtime relationship between the CSS rules and DOM elements of a given web application (2) detects unmatched and ineffective selectors, overridden declaration properties, and undefined class values. Our technique, implemented in an open source tool called CILLA, has a high precision and recall rate. The results of our case study, conducted on fifteen open source and industrial web-based systems, show an average of 60% unused CSS selectors in deployed applications, which points to the ubiquity of the problem.

Graph-Based Analysis and Prediction for Software Evolution
Pamela Bhattacharya, Marios Iliofotou, Iulian Neamtiu, and Michalis Faloutsos
(UC Riverside, USA)
We exploit recent advances in analysis of graph topology to better understand software evolution, and to construct predictors that facilitate software development and maintenance. Managing an evolving, collaborative software system is a complex and expensive process, which still cannot ensure software reliability. Emerging techniques in graph mining have revolutionized the modeling of many complex systems and processes. We show how we can use a graph-based characterization of a software system to capture its evolution and facilitate development, by helping us estimate bug severity, prioritize refactoring efforts, and predict defect-prone releases. Our work consists of three main thrusts. First, we construct graphs that capture software structure at two different levels: (a) the product, i.e., source code and module level, and (b) the process, i.e., developer collaboration level. We identify a set of graph metrics that capture interesting properties of these graphs. Second, we study the evolution of eleven open source programs, including Firefox, Eclipse, MySQL, over the lifespan of the programs, typically a decade or more. Third, we show how our graph metrics can be used to construct predictors for bug severity, high-maintenance software parts, and failure-prone releases. Our work strongly suggests that using graph topology analysis concepts can open many actionable avenues in software engineering research and practice.

Integrated Impact Analysis for Managing Software Changes
Malcom Gethers, Bogdan Dit, Huzefa Kagdi, and Denys Poshyvanyk

(College of William and Mary, USA; Wichita State University, USA)
The paper presents an adaptive approach to perform impact analysis from a given change request to source code. Given a textual change request (e.g., a bug report), a single snapshot (release) of source code, indexed using Latent Semantic Indexing, is used to estimate the impact set. Should additional contextual information be available, the approach configures the best-fit combination to produce an improved impact set. Contextual information includes the execution trace and an initial source code entity verified for change. Combinations of information retrieval, dynamic analysis, and data mining of past source code commits are considered. The research hypothesis is that these combinations help counter the precision or recall deficit of individual techniques and improve the overall accuracy. The tandem operation of the three techniques sets it apart from other related solutions. Automation along with the effective utilization of two key sources of developer knowledge, which are often overlooked in impact analysis at the change request level, is achieved.
To validate our approach, we conducted an empirical evaluation on four open source software systems. A benchmark consisting of a number of maintenance issues, such as feature requests and bug fixes, and their associated source code changes was established by manual examination of these systems and their change history. Our results indicate that there are combinations formed from the augmented developer contextual information that show statistically significant improvement over stand-alone approaches.

Detecting and Visualizing Inter-worksheet Smells in Spreadsheets
Felienne Hermans, Martin Pinzger, and Arie van Deursen

(TU Delft, Netherlands)
Spreadsheets are often used in business, for simple tasks, as well as for mission critical tasks such as finance or forecasting. Similar to software, some spreadsheets are of better quality than others, for instance with respect to usability, maintainability or reliability. In contrast with software however, spreadsheets are rarely checked, tested or certified. In this paper, we aim at developing an approach for detecting smells that indicate weak points in a spreadsheet's design. To that end we first study code smells and transform these code smells to their spreadsheet counterparts. We then present an approach to detect the smells, and communicate located smells to spreadsheet users with data flow diagrams. We analyzed occurrences of these smells in the Euses corpus. Furthermore we conducted ten case studies in an industrial setting. The results of the evaluation indicate that smells can indeed reveal weaknesses in a spreadsheet's design, and that data flow diagrams are an appropriate way to show those weaknesses.

Debugging
Thu, Jun 7, 10:45 - 12:45

An Empirical Study about the Effectiveness of Debugging When Random Test Cases Are Used
Mariano Ceccato, Alessandro Marchetto, Leonardo Mariani, Cu D. Nguyen, and Paolo Tonella
(Fondazione Bruno Kessler, Italy; University of Milano-Bicocca, Italy)
Automatically generated test cases are usually evaluated in terms of their fault revealing or coverage capability. Beside these two aspects, test cases are also the major source of information for fault localization and fixing. The impact of automatically generated test cases on the debugging activity, compared to the use of manually written test cases, has never been studied before.
In this paper we report the results obtained from two controlled experiments with human subjects performing debugging tasks using automatically generated or manually written test cases. We investigate whether the features of the former type of test cases, which make them less readable and understandable (e.g., unclear test scenarios, meaningless identifiers), have an impact on accuracy and efficiency of debugging. The empirical study is aimed at investigating whether, despite the lack of readability in automatically generated test cases, subjects can still take advantage of them during debugging.

Reducing Confounding Bias in Predicate-Level Statistical Debugging Metrics
Ross Gore and Paul F. Reynolds, Jr.
(University of Virginia, USA)
Statistical debuggers use data collected during test case execution to automatically identify the location of faults within software. Recent work has applied causal inference to eliminate or reduce control and data flow dependence confounding bias in statement-level statistical debuggers. The result is improved effectiveness. This is encouraging but motivates two novel questions: (1) how can causal inference be applied in predicate-level statistical debuggers and (2) what other biases can be eliminated or reduced. Here we address both questions by providing a model that eliminates or reduces control flow dependence and failure flow confounding bias within predicate-level statistical debuggers. We present empirical results demonstrating that our model significantly improves the effectiveness of a variety of predicate-level statistical debuggers, including those that eliminate or reduce only a single source of confounding bias.

BugRedux: Reproducing Field Failures for In-House Debugging
Wei Jin and Alessandro Orso

(Georgia Tech, USA)
A recent survey conducted among developers of the Apache, Eclipse, and Mozilla projects showed that the ability to recreate field failures is considered of fundamental importance when investigating bug reports. Unfortunately, the information typically contained in a bug report, such as memory dumps or call stacks, is usually insufficient for recreating the problem. Even more advanced approaches for gathering field data and help in-house debugging tend to collect either too little information, and be ineffective, or too much information, and be inefficient. To address these issues, we present BugRedux, a novel general approach for in-house debugging of field failures. BugRedux aims to synthesize, using execution data collected in the field, executions that mimic the observed field failures. We define several instances of BugRedux that collect different types of execution data and perform, through an empirical study, a cost-benefit analysis of the approach and its variations. In the study, we apply BugRedux to 16 failures of 14 real-world programs. Our results are promising in that they show that it is possible to synthesize in-house executions that reproduce failures observed in the field using a suitable set of execution data.

Object-Centric Debugging
Jorge Ressia, Alexandre Bergel, and Oscar Nierstrasz
(University of Bern, Switzerland; University of Chile, Chile)
During the process of developing and maintaining a complex software system, developers pose detailed questions about the runtime behavior of the system. Source code views offer strictly limited insights, so developers often turn to tools like debuggers to inspect and interact with the running system. Unfortunately, traditional debuggers focus on the runtime stack as the key abstraction to support debugging operations, though the questions developers pose often have more to do with objects and their interactions. We propose object-centric debugging as an alternative approach to interacting with a running software system. We show how, by focusing on objects as the key abstraction, natural debugging operations can be defined to answer developer questions related to runtime behavior. We present a running prototype of an object-centric debugger, and demonstrate, with the help of a series of examples, how object-centric debugging offers more effective support for many typical developer tasks than a traditional stack-oriented debugger.

Human Aspects of Process
Thu, Jun 7, 10:45 - 12:45

Disengagement in Pair Programming: Does It Matter?
Laura Plonka, Helen Sharp, and Janet van der Linden
(Open University, UK)
Pair Programming (PP) requires close collaboration and mutual engagement. Most existing empirical studies of PP do not focus on developers’ behaviour during PP sessions, and focus instead on the effects of PP such as productivity. However, disengagement, where a developer is not focusing on solving the task or understanding the problem and allows their partner to work by themselves, can hinder collaboration between developers and have a negative effect on their performance. This paper reports on an empirical study that investigates disengagement. Twenty-one industrial pair programming sessions were video and audio recorded and qualitatively analysed to investigate circumstances that lead to disengagement. We identified five reasons for disengagement: interruptions during the collaboration, the way the work is divided, the simplicity of the task involved, social pressure on inexperienced pair programmers, and time pressure. Our findings suggest that disengagement is sometimes acceptable and agreed upon between the developers in order to speed up problem solving. However, we also found episodes of disengagement where developers “drop out” of their PP sessions and are not able to follow their partner’s work nor contribute to the task at hand, thus losing the expected benefits of pairing. Analysis of sessions conducted under similar circumstances but where mutual engagement was sustained identified three behaviours that help to maintain engagement: encouraging the novice to drive, verbalisation and feedback, and asking for clarification.

Ambient Awareness of Build Status in Collocated Software Teams
John Downs, Beryl Plimmer, and John G. Hosking
(University of Melbourne, Australia; University of Auckland, New Zealand; Australian National University, Australia)
We describe the evaluation of a build awareness system that assists agile software development teams to understand current build status and who is responsible for any build breakages. The system uses ambient awareness technologies, providing a separate, easily perceived communication channel distinct from standard team workflow. Multiple system configurations and behaviours were evaluated. An evaluation of the system showed that, while there was no significant change in the proportion of build breakages, the overall number of builds increased substantially, and the duration of broken builds decreased. Team members also reported an increased sense of awareness of, and responsibility for, broken builds and some noted the system dramatically changed their perception of the build process making them more cognisant of broken builds.

What Make Long Term Contributors: Willingness and Opportunity in OSS Community
Minghui Zhou

and Audris Mockus
(Peking University, China; Key Laboratory of High Confidence Software Technologies, China; Avaya Labs Research, USA)
To survive and succeed, software projects need to attract and retain contributors. We model the individual's chances to become a valuable contributor through their capacity, willingness, and the opportunity to contribute at the time of joining. Using issue tracking data of Mozilla and Gnome, we find that the probability for a new joiner to become a Long Term Contributor (LTC) is associated with her willingness and environment. Specifically, during their first month, future LTCs tend to be more active and show more community-oriented attitude than other joiners. Joiners who start by commenting on instead of reporting an issue or ones who succeed to get at least one reported issue to be fixed, more than double their odds of becoming an LTC. The macro-climate with high project relative sociality and the micro-climate with a large, productive, and clustered peer group increase the odds. On the contrary, the macro-climate with high project popularity and the micro-climate with low attention from peers reduce the odds. This implies that the interaction between individual's attitude and project's climate are associated with the odds that an individual would become a valuable contributor or disengage from the project. Our findings may provide a basis for empirical approaches to design a better community architecture and to improve the experience of contributors.

Development of Auxiliary Functions: Should You Be Agile? An Empirical Assessment of Pair Programming and Test-First Programming
Otávio Augusto Lazzarini Lemos, Fabiano Cutigi Ferrari, Fábio Fagundes Silveira, and Alessandro Garcia

(UNIFESP, Brazil; UFSCar, Brazil; PUC-Rio, Brazil)
A considerable part of software systems is comprised of functions that support the main modules, such as array or string manipulation and basic math computation. These auxiliary functions are usually considered less complex, and thus tend to receive less attention from developers. However, failures in these functions might propagate to more critical modules, thereby affecting the system's overall reliability. Given the complementary role of auxiliary functions, a question that arises is whether agile practices, such as pair programming and test-first programming, can improve their correctness without affecting time-to-market. This paper presents an empirical assessment comparing the application of these agile practices with more traditional approaches. Our study comprises independent experiments of pair versus solo programming, and test-first versus test-last programming. The first study involved 85 novice programmers who applied both traditional and agile approaches in the development of six auxiliary functions within three different domains. Our results suggest that the agile practices might bring benefits in this context. In particular, pair programmers delivered correct implementations much more often, and test-first programming encouraged the production of larger and higher coverage test sets. On the downside, the main experiment showed that both practices significantly increase total development time. A replication of the test-first experiment with professional developers shows similar results.

Models
Thu, Jun 7, 10:45 - 12:45

Maintaining Invariant Traceability through Bidirectional Transformations
Yijun Yu, Yu Lin, Zhenjiang Hu, Soichiro Hidaka, Hiroyuki Kato, and Lionel Montrieux
(Open University, UK; University of Illinois at Urbana-Champaign, USA; National Institute of Informatics, Japan)
Following the ``convention over configuration" paradigm, model-driven development (MDD) generates code to implement the ``default'' behaviour that has been specified by a template separate from the input model, reducing the decision effort of developers. For flexibility, users of MDD are allowed to customise the model and the generated code in parallel. A synchronisation of changed model or code is maintained by reflecting them on the other end of the code generation, as long as the traceability is unchanged. However, such invariant traceability between corresponding model and code elements can be violated either when (a) users of MDD protect custom changes from the generated code, or when (b) developers of MDD change the template for generating the default behaviour. A mismatch between user and template code is inevitable as they evolve for their own purposes. In this paper, we propose a two-layered invariant traceability framework that reduces the number of mismatches through bidirectional transformations. On top of existing vertical (model<->code) synchronisations between a model and the template code, a horizontal (code<->code) synchronisation between user and template code is supported, aligning the changes in both directions. Our blinkit tool is evaluated using the data set available from the CVS repositories of a MDD project: Eclipse MDT/GMF.

Slicing MATLAB Simulink Models
Robert Reicherdt and Sabine Glesner
(TU Berlin, Germany)
MATLAB Simulink is the most widely used industrial tool for developing complex embedded systems in the automotive sector. The resulting Simulink models often consist of more than ten thousand blocks and a large number of hierarchy levels. To ensure the quality of such models, automated static analyses and slicing are necessary to cope with this complexity. In particular, static analyses are required that operate directly on the models. In this article, we present an approach for slicing Simulink Models using dependence graphs and demonstrate its efficiency using case studies from the automotive and avionics domain. With slicing, the complexity of a model can be reduced for a given point of interest by removing unrelated model elements, thus paving the way for subsequent static quality assurance methods.

Partial Evaluation of Model Transformations
Ali Razavi and Kostas Kontogiannis
(University of Waterloo, Canada; National Technical University of Athens, Greece)
Model Transformation is considered an important enabling factor for Model Driven Development. Transformations can be applied not only for the generation of new models from existing ones, but also for the consistent co-evolution of software artifacts that pertain to various phases of software lifecycle such as requirement models, design documents and source code. Furthermore, it is often common in practical scenarios to apply such transformations repeatedly and frequently; an activity that can take a significant amount of time and resources, especially when the affected models are complex and highly interdependent. In this paper, we discuss a novel approach for deriving incremental model transformations by the partial evaluation of original model transformation programs. Partial evaluation involves pre-computing parts of the transformation program based on known model dependencies and the type of the applied model change. Such pre-evaluation allows for significant reduction of transformation time in large and complex model repositories. To evaluate the approach, we have implemented QvtMix, a prototype partial evaluator for the Query, View and Transformation Operational Mappings (QVT-OM) language. The experiments indicate that the proposed technique can be used for significantly improving the performance of repetitive applications of model transformations.

Partial Models: Towards Modeling and Reasoning with Uncertainty
Michalis Famelis, Rick Salay, and Marsha Chechik

(University of Toronto, Canada)
Models are good at expressing information about software but not as good at expressing modelers' uncertainty about it. The highly incremental and iterative nature of software development nonetheless requires the ability to express uncertainty and reason with models containing it. In this paper, we build on our earlier work on expressing uncertainty using partial models, by elaborating an approach to reasoning with such models. We evaluate our approach by experimentally comparing it to traditional strategies for dealing with uncertainty as well as by conducting a case study using open source software. We conclude that we are able to reap the benefits of well-managed uncertainty while incurring minimal additional cost.

Concurrency and Exceptions
Thu, Jun 7, 16:00 - 17:30

Static Detection of Resource Contention Problems in Server-Side Scripts
Yunhui Zheng and Xiangyu Zhang

(Purdue University, USA)
With modern multi-core architectures, web applications are usually configured to serve multiple requests simultaneously by spawning multiple instances. These instances may access the same external resources such as database tables and files. Such contentions may become severe during peak time, leading to violations of atomicity business logic. In this paper, we propose a novel static analysis that detects atomicity violations of external operations for server side scripts. The analysis differs from traditional atomicity violation detection techniques by focusing on external resources instead of shared memory. It consists of three components. The first one is an interprocedural and path-sensitive resource identity analysis that determines whether multiple operations access the same external resource, which is critical to identifying contentions. The second component infers pairs of external operations that should be executed atomically. Finally, violations are detected by reasoning about serializability of interleaved atomic pairs. Experimental results show that the analysis is highly effective in detecting atomicity violations in real-world web apps.

Amplifying Tests to Validate Exception Handling Code
Pingyu Zhang and Sebastian Elbaum
(University of Nebraska-Lincoln, USA)
Validating code handling exceptional behavior is difficult, particularly when dealing with external resources that may be noisy and unreliable, as it requires: 1) the systematic exploration of the space of exceptions that may be thrown by the external resources, and 2) the setup of the context to trigger specific patterns of exceptions. In this work we present an approach that addresses those difficulties by performing an exhaustive amplification of the space of exceptional behavior associated with an external resource that is exercised by a test suite. Each amplification attempts to expose a program exception handling construct to new behavior by mocking an external resource so that it returns normally or throws an exception following a predefined pattern. Our assessment of the approach indicates that it can be fully automated, is powerful enough to detect 65% of the faults reported in the bug reports of this kind, and is precise enough that 77% of the detected anomalies correspond to faults fixed by the developers.

MagicFuzzer: Scalable Deadlock Detection for Large-Scale Applications
Yan Cai and W. K. Chan
(City University of Hong Kong, China)
We present MagicFuzzer, a novel dynamic deadlock detection technique. Unlike existing techniques to locate potential deadlock cycles from an execution, it iteratively prunes lock dependencies that each has no incoming or outgoing edge. Combining with a novel thread-specific strategy, it dramatically shrinks the size of lock dependency set for cycle detection, improving the efficiency and scalability of such a detection significantly. In the real deadlock confirmation phase, it uses a new strategy to actively schedule threads of an execution against the whole set of potential deadlock cycles. We have implemented a prototype and evaluated it on large-scale C/C++ programs. The experimental results confirm that our technique is significantly more effective and efficient than existing techniques.

Software Architecture
Thu, Jun 7, 16:00 - 17:30

Does Organizing Security Patterns Focus Architectural Choices?
Koen Yskout, Riccardo Scandariato, and Wouter Joosen
(KU Leuven, Belgium)
Security patterns can be a valuable vehicle to design secure software. Several proposals have been advanced to improve the usability of security patterns. They often describe extra annotations to be included in the pattern documentation. This paper presents an empirical study that validates whether those proposals provide any real benefit for software architects. A controlled experiment has been executed with 90 master students, who have performed several design tasks involving the hardening of a software architecture via security patterns. The results show that annotations produce benefits in terms of a reduced number of alternatives that need to be considered during the selection of a suitable pattern. However, they do not reduce the time spent in the selection process.

Enhancing Architecture-Implementation Conformance with Change Management and Support for Behavioral Mapping
Yongjie Zheng and Richard N. Taylor
(UC Irvine, USA)
It is essential for software architecture to be consistent with implementation during software development. Existing architecture-implementation mapping approaches are not sufficient due to a variety of reasons, including lack of support for change management and mapping of behavioral architecture specification. A new approach called 1.x-way architecture-implementation mapping is presented in this paper to address these issues. Its contribution includes deep separation of generated and non-generated code, an architecture change model, architecture-based code regeneration, and architecture change notification. The approach is implemented in ArchStudio 4, an Eclipse-based architecture development environment. To evaluate its utility, we refactored the code of ArchStudio, and replayed changes that had been made to ArchStudio in two research projects by redoing them with the developed tool.

A Tactic-Centric Approach for Automating Traceability of Quality Concerns
Mehdi Mirakhorli, Yonghee Shin, Jane Cleland-Huang, and Murat Cinar
(DePaul University, USA)
The software architectures of business, mission, or safety critical systems must be carefully designed to balance an exacting set of quality concerns describing characteristics such as security, reliability, and performance. Unfortunately, software architectures tend to degrade over time as maintainers modify the system without understanding the underlying architectural decisions. Although this problem can be mitigated by manually tracing architectural decisions into the code, the cost and effort required to do this can be prohibitively expensive. In this paper we therefore present a novel approach for automating the construction of traceability links for architectural tactics. Our approach utilizes machine learning methods and lightweight structural analysis to detect tactic-related classes. The detected tactic-related classes are then mapped to a Tactic Traceability Information Model. We train our trace algorithm using code extracted from fifteen performance-centric and safety-critical open source software systems and then evaluate it against the Apache Hadoop framework. Our results show that automatically generated traceability links can support software maintenance activities while preserving architectural qualities.

Formal Verification
Thu, Jun 7, 16:00 - 17:30

Build Code Analysis with Symbolic Evaluation
Ahmed Tamrawi, Hoan Anh Nguyen, Hung Viet Nguyen, and Tien N. Nguyen
(Iowa State University, USA)
Build process is crucial in software development. However, the analysis support for build code is still limited. In this paper, we present SYMake, an infrastructure and tool for the analysis of build code in make. Due to the dynamic nature of make language, it is challenging to understand and maintain complex Makefiles. SYMake provides a symbolic evaluation algorithm that processes Makefiles and produces a symbolic dependency graph (SDG), which represents the build dependencies (i.e. rules) among files via commands. During the symbolic evaluation, for each resulting string value in an SDG that represents a part of a file name or a command in a rule, SYMake provides also an acyclic graph (called T-model) to represent its symbolic evaluation trace. We have used SYMake to develop algorithms and a tool 1) to detect several types of code smells and errors in Makefiles, and 2) to support build code refactoring, e.g. renaming a variable/target even if its name is fragmented and built from multiple substrings. Our empirical evaluation for SYMake's renaming on several real-world systems showed its high accuracy in entity renaming. Our controlled experiment showed that with SYMake, developers were able to understand Makefiles better and to detect more code smells as well as to perform refactoring more accurately.

An Automated Approach to Generating Efficient Constraint Solvers
Dharini Balasubramaniam, Christopher Jefferson, Lars Kotthoff, Ian Miguel, and Peter Nightingale
(University of St. Andrews, UK)
Combinatorial problems appear in numerous settings, from timetabling to industrial design. Constraint solving aims to find solutions to such problems efficiently and automatically. Current constraint solvers are monolithic in design, accepting a broad range of problems. The cost of this convenience is a complex architecture, inhibiting efficiency, extensibility and scalability. Solver components are also tightly coupled with complex restrictions on their configuration, making automated generation of solvers difficult.
We describe a novel, automated, model-driven approach to generating efficient solvers tailored to individual problems and present some results from applying the approach. The main contribution of this work is a solver generation framework called Dominion, which analyses a problem and, based on its characteristics, generates a solver using components chosen from a library. The key benefit of this approach is the ability to solve larger and more difficult problems as a result of applying finer-grained optimisations and using specialised techniques as required.

Simulation-Based Abstractions for Software Product-Line Model Checking
Maxime Cordy, Andreas Classen, Gilles Perrouin, Pierre-Yves Schobbens, Patrick Heymans, and Axel Legay
(University of Namur, Belgium; INRIA, France; LIFL–CNRS, France; IRISA, France; Aalborg University, Denmark; University of Liège, Belgium)
Software Product Line (SPL) engineering is a software engineering paradigm that exploits the commonality between similar software products to reduce life cycle costs and time-to-market. Many SPLs are critical and would benefit from efficient verification through model checking. Model checking SPLs is more difficult than for single systems, since the number of different products is potentially huge. In previous work, we introduced Featured Transition Systems (FTS), a formal, compact representation of SPL behaviour, and provided efficient algorithms to verify FTS. Yet, we still face the state explosion problem, like any model checking-based verification. Model abstraction is the most relevant answer to state explosion. In this paper, we define a novel simulation relation for FTS and provide an algorithm to compute it. We extend well-known simulation preservation properties to FTS and thus lay the theoretical foundations for abstraction-based model checking of SPLs. We evaluate our approach by comparing the cost of FTS-based simulation and abstraction with respect to product-by-product methods. Our results show that FTS are a solid foundation for simulation-based model checking of SPL.

Invariant Generation
Fri, Jun 8, 08:45 - 10:15

Using Dynamic Analysis to Discover Polynomial and Array Invariants
ThanhVu Nguyen, Deepak Kapur, Westley Weimer, and Stephanie Forrest
(University of New Mexico, USA; University of Virginia, USA)
Dynamic invariant analysis identifies likely properties over variables from observed program traces. These properties can aid programmers in refactoring, documenting, and debugging tasks by making dynamic patterns visible statically. Two useful forms of invariants involve relations among polynomials over program variables and relations among array variables. Current dynamic analysis methods support such invariants in only very limited forms. We combine mathematical techniques that have not previously been applied to this problem, namely equation solving, polyhedra construction, and SMT solving, to bring new capabilities to dynamic invariant detection. Using these methods, we show how to find equalities and inequalities among nonlinear polynomials over program variables, and linear relations among array variables of multiple dimensions. Preliminary experiments on 24 mathematical algorithms and an implementation of AES encryption provide evidence that the approach is effective at finding these invariants.

Metadata Invariants: Checking and Inferring Metadata Coding Conventions
Myoungkyu Song and Eli Tilevich
(Virginia Tech, USA)
As the prevailing programming model of enterprise applications is becoming more declarative, programmers are spending an increasing amount of their time and efforts writing and maintaining metadata, such as XML or annotations. Although metadata is a cornerstone of modern software, automatic bug finding tools cannot ensure that metadata maintains its correctness during refactoring and enhancement. To address this shortcoming, this paper presents metadata invariants, a new abstraction that codifies various naming and typing relationships between metadata and the main source code of a program. We reify this abstraction as a domain-specific language. We also introduce algorithms to infer likely metadata invariants and to apply them to check metadata correctness in the presence of program evolution. We demonstrate how metadata invariant checking can help ensure that metadata remains consistent and correct during program evolution; it finds metadata-related inconsistencies and recommends how they should be corrected. Similar to static bug finding tools, a metadata invariant checker identifies metadata-related bugs as a program is being refactored and enhanced. Because metadata is omnipresent in modern software applications, our approach can help ensure the overall consistency and correctness of software as it evolves.

Generating Obstacle Conditions for Requirements Completeness
Dalal Alrajeh, Jeff Kramer, Axel van Lamsweerde, Alessandra Russo, and Sebastián Uchitel
(Imperial College London, UK; Université Catholique de Louvain, Belgium)
Missing requirements are known to be among the major causes of software failure. They often result from a natural inclination to conceive over-ideal systems where the software-to-be and its environment always behave as expected. Obstacle analysis is a goal-anchored form of risk analysis whereby exceptional conditions that may obstruct system goals are identified, assessed and resolved to produce complete requirements. Various techniques have been proposed for identifying obstacle conditions systematically. Among these, the formal ones have limited applicability or are costly to automate. This paper describes a tool-supported technique for generating a set of obstacle conditions guaranteed to be complete and consistent with respect to the known domain properties. The approach relies on a novel combination of model checking and learning technologies. Obstacles are iteratively learned from counterexample and witness traces produced by model checking against a goal and converted into positive and negative examples, respectively. A comparative evaluation is provided with respect to published results on the manual derivation of obstacles in a real safety-critical system for which failures have been reported.

Regression Testing
Fri, Jun 8, 08:45 - 10:15

make test-zesti: A Symbolic Execution Solution for Improving Regression Testing
Paul Dan Marinescu and Cristian Cadar

(Imperial College London, UK)
Software testing is an expensive and time consuming process, often involving the manual creation of comprehensive regression test suites. However, current testing methodologies do not take full advantage of these tests. In this paper, we present a technique for amplifying the effect of existing test suites using a lightweight symbolic execution mechanism, which thoroughly checks all sensitive operations (e.g., pointer dereferences) executed by the test suite for errors, and explores additional paths around sensitive operations. We implemented this technique in a prototype system called ZESTI (Zero-Effort Symbolic Test Improvement), and applied it to three open-source code bases—GNU Coreutils, libdwarf and readelf—where it found 52 previously unknown bugs, many of which are out of reach of standard symbolic execution. Our technique works transparently to the tester, requiring no additional human effort or changes to source code or tests.

BALLERINA: Automatic Generation and Clustering of Efficient Random Unit Tests for Multithreaded Code
Adrian Nistor, Qingzhou Luo, Michael Pradel

, Thomas R. Gross, and Darko Marinov

(University of Illinois at Urbana-Champaign, USA; ETH Zurich, Switzerland)
Testing multithreaded code is hard and expensive. Each multithreaded unit test creates two or more threads, each executing one or more methods on shared objects of the class under test. Such unit tests can be generated at random, but basic generation produces tests that are either slow or do not trigger concurrency bugs. Worse, such tests have many false alarms, which require human effort to filter out. We present BALLERINA, a novel technique for automatic generation of efficient multithreaded random tests that effectively trigger concurrency bugs. BALLERINA makes tests efficient by having only two threads, each executing a single, randomly selected method. BALLERINA increases chances that such a simple parallel code finds bugs by appending it to more complex, randomly generated sequential code. We also propose a clustering technique to reduce the manual effort in inspecting failures of automatically generated multithreaded tests. We evaluate BALLERINA on 14 real-world bugs from 6 popular codebases: Groovy, Java JDK, jFreeChart, Log4j, Lucene, and Pool. The experiments show that tests generated by BALLERINA can find bugs on average 2X-10X faster than various configurations of basic random generation, and our clustering technique reduces the number of inspected failures on average 4X-8X. Using BALLERINA, we found three previously unknown bugs in Apache Pool and Log4j, one of which was already confirmed and fixed.

On-Demand Test Suite Reduction
Dan Hao, Lu Zhang, Xingxia Wu, Hong Mei, and Gregg Rothermel
(Peking University, China; Key Laboratory of High Confidence Software Technologies, China; University of Nebraska, USA)
Most test suite reduction techniques aim to select, from a given test suite, a minimal representative subset of test cases that retains the same code coverage as the suite. Empirical studies have shown, however, that test suites reduced in this manner may lose fault detection capability. Techniques have been proposed to retain certain redundant test cases in the reduced test suite so as to reduce the loss in fault-detection capability, but these still do concede some degree of loss. Thus, these techniques may be applicable only in cases where loose demands are placed on the upper limit of loss in fault-detection capability. In this work we present an on-demand test suite reduction approach, which attempts to select a representative subset satisfying the same test requirements as an initial test suite conceding at most l% loss in fault-detection capability for at least c% of the instances in which it is applied. Our technique collects statistics about loss in fault-detection capability at the level of individual statements and models the problem of test suite reduction as an integer linear programming problem. We have evaluated our approach in the contexts of three scenarios in which it might be used. Our results show that most test suites reduced by our approach satisfy given fault detection capability demands, and that the approach compares favorably with an existing test suite reduction approach.

Software Vulnerability
Fri, Jun 8, 08:45 - 10:15

Automated Detection of Client-State Manipulation Vulnerabilities
Anders Møller

and Mathias Schwarz
(Aarhus University, Denmark)
Web application programmers must be aware of a wide range of potential security risks. Although the most common pitfalls are well described and categorized in the literature, it remains a challenging task to ensure that all guidelines are followed. For this reason, it is desirable to construct automated tools that can assist the programmers in the application development process by detecting weaknesses. Many vulnerabilities are related to web application code that stores references to application state in the generated HTML documents to work around the statelessness of the HTTP protocol. In this paper, we show that such client-state manipulation vulnerabilities are amenable to tool supported detection. We present a static analysis for the widely used frameworks Java Servlets, JSP, and Struts. Given a web application archive as input, the analysis identifies occurrences of client state and infers the information flow between the client state and the shared application state on the server. This makes it possible to check how client-state manipulation performed by malicious users may affect the shared application state and cause leakage or modifications of sensitive information. The warnings produced by the tool help the application programmer identify vulnerabilities. Moreover, the inferred information can be applied to configure a security filter that automatically guards against attacks. Experiments on a collection of open source web applications indicate that the static analysis is able to effectively help the programmer prevent client-state manipulation vulnerabilities.

Understanding Integer Overflow in C/C++
Will Dietz

, Peng Li, John Regehr

, and Vikram Adve

(University of Illinois at Urbana-Champaign, USA; University of Utah, USA)
Integer overflow bugs in C and C++ programs are difficult to track down and may lead to fatal errors or exploitable vulnerabilities. Although a number of tools for finding these bugs exist, the situation is complicated because not all overflows are bugs. Better tools need to be constructed---but a thorough understanding of the issues behind these errors does not yet exist. We developed IOC, a dynamic checking tool for integer overflows, and used it to conduct the first detailed empirical study of the prevalence and patterns of occurrence of integer overflows in C and C++ code. Our results show that intentional uses of wraparound behaviors are more common than is widely believed; for example, there are over 200 distinct locations in the SPEC CINT2000 benchmarks where overflow occurs. Although many overflows are intentional, a large number of accidental overflows also occur. Orthogonal to programmers' intent, overflows are found in both well-defined and undefined flavors. Applications executing undefined operations can be, and have been, broken by improvements in compiler optimizations. Looking beyond SPEC, we found and reported undefined integer overflows in SQLite, PostgreSQL, SafeInt, GNU MPC and GMP, Firefox, GCC, LLVM, Python, BIND, and OpenSSL; many of these have since been fixed. Our results show that integer overflow issues in C and C++ are subtle and complex, that they are common even in mature, widely used programs, and that they are widely misunderstood by developers.

A Large Scale Exploratory Analysis of Software Vulnerability Life Cycles
Muhammad Shahzad, Muhammad Zubair Shafiq, and Alex X. Liu
(Michigan State University, USA)
Software systems inherently contain vulnerabilities that have been exploited in the past resulting in significant revenue losses. The study of vulnerability life cycles can help in the development, deployment, and maintenance of software systems. It can also help in designing future security policies and conducting audits of past incidents. Furthermore, such an analysis can help customers to assess the security risks associated with software products of different vendors. In this paper, we conduct an exploratory measurement study of a large software vulnerability data set containing 46310 vulnerabilities disclosed since 1988 till 2011. We investigate vulnerabilities along following seven dimensions: (1) phases in the life cycle of vulnerabilities, (2) evolution of vulnerabilities over the years, (3) functionality of vulnerabilities, (4) access requirement for exploitation of vulnerabilities, (5) risk level of vulnerabilities, (6) software vendors, and (7) software products. Our exploratory analysis uncovers several statistically significant findings that have important implications for software development and deployment.

API Learning
Fri, Jun 8, 10:45 - 12:45

Synthesizing API Usage Examples
Raymond P. L. Buse and Westley Weimer
(University of Virginia, USA)
Key program interfaces are sometimes documented with usage examples: concrete code snippets that characterize common use cases for a particular data type. While such documentation is known to be of great utility, it is burdensome to create and can be incomplete, out of date, or not representative of actual practice. We present an automatic technique for mining and synthesizing succinct and representative human-readable documentation of program interfaces. Our algorithm is based on a combination of path sensitive dataflow analysis, clustering, and pattern abstraction. It produces output in the form of well-typed program snippets which document initialization, method calls, assignments, looping constructs, and exception handling. In a human study involving over 150 participants, 82% of our generated examples were found to be at least as good at human-written instances and 94% were strictly preferred to state of the art code search.

Temporal Analysis of API Usage Concepts
Gias Uddin, Barthélémy Dagenais, and Martin P. Robillard

(McGill University, Canada)
Software reuse through Application Programming Interfaces (APIs) is an integral part of software development. The functionality offered by an API is not always accessed uniformly throughout the lifetime of a client program. We propose Temporal API Usage Pattern Mining to detect API usage patterns in terms of their time of introduction into client programs. We detect concepts as distinct groups of API functionality from the change history of a client program. We locate those concepts in the client change history and detect temporal usage patterns, where a pattern contains a set of concepts that were added into the client program in a specific temporal order. We investigated the properties of temporal API usage patterns through a multiple-case study of three APIs and their use in up to 19 client software projects. Our technique was able to detect a number of valuable patterns in two out of three of the APIs investigated. Further investigation showed some patterns to be relatively consistent between clients, produced by multiple developers, and not trivially derivable from program structure or API documentation.

Inferring Method Specifications from Natural Language API Descriptions
Rahul Pandita, Xusheng Xiao, Hao Zhong, Tao Xie, Stephen Oney, and Amit Paradkar
(North Carolina State University, USA; Chinese Academy of Sciences, China; CMU, USA; IBM Research, USA)
Application Programming Interface (API) documents are a typical way of describing legal usage of reusable software libraries, thus facilitating software reuse. However, even with such documents, developers often overlook some documents and build software systems that are inconsistent with the legal usage of those libraries. Existing software verification tools require formal specifications (such as code contracts), and therefore cannot directly verify the legal usage described in natural language text of API documents against the code using that library. However, in practice, most libraries do not come with formal specifications, thus hindering tool-based verification. To address this issue, we propose a novel approach to infer formal specifications from natural language text of API documents. Our evaluation results show that our approach achieves an average of 92% precision and 93% recall in identifying sentences that describe code contracts from more than 2500 sentences of API documents. Furthermore, our results show that our approach has an average 83% accuracy in inferring specifications from over 1600 sentences describing code contracts.

Code Recommenders
Fri, Jun 8, 10:45 - 12:45

Automatic Parameter Recommendation for Practical API Usage
Cheng Zhang, Juyuan Yang, Yi Zhang, Jing Fan, Xin Zhang, Jianjun Zhao, and Peizhao Ou
(Shanghai Jiao Tong University, China)
Programmers extensively use application programming interfaces (APIs) to leverage existing libraries and frameworks. However, correctly and efficiently choosing and using APIs from unfamiliar libraries and frameworks is still a non-trivial task. Programmers often need to ruminate on API documentations (that are often incomplete) or inspect code examples (that are often absent) to learn API usage patterns. Recently, various techniques have been proposed to alleviate this problem by creating API summarizations, mining code examples, or showing common API call sequences. However, few techniques focus on recommending API parameters.
In this paper, we propose an automated technique, called Precise, to address this problem. Differing from common code completion systems, Precise mines existing code bases, uses an abstract usage instance representation for each API usage example, and then builds a parameter usage database. Upon a request, Precise queries the database for abstract usage instances in similar contexts and generates parameter candidates by concretizing the instances adaptively.
The experimental results show that our technique is more general and applicable than existing code completion systems, specially, 64% of the parameter recommendations are useful and 53% of the recommendations are exactly the same as the actual parameters needed. We have also performed a user study to show our technique is useful in practice.

On the Naturalness of Software
Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu
(UC Davis, USA; University of Texas at Dallas, USA)
Natural languages like English are rich, complex, and powerful. The highly creative and graceful use of languages like English and Tamil, by masters like Shakespeare and Avvaiyar, can certainly delight and inspire. But in practice, given cognitive constraints and the exigencies of daily life, most human utterances are far simpler and much more repetitive and predictable. In fact, these utterances can be very usefully modeled using modern statistical methods. This fact has led to the phenomenal success of statistical approaches to speech recognition, natural language translation, question-answering, and text mining and comprehension. We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations---and thus, like natural language, it is also likely to be repetitive and predictable. We then proceed to ask whether a) code can be usefully modeled by statistical language models and b) such models can be leveraged to support software engineers. Using the widely adopted n-gram model, we provide empirical evidence supportive of a positive answer to both these questions. We show that code is also very repetitive, and in fact even more so than natural languages. As an example use of the model, we have developed a simple code completion engine for Java that, despite its simplicity, already improves Eclipse's completion capability. We conclude the paper by laying out a vision for future research in this area.

Recommending Source Code for Use in Rapid Software Prototypes
Collin McMillan, Negar Hariri, Denys Poshyvanyk

, Jane Cleland-Huang, and Bamshad Mobasher
(College of William and Mary, USA; DePaul University, USA)
Rapid prototypes are often developed early in the software development process in order to help project stakeholders explore ideas for possible features, and to discover, analyze, and specify requirements for the project. As prototypes are typically thrown-away following the initial analysis phase, it is imperative for them to be created quickly with little cost and effort. Tool support for finding and reusing components from open-source repositories offers a major opportunity to reduce this manual effort. In this paper, we present a system for rapid prototyping that facilitates software reuse by mining feature descriptions and source code from open-source repositories. Our system identifies and recommends features and associated source code modules that are relevant to the software product under development. The modules are selected such that they implement as many of the desired features as possible while exhibiting the lowest possible levels of external coupling. We conducted a user study to evaluate our approach and results indicated that it returned packages that implemented more features and were considered more relevant than the state-of-the-art approach.

Active Code Completion
Cyrus Omar, YoungSeok Yoon, Thomas D. LaToza, and Brad A. Myers

(CMU, USA)
Code completion menus have replaced standalone API browsers for most developers because they are more tightly integrated into the development workflow. Refinements to the code completion menu that incorporate additional sources of information have similarly been shown to be valuable, even relative to standalone counterparts offering similar functionality. In this paper, we describe active code completion, an architecture that allows library developers to introduce interactive and highly-specialized code generation interfaces, called palettes, directly into the editor. Using several empirical methods, we examine the contexts in which such a system could be useful, describe the design constraints governing the system architecture as well as particular code completion interfaces, and design one such system, named Graphite, for the Eclipse Java development environment. Using Graphite, we implement a palette for writing regular expressions as our primary example and conduct a small pilot study. In addition to showing the feasibility of this approach, it provides further evidence in support of the claim that integrating specialized code completion interfaces directly into the editor is valuable to professional developers.

Test Automation
Fri, Jun 8, 10:45 - 12:45

Automated Oracle Creation Support, or: How I Learned to Stop Worrying about Fault Propagation and Love Mutation Testing
Matt Staats, Gregory Gay

, and Mats P. E. Heimdahl
(KAIST, South Korea; University of Minnesota, USA)
In testing, the test oracle is the artifact that determines whether an application under test executes correctly. The choice of test oracle can significantly impact the effectiveness of the testing process. However, despite the prevalence of tools that support the selection of test inputs, little work exists for supporting oracle creation. In this work, we propose a method of supporting test oracle creation. This method automatically selects the oracle data — the set of variables monitored during testing—for expected value test oracles. This approach is based on the use of mutation analysis to rank variables in terms of fault-finding effectiveness, thus automating the selection of the oracle data. Experiments over four industrial examples demonstrate that our method may be a cost-effective approach for producing small, effective oracle data, with fault finding improvements over current industrial best practice of up to 145.8% observed.

Automating Test Automation
Suresh Thummalapenta, Saurabh Sinha, Nimit Singhania, and Satish Chandra
(IBM Research, India; IBM Research, USA)
Mention test case, and it conjures up image of a script or a program that exercises a system under test. In industrial practice, however, test cases often start out as steps described in natural language. These are essentially directions a human tester needs to follow to interact with an application, exercising a given scenario. Since tests need to be executed repeatedly, such manual tests then have to go through test automation to create scripts or programs out of them. Test automation can be expensive in programmer time. We describe a technique to automate test automation. The input to our technique is a sequence of steps written in natural language, and the output is a sequence of procedure calls with accompanying parameters that can drive the application without human intervention. The technique is based on looking at the natural language test steps as consisting of segments that describe actions on targets, except that there can be ambiguity in the action itself, in the order in which segments occur, and in the specification of the target of the action. The technique resolves this ambiguity by backtracking, until it can synthesize a successful sequence of calls. We present an evaluation of our technique on professionally created manual test cases for two open-source web applications as well as a proprietary enterprise application. Our technique could automate over 82% of the steps contained in these test cases with no human intervention, indicating that the technique can reduce the cost of test automation quite effectively.

Stride: Search-Based Deterministic Replay in Polynomial Time via Bounded Linkage
Jinguo Zhou, Xiao Xiao, and Charles Zhang

(Hong Kong University of Science and Technology, China)
Abstract—Deterministic replay remains as one of the most effective ways to comprehend concurrent bugs. Existing approaches either maintain the exact shared read-write linkages with a large runtime overhead or use exponential off-line algorithms to search for a feasible interleaved execution. In this paper, we propose Stride, a hybrid solution that records the bounded shared memory access linkages at runtime and infers an equivalent interleaving in polynomial time, under the sequential consistency assumption. The recording scheme eliminates the need for synchronizing the shared read operations, which results in a significant overhead reduction. Comparing to the previous state-of-the-art approach of deterministic replay, Stride reduces, on average, 2.5 times of runtime overhead and produces, on average, 3.88 times smaller logs.

iTree: Efficiently Discovering High-Coverage Configurations Using Interaction Trees
Charles Song, Adam Porter, and Jeffrey S. Foster
(University of Maryland, USA)
Software configurability has many benefits, but it also makes programs much harder to test, as in the worst case the program must be tested under every possible configuration. One potential remedy to this problem is combinatorial interaction testing (CIT), in which typically the developer selects a strength t and then computes a covering array containing all t-way configuration option combinations. However, in a prior study we showed that several programs have important high-strength interactions (combinations of a subset of configuration options) that CIT is highly unlikely to generate in practice. In this paper, we propose a new algorithm called interaction tree discovery (iTree) that aims to identify sets of configurations to test that are smaller than those generated by CIT, while also including important high-strength interactions missed by practical applications of CIT. On each iteration of iTree, we first use low-strength CIT to test the program under a set of configurations, and then apply machine learning techniques to discover new interactions that are potentially responsible for any new coverage seen. By repeating this process, iTree builds up a set of configurations likely to contain key high-strength interactions. We evaluated iTree by comparing the coverage it achieves versus covering arrays and randomly generated configuration sets. Our results strongly suggest that iTree can identify high-coverage sets of configurations more effectively than traditional CIT or random sampling.

Validation of Specification
Fri, Jun 8, 10:45 - 12:45

Inferring Class Level Specifications for Distributed Systems
Sandeep Kumar, Siau-Cheng Khoo, Abhik Roychoudhury

, and David Lo

(National University of Singapore, Singapore; Singapore Management University, Singapore)
Distributed systems often contain many behaviorally similar processes, which are conveniently grouped into classes. In system modeling, it is common to specify such systems by describing the class level behavior, instead of object level behavior. While there have been techniques that mine specifications of such distributed systems from their execution traces, these methods only mine object-level specifications involving concrete process objects. This leads to specifications which are large, hard to comprehend, and sensitive to simple changes in the system (such as the number of objects).
In this paper, we develop a class level specification mining framework for distributed systems. A specification that describes interaction snippets between various processes in a distributed system forms a natural and intuitive way to document their behavior. Our mining method groups together such interactions between behaviorally similar processes, and presents a mined specification involving "symbolic" Message Sequence Charts. Our experiments indicate that our mined symbolic specifications are significantly smaller than mined concrete specifications, while at the same time achieving better precision and recall.

Statically Checking API Protocol Conformance with Mined Multi-Object Specifications
Michael Pradel

, Ciera Jaspan, Jonathan Aldrich

, and Thomas R. Gross
(ETH Zurich, Switzerland; CMU, USA)
Programmers using an API often must follow protocols that specify when it is legal to call particular methods. Several techniques have been proposed to find violations of such protocols based on mined specifications. However, existing techniques either focus on single-object protocols or on particular kinds of bugs, such as missing method calls. There is no practical technique to find multi-object protocol bugs without a priori known specifications. In this paper, we combine a dynamic analysis that infers multi-object protocols and a static checker of API usage constraints into a fully automatic protocol conformance checker. The combined system statically detects illegal uses of an API without human-written specifications. Our approach finds 41 bugs and code smells in mature, real-world Java programs with a true positive rate of 51%. Furthermore, we show that the analysis reveals bugs not found by state of the art approaches.

Behavioral Validation of JFSL Specifications through Model Synthesis
Carlo Ghezzi and Andrea Mocci
(Politecnico di Milano, Italy)
Contracts are a popular declarative specification technique to describe the behavior of stateful components in terms of pre/post conditions and invariants. Since each operation is specified separately in terms of an abstract implementation, it may be hard to understand and validate the resulting component behavior from contracts in terms of method interactions. In particular, properties expressed through algebraic axioms, which specify the effect of sequences of operations, require complex theorem proving techniques to be validated. In this paper, we propose an automatic small-scope based approach to synthesize incomplete behavioral abstractions for contracts expressed in the JFSL notation. The proposed abstraction technique enables the possibility to check that the contract behavior is coherent with behavioral properties expressed as axioms of an algebraic specifications. We assess the applicability of our approach by showing how the synthesis methodology can be applied to some classes of contract-based artifacts like specifications of data abstractions and requirement engineering models.

Verifying Client-Side Input Validation Functions Using String Analysis
Muath Alkhalaf, Tevfik Bultan

, and Jose L. Gallegos
(UC Santa Barbara, USA)
Client-side computation in web applications is becoming increasingly common due to the popularity of powerful client-side programming languages such as JavaScript. Client-side computation is commonly used to improve an application’s responsiveness by validating user inputs before they are sent to the server. In this paper, we present an analysis technique for checking if a client-side input validation function conforms to a given policy. In our approach, input validation policies are expressed using two regular expressions, one specifying the maximum policy (the upper bound for the set of inputs that should be allowed) and the other specifying the minimum policy (the lower bound for the set of inputs that should be allowed). Using our analysis we can identify two types of errors 1) the input validation function accepts an input that is not permitted by the maximum policy, or 2) the input validation function rejects an input that is permitted by the minimum policy. We implemented our analysis using dynamic slicing to automatically extract the input validation functions from web applications and using automata-based string analysis to analyze the extracted functions. Our experiments demonstrate that our approach is effective in finding errors in input validation functions that we collected from real-world applications and from tutorials and books for teaching JavaScript.

ICSE 2012 – Proceedings

Technical Research

Fault Handling
Wed, Jun 6, 10:45 - 12:45

Code Generation and Recovery
Wed, Jun 6, 10:45 - 12:45

Empirical Studies of Development
Wed, Jun 6, 10:45 - 12:45

Performance Analysis
Wed, Jun 6, 10:45 - 12:45

Defect Prediction
Wed, Jun 6, 14:00 - 15:30

Refactoring
Wed, Jun 6, 14:00 - 15:30

Human Aspects of Development
Wed, Jun 6, 14:00 - 15:30

Bug Detection
Wed, Jun 6, 16:00 - 18:00

Multiversion Software
Wed, Jun 6, 16:00 - 18:00

Similarity and Classification
Wed, Jun 6, 16:00 - 18:00

Analysis for Evolution
Thu, Jun 7, 10:45 - 12:45

Debugging
Thu, Jun 7, 10:45 - 12:45

Human Aspects of Process
Thu, Jun 7, 10:45 - 12:45

Models
Thu, Jun 7, 10:45 - 12:45

Concurrency and Exceptions
Thu, Jun 7, 16:00 - 17:30

Software Architecture
Thu, Jun 7, 16:00 - 17:30

Formal Verification
Thu, Jun 7, 16:00 - 17:30

Invariant Generation
Fri, Jun 8, 08:45 - 10:15

Regression Testing
Fri, Jun 8, 08:45 - 10:15

Software Vulnerability
Fri, Jun 8, 08:45 - 10:15

API Learning
Fri, Jun 8, 10:45 - 12:45

Code Recommenders
Fri, Jun 8, 10:45 - 12:45

Test Automation
Fri, Jun 8, 10:45 - 12:45

Validation of Specification
Fri, Jun 8, 10:45 - 12:45

ICSE 2012 – Proceedings

Technical Research

Fault Handling Wed, Jun 6, 10:45 - 12:45

Code Generation and Recovery Wed, Jun 6, 10:45 - 12:45

Empirical Studies of Development Wed, Jun 6, 10:45 - 12:45

Performance Analysis Wed, Jun 6, 10:45 - 12:45

Defect Prediction Wed, Jun 6, 14:00 - 15:30

Refactoring Wed, Jun 6, 14:00 - 15:30

Human Aspects of Development Wed, Jun 6, 14:00 - 15:30

Bug Detection Wed, Jun 6, 16:00 - 18:00

Multiversion Software Wed, Jun 6, 16:00 - 18:00

Similarity and Classification Wed, Jun 6, 16:00 - 18:00

Analysis for Evolution Thu, Jun 7, 10:45 - 12:45

Debugging Thu, Jun 7, 10:45 - 12:45

Human Aspects of Process Thu, Jun 7, 10:45 - 12:45

Models Thu, Jun 7, 10:45 - 12:45

Concurrency and Exceptions Thu, Jun 7, 16:00 - 17:30

Software Architecture Thu, Jun 7, 16:00 - 17:30

Formal Verification Thu, Jun 7, 16:00 - 17:30

Invariant Generation Fri, Jun 8, 08:45 - 10:15

Regression Testing Fri, Jun 8, 08:45 - 10:15

Software Vulnerability Fri, Jun 8, 08:45 - 10:15

API Learning Fri, Jun 8, 10:45 - 12:45

Code Recommenders Fri, Jun 8, 10:45 - 12:45

Test Automation Fri, Jun 8, 10:45 - 12:45

Validation of Specification Fri, Jun 8, 10:45 - 12:45

Fault Handling
Wed, Jun 6, 10:45 - 12:45

Code Generation and Recovery
Wed, Jun 6, 10:45 - 12:45

Empirical Studies of Development
Wed, Jun 6, 10:45 - 12:45

Performance Analysis
Wed, Jun 6, 10:45 - 12:45

Defect Prediction
Wed, Jun 6, 14:00 - 15:30

Refactoring
Wed, Jun 6, 14:00 - 15:30

Human Aspects of Development
Wed, Jun 6, 14:00 - 15:30

Bug Detection
Wed, Jun 6, 16:00 - 18:00

Multiversion Software
Wed, Jun 6, 16:00 - 18:00

Similarity and Classification
Wed, Jun 6, 16:00 - 18:00

Analysis for Evolution
Thu, Jun 7, 10:45 - 12:45

Debugging
Thu, Jun 7, 10:45 - 12:45

Human Aspects of Process
Thu, Jun 7, 10:45 - 12:45

Models
Thu, Jun 7, 10:45 - 12:45

Concurrency and Exceptions
Thu, Jun 7, 16:00 - 17:30

Software Architecture
Thu, Jun 7, 16:00 - 17:30

Formal Verification
Thu, Jun 7, 16:00 - 17:30

Invariant Generation
Fri, Jun 8, 08:45 - 10:15

Regression Testing
Fri, Jun 8, 08:45 - 10:15

Software Vulnerability
Fri, Jun 8, 08:45 - 10:15

API Learning
Fri, Jun 8, 10:45 - 12:45

Code Recommenders
Fri, Jun 8, 10:45 - 12:45

Test Automation
Fri, Jun 8, 10:45 - 12:45

Validation of Specification
Fri, Jun 8, 10:45 - 12:45