Powered by
Conference Publishing Consulting

22nd ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE 2014), November 16–21, 2014, Hong Kong, China

FSE 2014 – Proceedings

Contents - Abstracts - Authors
Online Calendar - iCal File
Twitter: https://twitter.com/FSEconf

Technical Research

Helping and Understanding Developers

Developers’ Code Context Models for Change Tasks
Thomas Fritz, David C. Shepherd, Katja Kevic, Will Snipes, and Christoph Bräunlich
(University of Zurich, Switzerland; ABB Research, USA)
To complete a change task, software developers spend a substantial amount of time navigating code to understand the relevant parts. During this investigation phase, they implicitly build context models of the elements and relations that are relevant to the task. Through an exploratory study with twelve developers completing change tasks in three open source systems, we identified important characteristics of these context models and how they are created. In a second empirical analysis, we further examined our findings on data collected from eighty developers working on a variety of change tasks on open and closed source projects. Our studies uncovered, amongst other results, that code context models are highly connected, structurally and lexically, that developers start tasks using a combination of search and navigation and that code navigation varies substantially across developers. Based on these findings we identify and discuss design requirements to better support developers in the initial creation of code context models. We believe this work represents a substantial step in better understanding developers' code navigation and providing better tool support that will reduce time and effort needed for change tasks.
Publisher's Version Article Search
Software Developers’ Perceptions of Productivity
André N. Meyer, Thomas Fritz, Gail C. Murphy, and Thomas Zimmermann
(University of Zurich, Switzerland; University of British Columbia, Canada; Microsoft Research, USA)
The better the software development community becomes at creating software, the more software the world seems to demand. Although there is a large body of research about measuring and investigating productivity from an organizational point of view, there is a paucity of research about how software developers, those at the front-line of software construction, think about, assess and try to improve their productivity. To investigate software developers' perceptions of software development productivity, we conducted two studies: a survey with 379 professional software developers to help elicit themes and an observational study with 11 professional software developers to investigate emergent themes in more detail. In both studies, we found that developers perceive their days as productive when they complete many or big tasks without significant interruptions or context switches. Yet, the observational data we collected shows our participants performed significant task and activity switching while still feeling productive. We analyze such apparent contradictions in our findings and use the analysis to propose ways to better support software developers in a retrospection and improvement of their productivity through the development of new tools and the sharing of best practices.
Publisher's Version Article Search Info
Enablers, Inhibitors, and Perceptions of Testing in Novice Software Teams
Raphael Pham, Stephan Kiesling, Olga Liskin, Leif Singer, and Kurt Schneider
(Leibniz Universität Hannover, Germany; University of Victoria, Canada)
There are many different approaches to testing software, with different benefits for software quality and the development process. Yet, it is not well understood what developers struggle with when getting started with testing - and why some do not test at all or not as much as would be good for their project. This missing understanding keeps us from improving processes and tools to help novices adopt proper testing practices. We conducted a qualitative study with 97 computer science students. Through interviews, we explored their experiences and attitudes regarding testing in a collaborative software project. We found enabling and inhibiting factors for testing activities, the different testing strategies they used, and novices’ perceptions and attitudes of testing. Students push test automation to the end of the project, thus robbing themselves from the advantages of having a test suite during implementation. Students were not convinced of the return of investment in automated tests and opted for laborious manual tests - which they often regretted in the end. Understanding such challenges and opportunities novices face when confronted with adopting testing can help us improve testing processes, company policies, and tools. Our findings provide recommendations that can enable organizations to facilitate the adoption of testing practices by their members.
Publisher's Version Article Search
Feedback Generation for Performance Problems in Introductory Programming Assignments
Sumit Gulwani, Ivan Radiček, and Florian Zuleger
(Microsoft Research, USA; Vienna University of Technology, Austria)
Providing feedback on programming assignments manually is a tedious, error prone, and time-consuming task. In this paper, we motivate and address the problem of generating feedback on performance aspects in introductory programming assignments. We studied a large number of functionally correct student solutions to introductory programming assignments and observed: (1) There are different algorithmic strategies, with varying levels of efficiency, for solving a given problem. These different strategies merit different feedback. (2) The same algorithmic strategy can be implemented in countless different ways, which are not relevant for reporting feedback on the student program. We propose a light-weight programming language extension that allows a teacher to define an algorithmic strategy by specifying certain key values that should occur during the execution of an implementation. We describe a dynamic analysis based approach to test whether a student's program matches a teacher's specification. Our experimental results illustrate the effectiveness of both our specification language and our dynamic analysis. On one of our benchmarks consisting of 2316 functionally correct implementations to 3 programming problems, we identified 16 strategies that we were able to describe using our specification language (in 95 minutes after inspecting 66, i.e., around 3%, implementations). Our dynamic analysis correctly matched each implementation with its corresponding specification, thereby automatically producing the intended feedback.
Publisher's Version Article Search Info

Debugging and Refactoring

Test Case Purification for Improving Fault Localization
Jifeng Xuan and Martin Monperrus
(INRIA, France; University of Lille, France)
Finding and fixing bugs are time-consuming activities in software development. Spectrum-based fault localization aims to identify the faulty position in source code based on the execution trace of test cases. Failing test cases and their assertions form test oracles for the failing behavior of the system under analysis. In this paper, we propose a novel concept of spectrum driven test case purification for improving fault localization. The goal of test case purification is to separate existing test cases into small fractions (called purified test cases) and to enhance the test oracles to further localize faults. Combining with an original fault localization technique (e.g., Tarantula), test case purification results in better ranking the program statements. Our experiments on 1800 faults in six open-source Java programs show that test case purification can effectively improve existing fault localization techniques.
Publisher's Version Article Search
Automatically Generated Patches as Debugging Aids: A Human Study
Yida Tao, Jindae Kim, Sunghun Kim, and Chang Xu
(Hong Kong University of Science and Technology, China; Nanjing University, China)
Recent research has made significant progress in automatic patch generation, an approach to repair programs with less or no manual intervention. However, direct deployment of auto-generated patches remains difficult, for reasons such as patch quality variations and developers' intrinsic resistance. In this study, we take one step back and investigate a more feasible application scenario of automatic patch generation, that is, using generated patches as debugging aids. We recruited 95 participants for a controlled experiment, in which they performed debugging tasks with the aid of either buggy locations (i.e., the control group), or generated patches of varied qualities. We observe that: a) high-quality patches significantly improve debugging correctness; b) such improvements are more obvious for difficult bugs; c) when using low-quality patches, participants' debugging correctness drops to an even lower point than that of the control group; d) debugging time is significantly affected not by debugging aids, but by participant type and the specific bug to fix. These results highlight that the benefits of using generated patches as debugging aids are contingent upon the quality of the patches. Our qualitative analysis of participants' feedback further sheds light on how generated patches can be improved and better utilized as debugging aids.
Publisher's Version Article Search
A Foundation for Refactoring C with Macros
Jeffrey L. Overbey, Farnaz Behrang, and Munawar Hafiz
(Auburn University, USA)
This paper establishes the concept of "preprocessor dependences" as a foundation for building automated refactoring tools that transform source code containing lexical macros and conditional compilation directives, such as those provided by the C preprocessor. We define a preprocessor dependence graph (PPDG) that models the relationships among macro definitions, macro invocations, and conditional compilation directives in a file--the relationships that must be maintained for the semantics of the C preprocessor to be preserved. For many refactorings, a tool can construct a PPDG from the code before and after it is transformed, then perform a linear-time comparison of the two graphs to determine whether the refactoring will operate correctly in the presence of macros and conditional compilation directives. The proposed technique was implemented in OpenRefactory/C and tested by applying refactorings to GNU Coreutils version 8.21. Empirical results indicate that the technique is effective; it successfully handled refactoring scenarios in which Eclipse CDT, Visual Assist X, and XRefactory all refactored code incorrectly.
Publisher's Version Article Search
Vector Abstraction and Concretization for Scalable Detection of Refactorings
Narcisa Andreea Milea, Lingxiao Jiang, and Siau-Cheng Khoo
(National University of Singapore, Singapore; Singapore Management University, Singapore)
Automated techniques have been proposed to either identify refactoring opportunities (i.e., code fragments that can be but have not yet been restructured in a program), or reconstruct historical refactorings (i.e., code restructuring operations that have happened between different versions of a program). In this paper, we propose a new technique that can detect both refactoring opportunities and historical refactorings in large code bases. The key of our technique is the design of vector abstraction and concretization operations that can encode code changes induced by certain refactorings as characteristic vectors. Thus, the problem of identifying refactorings can be reduced to the problem of identifying matching vectors, which can be solved efficiently. We have implemented our technique for Java. The prototype is applied to 200 bundle projects from the Eclipse ecosystem containing 4.5 million lines of code, and reports in total more than 32K instances of 17 types of refactoring opportunities, taking 25 minutes on average for each type. The prototype is also applied to 14 versions of 3 smaller programs (JMeter, Ant, XML-Security), and detects (1) more than 2.8K refactoring opportunities within individual versions with a precision of about 87%, and (2) more than 190 historical refactorings across consecutive versions of the programs with a precision of about 92%.
Publisher's Version Article Search

Static Analysis

FlowTwist: Efficient Context-Sensitive Inside-Out Taint Analysis for Large Codebases
Johannes Lerch, Ben Hermann, Eric Bodden, and Mira Mezini
(TU Darmstadt, Germany; Fraunhofer SIT, Germany)
Over the past years, widely used platforms such as the Java Class Library have been under constant attack through vulnerabilities that involve a combination of two taint-analysis problems: an integrity problem allowing attackers to trigger sensitive operations within the platform, and a confidentiality problem allowing the attacker to retrieve sensitive information or pointers from the results of those operations. While existing static taint analyses are good at solving either of those problems, we show that they scale prohibitively badly when being applied to situations that require the exploitation of both an integrity and confidentiality problem in combination. The main problem is the huge attack surface of libraries such as the Java Class Library, which exposes thousands of methods potentially controllable by an attacker. In this work we thus present FlowTwist, a novel taint-analysis approach that works inside-out, i.e., tracks data flows from potentially vulnerable calls to the outer level of the API which the attacker might control. This inside-out analysis requires a careful, context-sensitive coordination of both a backward and a forward taint analysis. In this work, we expose a design of the analysis approach based on the IFDS algorithm, and explain several extensions to IFDS that enable not only this coordination but also a helpful reporting of error situations to security analysts. Experiments with the Java Class Library show that, while a simple forward taint-analysis approach does not scale even with much machine power, FlowTwist's algorithm is able to fully analyze the library within 10 minutes.
Publisher's Version Article Search
ORBS: Language-Independent Program Slicing
David Binkley, Nicolas Gold, Mark Harman, Syed Islam, Jens Krinke, and Shin Yoo
(Loyola University Maryland, USA; University College London, UK)
Current slicing techniques cannot handle systems written in multiple programming languages. Observation-Based Slicing (ORBS) is a language-independent slicing technique capable of slicing multi-language systems, including systems which contain (third party) binary components. A potential slice obtained through repeated statement deletion is validated by observing the behaviour of the program: if the slice and original program behave the same under the slicing criterion, the deletion is accepted. The resulting slice is similar to a dynamic slice. We evaluate five variants of ORBS on ten programs of different sizes and languages showing that it is less expensive than similar existing techniques. We also evaluate it on bash and four other systems to demonstrate feasible large-scale operation in which a parallelised ORBS needs up to 82% less time when using four threads. The results show that an ORBS slicer is simple to construct, effective at slicing, and able to handle systems written in multiple languages without specialist analysis tools.
Publisher's Version Article Search Info
JSAI: A Static Analysis Platform for JavaScript
Vineeth Kashyap, Kyle Dewey, Ethan A. Kuefner, John Wagner, Kevin Gibbons, John Sarracino, Ben Wiedermann, and Ben Hardekopf
(University of California at Santa Barbara, USA; Harvey Mudd College, USA)
JavaScript is used everywhere from the browser to the server, including desktops and mobile devices. However, the current state of the art in JavaScript static analysis lags far behind that of other languages such as C and Java. Our goal is to help remedy this lack. We describe JSAI, a formally specified, robust abstract interpreter for JavaScript. JSAI uses novel abstract domains to compute a reduced product of type inference, pointer analysis, control-flow analysis, string analysis, and integer and boolean constant propagation. Part of JSAI's novelty is user-configurable analysis sensitivity, i.e., context-, path-, and heap-sensitivity. JSAI is designed to be provably sound with respect to a specific concrete semantics for JavaScript, which has been extensively tested against a commercial JavaScript implementation. We provide a comprehensive evaluation of JSAI's performance and precision using an extensive benchmark suite, including real-world JavaScript applications, machine generated JavaScript code via Emscripten, and browser addons. We use JSAI's configurability to evaluate a large number of analysis sensitivities (some well-known, some novel) and observe some surprising results that go against common wisdom. These results highlight the usefulness of a configurable analysis platform such as JSAI.
Publisher's Version Article Search
A Path-Sensitively Sliced Control Flow Graph
Joxan Jaffar and Vijayaraghavan Murali
(National University of Singapore, Singapore)
We present a new graph representation of programs with specified target variables. These programs are intended to be processed by third-party applications querying target variables such as testers and verifiers. The representation embodies two concepts. First, it is path-sensitive in the sense that multiple nodes representing one program point may exist so that infeasible paths can be excluded. Second, and more importantly, it is sliced with respect to the target variables. This key step is founded on a novel idea introduced in this paper, called ``Tree Slicing'', and on the fact that slicing is more effective when there is path sensitivity. Compared to the traditional Control Flow Graph (CFG), the new graph may be bigger (due to path-sensitivity) or smaller (due to slicing). We show that it is not much bigger in practice, if at all. The main result however concerns its quality: third-party testers and verifiers perform substantially better on the new graph compared to the original CFG.
Publisher's Version Article Search

Mining Software Repositories

Let's Talk About It: Evaluating Contributions through Discussion in GitHub
Jason Tsay, Laura Dabbish, and James Herbsleb
(Carnegie Mellon University, USA)
Open source software projects often rely on code contributions from a wide variety of developers to extend the capabilities of their software. Project members evaluate these contributions and often engage in extended discussions to decide whether to integrate changes. These discussions have important implications for project management regarding new contributors and evolution of project requirements and direction. We present a study of how developers in open work environments evaluate and discuss pull requests, a primary method of contribution in GitHub, analyzing a sample of extended discussions around pull requests and interviews with GitHub developers. We found that developers raised issues around contributions over both the appropriateness of the problem that the submitter attempted to solve and the correctness of the implemented solution. Both core project members and third-party stakeholders discussed and sometimes implemented alternative solutions to address these issues. Different stakeholders also influenced the outcome of the evaluation by eliciting support from different communities such as dependent projects or even companies. We also found that evaluation outcomes may be more complex than simply acceptance or rejection. In some cases, although a submitter's contribution was rejected, the core team fulfilled the submitter's technical goals by implementing an alternative solution. We found that the level of a submitter's prior interaction on a project changed how politely developers discussed the contribution and the nature of proposed alternative solutions.
Publisher's Version Article Search
A Large Scale Study of Programming Languages and Code Quality in Github
Baishakhi Ray, Daryl Posnett, Vladimir Filkov, and Premkumar Devanbu
(University of California at Davis, USA)
What is the effect of programming languages on software quality? This question has been a topic of much debate for a very long time. In this study, we gather a very large data set from GitHub (729 projects, 80 Million SLOC, 29,000 authors, 1.5 million commits, in 17 languages) in an attempt to shed some empirical light on this question. This reasonably large sample size allows us to use a mixed-methods approach, combining multiple regression modeling with visualization and text analytics, to study the effect of language features such as static v.s. dynamic typing, strong v.s. weak typing on software quality. By triangulating findings from different methods, and controlling for confounding effects such as team size, project size, and project history, we report that language design does have a significant, but modest effect on software quality. Most notably, it does appear that strong typing is modestly better than weak typing, and among functional languages, static typing is also somewhat better than dynamic typing. We also find that functional languages are somewhat better than procedural languages. It is worth noting that these modest effects arising from language design are overwhelmingly dominated by the process factors such as project size, team size, and commit size. However, we hasten to caution the reader that even these modest effects might quite possibly be due to other, intangible process factors, e.g., the preference of certain personality types for functional, static and strongly typed languages.
Publisher's Version Article Search
Mining Preconditions of APIs in Large-Scale Code Corpus
Hoan Anh Nguyen, Robert Dyer, Tien N. Nguyen, and Hridesh Rajan
(Iowa State University, USA)
Modern software relies on existing application programming interfaces (APIs) from libraries. Formal specifications for the APIs enable many software engineering tasks as well as help developers correctly use them. In this work, we mine large-scale repositories of existing open-source software to derive potential preconditions for API methods. Our key idea is that APIs’ preconditions would appear frequently in an ultra-large code corpus with a large number of API usages, while project-specific conditions will occur less frequently. First, we find all client methods invoking APIs. We then compute a control dependence relation from each call site and mine the potential conditions used to reach those call sites. We use these guard conditions as a starting point to automatically infer the preconditions for each API. We analyzed almost 120 million lines of code from SourceForge and Apache projects to infer preconditions for the standard Java Development Kit (JDK) library. The results show that our technique can achieve high accuracy with recall from 75–80% and precision from 82–84%. We also found 5 preconditions missing from human written specifications. They were all confirmed by a specification expert. In a user study, participants found 82% of the mined preconditions as a good starting point for writing specifications. Using our mining result, we also built a benchmark of more than 4,000 precondition-related bugs.
Publisher's Version Article Search
Automatic Mining of Specifications from Invocation Traces and Method Invariants
Ivo Krka, Yuriy Brun, and Nenad Medvidovic
(Google, Switzerland; University of Massachusetts, USA; University of Southern California, USA)
Software library documentation often describes individual methods' APIs, but not the intended protocols and method interactions. This can lead to library misuse, and restrict runtime detection of protocol violations and automated verification of software that uses the library. Specification mining, if accurate, can help mitigate these issues, which has led to significant research into new model-inference techniques that produce FSM-based models from program invariants and execution traces. However, there is currently a lack of empirical studies that, in a principled way, measure the impact of the inference strategies on model quality. To this end, we identify four such strategies and systematically study the quality of the models they produce for nine off-the-shelf libraries. We find that (1) using invariants to infer an initial model significantly improves model quality, increasing precision by 4% and recall by 41%, on average; (2) effective invariant filtering is crucial for quality and scalability of strategies that use invariants; and (3) using traces in combination with invariants greatly improves robustness to input noise. We present our empirical evaluation, implement new and extend existing model-inference techniques, and make public our implementations, ground-truth models, and experimental data. Our work can lead to higher-quality model inference, and directly improve the techniques and tools that rely on model inference.
Publisher's Version Article Search

Formal Methods and Verification

Counterexample Guided Abstraction Refinement of Product-Line Behavioural Models
Maxime Cordy, Patrick Heymans, Axel Legay, Pierre-Yves Schobbens, Bruno Dawagne, and Martin Leucker
(University of Namur, Belgium; INRIA, France; University of Lübeck, Germany)
The model-checking problem for Software Products Lines (SPLs) is harder than for single systems: variability constitutes a new source of complexity that exacerbates the state-explosion problem. Abstraction techniques have successfully alleviated state explosion in single-system models. However, they need to be adapted to SPLs, to take into account the set of variants that produce a counterexample. In this paper, we apply CEGAR (Counterexample-Guided Abstraction Refinement) and we design new forms of abstraction specifically for SPLs. We carry out experiments to evaluate the efficiency of our new abstractions. The results show that our abstractions, combined with an appropriate refinement strategy, hold the potential to achieve large reductions in verification time, although they sometimes perform worse. We discuss in which cases a given abstraction should be used.
Publisher's Version Article Search
Powering the Static Driver Verifier using Corral
Akash Lal and Shaz Qadeer
(Microsoft Research, India; Microsoft Research, USA)
The application of software-verification technology towards building realistic bug-finding tools requires working through several precision-scalability tradeoffs. For instance, a critical aspect while dealing with C programs is to formally define the treatment of pointers and the heap. A machine-level modeling is often intractable, whereas one that leverages high-level information (such as types) can be inaccurate. Another tradeoff is modeling integer arithmetic. Ideally, all arithmetic should be performed over bitvector representations whereas the current practice in most tools is to use mathematical integers for scalability. A third tradeoff, in the context of bounded program exploration, is to choose a bound that ensures high coverage without overwhelming the analysis. This paper works through these three tradeoffs when we applied Corral, an SMT-based verifier, inside Microsoft's Static Driver Verifier (SDV). Our decisions were guided by experimentation on a large set of drivers; the total verification time exceeded well over a month. We justify that each of our decisions were crucial in getting value out of Corral and led to Corral being accepted as the engine that powers SDV in the Windows 8.1 release, replacing the SLAM engine that had been used inside SDV for the past decade.
Publisher's Version Article Search
Verifying CTL-Live Properties of Infinite State Models using an SMT Solver
Amirhossein Vakili and Nancy A. Day
(University of Waterloo, Canada)
The ability to create and analyze abstract models is an important step in conquering software complexity. In this paper, we show that it is practical to verify dynamic properties of infinite state models expressed in a subset of CTL directly using an SMT solver without iteration, abstraction, or human intervention. We call this subset CTL-Live and it consists of the operators of CTL expressible using the least fixed point operator of the mu-calculus, which are commonly considered liveness properties (e.g., AF, AU). We show that using this method the verification of an infinite state model can sometimes complete more quickly than verifying a finite version of the model. We also examine modelling techniques to represent abstract models in first-order logic that facilitate this form of model checking.
Publisher's Version Article Search
Efficient Runtime-Enforcement Techniques for Policy Weaving
Richard Joiner, Thomas Reps, Somesh Jha, Mohan Dhawan, and Vinod Ganapathy
(University of Wisconsin-Madison, USA; GrammaTech, USA; IBM Research, India; Rutgers University, USA)
Policy weaving is a program-transformation technique that rewrites a program so that it is guaranteed to be safe with respect to a stateful security policy. It utilizes (i) static analysis to identify points in the program at which policy violations might occur, and (ii) runtime checks inserted at such points to monitor policy state and prevent violations from occurring. The promise of policy weaving stems from the possibility of blending the best aspects of static and dynamic analysis components. Therefore, a successful instantiation of policy weaving requires a careful balance and coordination between the two. In this paper, we examine the strategy of using a combination of transactional introspection and statement indirection to implement runtime enforcement in a policy-weaving system. Transactional introspection allows the state resulting from the execution of a statement to be examined and, if the policy would be violated, suppressed. Statement indirection serves as a light-weight runtime analysis that can recognize and instrument dynamically generated code that is not available to the static analysis. These techniques can be implemented via static rewriting so that all possible program executions are protected against policy violations. We describe our implementation of transactional introspection and statement indirection for policy weaving, and report experimental results that show the viability of the approach in the context of real-world JavaScript programs executing in a browser.
Publisher's Version Article Search

Regression Testing

Techniques for Improving Regression Testing in Continuous Integration Development Environments
Sebastian Elbaum, Gregg Rothermel, and John Penix
(University of Nebraska-Lincoln, USA; Google, USA)
In continuous integration development environments, software engineers frequently integrate new or changed code with the mainline codebase. This can reduce the amount of code rework that is needed as systems evolve and speed up development time. While continuous integration processes traditionally require that extensive testing be performed following the actual submission of code to the codebase, it is also important to ensure that enough testing is performed prior to code submission to avoid breaking builds and delaying the fast feedback that makes continuous integration desirable. In this work, we present algorithms that make continuous integration processes more cost-effective. In an initial pre-submit phase of testing, developers specify modules to be tested, and we use regression test selection techniques to select a subset of the test suites for those modules that render that phase more cost-effective. In a subsequent post-submit phase of testing, where dependent modules as well as changed modules are tested, we use test case prioritization techniques to ensure that failures are reported more quickly. In both cases, the techniques we utilize are novel, involving algorithms that are relatively inexpensive and do not rely on code coverage information -- two requirements for conducting testing cost-effectively in this context. To evaluate our approach, we conducted an empirical study on a large data set from Google that we make publicly available. The results of our study show that our selection and prioritization techniques can each lead to cost-effectiveness improvements in the continuous integration process.
Publisher's Version Article Search Info
Balancing Trade-Offs in Test-Suite Reduction
August Shi, Alex Gyori, Milos Gligoric, Andrey Zaytsev, and Darko Marinov
(University of Illinois at Urbana-Champaign, USA)
Regression testing is an important activity but can get expensive for large test suites. Test-suite reduction speeds up regression testing by identifying and removing redundant tests based on a given set of requirements. Traditional research on test-suite reduction is rather diverse but most commonly shares three properties: (1) requirements are defined by a coverage criterion such as statement coverage; (2) the reduced test suite has to satisfy all the requirements as the original test suite; and (3) the quality of the reduced test suites is measured on the software version on which the reduction is performed. These properties make it hard for test engineers to decide how to use reduced test suites. We address all three properties of traditional test-suite reduction: (1) we evaluate test-suite reduction with requirements defined by killed mutants; (2) we evaluate inadequate reduction that does not require reduced test suites to satisfy all the requirements; and (3) we propose evolution-aware metrics that evaluate the quality of the reduced test suites across multiple software versions. Our evaluations allow a more thorough exploration of trade-offs in test-suite reduction, and our evolution-aware metrics show how the quality of reduced test suites can change after the version where the reduction is performed. We compare the trade-offs among various reductions on 18 projects with a total of 261,235 tests over 3,590 commits and a cumulative history spanning 35 years of development. Our results help test engineers make a more informed decision about balancing size, coverage, and fault-detection loss of reduced test suites.
Publisher's Version Article Search
Identifying the Characteristics of Vulnerable Code Changes: An Empirical Study
Amiangshu Bosu, Jeffrey C. Carver, Munawar Hafiz, Patrick Hilley, and Derek Janni
(University of Alabama, USA; Auburn University, USA; Providence College, USA; Lewis & Clark College, USA)
To focus the efforts of security experts, the goals of this empirical study are to analyze which security vulnerabilities can be discovered by code review, identify characteristics of vulnerable code changes, and identify characteristics of developers likely to introduce vulnerabilities. Using a three-stage manual and automated process, we analyzed 267,046 code review requests from 10 open source projects and identified 413 Vulnerable Code Changes (VCC). Some key results include: (1) code review can identify common types of vulnerabilities; (2) while more experienced contributors authored the majority of the VCCs, the less experienced contributors' changes were 1.8 to 24 times more likely to be vulnerable; (3) the likelihood of a vulnerability increases with the number of lines changed, and (4) modified files are more likely to contain vulnerabilities than new files. Knowing which code changes are more prone to contain vulnerabilities may allow a security expert to concentrate on a smaller subset of submitted code changes. Moreover, we recommend that projects should: (a) create or adapt secure coding guidelines, (b) create a dedicated security review team, (c) ensure detailed comments during review to help knowledge dissemination, and (d) encourage developers to make small, incremental changes rather than large changes.
Publisher's Version Article Search

Improving Recommender Systems

On the Localness of Software
Zhaopeng Tu, Zhendong Su, and Premkumar Devanbu
(University of California at Davis, USA)
The n-gram language model, which has its roots in statistical natural language processing, has been shown to successfully capture the repetitive and predictable regularities (“naturalness") of source code, and help with tasks such as code suggestion, porting, and designing assistive coding devices. However, we show in this paper that this natural-language-based model fails to exploit a special property of source code: localness. We find that human-written programs are localized: they have useful local regularities that can be captured and exploited. We introduce a novel cache language model that consists of both an n-gram and an added “cache" component to exploit localness. We show empirically that the additional cache component greatly improves the n-gram approach by capturing the localness of software, as measured by both cross-entropy and suggestion accuracy. Our model’s suggestion accuracy is actually comparable to a state-of-the-art, semantically augmented language model; but it is simpler and easier to implement. Our cache language model requires nothing beyond lexicalization, and thus is applicable to all programming languages.
Publisher's Version Article Search
Learning Natural Coding Conventions
Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton
(University of Edinburgh, UK; University College London, UK; Microsoft Research, USA)
Every programmer has a characteristic style, ranging from preferences about identifier naming to preferences about object relationships and design patterns. Coding conventions define a consistent syntactic style, fostering readability and hence maintainability. When collaborating, programmers strive to obey a project’s coding conventions. However, one third of reviews of changes contain feedback about coding conventions, indicating that programmers do not always follow them and that project members care deeply about adherence. Unfortunately, programmers are often unaware of coding conventions because inferring them requires a global view, one that aggregates the many local decisions programmers make and identifies emergent consensus on style. We present NATURALIZE, a framework that learns the style of a codebase, and suggests revisions to improve stylistic consistency. NATURALIZE builds on recent work in applying statistical natural language processing to source code. We apply NATURALIZE to suggest natural identifier names and formatting conventions. We present four tools focused on ensuring natural code during development and release management, including code review. NATURALIZE achieves 94 % accuracy in its top suggestions for identifier names. We used NATURALIZE to generate 18 patches for 5 open source projects: 14 were accepted.
Publisher's Version Article Search Info
How Should We Measure Functional Sameness from Program Source Code? An Exploratory Study on Java Methods
Yoshiki Higo and Shinji Kusumoto
(Osaka University, Japan)
Program source code is one of the main targets of software engineering research. A wide variety of research has been conducted on source code, and many studies have leveraged structural, vocabulary, and method signature similarities to measure the functional sameness of source code. In this research, we conducted an empirical study to ascertain how we should use three similarities to measure functional sameness. We used two large datasets and measured the three similarities between all the method pairs in the datasets, each of which included approximately 15 million Java method pairs. The relationships between the three similarities were analyzed to determine how we should use each to detect functionally similar code. The results of our study revealed the following. (1) Method names are not always useful for detecting functionally similar code. Only if there are a small number of methods having a given name, the methods are likely to include functionally similar code. (2) Existing file-level, method-level, and block-level clone detection techniques often miss functionally similar code generated by copy-and-paste operations between different projects. (3) In the cases we use structural similarity for detecting functionally similar code, we obtained many false positives. However, we can avoid detecting most false positives by using a vocabulary similarity in addition to a structural one. (4) Using a vocabulary similarity to detect functionally similar code is not suitable for method pairs in the same file because such method pairs use many of the same program elements such as private methods or private fields.
Publisher's Version Article Search Info
The Plastic Surgery Hypothesis
Earl T. Barr, Yuriy Brun, Premkumar Devanbu, Mark Harman, and Federica Sarro
(University College London, UK; University of Massachusetts, USA; University of California at Davis, USA)
Recent work on genetic-programming-based approaches to automatic program patching have relied on the insight that the content of new code can often be assembled out of fragments of code that already exist in the code base. This insight has been dubbed the plastic surgery hypothesis; successful, well-known automatic repair tools such as GenProg rest on this hypothesis, but it has never been validated. We formalize and validate the plastic surgery hypothesis and empirically measure the extent to which raw material for changes actually already exists in projects. In this paper, we mount a large-scale study of several large Java projects, and examine a history of 15,723 commits to determine the extent to which these commits are graftable, i.e., can be reconstituted from existing code, and find an encouraging degree of graftability, surprisingly independent of commit size and type of commit. For example, we find that changes are 43% graftable from the exact version of the software being changed. With a view to investigating the difficulty of finding these grafts, we study the abundance of such grafts in three possible sources: the immediately previous version, prior history, and other projects. We also examine the contiguity or chunking of these grafts, and the degree to which grafts can be found in the same file. Our results are quite promising and suggest an optimistic future for automatic program patching methods that search for raw material in already extant code in the project being patched.
Publisher's Version Article Search

Concurrency and Parallelism

Grail: Context-Aware Fixing of Concurrency Bugs
Peng Liu, Omer Tripp, and Charles Zhang
(Wuhan University, China; IBM Research, USA; Hong Kong University of Science and Technology, China)
Writing efficient synchronization for multithreaded programs is notoriously hard. The resulting code often contains subtle concurrency bugs. Even worse, many bug fixes introduce new bugs. A classic example, seen widely in practice, is deadlocks resulting from fixing of an atomicity violation. These complexities have motivated the development of automated fixing techniques. Current techniques generate fixes that are typically conservative, giving up on available parallelism. Moreover, some of the techniques cannot guarantee the correctness of a fix, and may introduce deadlocks similarly to manual fix, whereas techniques that ensure correctness do so at the expense of even greater performance loss. We present Grail, a novel fixing algorithm that departs from previous techniques by simultaneously providing both correctness and optimality guarantees. Grail synthesizes bug-free yet optimal lock-based synchronization. To achieve this, Grail builds an analysis model of the buggy code that is both contextual, distinguishing different aliasing contexts to ensure efficiency, and global, accounting for the entire synchronization behavior of the involved threads to ensure correctness. Evaluation of Grail on 12 bugs from popular codebases confirms its practical advantages, especially compared with existing techniques: Grail patches are, in general, >=40% more efficient than the patches produced by other techniques, and incur only 2% overhead.
Publisher's Version Article Search
AI: A Lightweight System for Tolerating Concurrency Bugs
Mingxing Zhang, Yongwei Wu, Shan Lu, Shanxiang Qi, Jinglei Ren, and Weimin Zheng
(Tsinghua University, China; University of Wisconsin-Madison, USA; University of Illinois at Urbana-Champaign, USA)
Concurrency bugs are notoriously difficult to eradicate during software testing because of their non-deterministic nature. Moreover, fixing concurrency bugs is time-consuming and error-prone. Thus, tolerating concurrency bugs during production runs is an attractive complementary approach to bug detection and testing. Unfortunately, existing bug-tolerating tools are usually either 1) constrained in types of bugs they can handle or 2) requiring roll-back mechanism, which can hitherto not be fully achieved efficiently without hardware supports. This paper presents a novel program invariant, called Anticipating Invariant (AI), which can help anticipate bugs before any irreversible changes are made. Benefiting from this ability of anticipating bugs beforehand, our software-only system is able to forestall the failures with a simple thread stalling technique, which does not rely on execution roll-back and hence has good performance Experiments with 35 real-world concurrency bugs demonstrate that AI is capable of detecting and tolerating most types of concurrency bugs, including both atomicity and order violations. Two new bugs have been detected and confirmed by the corresponding developers. Performance evaluation with 6 representative parallel programs shows that AI incurs negligible overhead (<1%) for many nontrivial desktop and server applications.
Publisher's Version Article Search Info
Retrofitting Concurrency for Android Applications through Refactoring
Yu Lin, Cosmin Radoi, and Danny Dig
(University of Illinois at Urbana-Champaign, USA; Oregon State University, USA)
Running compute-intensive or blocking I/O operations in the UI event thread of smartphone apps can severely degrade responsiveness. Despite the fact that Android supports writing concurrent code via AsyncTask, we know little about how developers use AsyncTask to improve responsiveness. To understand how AsyncTask is used/underused/misused in practice, we rst conduct a formative study using a corpus of top 104 most popular open-source Android apps comprising 1.34M SLOC. Our study shows that even though half of the apps use AsyncTask, there are hundreds of places where they missed opportunities to encapsulate long-running operations in AsyncTask. Second, 46% of the usages are manually refactored. However, the refactored code contains concurrency bugs (such as data races) and performance bugs (concurrent code still executes sequentially). Inspired by these ndings, we designed, developed, and evaluated Asynchronizer, an automated refactoring tool that enables developers to extract long-running operations into AsyncTask. Asynchronizer uses a points-to static analysis to determine the safety of the transformation. Our empirical evaluation shows that Asynchronizer is (i) highly applicable, (ii) accurate, (iii) safer than manual refactoring (iv) it saves development eort, (v) its results have been accepted by the open-source developers. This shows that Asynchronizer is useful.
Publisher's Version Article Search Info
Sherlock: Scalable Deadlock Detection for Concurrent Programs
Mahdi Eslamimehr and Jens Palsberg
(University of California at Los Angeles, USA)
We present a new technique to find real deadlocks in concurrent programs that use locks. For 4.5 million lines of Java, our technique found almost twice as many real deadlocks as four previous techniques combined. Among those, 33 deadlocks happened after more than one million computation steps, including 27 new deadlocks. We first use a known technique to find 1275 deadlock candidates and then we determine that 146 of them are real deadlocks. Our technique combines previous work on concolic execution with a new constraint-based approach that iteratively drives an execution towards a deadlock candidate.
Publisher's Version Article Search

Self Adaptation and Repair / Program Analysis Applications

Search-Based Synthesis of Equivalent Method Sequences
Alberto Goffi, Alessandra Gorla, Andrea Mattavelli, Mauro Pezzè, and Paolo Tonella
(University of Lugano, Switzerland; Saarland University, Germany; Fondazione Bruno Kessler, Italy)
Software components are usually redundant, since their interface offers different operations that are equivalent in their functional behavior. Several reliability techniques exploit this redundancy to either detect or tolerate faults in software. Metamorphic testing, for instance, executes pairs of sequences of operations that are expected to produce equivalent results, and identifies faults in case of mismatching outcomes. Some popular fault tolerance and self-healing techniques execute redundant operations in an attempt to avoid failures at runtime. The common assumption of these techniques, though, is that such redundancy is known a priori. This means that the set of operations that are supposed to be equivalent in a given component should be available in the specifications. Unfortunately, inferring this information manually can be expensive and error prone. This paper proposes a search-based technique to synthesize sequences of method invocations that are equivalent to a target method within a finite set of execution scenarios. The experimental results obtained on 47 methods from 7 classes show that the proposed approach correctly identifies equivalent method sequences in the majority of the cases where redundancy was known to exist, with very few false positives.
Publisher's Version Article Search
Beyond the Rainbow: Self-Adaptive Failure Avoidance in Configurable Systems
Jacob Swanson, Myra B. Cohen, Matthew B. Dwyer, Brady J. Garvin, and Justin Firestone
(University of Nebraska-Lincoln, USA)
Self-adaptive software systems monitor their state and then adapt when certain conditions are met, guided by a global utility function. In prior work we developed algorithms and conducted a post-hoc analysis demonstrating the possibility of adapting to software failures by judiciously changing configurations. In this paper we present the REFRACT framework that realizes this idea in practice by building on the self-adaptive Rainbow architecture. REFRACT extends Rainbow with new components and algorithms targeting failure avoidance. We use REFRACT in a case study running four independently executing Firefox clients with 36 passing test cases and 7 seeded faults. The study show that workarounds for all but one of the seeded faults are found and the one that is not found never fails -- it is guarded from failing by a related workaround. Moreover, REFRACT finds workarounds for eight configuration-related unseeded failures from tests that were expected to pass (and did under the default configuration). Finally, the data show that when a failure and its workaround are found, configuration guards prevent the failure from appearing again. In a simulation lasting 24 hours we see over 150 guard activations and no failures with workarounds remaining beyond 16 hours.
Publisher's Version Article Search
Semantics-Based Obfuscation-Resilient Binary Code Similarity Comparison with Applications to Software Plagiarism Detection
Lannan Luo, Jiang Ming, Dinghao Wu, Peng Liu, and Sencun Zhu
(Pennsylvania State University, USA)
Existing code similarity comparison methods, whether source or binary code based, are mostly not resilient to obfuscations. In the case of software plagiarism, emerging obfuscation techniques have made automated detection increasingly difficult. In this paper, we propose a binary-oriented, obfuscation-resilient method based on a new concept, longest common subsequence of semantically equivalent basic blocks, which combines rigorous program semantics with longest common subsequence based fuzzy matching. We model the semantics of a basic block by a set of symbolic formulas representing the input-output relations of the block. This way, the semantics equivalence (and similarity) of two blocks can be checked by a theorem prover. We then model the semantics similarity of two paths using the longest common subsequence with basic blocks as elements. This novel combination has resulted in strong resiliency to code obfuscation. We have developed a prototype and our experimental results show that our method is effective and practical when applied to real-world software.
Publisher's Version Article Search
Focus-Shifting Patterns of OSS Developers and Their Congruence with Call Graphs
Qi Xuan, Aaron Okano, Premkumar Devanbu, and Vladimir Filkov
(University of California at Davis, USA; Zhejiang University of Technology, China)
Developers in complex, self-organized open-source projects often work on many different files, and over time switch focus between them. Shifting focus can have impact on the software quality and productivity, and is thus an important topic of investigation. In this paper, we study focus shifting patterns (FSPs) of developers by comparing trace data from a dozen open source software (OSS) projects of their longitudinal commit activities and file dependencies from the projects call graphs. Using information theoretic measures of network structure, we find that fairly complex focus-shifting patterns emerge, and FSPs in the same project are more similar to each other. We show that developers tend to shift focus along with, rather than away from, software dependency links described by the call graphs. This tendency becomes weaker as either the interval between successive commits, or the organizational distance between committed files (i.e. directory distance), gets larger. Interestingly, this tendency appears stronger with more productive developers. We hope our study will initiate interest in further understanding of FSPs, which can ultimately help to (1) improve current recommender systems to predict the next focus of developers, and (2) provide insight into better call graph design, so as to facilitate developers' work.
Publisher's Version Article Search

Symbolic Execution

How We Get There: A Context-Guided Search Strategy in Concolic Testing
Hyunmin Seo and Sunghun Kim
(Hong Kong University of Science and Technology, China)
One of the biggest challenges in concolic testing, an automatic test generation technique, is its huge search space. Concolic testing generates next inputs by selecting branches from previous execution paths. However, a large number of candidate branches makes a simple exhaustive search infeasible, which often leads to poor test coverage. Several search strategies have been proposed to explore high-priority branches only. Each strategy applies different criteria to the branch selection process but most do not consider context, how we got to the branch, in the selection process. In this paper, we introduce a context-guided search (CGS) strategy. CGS looks at preceding branches in execution paths and selects a branch in a new context for the next input. We evaluate CGS with two publicly available concolic testing tools, CREST and CarFast, on six C subjects and six Java subjects. The experimental results show that CGS achieves the highest coverage of all twelve subjects and reaches a target coverage with a much smaller number of iterations on most subjects than other strategies.
Publisher's Version Article Search
Solving Complex Path Conditions through Heuristic Search on Induced Polytopes
Peter Dinges and Gul Agha
(University of Illinois at Urbana-Champaign, USA)
Test input generators using symbolic and concolic execution must solve path conditions to systematically explore a program and generate high coverage tests. However, path conditions may contain complicated arithmetic constraints that are infeasible to solve: a solver may be unavailable, solving may be computationally intractable, or the constraints may be undecidable. Existing test generators either simplify such constraints with concrete values to make them decidable, or rely on strong but incomplete constraint solvers. Unfortunately, simplification yields coarse approximations whose solutions rarely satisfy the original constraint. Moreover, constraint solvers cannot handle calls to native library methods. We show how a simple combination of linear constraint solving and heuristic search can overcome these limitations. We call this technique Concolic Walk. On a corpus of 11 programs, an instance of our Concolic Walk algorithm using tabu search generates tests with two- to three-times higher coverage than simplification-based tools while being up to five-times as efficient. Furthermore, our algorithm improves the coverage of two state-of-the-art test generators by 21% and 32%. Other concolic and symbolic testing tools could integrate our algorithm to solve complex path conditions without having to sacrifice any of their own capabilities, leading to higher overall coverage.
Publisher's Version Article Search Info
Statistical Symbolic Execution with Informed Sampling
Antonio Filieri, Corina S. Păsăreanu, Willem Visser, and Jaco Geldenhuys
(University of Stuttgart, Germany; Carnegie Mellon University, USA; NASA Ames Research Center, USA; Stellenbosch University, South Africa)
Symbolic execution techniques have been proposed recently for the probabilistic analysis of programs. These techniques seek to quantify the likelihood of reaching program events of interest, e.g., assert violations. They have many promising applications but have scalability issues due to high computational demand. To address this challenge, we propose a statistical symbolic execution technique that performs Monte Carlo sampling of the symbolic program paths and uses the obtained information for Bayesian estimation and hypothesis testing with respect to the probability of reaching the target events. To speed up the convergence of the statistical analysis, we propose Informed Sampling, an iterative symbolic execution that first explores the paths that have high statistical significance, prunes them from the state space and guides the execution towards less likely paths. The technique combines Bayesian estimation with a partial exact analysis for the pruned paths leading to provably improved convergence of the statistical analysis. We have implemented statistical symbolic execution with informed sampling in the Symbolic PathFinder tool. We show experimentally that the informed sampling obtains more precise results and converges faster than a purely statistical analysis and may also be more efficient than an exact symbolic analysis. When the latter does not terminate symbolic execution with informed sampling can give meaningful results under the same time and memory limits.
Publisher's Version Article Search
SymJS: Automatic Symbolic Testing of JavaScript Web Applications
Guodong Li, Esben Andreasen, and Indradeep Ghosh
(Fujitsu Labs, USA; Aarhus University, Denmark)
We present SymJS, a comprehensive framework for automatic testing of client-side JavaScript Web applications. The tool contains a symbolic execution engine for JavaScript, and an automatic event explorer for Web pages. Without any user intervention, SymJS can automatically discover and explore Web events, symbolically execute the associated JavaScript code, refine the execution based on dynamic feedbacks, and produce test cases with high coverage. The symbolic engine contains a symbolic virtual machine, a string-numeric solver, and a symbolic executable DOM model. SymJS's innovations include a novel symbolic virtual machine for JavaScript Web, symbolic+dynamic feedback directed event space exploration, and dynamic taint analysis for enhancing event sequence construction. We illustrate the effectiveness of SymJS on standard JavaScript benchmarks and various real-life Web applications. On average SymJS achieves over 90% line coverage for the benchmark programs, significantly outperforming existing methods.
Publisher's Version Article Search

Software Documentation

Selection and Presentation Practices for Code Example Summarization
Annie T. T. Ying and Martin P. Robillard
(McGill University, Canada)
Code examples are an important source for answering questions about software libraries and applications. Many usage contexts for code examples require them to be distilled to their essence: e.g., when serving as cues to longer documents, or for reminding developers of a previously known idiom. We conducted a study to discover how code can be summarized and why. As part of the study, we collected 156 pairs of code examples and their summaries from 16 participants, along with over 26 hours of think-aloud verbalizations detailing the decisions of the participants during their summarization activities. Based on a qualitative analysis of this data we elicited a list of practices followed by the participants to summarize code examples and propose empirically-supported hypotheses justifying the use of specific practices. One main finding was that none of the participants exclusively extracted code verbatim for the summaries, motivating abstractive summarization. The results provide a grounded basis for the development of code example summarization and presentation technology.
Publisher's Version Article Search
Mining Idioms from Source Code
Miltiadis Allamanis and Charles Sutton
(University of Edinburgh, UK)
We present the first method for automatically mining code idioms from a corpus of previously written, idiomatic software projects. We take the view that a code idiom is a syntactic fragment that recurs across projects and has a single semantic purpose. Idioms may have metavariables, such as the body of a for loop. Modern IDEs commonly provide facilities for manually defining idioms and inserting them on demand, but this does not help programmers to write idiomatic code in languages or using libraries with which they are unfamiliar. We present Haggis, a system for mining code idioms that builds on recent advanced techniques from statistical natural language processing, namely, nonparametric Bayesian probabilistic tree substitution grammars. We apply Haggis to several of the most popular open source projects from GitHub. We present a wide range of evidence that the resulting idioms are semantically meaningful, demonstrating that they do indeed recur across software projects and that they occur more frequently in illustrative code examples collected from a Q&A site. Manual examination of the most common idioms indicate that they describe important program concepts, including object creation, exception handling, and resource management.
Publisher's Version Article Search
Automatic Generation of Release Notes
Laura Moreno, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, Andrian Marcus, and Gerardo Canfora
(University of Texas at Dallas, USA; University of Sannio, Italy; University of Molise, Italy)
This paper introduces ARENA (Automatic RElease Notes generAtor), an approach for the automatic generation of release notes. ARENA extracts changes from the source code, summarizes them, and integrates them with information from versioning systems and issue trackers. It was designed based on the manual analysis of 1,000 existing release notes. To evaluate the quality of the ARENA release notes, we performed three empirical studies involving a total of 53 participants (45 professional developers and 8 students). The results indicate that the ARENA release notes are very good approximations of those produced by developers and often include important information that is missing in the manually produced release notes.
Publisher's Version Article Search

Web Apps

Discovering Refactoring Opportunities in Cascading Style Sheets
Davood Mazinanian, Nikolaos Tsantalis, and Ali Mesbah
(Concordia University, Canada; University of British Columbia, Canada)
Cascading Style Sheets (CSS) is a language used for describing the look and formatting of HTML documents. CSS has been widely adopted in web and mobile development practice, since it enables a clean separation of content from presentation. The language exhibits complex features, such as inheritance, cascading and specificity, which make CSS code hard to maintain. Therefore, it is important to find ways to improve the maintainability of CSS code. In this paper, we propose an automated approach to remove duplication in CSS code. More specifically, we have developed a technique that detects three types of CSS declaration duplication and recommends refactoring opportunities to eliminate those duplications. Our approach uses preconditions that ensure the application of a refactoring will preserve the original document styling. We evaluate our technique on 38 real-world web systems and 91 CSS files, in total. Our findings show that duplication in CSS code is widely prevalent. Additionally, there is a significant number of presentation-preserving refactoring opportunities that can reduce the size of the CSS files and increase the maintainability of the code.
Publisher's Version Article Search
SAFEWAPI: Web API Misuse Detector for Web Applications
SungGyeong Bae, Hyunghun Cho, Inho Lim, and Sukyoung Ryu
(KAIST, South Korea; Samsung Electronics, South Korea)
The evolution of Web 2.0 technologies makes web applications prevalent in various platforms including mobile devices and smart TVs. While one of the driving technologies of web applications is JavaScript, the extremely dynamic features of JavaScript make it very difficult to define and detect errors in JavaScript applications. The problem becomes more important and complicated for JavaScript web applications which may lead to severe security vulnerabilities. To help developers write safe JavaScript web applications using vendor-specific Web APIs, vendors specify their APIs often in Web IDL, which enables both API writers and users to communicate better by understanding the expected behaviors of the Web APIs. In this paper, we present SAFEWAPI, a tool to analyze Web APIs and JavaScript web applications that use the Web APIs and to detect possible misuses of Web APIs by the web applications. Even though the JavaScript language semantics allows to call a function defined with some parameters without any arguments, platform developers may require application writers to provide the exact number of arguments. Because the library functions in Web APIs expose their intended semantics clearly to web application developers unlike pure JavaScript functions, we can detect wrong uses of Web APIs precisely. For representative misuses of Web APIs defined by software quality assurance engineers, our SAFEWAPI detects such misuses in real-world JavaScript web applications.
Publisher's Version Article Search
Building Call Graphs for Embedded Client-Side Code in Dynamic Web Applications
Hung Viet Nguyen, Christian Kästner, and Tien N. Nguyen
(Iowa State University, USA; Carnegie Mellon University, USA)
When developing and maintaining a software system, programmers often rely on IDEs to provide editor services such as syntax highlighting, auto-completion, and "jump to declaration". In dynamic web applications, such tool support is currently limited to either the server-side code or to hand-written or generated client-side code. Our goal is to build a call graph for providing editor services on client-side code while it is still embedded as string literals within server-side code. First, we symbolically execute the server-side code to identify all possible client-side code variations. Subsequently, we parse the generated client-side code with all its variations into a VarDOM that compactly represents all DOM variations for further analysis. Based on the VarDOM, we build conditional call graphs for embedded HTML, CSS, and JS. Our empirical evaluation on real-world web applications show that our analysis achieves 100% precision in identifying call-graph edges. 62% of the edges cross PHP strings, and 17% of them cross files - in both situations, navigation without tool support is tedious and error prone.
Publisher's Version Article Search

Architecture and Design

Sketches and Diagrams in Practice
Sebastian Baltes and Stephan Diehl
(University of Trier, Germany)
Sketches and diagrams play an important role in the daily work of software developers. In this paper, we investigate the use of sketches and diagrams in software engineering practice. To this end, we used both quantitative and qualitative methods. We present the results of an exploratory study in three companies and an online survey with 394 participants. Our participants included software developers, software architects, project managers, consultants, as well as researchers. They worked in different countries and on projects from a wide range of application areas. Most questions in the survey were related to the last sketch or diagram that the participants had created. Contrary to our expectations and previous work, the majority of sketches and diagrams contained at least some UML elements. However, most of them were informal. The most common purposes for creating sketches and diagrams were designing, explaining, and understanding, but analyzing requirements was also named often. More than half of the sketches and diagrams were created on analog media like paper or whiteboards and have been revised after creation. Most of them were used for more than a week and were archived. We found that the majority of participants related their sketches to methods, classes, or packages, but not to source code artifacts with a lower level of abstraction.
Publisher's Version Article Search
Architecture Challenges for Internal Software Ecosystems: A Large-Scale Industry Case Study
Klaus-Benedikt Schultis, Christoph Elsner, and Daniel Lohmann
(Siemens, Germany; University of Erlangen-Nuremberg, Germany)
The idea of software ecosystems encourages organizations to open software projects for external businesses, governing the cross-organizational development by architectural and other measures. Even within a single organization, this paradigm can be of high value for large-scale decentralized software projects that involve various internal, yet self-contained organizational units. However, this intra-organizational decentralization causes architecture challenges that must be understood to reason about suitable architectural measures. We present an in-depth case study on collaboration and architecture challenges in two of these large-scale software projects at Siemens. We performed a total of 46 hours of semi-structured interviews with 17 leading software architects from all involved organizational units. Our major findings are: (1) three collaboration models on a continuum that ranges from high to low coupling, (2) a classification of architecture challenges, together with (3) a qualitative and quantitative exposure of the identified recurring issues along each collaboration model. Our study results provide valuable insights for both industry and academia: Practitioners that find themselves in one of the collaboration models can use empirical evidence on challenges to make informed decisions about counteractive measures. Researchers can focus their attention on challenges faced by practitioners to make software engineering more effective.
Publisher's Version Article Search
Variable-Specific Resolutions for Feature Interactions
Cecylia Bocovich and Joanne M. Atlee
(University of Waterloo, Canada)
Systems assembled from independently developed features suffer from feature interactions, in which features affect one another's behaviour in surprising ways. The feature-interaction problem states that the number of potential interactions is exponential in the number of features in a system. Resolution strategies offer general strategies that resolve entire classes of interactions, thereby reducing the work of the developer who is charged with the task of resolving interactions. In this paper, we focus on resolving interactions due to conflict. We present an approach, language, and implementation based on resolution modules in which the developer can specify an appropriate resolution for each variable under conflict. We performed a case study involving 24 automotive features, and found that the number of resolutions to be specified was much smaller than the number of possible feature interactions (6 resolutions for 24 features), that what constitutes an appropriate resolution strategy is different for different variables, and that the subset of situation calculus we used was sufficient to construct nontrivial resolution strategies for six distinct output variables.
Publisher's Version Article Search
An Empirical Study on Program Comprehension with Reactive Programming
Guido Salvaneschi, Sven Amann, Sebastian Proksch, and Mira Mezini
(TU Darmstadt, Germany; Lancaster University, UK)
Starting from the first investigations with strictly functional languages, reactive programming has been proposed as THE programming paradigm for reactive applications. The advantages of designs based on this style over designs based on the Observer design pattern have been studied for a long time. Over the years, researchers have enriched reactive languages with more powerful abstractions, embedded these abstractions into mainstream languages – including object-oriented languages – and applied reactive programming to several domains, like GUIs, animations, Web applications, robotics, and sensor networks. However, an important assumption behind this line of research – that, beside other advantages, reactive programming makes a wide class of otherwise cumbersome applications more comprehensible – has never been evaluated. In this paper, we present the design and the results of the first empirical study that evaluates the effect of reactive programming on comprehensibility compared to the traditional object-oriented style with the Observer design pattern. Results confirm the conjecture that comprehensibility is enhanced by reactive programming. In the experiment, the reactive programming group significantly outperforms the other group.
Publisher's Version Article Search

Mobile Apps

Apposcopy: Semantics-Based Detection of Android Malware through Static Analysis
Yu Feng, Saswat Anand, Isil Dillig, and Alex Aiken
(University of Texas at Austin, USA; Stanford University, USA)
We present Apposcopy, a new semantics-based approach for identifying a prevalent class of Android malware that steals private user information. Apposcopy incorporates (i) a high-level language for specifying signatures that describe semantic characteristics of malware families and (ii) a static analysis for deciding if a given application matches a malware signature. The signature matching algorithm of Apposcopy uses a combination of static taint analysis and a new form of program representation called Inter-Component Call Graph to efficiently detect Android applications that have certain control- and data-flow properties. We have evaluated Apposcopy on a corpus of real-world Android applications and show that it can effectively and reliably pinpoint malicious applications that belong to certain malware families.
Publisher's Version Article Search
Detecting Energy Bugs and Hotspots in Mobile Apps
Abhijeet Banerjee, Lee Kee Chong, Sudipta Chattopadhyay, and Abhik Roychoudhury
(National University of Singapore, Singapore; Linköping University, Sweden)
Over the recent years, the popularity of smartphones has increased dramatically. This has lead to a widespread availability of smartphone applications. Since smartphones operate on a limited amount of battery power, it is important to develop tools and techniques that aid in energy-efficient application development. Energy inefficiencies in smartphone applications can broadly be categorized into energy hotspots and energy bugs. An energy hotspot can be described as a scenario where executing an application causes the smartphone to consume abnormally high amount of battery power, even though the utilization of its hardware resources is low. In contrast, an energy bug can be described as a scenario where a malfunctioning application prevents the smartphone from becoming idle, even after it has completed execution and there is no user activity. In this paper, we present an automated test generation framework that detects energy hotspots/bugs in Android applications. Our framework systematically generates test inputs that are likely to capture energy hotspots/bugs. Each test input captures a sequence of user interactions (e.g. touches or taps on the smartphone screen) that leads to an energy hotspot/bug in the application. Evaluation with 30 freely-available Android applications from Google Play/F-Droid shows the efficacy of our framework in finding hotspots/bugs. Manual validation of the experimental results shows that our framework reports reasonably low number of false positives. Finally, we show the usage of the generated results by improving the energy-efficiency of some Android applications.
Publisher's Version Article Search
EvoDroid: Segmented Evolutionary Testing of Android Apps
Riyadh Mahmood, Nariman Mirzaei, and Sam Malek
(George Mason University, USA)
Proliferation of Android devices and apps has created a demand for applicable automated software testing techniques. Prior research has primarily focused on either unit or GUI testing of Android apps, but not their end-to-end system testing in a systematic manner. We present EvoDroid, an evolutionary approach for system testing of Android apps. EvoDroid overcomes a key shortcoming of using evolutionary techniques for system testing, i.e., the inability to pass on genetic makeup of good individuals in the search. To that end, EvoDroid combines two novel techniques: (1) an Android-specific program analysis technique that identifies the segments of the code amenable to be searched independently, and (2) an evolutionary algorithm that given information of such segments performs a step-wise search for test cases reaching deep into the code. Our experiments have corroborated EvoDroid’s ability to achieve significantly higher code coverage than existing Android testing tools.
Publisher's Version Article Search
Prioritizing the Devices to Test Your App on: A Case Study of Android Game Apps
Hammad Khalid, Meiyappan Nagappan, Emad Shihab, and Ahmed E. Hassan
(Queen's University, Canada; Rochester Institute of Technology, USA; Concordia University, Canada)
Star ratings that are given by the users of mobile apps directly impact the revenue of its developers. At the same time, for popular platforms like Android, these apps must run on hundreds of devices increasing the chance for device-specific problems. Device-specific problems could impact the rating assigned to an app, given the varying capabilities of devices (e.g., hardware and software). To fix device-specific problems developers must test their apps on a large number of Android devices, which is costly and inefficient. Therefore, to help developers pick which devices to test their apps on, we propose using the devices that are mentioned in user reviews. We mine the user reviews of 99 free game apps and find that, apps receive user reviews from a large number of devices: between 38 to 132 unique devices. However, most of the reviews (80%) originate from a small subset of devices (on average, 33%). Furthermore, we find that developers of new game apps with no reviews can use the review data of similar game apps to select the devices that they should focus on first. Finally, among the set of devices that generate the most reviews for an app, we find that some devices tend to generate worse ratings than others. Our findings indicate that focusing on the devices with the most reviews (in particular the ones with negative ratings), developers can effectively prioritize their limited Quality Assurance (QA) efforts, since these devices have the greatest impact on ratings.
Publisher's Version Article Search Info

Testing and Oracles

Improving Oracle Quality by Detecting Brittle Assertions and Unused Inputs in Tests
Chen Huo and James Clause
(University of Delaware, USA)
Writing oracles is challenging. As a result, developers often create oracles that check too little, resulting in tests that are unable to detect failures, or check too much, resulting in tests that are brittle and difficult to maintain. In this paper we present a new technique for automatically analyzing test oracles. The technique is based on dynamic tainting and detects both brittle assertions—assertions that depend on values that are derived from uncontrolled inputs—and unused inputs—inputs provided by the test that are not checked by an assertion. We also presented OraclePolish, an implementation of the technique that can analyze tests that are written in Java and use the JUnit testing framework. Using OraclePolish, we conducted an empirical evaluation of more than 4000 real test cases. The results of the evaluation show that OraclePolish is effective; it detected 164 tests that contain brittle assertions and 1618 tests that have unused inputs. In addition, the results also demonstrate that the costs associated with using the technique are reasonable.
Publisher's Version Article Search
On the Efficiency of Automated Testing
Marcel Böhme and Soumya Paul
(Saarland University, Germany; National University of Singapore, Singapore)
The aim of automated program testing is to gain confidence about a program's correctness by sampling its input space. The sampling process can be either systematic or random. For every systematic testing technique the sampling is informed by the analysis of some program artefacts, like the specification, the source code (e.g., to achieve coverage), or even faulty versions of the program (e.g., mutation testing). This analysis incurs some cost. In contrast, random testing is unsystematic and does not sustain any analysis cost. In this paper, we investigate the theoretical efficiency of systematic versus random testing. First, we mathematically model the most effective systematic testing technique S_0 in which every sampled test input strictly increases the "degree of confidence" and is subject to the analysis cost c. Note that the efficiency of S_0 depends on c. Specifically, if we increase c, we also increase the time it takes S_0 to establish the same degree of confidence. So, there exists a maximum analysis cost beyond which R is generally more efficient than S_0. Given that we require the confidence that the program works correctly for x% of its input, we prove an upper bound on c of S_0, beyond which R is more efficient on the average. We also show that this bound depends asymptotically only on x. For instance, let R take 10ms time to sample one test input; to establish that the program works correctly for 90% of its input, S_0 must take less than 41ms to sample one test input. Otherwise, R is expected to establish the 90%-degree of confidence earlier. We prove similar bounds on the cost if the software tester is interested in revealing as many errors as possible in a given time span.
Publisher's Version Article Search
An Empirical Analysis of Flaky Tests
Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov
(University of Illinois at Urbana-Champaign, USA)
Regression testing is a crucial part of software development. It checks that software changes do not break existing functionality. An important assumption of regression testing is that test outcomes are deterministic: an unmodified test is expected to either always pass or always fail for the same code under test. Unfortunately, in practice, some tests often called flaky tests—have non-deterministic outcomes. Such tests undermine the regression testing as they make it difficult to rely on test results. We present the first extensive study of flaky tests. We study in detail a total of 201 commits that likely fix flaky tests in 51 open-source projects. We classify the most common root causes of flaky tests, identify approaches that could manifest flaky behavior, and describe common strategies that developers use to fix flaky tests. We believe that our insights and implications can help guide future research on the important topic of (avoiding) flaky tests.
Publisher's Version Article Search
Are Mutants a Valid Substitute for Real Faults in Software Testing?
René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser
(University of Washington, USA; University of Waterloo, Canada; University of Sheffield, UK)
A good test suite is one that detects real faults. Because the set of faults in a program is usually unknowable, this definition is not useful to practitioners who are creating test suites, nor to researchers who are creating and evaluating tools that generate test suites. In place of real faults, testing research often uses mutants, which are artificial faults -- each one a simple syntactic variation -- that are systematically seeded throughout the program under test. Mutation analysis is appealing because large numbers of mutants can be automatically-generated and used to compensate for low quantities or the absence of known real faults. Unfortunately, there is little experimental evidence to support the use of mutants as a replacement for real faults. This paper investigates whether mutants are indeed a valid substitute for real faults, i.e., whether a test suite’s ability to detect mutants is correlated with its ability to detect real faults that developers have fixed. Unlike prior studies, these investigations also explicitly consider the conflating effects of code coverage on the mutant detection rate. Our experiments used 357 real faults in 5 open-source applications that comprise a total of 321,000 lines of code. Furthermore, our experiments used both developer-written and automatically-generated test suites. The results show a statistically significant correlation between mutant detection and real fault detection, independently of code coverage. The results also give concrete suggestions on how to improve mutation analysis and reveal some inherent limitations.
Publisher's Version Article Search

Evolution and Maintenance

No Issue Left Behind: Reducing Information Overload in Issue Tracking
Olga Baysal, Reid Holmes, and Michael W. Godfrey
(Université de Montréal, Canada; University of Waterloo, Canada)
Modern software development tools such as issue trackers are often complex and multi-purpose tools that provide access to an immense amount of raw information. Unfortunately, developers sometimes feel frustrated when they cannot easily obtain the particular information they need for a given task; furthermore, the constant influx of new data — the vast majority of which is irrelevant to their task at hand — may result in issues being "dropped on the floor". In this paper, we present a developer-centric approach to issue tracking that aims to reduce information overload and improve developers' situational awareness. Our approach is motivated by a grounded theory study of developer comments, which suggests that customized views of a project's repositories that are tailored to developer-specific tasks can help developers better track their progress and understand the surrounding technical context. From the qualitative study, we uncovered a model of the kinds of information elements that are essential for developers in completing their daily tasks, and from this model we built a tool organized around customized issue-tracking dashboards. Further quantitative and qualitative evaluation demonstrated that this dashboard-like approach to issue tracking can reduce the volume of irrelevant emails by over 99% and also improve support for specific issue-tracking tasks.
Publisher's Version Article Search
Panning Requirement Nuggets in Stream of Software Maintenance Tickets
Senthil Mani, Karthik Sankaranarayanan, Vibha Singhal Sinha, and Premkumar Devanbu
(IBM Research, India; University of California at Davis, USA)
There is an increasing trend to outsource maintenance of large applications and application portfolios of a business to third parties, specialising in application maintenance, who are incented to deliver the best possible maintenance at the lowest cost. To do so, they need to identify repeat problem areas, which cause more maintenance grief, and seek a unified remedy to avoid the costs spent on fixing these individually. These repeat areas, in a sense, represent major, evolving areas of need, or requirements, for the customer. The information about the repeating problem is typically embedded in the unstructured text of multiple tickets, waiting to be found and addressed. Currently, repeat problems are found by manual analysis; effective solutions depend on the collective experience of the team solving them. In this paper, we propose an approach to automatically analyze problem tickets to discover groups of problems being reported in them and provide meaningful, descriptive labels to help interpret these groups. Our approach incorporates a cleansing phase to handle the high level of noise observed in problem tickets and a method to incorporate multiple text clustering techniques and merge their results in a meaningful manner. We provide detailed experiments to quantitatively and qualitatively evaluate our approach
Publisher's Version Article Search
Learning to Rank Relevant Files for Bug Reports using Domain Knowledge
Xin Ye, Razvan Bunescu, and Chang Liu
(Ohio University, USA)
When a new bug report is received, developers usually need to reproduce the bug and perform code reviews to find the cause, a process that can be tedious and time consuming. A tool for ranking all the source files of a project with respect to how likely they are to contain the cause of the bug would enable developers to narrow down their search and potentially could lead to a substantial increase in productivity. This paper introduces an adaptive ranking approach that leverages domain knowledge through functional decompositions of source code files into methods, API descriptions of library components used in the code, the bug-fixing history, and the code change history. Given a bug report, the ranking score of each source file is computed as a weighted combination of an array of features encoding domain knowledge, where the weights are trained automatically on previously solved bug reports using a learning-to-rank technique. We evaluated our system on six large scale open source Java projects, using the before-fix version of the project for every bug report. The experimental results show that the newly introduced learning-to-rank approach significantly outperforms two recent state-of-the-art methods in recommending relevant files for bug reports. In particular, our method makes correct recommendations within the top 10 ranked source files for over 70% of the bug reports in the Eclipse Platform and Tomcat projects.
Publisher's Version Article Search
Querying Sequential Software Engineering Data
Chengnian Sun, Haidong Zhang, Jian-Guang Lou, Hongyu Zhang, Qiang Wang, Dongmei Zhang, and Siau-Cheng Khoo
(University of California at Davis, USA; Microsoft Research, China; National University of Singapore, Singapore)
We propose a pattern-based approach to effectively and efficiently analyzing sequential software engineering (SE) data. Different from other types of SE data, sequential SE data preserves unique temporal properties, which cannot be easily analyzed without much programming effort. In order to facilitate the analysis of sequential SE data, we design a sequential pattern query language (SPQL), which specifies the temporal properties based on regular expressions, and is enhanced with variables and statements to store and manipulate matching states. We also propose a query engine to effectively process the SPQL queries. We have applied our approach to analyze two types of SE data, namely bug report history and source code change history. We experiment with 181,213 Eclipse bug reports and 323,989 code revisions of Android. SPQL enables us to explore interesting temporal properties underneath these sequential data with a few lines of query code and low matching overhead. The analysis results can help better under- stand a software process and identify process violations.
Publisher's Version Article Search

proc time: 0.41