ISSTA 2016
25th International Symposium on Software Testing and Analysis (ISSTA)
Powered by
Conference Publishing Consulting

25th International Symposium on Software Testing and Analysis (ISSTA), July 18–20, 2016, Saarbrücken, Germany

ISSTA 2016 – Proceedings

Contents - Abstracts - Authors


Title Page

Message from the Chairs
It is with great pleasure that we welcome you to Saarbrücken, Germany for the 25th International Symposium on Software Testing and Analysis (ISSTA) 2016. ISSTA 2016 is being held from July 17 – 21 2016, with several pre- and post-conference events.

ISSTA 2016 Conference Organization
Conference Organization

Message from Workshop Chairs
It is our great pleasure to introduce the four workshops accompanying the 2016 International Symposium on Software Testing and Analysis. These work- shops present a broad and vibrant picture of the testing and analysis research and industrial communities.

Summary of the Workshop DECAF 2014
Designing a code-analysis framework is not an easy task. Design decisions that framework builders took more than a decade ago are still affecting the way many researchers implement their static analyses today. However, modern software systems are often heterogeneous and gigantic in size, employing many programming languages and APIs. Further, modern program analyses tend to be user-driven and interactive, as opposed to traditional program analyses that were more targeted towards whole-program optimizations. As analysis framework authors, we have recently been discussing the various strengths and weaknesses of our systems regarding the needs of modern analyses and analyzed software systems. One idea expressed was perhaps to start over with a new analysis framework that could incorporate into its design all the lessons we have learnt from current frameworks.

Sponsors and Supporters
Sponsors and Supporters

Research Papers

The Web

DEKANT: A Static Analysis Tool That Learns to Detect Web Application Vulnerabilities
Ibéria Medeiros, Nuno Neves, and Miguel Correia
(University of Lisbon, Portugal; INESC-ID, Portugal; LaSIGE, Portugal)
The state of web security remains troubling as web applications continue to be favorite targets of hackers. Static analysis tools are important mechanisms for programmers to deal with this problem as they search for vulnerabilities automatically in the application source code, allowing programmers to remove them. However, developing these tools requires explicitly coding knowledge about how to discover each kind of vulnerability. This paper presents a new approach in which static analysis tools learn to detect vulnerabilities automatically using machine learning. The approach uses a sequence model to learn to characterize vulnerabilities based on a set of annotated source code slices. This model takes into consideration the order in which the code elements appear and are executed in the slices. The model created can then be used as a static analysis tool to discover and identify vulnerabilities in source code. The approach was implemented in the DEKANT tool and evaluated experimentally with a set of open source PHP applications and WordPress plugins, finding 16 zero-day vulnerabilities.

Publisher's Version Article Search
Automated and Effective Testing of Web Services for XML Injection Attacks
Sadeeq Jan, Cu D. Nguyen, and Lionel C. Briand
(University of Luxembourg, Luxembourg)
XML is extensively used in web services for integration and data exchange. Its popularity and wide adoption make it an attractive target for attackers and a number of XML-based attack types have been reported recently. This raises the need for cost-effective, automated testing of web services to detect XML-related vulnerabilities, which is the focus of this paper. We discuss a taxonomy of the types of XML injection attacks and use it to derive four different ways to mutate XML messages, turning them into attacks (tests) automatically. Further, we consider domain constraints and attack grammars, and use a constraint solver to generate XML messages that are both malicious and valid, thus making it more difficult for any protection mechanism to recognise them. As a result, such messages have a better chance to detect vulnerabilities. Our evaluation on an industrial case study has shown that a large proportion (78.86%) of the attacks generated using our approach could circumvent the first layer of security protection, an XML gateway (firewall), a result that is much better than what a state-of-the-art tool based on fuzz testing could achieve.

Publisher's Version Article Search

Static Analysis

Binary Code Is Not Easy
Xiaozhu Meng and Barton P. Miller
(University of Wisconsin-Madison, USA)
Binary code analysis is an enabling technique for many applications. Modern compilers and run-time libraries have introduced significant complexities to binary code, which negatively affect the capabilities of binary analysis tool kits to analyze binary code, and may cause tools to report inaccurate information about binary code. Analysts may hence be confused and applications based on these tool kits may have degrading quality. We examine the problem of constructing control flow graphs from binary code and labeling the graphs with accurate function boundary annotations. We identified several challenging code constructs that represent hard-to-analyze aspects of binary code, and show code examples for each code construct. As part of this discussion, we present new code parsing algorithms in our open source Dyninst tool kit that support these constructs, including a new model for describing jump tables that improves our ability to precisely determine the control flow targets, a new interprocedural analysis to determine when a function is non-returning, and techniques for handling tail calls. We evaluated how various tool kits fare when handling these code constructs with real software as well as test binaries patterned after each challenging code construct we found in real software.

Publisher's Version Article Search
Specification of Concretization and Symbolization Policies in Symbolic Execution
Robin David, Sébastien Bardin, Josselin Feist, Laurent Mounier, Marie-Laure Potet, Thanh Dinh Ta, and Jean-Yves Marion
(CEA LIST, France; VERIMAG, France; University of Lorraine, France; CNRS, France; LORIA, France)
Symbolic Execution (SE) is a popular and profitable approach to automatic code-based software testing. Concretization and symbolization (C/S) is a crucial part of modern SE tools, since it directly impacts the trade-offs between correctness, completeness and efficiency of the approach. Yet, C/S policies have been barely studied. We intend to remedy to this situation and to establish C/S policies on a firm ground. To this end, we propose a clear separation of concerns between C/S specification on one side, through the new rule-based description language CSml, and the algorithmic core of SE on the other side, revisited to take C/S policies into account. This view is implemented on top of an existing SE tool, demonstrating the feasibility and the benefits of the method. This work paves the way for more flexible SE tools with well-documented and reusable C/S policies, as well as for a systematic study of C/S policies.

Publisher's Version Article Search
EagerMerge: An Optimistic Technique for Efficient Points-To Analysis
Sudhir Samrit and Rupesh Nasre
(IIT Madras, India)
We present an information-merging technique for efficient computation of points-to information for C programs. Invalid use of pointers can lead to hard-to-find bugs and may expose security vulnerabilities. Thus, analyzing them is critical for software analysis as well as optimization. Pointer analysis is a key step during compilation, and the computed points-to information is useful for client analyses from varied domains such as bug finding, data-flow analysis, identifying security vulnerabilities, and parallelization, to name a few. Former research on pointer analysis has indicated that the main bottleneck towards scalability is large propagation of points-to information in the constraint graph. To reduce the propagation cost, we present a technique based on temporal similarity of points-to sets. The method tracks pointers whose dynamically changing points-to information remains equal for a while. Based on the optimism gained by observing the points-to sets over time, the analysis decides to merge the corresponding nodes. Using the notion of merge and split, we build a family of points-to analyses, and compare their relative precisions in the context of existing analyses. We illustrate the effectiveness of our approach using a suite of sixteen SPEC 2000 benchmarks and three large open-source programs, and show that the technique improves the analysis time over BDD and bitmap based Hybrid Cycle Detection, well-known Andersen's analysis, and Deep Propagation, affecting minimal precision (precision is 96.4% on an average). Specifically, it is faster than Deep Propagation by 45%.

Publisher's Version Article Search aec-badge-issta
IPA: Improving Predictive Analysis with Pointer Analysis
Peng Liu, Omer Tripp, and Xiangyu Zhang
(Purdue University, USA; IBM Research, USA; Google, USA)
Predictive analysis, recently proposed for race detection, guarantees to report no false positives and achieves good coverage. Predictive analysis starts with the trace of an execution and mutates the schedule order of the trace to ``predict'' the executions that expose the hidden races. Ideally, the predictive analysis should allow the schedule mutation to change the memory location accessed by the field access, which helps meet the ``same memory location'' requirement of the data race. However, existing predictive approaches, including causality-preserving approaches and symbolic approaches, lack this capability. We propose the first predictive analysis that allows changing the accessed locations. The key challenge is that modeling of the field accesses relies on the location, which may however become unknown due to schedule mutation. We solve this challenge through a novel combination of predictive analysis and pointer analysis. Furthermore, unlike previous work, our analysis applies a hybrid encoding scheme to increase practical applicability. We have implemented our approach as a prototype IPA, and compared it against the most recent predictive analysis over a set of popular Java applications. Our experimental evaluation confirms the effectiveness of our approach:IPA is able to find close to 2X as many races as previous approaches.

Publisher's Version Article Search

Test Generation

Generating Focused Random Tests using Directed Swarm Testing
Mohammad Amin Alipour, Alex Groce, Rahul Gopinath, and Arpit Christi
(Oregon State University, USA)
Random testing can be a powerful and scalable method for finding faults in software. However, sophisticated random testers usually test a whole program, not individual components. Writing random testers for individual components of complex programs may require unreasonable effort. In this paper we present a novel method, directed swarm testing, that uses statistics and a variation of random testing to produce random tests that focus on only part of a program, increasing the frequency with which tests cover the targeted code. We demonstrate the effectiveness of this technique using real-world programs and test systems (the YAFFS2 file system, GCC, and Mozilla's SpiderMonkey JavaScript engine), and discuss various strategies for directed swarm testing. The best strategies can improve coverage frequency for targeted code by a factor ranging from 1.1-4.5x on average, and from nearly 3x to nearly 9x in the best case. For YAFFS2, directed swarm testing never decreased coverage, and for GCC and SpiderMonkey coverage increased for over 99% and 73% of targets, respectively, using the best strategies. Directed swarm testing improves detection rates for real SpiderMonkey faults, when the code in the introducing commit is targeted. This lightweight technique is applicable to existing industrial-strength random testers.

Publisher's Version Article Search
Monkey See, Monkey Do: Effective Generation of GUI Tests with Inferred Macro Events
Markus Ermuth and Michael Pradel
(TU Darmstadt, Germany)
Automated testing is an important part of validating the behavior of software with complex graphical user interfaces, such as web, mobile, and desktop applications. Despite recent advances in UI-level test generation, existing approaches often fail to create complex sequences of events that represent realistic user interactions. As a result, these approaches cannot reach particular parts of the application under test, which then remain untested. This paper presents a UI-level test generation approach that exploits execution traces of human users to automatically create complex sequences of events that go beyond the recorded traces. The key idea is to infer so-called macro events, i.e., sequences of low-level UI events that correspond to a single logical step of interaction, such as choosing an item of a drop-down menu or filling and submitting a form. The approach builds upon and adapts well-known data mining techniques, in particular frequent subsequence mining and inference of finite state machines. We implement the approach for client-side web applications and apply it to four real-world applications. Our results show that macro-based test generation reaches more pages, exercises more usage scenarios, and covers more code within a fixed testing budget than a purely random test generator.

Publisher's Version Article Search
Sapienz: Multi-objective Automated Testing for Android Applications
Ke Mao, Mark Harman, and Yue Jia
(University College London, UK)
We introduce Sapienz, an approach to Android testing that uses multi-objective search-based testing to automatically explore and optimise test sequences, minimising length, while simultaneously maximising coverage and fault revelation. Sapienz combines random fuzzing, systematic and search-based exploration, exploiting seeding and multi-level instrumentation. Sapienz significantly outperforms (with large effect size) both the state-of-the-art technique Dynodroid and the widely-used tool, Android Monkey, in 7/10 experiments for coverage, 7/10 for fault detection and 10/10 for fault-revealing sequence length. When applied to the top 1,000 Google Play apps, Sapienz found 558 unique, previously unknown crashes. So far we have managed to make contact with the developers of 27 crashing apps. Of these, 14 have confirmed that the crashes are caused by real faults. Of those 14, six already have developer-confirmed fixes.

Publisher's Version Article Search Video Info
FSX: Fine-Grained Incremental Unit Test Generation for C/C++ Programs
Hiroaki Yoshida, Susumu Tokumoto, Mukul R. Prasad, Indradeep Ghosh, and Tadahiro Uehara
(Fujitsu Labs, USA; Fujitsu Labs, Japan)
Automated unit test generation bears the promise of significantly reducing test cost and hence improving software quality. However, the maintenance cost of the automatically generated tests presents a significant barrier to adoption of this technology. To address this challenge, we propose a novel technique for automated and fine-grained incremental generation of unit tests through minimal augmentation of an existing test suite. The technique uses iterative, incremental refinement of test-drivers and symbolic execution, guided by a diagnostics engine. The diagnostics engine works off a novel precise and efficient byte-level dynamic dependence analysis built using Reduced Ordered Binary Decision Diagrams (ROBDDs). We present a tool FSX implementing this technique and evaluate it under two practical use-cases of incremental unit test generation, on five revisions of the open-source software iPerf, as well as on 3 large subjects, comprising more than 60 thousand lines of code, from in-house commercial network products. The evaluation shows that FSX can generate high-quality unit tests on large industrial software while minimizing the maintenance cost of the overall test-suite.

Publisher's Version Article Search

Testing Processes

CSNIPPEX: Automated Synthesis of Compilable Code Snippets from Q&A Sites
Valerio Terragni, Yepang Liu, and Shing-Chi Cheung
(Hong Kong University of Science and Technology, China)
Popular Q&A sites like StackOverflow have collected numerous code snippets. However, many of them do not have complete type information, making them uncompilable and inapplicable to various software engineering tasks. This paper analyzes this problem, and proposes a technique CSNIPPEX to automatically convert code snippets into compilable Java source code files by resolving external dependencies, generating import declarations, and fixing syntactic errors. We implemented CSNIPPEX as a plug-in for Eclipse and evaluated it with 242,175 StackOverflow posts that contain code snippets. CSNIPPEX successfully synthesized compilable Java files for 40,410 of them. It was also able to effectively recover import declarations for each post with a precision of 91.04% in a couple of seconds.

Publisher's Version Article Search aec-badge-issta
Automatic Test Case Generation: What If Test Code Quality Matters?
Fabio Palomba, Annibale Panichella, Andy Zaidman, Rocco Oliveto, and Andrea De Lucia
(University of Salerno, Italy; Delft University of Technology, Netherlands; University of Molise, Italy)
Test case generation tools that optimize code coverage have been extensively investigated. Recently, researchers have suggested to add other non-coverage criteria, such as memory consumption or readability, to increase the practical usefulness of generated tests. In this paper, we observe that test code quality metrics, and test cohesion and coupling in particular, are valuable candidates as additional criteria. Indeed, tests with low cohesion and/or high coupling have been shown to have a negative impact on future maintenance activities. In an exploratory investigation we show that most generated tests are indeed affected by poor test code quality. For this reason, we incorporate cohesion and coupling metrics into the main loop of search-based algorithm for test case generation. Through an empirical study we show that our approach is not only able to generate tests that are more cohesive and less coupled, but can (i) increase branch coverage up to 10% when enough time is given to the search and (ii) result in statistically shorter tests.

Publisher's Version Article Search
Analyzing Test Completeness for Dynamic Languages
Christoffer Quist Adamsen, Gianluca Mezzetti, and Anders Møller
(Aarhus University, Denmark)
In dynamically typed programming languages, type errors can occur at runtime. Executing the test suites that often accompany programs may provide some confidence about absence of such errors, but generally without any guarantee. We present a program analysis that can check whether a test suite has sufficient coverage to prove a given type-related property, which is particularly challenging for program code with overloading and value dependent types. The analysis achieves a synergy between scalable static analysis and dynamic analysis that goes beyond what can be accomplished by the static analysis alone. Additionally, the analysis provides a new coverage adequacy metric for the completeness of a test suite regarding a family of type-related properties. Based on an implementation for Dart, we demonstrate how such a hybrid static/dynamic program analysis can be used for measuring the quality of a test suite with respect to showing absence of type errors and inferring sound call graph information, specifically for program code that is difficult to handle by traditional static analysis techniques.

Publisher's Version Article Search Info aec-badge-issta
Unveiling Anomalies and Their Impact on Software Quality in Model-Based Automotive Software Revisions with Software Metrics and Domain Experts
Jan Schroeder, Christian Berger, Miroslaw Staron, Thomas Herpel, and Alessia Knauss
(University of Gothenburg, Sweden; Automotive Safety Technologies, Germany; Chalmers University of Technology, Sweden)
The validation of simulation models (e.g., of electronic control units for vehicles) in industry is becoming increasingly challenging due to their growing complexity. To systematically assess the quality of such models, software metrics seem to be promising. In this paper we explore the use of software metrics and outlier analysis as a means to assess the quality of model-based software. More specifically, we investigate how results from regression analysis applied to measurement data received from size and complexity metrics can be mapped to software quality. Using the moving averages approach, models were fit to data received from over 65,000 software revisions for 71 simulation models that represent different electronic control units of real premium vehicles. Consecutive investigations using studentized deleted residuals and Cook’s Distance revealed outliers among the measurements. From these outliers we identified a subset, which provides meaningful information (anomalies) by comparing outlier scores with expert opinions. Eight engineers were interviewed separately for outlier impact on software quality. Findings were validated in consecutive workshops. The results show correlations between outliers and their impact on four of the considered quality characteristics. They also demonstrate the applicability of this approach in industry.

Publisher's Version Article Search

Debugging and Repair

Practitioners' Expectations on Automated Fault Localization
Pavneet Singh Kochhar, Xin Xia, David Lo, and Shanping Li
(Singapore Management University, Singapore; Zhejiang University, China)
Software engineering practitioners often spend significant amount of time and effort to debug. To help practitioners perform this crucial task, hundreds of papers have proposed various fault localization techniques. Fault localization helps practitioners to find the location of a defect given its symptoms (e.g., program failures). These localization techniques have pinpointed the locations of bugs of various systems of diverse sizes, with varying degrees of success, and for various usage scenarios. Unfortunately, it is unclear whether practitioners appreciate this line of research. To fill this gap, we performed an empirical study by surveying 386 practitioners from more than 30 countries across 5 continents about their expectations of research in fault localization. In particular, we investigated a number of factors that impact practitioners' willingness to adopt a fault localization technique. We then compared what practitioners need and the current state-of-research by performing a literature review of papers on fault localization techniques published in ICSE, FSE, ESEC-FSE, ISSTA, TSE, and TOSEM in the last 5 years (2011-2015). From this comparison, we highlight the directions where researchers need to put effort to develop fault localization techniques that matter to practitioners.

Publisher's Version Article Search
A Learning-to-Rank Based Fault Localization Approach using Likely Invariants
Tien-Duy B. Le, David Lo, Claire Le Goues, and Lars Grunske
(Singapore Management University, Singapore; Carnegie Mellon University, USA; Humboldt University of Berlin, Germany)
Debugging is a costly process that consumes much of developer time and energy. To help reduce debugging effort, many studies have proposed various fault localization approaches. These approaches take as input a set of test cases (some failing, some passing) and produce a ranked list of program elements that are likely to be the root cause of the failures (i.e., failing test cases). In this work, we propose Savant, a new fault localization approach that employs a learning-to-rank strategy, using likely invariant diffs and suspiciousness scores as features, to rank methods based on their likelihood to be a root cause of a failure. Savant has four steps: method clustering & test case selection, invariant mining, feature extraction, and method ranking. At the end of these four steps, Savant produces a short ranked list of potentially buggy methods. We have evaluated Savant on 357 real-life bugs from 5 programs from the Defects4J benchmark. Out of these bugs, averaging over 100 repeated trials with different seeds to randomly break ties, we find that on average Savant can identify correct buggy methods for 63.03, 101.72, and 122 bugs at top 1, 3, and 5 positions in the ranked lists that Savant produces. We have compared Savant against several state-of-the-art fault localization baselines that work on program spectra. We show that Savant can successfully locate 57.73%, 56.69%, and 43.13% more bugs at top 1, top 3, and top 5 positions than the best performing baseline, respectively.

Publisher's Version Article Search
Optimal Sanitization Synthesis for Web Application Vulnerability Repair
Fang Yu, Ching-Yuan Shueh, Chun-Han Lin, Yu-Fang Chen, Bow-Yaw Wang, and Tevfik Bultan
(National Chengchi University, Taiwan; Academia Sinica, Taiwan; University of California at Santa Barbara, USA)
We present a code- and input-sensitive sanitization synthesis approach for repairing string vulnerabilities that are common in web applications. The synthesized sanitization patch modifies the user input in an optimal way while guaranteeing that the repaired web application is not vulnerable. Given a web application, an input pattern and an attack pattern, we use automata-based static string analysis techniques to compute a sanitization signature that characterizes safe input values that obey the given input pattern and are safe with respect to the given attack pattern. Using the sanitization signature, we synthesize an optimal sanitization patch that converts malicious user inputs to benign ones with minimal editing. When the generated patch is added to the web application, it is guaranteed that the repaired web application is no longer vulnerable. We present refinements to previous sanitization synthesis algorithms that reduce the runtime sanitization cost significantly. We evaluate our approach on open source web applications using common input and attack patterns, demonstrating the effectiveness of our approach.

Publisher's Version Article Search
ARROW: Automated Repair of Races on Client-Side Web Pages
Weihang Wang, Yunhui Zheng, Peng Liu, Lei Xu, Xiangyu Zhang, and Patrick Eugster
(Purdue University, USA; IBM Research, USA; Nanjing University, China)
Modern browsers have a highly concurrent page rendering process in order to be more responsive. However, such a concurrent execution model leads to various race issues. In this paper, we present ARROW, a static technique that can automatically, safely, and cost effectively patch certain race issues on client side pages. It works by statically modeling a web page as a causal graph denoting happens-before relations between page elements, according to the rendering process in browsers. Races are detected by identifying inconsistencies between the graph and the dependence relations intended by the developer. Detected races are fixed by leveraging a constraint solver to add a set of edges with the minimum cost to the causal graph so that it is consistent with the intended dependences. The input page is then transformed to respect the repair edges. ARROW has fixed 151 races from 20 real world commercial web sites.

Publisher's Version Article Search


Automatic Generation of Oracles for Exceptional Behaviors
Alberto Goffi, Alessandra Gorla, Michael D. Ernst, and Mauro Pezzè
(University of Lugano, Switzerland; IMDEA Software Institute, Spain; University of Washington, USA)
Test suites should test exceptional behavior to detect faults in error-handling code. However, manually-written test suites tend to neglect exceptional behavior. Automatically-generated test suites, on the other hand, lack test oracles that verify whether runtime exceptions are the expected behavior of the code under test.
This paper proposes a technique that automatically creates test oracles for exceptional behaviors from Javadoc comments. The technique uses a combination of natural language processing and run-time instrumentation. Our implementation, Toradocu, can be combined with a test input generation tool. Our experimental evaluation shows that Toradocu improves the fault-finding effectiveness of EvoSuite and Randoop test suites by 8% and 16% respectively, and reduces EvoSuite’s false positives by 33%.

Publisher's Version Article Search aec-badge-issta
Verdict Machinery: On the Need to Automatically Make Sense of Test Results
Mikael Fagerström, Emre Emir Ismail, Grischa Liebel, Rohit Guliani, Fredrik Larsson, Karin Nordling, Eric Knauss, and Patrizio Pelliccione
(Chalmers University of Technology, Sweden; University of Gothenburg, Sweden; Ericsson, Sweden)
Along with technological developments and increasing competition there is a major incentive for companies to produce and market high quality products before their competitors. In order to conquer a bigger portion of the market share, companies have to ensure the quality of the product in a shorter time frame. To accomplish this task companies try to automate their test processes as much as possible. It is critical to investigate and understand the problems that occur during different stages of test automation processes. In this paper we report on a case study on automatic analysis of non-functional test results. We discuss challenges in the face of continuous integration and deployment and provide improvement suggestions based on interviews at a large company in Sweden. The key contributions of this work are filling the knowledge gap in research about performance regression test analysis automation and providing warning signs and a road map for the industry.

Publisher's Version Article Search
Testing Stochastic Software using Pseudo-Oracles
Matthew Patrick, Andrew P. Craig, Nik J. Cunniffe, Matthew Parry, and Christopher A. Gilligan
(University of Cambridge, UK; University of Otago, New Zealand)
Stochastic models can be difficult to test due to their complexity and randomness, yet their predictions are often used to make important decisions, so they need to be correct. We introduce a new search-based technique for testing implementations of stochastic models by maximising the differences between the implementation and a pseudo-oracle. Our technique reduces testing effort and enables discrepancies to be found that might otherwise be overlooked. We show the technique can identify differences challenging for humans to observe, and use it to help a new user understand implementation differences in a real model of a citrus disease (Huanglongbing) used to inform policy and research.

Publisher's Version Article Search
Test Oracle Assessment and Improvement
Gunel Jahangirova, David Clark, Mark Harman, and Paolo Tonella
(Fondazione Bruno Kessler, Italy; University College London, UK)
We introduce a technique for assessing and improving test oracles by reducing the incidence of both false positives and false negatives. We prove that our approach can always result in an increase in the mutual information between the actual and perfect oracles. Our technique combines test case generation to reveal false positives and mutation testing to reveal false negatives. We applied the decision support tool that implements our oracle improvement technique to five real-world subjects. The experimental results show that the fault detection rate of the oracles after improvement increases, on average, by 48.6% (86% over the implicit oracle). Three actual, exposed faults in the studied systems were subsequently confirmed and fixed by the developers.

Publisher's Version Article Search

Program Understanding

DSI: An Evidence-Based Approach to Identify Dynamic Data Structures in C Programs
David H. White, Thomas Rupprecht, and Gerald Lüttgen
(University of Bamberg, Germany)
Comprehension of C programs containing pointer-based dynamic data structures can be a challenging task. To tackle this challenge we present Data Structure Investigator (DSI), a new dynamic analysis for automated data structure identification that targets C source code. Our technique first applies a novel abstraction on the evolving memory structures observed at runtime to discover data structure building blocks. By analyzing the interconnections between building blocks we are then able to identify, e.g., binary trees, doubly-linked lists, skip lists, and relationships between these such as nesting. Since the true shape of a data structure may be temporarily obscured by manipulation operations, we ensure robustness by first discovering and then reinforcing evidence for data structure observations. We show the utility of our DSI prototype implementation by applying it to both synthetic and real world examples. DSI outputs summarizations of the identified data structures, which will benefit software developers when maintaining (legacy) code and inform other applications such as memory visualization and program verification.

Publisher's Version Article Search
Documenting Database Usages and Schema Constraints in Database-Centric Applications
Mario Linares-Vásquez, Boyang Li, Christopher Vendome, and Denys Poshyvanyk
(College of William and Mary, USA; Universidad de los Andes, Colombia)
Database-centric applications (DCAs) usually rely on database operations over a large number of tables and attributes. Understanding how database tables and attributes are used to implement features in DCAs along with the constraints related to these usages is an important component of any DCA’s maintenance. However, manually documenting database related operations and their asynchronously evolving constraints in constantly changing source code is a hard and time-consuming problem. In this paper, we present a novel approach, namely DBScribe, aimed at automatically generating always up-to-date natural language descriptions of database operations and schema constraints in source code methods. DBScribe statically analyzes the code and database schema to detect database usages and then prop- agates these usages and schema constraints through the call-chains implementing database-related features. Finally, each method in these call-chains is automatically documented based on the underlying database usages and constraints.
We evaluated DBScribe in a study with 52 participants analyzing generated documentation for database-related methods in five open-source DCAs. Additionally, we evaluated the descriptions generated by DBScribe on two commercial DCAs involving original developers. The results for the studies involving open-source and commercial DCAs demonstrate that generated descriptions are accurate and useful while understanding database usages and constraints, in particular during maintenance tasks.

Publisher's Version Article Search Info
Exploring Regular Expression Usage and Context in Python
Carl Chapman and Kathryn T. Stolee
(Iowa State University, USA; North Carolina State University, USA)
Due to the popularity and pervasive use of regular expressions, researchers have created tools to support their creation, validation, and use. However, little is known about the context in which regular expressions are used, the features that are most common, and how behaviorally similar regular expressions are to one another.
In this paper, we explore the context in which regular expressions are used through a combination of developer surveys and repository analysis. We survey 18 professional developers about their regular expression usage and pain points. Then, we analyze nearly 4,000 open source Python projects from GitHub and extract nearly 14,000 unique regular expression patterns. We map the most common features used in regular expressions to those features supported by four major regex research efforts from industry and academia: brics, Hampi, RE2, and Rex. Using similarity analysis of regular expressions across projects, we identify six common behavioral clusters that describe how regular expressions are often used in practice. This is the first rigorous examination of regex usage and it provides empirical evidence to support design decisions by regex tool builders. It also points to areas of needed future work, such as refactoring regular expressions to increase regex understandability and context-specific tool support for common regex usages.

Publisher's Version Article Search
Toward Understanding Compiler Bugs in GCC and LLVM
Chengnian Sun, Vu Le, Qirun Zhang, and Zhendong Su
(University of California at Davis, USA)
Compilers are critical, widely-used complex software. Bugs in them have significant impact, and can cause serious damage when they silently miscompile a safety-critical application. An in-depth understanding of compiler bugs can help detect and fix them. To this end, we conduct the first empirical study on the characteristics of the bugs in two main-stream compilers, GCC and LLVM. Our study is significant in scale — it exhaustively examines about 50K bugs and 30K bug fix revisions over more than a decade’s span. This paper details our systematic study. Summary findings include: (1) In both compilers, C++ is the most buggy component, accounting for around 20% of the total bugs and twice as many as the second most buggy component; (2) the bug revealing test cases are typically small, with 80% having fewer than 45 lines of code; (3) most of the bug fixes touch a single source file with small modifications (43 lines for GCC and 38 for LLVM on average); (4) the average lifetime of GCC bugs is 200 days, and 111 days for LLVM; and (5) high priority tends to be assigned to optimizer bugs, most notably 30% of the bugs in GCC’s inter-procedural analysis component are labeled P1 (the highest priority). This study deepens our understanding of compiler bugs. For application developers, it shows that even mature production compilers still have many bugs, which may affect development. For researchers and compiler developers, it sheds light on interesting characteristics of compiler bugs, and highlights challenges and opportunities to more effectively test and debug compilers.

Publisher's Version Article Search aec-badge-issta


Semantic Modelling of Android Malware for Effective Malware Comprehension, Detection, and Classification
Guozhu Meng, Yinxing Xue, Zhengzi Xu, Yang Liu, Jie Zhang, and Annamalai Narayanan
(Nanyang Technological University, Singapore)
Malware has posed a major threat to the Android ecosystem. Existing malware detection tools mainly rely on signature- or feature- based approaches, failing to provide detailed information beyond the mere detection. In this work, we propose a precise semantic model of Android malware based on Deterministic Symbolic Automaton (DSA) for the purpose of malware comprehension, detection and classification. It shows that DSA can capture the common malicious behaviors of a malware family, as well as the malware variants. Based on DSA, we develop an automatic analysis framework, named SMART, which learns DSA by detecting and summarizing semantic clones from malware families, and then extracts semantic features from the learned DSA to classify malware according to the attack patterns. We conduct the experiments in both malware benchmark and 223,170 real-world apps. The results show that SMART builds meaningful semantic models and outperforms both state-of-the-art approaches and anti-virus tools in malware detection. SMART identifies 4583 new malware in real-world apps that are missed by most anti-virus tools. The classification step further identifies new malware variants and unknown families.

Publisher's Version Article Search
DroidRA: Taming Reflection to Support Whole-Program Analysis of Android Apps
Li Li, Tegawendé F. Bissyandé, Damien Octeau, and Jacques Klein
(University of Luxembourg, Luxembourg; Pennsylvania State University, USA)
Android developers heavily use reflection in their apps for legitimate reasons, but also significantly for hiding malicious actions. Unfortunately, current state-of-the-art static analysis tools for Android are challenged by the presence of reflective calls which they usually ignore. Thus, the results of their security analysis, e.g., for private data leaks, are inconsistent given the measures taken by malware writers to elude static detection. We propose the DroidRA instrumentation-based approach to address this issue in a non-invasive way. With DroidRA, we reduce the resolution of reflective calls to a composite constant propagation problem. We leverage the COAL solver to infer the values of reflection targets and app, and we eventually instrument this app to include the corresponding traditional Java call for each reflective call. Our approach allows to boost an app so that it can be immediately analyzable, including by such static analyzers that were not reflection-aware. We evaluate DroidRA on benchmark apps as well as on real-world apps, and demonstrate that it can allow state-of-the-art tools to provide more sound and complete analysis results.

Publisher's Version Article Search Info

Mutation Testing

Mutation-Aware Fault Prediction
David Bowes, Tracy Hall, Mark Harman, Yue Jia, Federica Sarro, and Fan Wu
(University of Hertfordshire, UK; Brunel University London, UK; University College London, UK)
We introduce mutation-aware fault prediction, which leverages additional guidance from metrics constructed in terms of mutants and the test cases that cover and detect them. We report the results of 12 sets of experiments, applying 4 different predictive modelling techniques to 3 large real-world systems (both open and closed source). The results show that our proposal can significantly (p ≤ 0.05) improve fault prediction performance. Moreover, mutation-based metrics lie in the top 5% most frequently relied upon fault predictors in 10 of the 12 sets of experiments, and provide the majority of the top ten fault predictors in 9 of the 12 sets of experiments.

Publisher's Version Article Search
Predictive Mutation Testing
Jie Zhang, Ziyi Wang, Lingming Zhang, Dan Hao, Lei Zang, Shiyang Cheng, and Lu Zhang
(Peking University, China; University of Texas at Dallas, USA)
Mutation testing is a powerful methodology for evaluating test suite quality. In mutation testing, a large number of mutants are generated and executed against the test suite to check the ratio of killed mutants. Therefore, mutation testing is widely believed to be a computationally expensive technique. To alleviate the efficiency concern of mutation testing, in this paper, we propose predictive mutation testing (PMT), the first approach to predicting mutation testing results without mutant execution. In particular, the proposed approach constructs a classification model based on a series of features related to mutants and tests, and uses the classification model to predict whether a mutant is killed or survived without executing it. PMT has been evaluated on 163 real-world projects under two application scenarios (i.e., cross-version and cross-project). The experimental results demonstrate that PMT improves the efficiency of mutation testing by up to 151.4X while incurring only a small accuracy loss when predicting mutant execution results, indicating a good tradeoff between efficiency and effectiveness of mutation testing.

Publisher's Version Article Search
Threats to the Validity of Mutation-Based Test Assessment
Mike Papadakis, Christopher Henard, Mark Harman, Yue Jia, and Yves Le Traon
(University of Luxembourg, Luxembourg; University College London, UK)
Much research on software testing and test techniques relies on experimental studies based on mutation testing. In this paper we reveal that such studies are vulnerable to a potential threat to validity, leading to possible Type I errors; incorrectly rejecting the Null Hypothesis. Our findings indicate that Type I errors occur, for arbitrary experiments that fail to take countermeasures, approximately 62% of the time. Clearly, a Type I error would potentially compromise any scientific conclusion. We show that the problem derives from such studies’ combined use of both subsuming and subsumed mutants. We collected articles published in the last two years at three leading software engineering conferences. Of those that use mutation-based test assessment, we found that 68% are vulnerable to this threat to validity.

Publisher's Version Article Search


Efficient Race Detection in the Presence of Programmatic Event Loops
Anirudh Santhiar, Shalini Kaleeswaran, and Aditya Kanade
(Indian Institute of Science, India)
An event loop is the basic scheduling mechanism for programs that respond to asynchronous events. In some frameworks, only the runtime can spin event loops, while in others, these can also be spun programmatically by event handlers. The latter provides more flexibility and helps improve responsiveness in cases where an event handler must wait for some input, for example, from the user or network. It can do so while spinning an event loop.
In this paper, we consider the scheduling scheme of programmatic event loops. Programs which follow this scheme are prone to interference between a handler that is spinning an event loop and another handler that runs inside the loop. We present a happens-before based race detection technique for such programs. We exploit the structure and semantics of executions of these programs to design a sparse representation of the happens-before relation. It relates only a few pairs of operations explicitly in such a way that the ordering between any pair of operations can be inferred from the sparse representation in constant time.
We have implemented our technique in an offline race detector for C/C++ programs, called SparseRacer. We discovered 13 new and harmful race conditions in 9 open-source applications using SparseRacer. So far, developers have confirmed 8 as valid bugs, and have fixed 3. These bugs arise from unintended interference due to programmatic event loops. Our sparse representation improved efficiency and gave an average speedup of 5x in race detection time.

Publisher's Version Article Search
Automatically Verifying and Reproducing Event-Based Races in Android Apps
Yongjian Hu, Iulian Neamtiu, and Arash Alavi
(University of California at Riverside, USA; New Jersey Institute of Technology, USA)
Concurrency has been a perpetual problem in Android apps, mainly due to event-based races. Several event-based race detectors have been proposed, but they produce false positives, cannot reproduce races, and cannot distinguish be- tween benign and harmful races. To address these issues, we introduce a race verification and reproduction approach named ERVA. Given a race report produced by a race detector, ERVA uses event dependency graphs, event flipping, and replay to verify the race and determine whether it is a false positive, or a true positive; for true positives, ERVA uses state comparison to distinguish benign races from harmful races. ERVA automatically produces an event schedule that can be used to deterministically reproduce the race, so developers can fix it. Experiments on 16 apps indicate that only 3% of the races reported by race detectors are harmful, and that ERVA can verify an app in 20 minutes on average.

Publisher's Version Article Search
SyncProf: Detecting, Localizing, and Optimizing Synchronization Bottlenecks
Tingting Yu and Michael Pradel
(University of Kentucky, USA; TU Darmstadt, Germany)
Writing concurrent programs is a challenge because developers must consider both functional correctness and performance requirements. Numerous program analyses and testing techniques have been proposed to detect functional faults, e.g., caused by incorrect synchronization. However, little work has been done to help developers address performance problems in concurrent programs, e.g., because of inefficient synchronization. This paper presents SyncProf, a concurrency-focused profiling approach that helps in detecting, localizing, and optimizing synchronization bottlenecks. In contrast to traditional profilers, SyncProf repeatedly executes a program with various inputs and summarizes the observed performance behavior. A key novelty is a graph-based representation of relations between critical sections, which is the basis for computing the performance impact of critical sections, for identifying the root cause of a bottleneck, and for suggesting optimization strategies to the developer. We evaluate SyncProf on 19 versions of eight C/C++ projects with both known and previously unknown synchronization bottlenecks. The results show that SyncProf effectively localizes the root causes of these bottlenecks with higher precision than a state of the art lock contention profiler and that it suggests valuable strategies to avoid the bottlenecks.

Publisher's Version Article Search


Zero-Overhead Profiling via EM Emanations
Robert Callan, Farnaz Behrang, Alenka Zajic, Milos Prvulovic, and Alessandro Orso
(Georgia Tech, USA)
This paper presents an approach for zero-overhead profiling (ZOP). ZOP accomplishes accurate program profiling with no modification to the program or system during profiling and no dedicated hardware features. To do so, ZOP records the electromagnetic (EM) emanations generated by computing systems during program execution and analyzes the recorded emanations to track a program’s execution path and generate profiling information. Our approach consists of two main phases. In the training phase, ZOP instruments the program and runs it against a set of inputs to collect path timing information while simultaneously collecting waveforms for the EM emanations generated by the program. In the profiling phase, ZOP runs the original (i.e., uninstrumented and unmodified) program against inputs whose executions need to be profiled, records the waveforms produced by the program, and matches these waveforms with those collected during training to predict which parts of the code were exercised by the inputs and how often. We evaluated an implementation of ZOP on several benchmarks and our results show that ZOP can predict path profiling information for these benchmarks with greater than 94% accuracy on average.

Publisher's Version Article Search
Efficient Flow Profiling for Detecting Performance Bugs
Rashmi Mudduluru and Murali Krishna Ramanathan
(Indian Institute of Science, India)
Performance issues in large applications arise only in particular scenarios under heavy load conditions. It is therefore difficult to catch them during testing and they easily escape into production. This necessitates the design of a common and efficient instrumentation strategy that profiles the flow of objects during an execution. Designing such a strategy which enables profile generation precisely with low overhead is non-trivial due to the number of objects created, accessed and paths traversed by them in an execution. In this paper, we design and implement an efficient instrumentation technique that efficiently generates object flow profiles for Java programs, without requiring any modifications to the underlying virtual machine. We achieve this by applying Ball-Larus numbering on a specialized hybrid flow graph (hfg). The hfg path profiles that are collected during runtime are post-processed offline to derive the object flow profiles. We implemented the profiler and validated its efficacy by applying it on Java programs. The results demonstrate the scalability of our approach, which handles 0.2M to 0.55B object accesses with an average runtime overhead of 8x. We also demonstrate the effectiveness of the generated profiles by implementing a client analysis that consumes the profiles to detect performance bugs. The analysis detects 38 performance bugs which when refactored result in significant performance gains (up to 30%) in running times.

Publisher's Version Article Search aec-badge-issta
Energy-Aware Test-Suite Minimization for Android Apps
Reyhaneh Jabbarvand, Alireza Sadeghi, Hamid Bagheri, and Sam Malek
(University of California at Irvine, USA)
The rising popularity of mobile apps deployed on battery-constrained devices has motivated the need for effective energy-aware testing techniques. Energy testing is generally more labor intensive and expensive than functional testing, as tests need to be executed in the deployment environment and specialized equipment needs to be used to collect energy measurements. Currently, there is a dearth of automatic mobile testing techniques that consider energy as a program property of interest. This paper presents an energy-aware test-suite minimization approach to significantly reduce the number of tests needed to effectively test the energy properties of an Android app. It relies on an energy-aware coverage criterion that indicates the degree to which energy-greedy segments of a program are tested. We describe and evaluate two complementary algorithms for test-suite minimization. Experiments over test suites provided for real-world apps have corroborated our ability to reduce the test suite size by 84% on average, while maintaining the effectiveness of test suite in revealing the great majority of energy bugs.

Publisher's Version Article Search aec-badge-issta

Demonstration Papers

COSTOTest: A Tool for Building and Running Test Harness for Service-Based Component Models (Demo)
Pascal André, Jean-Marie Mottu, and Gerson Sunyé
(LINA, France; University of Nantes, France; Inria, France; Mines Nantes, France)
Early testing reduces the cost of detecting faults and improves the system reliability. In particular, testing component or service based systems during modeling frees the tests from implementation details, especially those related to the middleware. COSTOTest is a tool that helps the tester during the process of designing tests at the model level. It suggests the possibilities and the lacks when (s)he builds test cases. Building executable tests is achieved thanks to model transformations.

Publisher's Version Article Search Info
ASTOR: A Program Repair Library for Java (Demo)
Matias Martinez and Martin Monperrus
(University of Lugano, Switzerland; University of Lille, France; Inria, France)
During the last years, the software engineering research community has proposed approaches for automatically repairing software bugs. Unfortunately, many software artifacts born from this research are not available for repairing Java programs. To-reimplement those approaches from scratch is costly. To facilitate experimental replications and comparative evaluations, we present Astor, a publicly available program repair library that includes the implementation of three notable repair approaches (jGenProg, jKali and jMutRepair). We envision that the research community will use Astor for setting up comparative evaluations and explore the design space of automatic repair for Java. Astor offers researchers ways to implement new repair approaches or to modify existing ones. Astor repairs in total 33 real bugs from four large open source projects.

Publisher's Version Article Search Info
Jolinar: Analysing the Energy Footprint of Software Applications (Demo)
Adel Noureddine, Syed Islam, and Rabih Bashroush
(University of East London, UK)
Monitoring energy consumption of applications is crucial for energy optimisation and improvements in software systems. With the recent emphasis on energy efficiency, it is vital that software engineers have an understanding of the energy consumed by the code they write. In this paper, we present Jolinar, a tool that bridges the gap between energy measurements and accessibility to software engineers and even end-users. The tool builds on top of recent energy models to provide an accurate, light and easy-to-use interface for energy measurements. The target audience of Jolinar is both software engineers and non-technical end-users who want to monitor their applications' energy footprint. We show that end-users can use Jolinar's GUI to determine the energy consumed by the software they are using, and software engineers can use the tool to analyse energy consumption of systems to make energy-conscious decisions.

Publisher's Version Article Search
PIT: A Practical Mutation Testing Tool for Java (Demo)
Henry Coles, Thomas Laurent, Christopher Henard, Mike Papadakis, and Anthony Ventresque
(NCR, UK; Lero, Ireland; University College Dublin, Ireland; École Centrale de Nantes, France; University of Luxembourg, Luxembourg)
Mutation testing introduces artificial defects to measure the adequacy of testing. In case candidate tests can distinguish the behaviour of mutants from that of the original program, they are considered of good quality – otherwise developers need to design new tests. While, this method has been shown to be effective, industry-scale code challenges its applicability due to the sheer number of mutants and test executions it requires. In this paper we present PIT, a practical mutation testing tool for Java, applicable on real-world codebases. PIT is fast since it operates on bytecode and optimises mutant executions. It is also robust and well integrated with development tools, as it can be invoked through a command line interface, Ant or Maven. PIT is also open source and hence, publicly available at

Publisher's Version Article Search Info

proc time: 2.54