ISSTA 2013 – Proceedings

Message from the Chairs
It is our great pleasure to welcome you to Lugano for ISSTA 2013, the 22nd International Symposium on Software Testing and Analysis. ISSTA is the leading research conference in software testing and analysis. It brings together researchers and practitioners to exchange new ideas, problems, and experience on how to analyse and test software systems. The ISSTA 2013 program includes keynotes, a doctoral symposium, technical papers and workshops and has been co-ordinated to be also close to the TAROT summer school on testing in both time and location.

Technical Research

Contracts, Invariants, and Stability

Expressing and Checking Intended Changes via Software Change Contracts
Jooyong Yi, Dawei Qi, Shin Hwei Tan, and Abhik Roychoudhury

(National University of Singapore, Singapore)
Software errors often originate from incorrect changes, including incorrect program fixes, incorrect feature updates and so on. Capturing the intended program behavior explicitly via contracts is thus an attractive proposition. In our recent work, we had espoused the notion of "change contracts" to express the intended program behavior changes across program versions. Change contracts differ from program contracts in that they do not require the programmer to describe the intended behavior of program features which are unchanged across program versions. In this work, we present the formal semantics of our change contract language built on top of the Java Modeling Language (JML). Our change contract language can describe behavioral as well as structural changes. We evaluate the expressivity of the change contract language via a survey given to final year undergraduate students. The survey results enable us to understand the usability of our change contract language for purposes of writing contracts, comprehending written contracts, and modifying programs according to given change contracts. Finally, we discuss the tool support developed for our change contract language. The tool support enables (i) test generation to witness contract violation, as well as (ii) automated repair of certain tests which are broken due to program changes.

Finding Rare Numerical Stability Errors in Concurrent Computations
Hana Chockler, Karine Even, and Eran Yahav
(IBM Research, Israel; Technion, Israel)
A numerical algorithm is called stable if an error, in all possible executions of the algorithm, does not exceed a predefined bound. Introduction of concurrency to numerical algorithms results in a significant increase in the number of possible computations of the same result, due to different possible interleavings of concurrent threads. This can lead to instability of previously stable algorithms, since rounding can result in a larger error than expected for some interleavings. Such errors can be very rare, since the particular combination of rounding can occur in only a small fraction of interleavings. In this paper, we apply the cross-entropy method -- a generic approach to rare event simulation and combinatorial optimization -- to detect rare numerical instability in concurrent programs. The cross-entropy method iteratively samples a small number of executions and adjusts the probability distribution of possible scheduling decisions to increase the probability of encountering an error in a subsequent iteration. We demonstrate the effectiveness of our approach on implementations of several numerical algorithms with concurrency and rounding by truncation of intermediate computations. We describe several abstraction algorithms on top of the implementation of the cross-entropy method and show that with abstraction, our algorithms successfully find rare errors in programs with hundreds of threads. In fact, some of our abstractions lead to a state space whose size does not depend on the number of threads at all. We compare our approach to several existing testing algorithms and argue that its performance is superior to other techniques.

Parallel Bounded Analysis in Code with Rich Invariants by Refinement of Field Bounds
Nicolás Rosner, Juan Galeotti, Santiago Bermúdez, Guido Marucci Blas, Santiago Perez De Rosso, Lucas Pizzagalli, Luciano Zemín, and Marcelo F. Frias
(UBA, Argentina; Saarland University, Germany; ITBA, Argentina)
In this article we present a novel technique for automated parallel bug-finding based on the sequential analysis tool TACO. TACO is a tool based on SAT-solving for efficient bug-finding in Java code with rich class invariants. It prunes the SAT-solver's search space by introducing precise symmetry-breaking predicates and bounding the relational semantics of Java class fields. The bounds computed by TACO generally include a substantial amount of nondeterminism; its reduction allows us to split the original analysis into disjoint subproblems. We discuss the soundness and completeness of the decomposition. Furthermore, we present experimental results showing that MUCHO-TACO, our tool which implements this technique, yields significant speed-ups over TACO on commodity cluster hardware.

Fuzzing and GUI Testing

Efficient and Flexible GUI Test Execution via Test Merging
Pranavadatta Devaki, Suresh Thummalapenta, Nimit Singhania, and Saurabh Sinha
(IBM Research, India)
As a test suite evolves, it can accumulate redundant tests. To address this problem, many test-suite reduction techniques, based on different measures of redundancy, have been developed. A more subtle problem, that can also cause test-suite bloat and that has not been addressed by existing research, is the accumulation of similar tests. Similar tests are not redundant by any measure; but, they contain many common actions that are executed repeatedly, which over a large test suite, can degrade execution time substantially.
We present a test merging technique for GUI tests. Given a test suite, the technique identifies the tests that can be merged and creates a merged test, which covers all the application states that are exercised individually by the tests, but with the redundant common steps executed only once. The key novelty in the merging technique is that it compares the dynamic states induced by the tests to identify a semantically meaningful interleaving of steps from different tests. The technique not only improves the efficiency of test execution, but also ensures that there is no loss in the fault-revealing power of the original tests. In the empirical studies, conducted using four open-source web applications and one proprietary enterprise web application, in which over 3300 test cases and 19600 test steps were analyzed, the technique reduced the number of test steps by 29

Automatically Repairing Broken Workflows for Evolving GUI Applications
Sai Zhang, Hao Lü, and Michael D. Ernst

(University of Washington, USA)
A workflow is a sequence of UI actions to complete a specific task. In the course of a GUI application's evolution, changes ranging from a simple GUI refactoring to a complete rearchitecture can break an end-user's well-established workflow. It can be challenging to find a replacement workflow. To address this problem, we present a technique (and its tool implementation, called FlowFixer) that repairs a broken workflow. FlowFixer uses dynamic profiling, static analysis, and random testing to suggest a replacement UI action that fixes a broken workflow.
We evaluated FlowFixer on 16 broken workflows from 5 realworld GUI applications written in Java. In 13 workflows, the correct replacement action was FlowFixer's first suggestion. In 2 workflows, the correct replacement action was FlowFixer's second suggestion. The remaining workflow was un-repairable. Overall, FlowFixer produced significantly better results than two alternative approaches.

Semi-valid Input Coverage for Fuzz Testing
Petar Tsankov, Mohammad Torabi Dashti, and David Basin
(ETH Zurich, Switzerland)
We define semi-valid input coverage (SVCov), the first coverage criterion for fuzz testing. Our criterion is applicable whenever the valid inputs can be defined by a finite set of constraints. SVCov measures to what extent the tests cover the domain of semi-valid inputs, where an input is semi-valid if and only if it satisfies all the constraints but one.
We demonstrate SVCov's practical value in a case study on fuzz testing the Internet Key Exchange protocol (IKE). Our study shows that it is feasible to precisely define and efficiently measure SVCov. Moreover, SVCov provides essential information for improving the effectiveness of fuzz testing and enhancing fuzz-testing tools and libraries. In particular, by increasing coverage under SVCov, we have discovered a previously unknown vulnerability in a mature IKE implementation.

Automated Testing with Targeted Event Sequence Generation
Casper S. Jensen, Mukul R. Prasad, and Anders Møller

(Aarhus University, Denmark; Fujitsu Labs, USA)
Automated software testing aims to detect errors by producing test inputs that cover as much of the application source code as possible. Applications for mobile devices are typically event-driven, which raises the challenge of automatically producing event sequences that result in high coverage. Some existing approaches use random or model-based testing that largely treats the application as a black box. Other approaches use symbolic execution, either starting from the entry points of the applications or on specific event sequences. A common limitation of the existing approaches is that they often fail to reach the parts of the application code that require more complex event sequences.
We propose a two-phase technique for automatically finding event sequences that reach a given target line in the application code. The first phase performs concolic execution to build summaries of the individual event handlers of the application. The second phase builds event sequences backward from the target, using the summaries together with a UI model of the application. Our experiments on a collection of open source Android applications show that this technique can successfully produce event sequences that reach challenging targets.

Analysis and Testing of Non-functional Properties

Calculating Source Line Level Energy Information for Android Applications
Ding Li, Shuai Hao, William G. J. Halfond

, and Ramesh Govindan
(University of Southern California, USA)
The popularity of mobile apps continues to grow as developers take advantage of the sensors and data available on mobile devices. However, the increased functionality comes with a higher energy cost, which can cause a problem for users on battery constrained mobile devices. To improve the energy consumption of mobile apps, developers need detailed information about the energy consumption of their applications. Existing techniques have drawbacks that limit their usefulness or provide information at too high of a level of granularity, such as components or methods. Our approach is able to calculate source line level energy consumption information. It does this by combining hardware-based power measurements with program analysis and statistical modeling. Our empirical evaluation of the approach shows that it is fast and accurate.

Context-Sensitive Delta Inference for Identifying Workload-Dependent Performance Bottlenecks
Xusheng Xiao, Shi Han

, Dongmei Zhang

, and Tao Xie
(North Carolina State University, USA; Microsoft Research, China)
Software hangs can be caused by expensive operations in responsive actions (such as time-consuming operations in UI threads). Some of the expensive operations depend on the input workloads, referred to as workload-dependent performance bottlenecks (WDPBs). WDPBs are usually caused by workload-dependent loops (i.e., WDPB loops) that contain relatively expensive operations. Traditional performance testing and single-execution profiling may not reveal WDPBs due to incorrect assumptions of workloads. To address these issues, we propose the DeltaInfer approach that predicts WDPB loops under large workloads via inferring iteration counts of WDPB loops using complexity models for the workload size. DeltaInfer incorporates a novel concept named context-sensitive delta inference that consists of two parts: temporal inference for inferring the complexity models of different program locations, and spatial inference for identifying WDPB loops as WDPB candidates. We conducted evaluations on two popular open-source GUI applications, and identified impactful WDPBs that caused 10 performance bugs.

Info

Combining Model Checking and Testing with an Application to Reliability Prediction and Distribution
Lin Gui, Jun Sun, Yang Liu

, Yuan Jie Si, Jin Song Dong, and Xin Yu Wang
(National University of Singapore, Singapore; Singapore University of Technology and Design, Singapore; Nanyang Technological University, Singapore; Zhejiang University, China)
Testing provides a probabilistic assurance of system correctness. In general, testing relies on the assumptions that the system under test is deterministic so that test cases can be sampled. However, a challenge arises when a system under test behaves non-deterministiclly in a dynamic operating environment because it will be unknown how to sample test cases.
In this work, we propose a method combining hypothesis testing and probabilistic model checking so as to provide the ``assurance" and quantify the error bounds. The idea is to apply hypothesis testing to deterministic system components and use probabilistic model checking techniques to lift the results through non-determinism. Furthermore, if a requirement on the level of ``assurance" is given, we apply probabilistic model checking techniques to push down the requirement through non-determinism to individual components so that they can be verified using hypothesis testing. We motivate and demonstrate our method through an application of system reliability prediction and distribution. Our approach has been realized in a toolkit named RaPiD, which has been applied to investigate two real-world systems.

Info

Heap Analysis and Testing

Dynamically Validating Static Memory Leak Warnings
Mengchen Li, Yuanjun Chen, Linzhang Wang

, and Guoqing Xu
(Nanjing University, China; UC Irvine, USA)
File Edit Options Buffers Tools TeX Help Memory leaks have significant impact on software availability, performance, and security. Static analysis has been widely used to find memory leaks in C/C++ programs. Although a static analysis is able to find all potential leaks in a program, it often reports a great number of false warnings. Manually validating these warnings is a daunting task, which significantly limits the practicality of the analysis. In this paper, we develop a novel dynamic technique that automatically validates and categorizes such warnings to unleash the power of static memory leak detectors. Our technique analyzes each warning that contains information regarding the leaking allocation site and the leaking path, generates test cases to cover the leaking path, and tracks objects created by the leaking allocation site. Eventually, warnings are classified into four categories: MUST-LEAK, LIKELY-NOT-LEAK, BLOAT, and MAY-LEAK. Warnings in MUST-LEAK are guaranteed by our analysis to be true leaks. Warnings in LIKELY-NOT-LEAK are highly likely to be false warnings. Although we cannot provide any formal guarantee that they are not leaks, we have high confidence that this is the case. Warnings in BLOAT are also not likely to be leaks but they should be fixed to improve performance. Using our approach, the developer's manual validation effort needs to be focused only on warnings in the category MAY-LEAK, which is often much smaller than the original set.

Collecting a Heap of Shapes
Earl T. Barr

, Christian Bird, and Mark Marron
(University College London, UK; UC Davis, USA; Microsoft Research, USA; IMDEA Software Institute, Spain)
The program heap is fundamentally a simple mathematical concept --- a set of objects and a connectivity relation on them. However, a large gap exists between the set of heap structures that could be constructed and those that programmers actually build. To understand this gap, we empirically study heap structures and sharing relations in large object-oriented programs. To scale and make sense of real world heaps, any analysis must employ abstraction; our abstraction groups sets of objects by role and the aliasing present in pointer sets. We find that the heaps of real-world programs are, in practice, fundamentally simple structures that are largely constructed from a small number of simple structures and sharing idioms, such as the sharing of immutable or unique (e.g., singleton) objects. For instance, we find that, under our abstraction, 53--75% of pointers build tree structures and we classify all but 7--18% of aliasing pointers. These results provide actionable information for rethinking the design of annotation systems, memory allocation/collection, and program analyses.

Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding of Concurrency Bugs
Sangmin Park, Mary Jean Harrold, and Richard Vuduc
(Georgia Tech, USA)
This paper presents Griffin, a new fault-comprehension technique. Griffin provides a way to explain concurrency bugs using additional information over existing fault-localization techniques, and thus, bridges the gap between fault- localization and fault-fixing techniques. Griffin inputs a list of memory-access patterns and a coverage matrix, groups those patterns responsible for the same concurrency bug, and outputs the grouped patterns along with suspicious methods and bug graphs. Griffin is the first technique that handles multiple concurrency bugs. This paper also describes the implementation of Griffin in Java and C++, and shows the empirical evaluation of Griffin on a set of subjects. The results show that, for our subjects, Griffin clusters failing executions and memory-access patterns for the same bug with few false positives, provides suspicious methods that contain the locations to be fixed, and runs efficiently.

Races and Multi-threaded Analysis and Testing

Variable and Thread Bounding for Systematic Testing of Multithreaded Programs
Sandeep Bindal, Sorav Bansal, and Akash Lal

(IIT Delhi, India; Microsoft Research, India)
Previous approaches to systematic state-space exploration for testing multi-threaded programs have proposed context-bounding and depth-bounding to be effective ranking algorithms for testing multithreaded programs. This paper proposes two new metrics to rank thread schedules for systematic state-space exploration. Our metrics are based on characterization of a concurrency bug using v (the minimum number of distinct variables that need to be involved for the bug to manifest) and t (the minimum number of distinct threads among which scheduling constraints are required to manifest the bug). Our algorithm is based on the hypothesis that in practice, most concurrency bugs have low v (typically 1-2) and low t (typically 2-4) characteristics. We iteratively explore the search space of schedules in increasing orders of v and t. We show qualitatively and empirically that our algorithm finds common bugs in fewer number of execution runs, compared with previous approaches. We also show that using v and t improves the lower bounds on the probability of finding bugs through randomized algorithms. Systematic exploration of schedules requires instrumenting each variable access made by a program, which can be very expensive and severely limits the applicability of this approach. Previous work has avoided this problem by interposing only on synchronization operations (and ignoring other variable accesses). We demonstrate that by using variable bounding (v) and a static imprecise alias analysis, we can interpose on all variable accesses (and not just synchronization operations) at 10-100x less overhead than previous approaches.

EnforceMOP: A Runtime Property Enforcement System for Multithreaded Programs
Qingzhou Luo and Grigore Roşu

(University of Illinois at Urbana-Champaign, USA)
Multithreaded programs are hard to develop and test. In order for programs to avoid unexpected concurrent behaviors at runtime, for example data-races, synchronization mechanisms are typically used to enforce a safe subset of thread interleavings. Also, to test multithreaded programs, devel- opers need to enforce the precise thread schedules that they want to test. These tasks are nontrivial and error prone.
This paper presents EnforceMOP, a framework for specifying and enforcing complex properties in multithreaded Java programs. A property is enforced at runtime by blocking the threads whose next actions would violate it. This way the remaining threads, whose execution is safe, can make global progress until the system eventually reaches a global state in which the blocked threads can be safely unblocked and allowed to execute. Users of EnforceMOP can specify the properties to be enforced using the expressive MOP multi-formalism notation, and can provide code to be executed at deadlock (when no thread is safe to continue). EnforceMOP was used in two different kinds of applications. First, to enforce general properties in multithreaded programs, as a formal, semantic alternative to the existing rigid and sometimes expensive syntactic synchronization mechanisms. Second, to enforce testing desirable schedules in unit testing of multithreaded programs, as an alternative to the existing limited and often adhoc techniques. Results show that EnforceMOP is able to effectively express and enforce complex properties and schedules in both scenarios.

SimRacer: An Automated Framework to Support Testing for Process-Level Races
Tingting Yu, Witawas Srisa-an, and Gregg Rothermel
(University of Nebraska-Lincoln, USA)
Faults introduced by races are difficult to detect because they usually occur only under specific execution interleavings. Numerous program analysis and testing techniques have been proposed to detect races between threads. Little work, however, has addressed the problem of detecting and testing for process-level races, in which two processes access a shared resource without proper synchronization. In this paper, we present SIMRACER, a novel testing-based framework that allows engineers to effectively test for process-level races. SIMRACER first computes potential races based on runtime traces obtained by running existing tests on target processes, and then it controls process scheduling relative to the potential races so that real races can be created. We implemented SIMRACER on a commercial virtual platform that is widely used to support hardware/software co-design. We then evaluated its effectiveness on sixteen real-world applications containing known process-level races. Our results show that SIMRACER is effective at detecting process-level races, and more effective than traditional stress testing techniques at detecting faults caused by those races.

Practical Static Race Detection for Java Parallel Loops
Cosmin Radoi and Danny Dig
(University of Illinois, USA; Oregon State University, USA)
Despite significant progress in recent years, the important problem of static race detection remains open. Previous techniques took a general approach and looked for races by analyzing the effects induced by low-level concurrency constructs (e.g., java.lang.Thread). But constructs and libraries for expressing parallelism at a higher level (e.g., fork-join, futures, parallel loops) are becoming available in all major programming languages.
We claim that specializing an analysis to take advantage of the extra semantic information provided by the use of these constructs and libraries improves precision and scalability.
We present IteRace, a set of techniques that are specialized to use the intrinsic thread, safety, and data-flow structure of collections and of the new loop-parallelism mechanism to be introduced in Java 8. Our evaluation shows that IteRace is fast and precise enough to be practical. It scales to programs of hundreds of thousands of lines of code and it reports few race warnings, thus avoiding a common pitfall of static analyses. The tool revealed six bugs in real-world applications. We reported four of them, one had already been fixed, and three were new and the developers confirmed and fixed them.

Localisation, Patching, and Repair

Using Automated Program Repair for Evaluating the Effectiveness of Fault Localization Techniques
Yuhua Qi, Xiaoguang Mao, Yan Lei, and Chengsong Wang
(National University of Defense Technology, China)
Many techniques on automated fault localization (AFL) have been introduced to assist developers in debugging. Prior studies evaluate the localization technique from the viewpoint of developers: measuring how many benefits that developers can obtain from the localization technique used when debugging. However, these evaluation approaches are not always suitable, because it is difficult to quantify precisely the benefits due to the complex debugging behaviors of developers. In addition, recent user studies have presented that developers working with AFL do not correct the defects more efficiently than ones working with only traditional debugging techniques such as breakpoints, even when the effectiveness of AFL is artificially improved. In this paper we attempt to propose a new research direction of developing AFL techniques from the viewpoint of fully automated debugging including the program repair of automation, for which the activity of AFL is necessary. We also introduce the NCP score as the evaluation measurement to assess and compare various techniques from this perspective. Our experiment on 15 popular AFL techniques with 11 subject programs shipping with real-life field failures presents the evidence that these AFL techniques performing well in prior studies do not have better localization effectiveness according to NCP score. We also observe that Jaccard has the better performance over other techniques in our experiment.

Info

Data Model Property Inference and Repair
Jaideep Nijjar and Tevfik Bultan

(UC Santa Barbara, USA)
Nowadays many software applications are deployed over compute clouds using the three-tier architecture, where the persistent data for the application is stored in a backend datastore and is accessed and modified by the server-side code based on the user interactions at the client-side. The data model forms the foundation of these three tiers, and identifies the set of objects stored by the application and the relations (associations) among them. In this paper, we present techniques for automatically inferring properties about the data model by analyzing the relations among the object classes. We then check the inferred properties with respect to the semantics of the data model using automated verification techniques. For the properties that fail, we present techniques that generate fixes to the data model that establish the inferred properties. We implemented this approach for web applications built using the Ruby on Rails framework and applied it to five open source applications. Our experimental results demonstrate that our approach is effective in automatically identifying and fixing errors in data models of real-world web applications.

F3: Fault Localization for Field Failures
Wei Jin and Alessandro Orso

(Georgia Tech, USA)
Reproducing and debugging field failures--failures that occur on user machines after release--are challenging tasks for developers. To help the first task, in previous work we have proposed BugRedux, a technique for reproducing, in-house, failures observed in the field. Although BugRedux can help developers reproduce field failures, it does not provide any specific support for debugging such failures. To address this limitation, in this paper we present F3, a novel technique that builds on BugRedux and extends it with support for fault localization. Specifically, in F3 we extend our previous technique in two main ways: first, we modify BugRedux so that it generates multiple failing and passing executions "similar" to the observed field failure; second, we add to BugRedux debugging capabilities by combining it with a customized fault-localization technique. The results of our empirical evaluation, performed on a set of real-world programs and field failures, are promising: for all the failures considered, F3 was able to (1) synthesize passing and failing executions and (2) successfully use the synthesized executions to perform fault localization and, ultimately, help debugging.

Info

Mutating and Mocking

Selective Mutation Testing for Concurrent Code
Milos Gligoric, Lingming Zhang, Cristiano Pereira, and Gilles Pokam
(University of Illinois at Urbana-Champaign, USA; University of Texas at Austin, USA; Intel, USA)
Concurrent code is becoming increasingly important with the advent of multi-cores, but testing concurrent code is challenging. Researchers are developing new testing techniques and test suites for concurrent code, but evaluating these techniques and test suites often uses a small number of real or manually seeded bugs.
Mutation testing allows creating a large number of buggy programs to evaluate test suites. However, performing mutation testing is expensive even for sequential code, and the cost is higher for concurrent code where each test has to be executed for many (possibly all) thread schedules. The most widely used technique to speed up mutation testing is selective mutation, which reduces the number of mutants by applying only a subset of mutation operators such that test suites that kill all mutants generated by this subset also kill (almost) all mutants generated by all mutation operators. To date, selective mutation has been used only for sequential mutation operators.
This paper explores selective mutation for concurrent mutation operators. Our results identify several sets of concurrent mutation operators that can effectively reduce the number of mutants, show that operator-based selection is slightly better than random mutant selection, and show that sequential and concurrent mutation operators are independent, demonstrating the importance of studying concurrent mutation operators.

Faster Mutation Testing Inspired by Test Prioritization and Reduction
Lingming Zhang, Darko Marinov, and Sarfraz Khurshid
(University of Texas at Austin, USA; University of Illinois at Urbana-Champaign, USA)
Mutation testing is a well-known but costly approach for determining test adequacy. The central idea behind the approach is to generate mutants, which are small syntactic transformations of the program under test, and then to measure for a given test suite how many mutants it kills. A test t is said to kill a mutant m of program p if the output of t on m is different from the output of t on p. The effectiveness of mutation testing in determining the quality of a test suite relies on the ability to apply it using a large number of mutants. However, running many tests against many mutants is time consuming. We present a family of techniques to reduce the cost of mutation testing by prioritizing and reducing tests to more quickly determine the sets of killed and non-killed mutants. Experimental results show the effectiveness and efficiency of our techniques.

Info

Declarative Mocking
Hesam Samimi, Rebecca Hicks, Ari Fogel, and Todd Millstein

(UC Los Angeles, USA)
Test-driven methodologies encourage testing early and often. "Mock objects" support this approach by allowing a component to be tested before all depended-upon components are available. Today mock objects typically reflect little to none of an object's intended functionality, which makes it difficult and error-prone for developers to test rich properties of their code. This paper presents "declarative mocking", which enables the creation of expressive and reliable mock objects with relatively little effort. In our approach, developers write method specifications in a high-level logical language for the API being mocked, and a constraint solver dynamically executes these specifications when the methods are invoked. In addition to mocking functionality, this approach seamlessly allows data and other aspects of the environment to be easily mocked. We have implemented the approach as an extension to an existing tool for executable specifications in Java called PBnJ. We have performed an exploratory study of declarative mocking on several existing Java applications, in order to understand the power of the approach and to categorize its potential benefits and limitations. We also performed an experiment to port the unit tests of several open-source applications from a widely used mocking library to PBnJ. We found that more than half of these unit tests can be enhanced, in terms of the strength of properties and coverage, by exploiting executable specifications, with relatively little additional developer effort.

Info

Learning, Optimizing, and Searching

Orthogonal Exploration of the Search Space in Evolutionary Test Case Generation
Fitsum M. Kifetew

, Annibale Panichella, Andrea De Lucia

, Rocco Oliveto, and Paolo Tonella
(Fondazione Bruno Kessler, Italy; University of Salerno, Italy; University of Molise, Italy)
The effectiveness of evolutionary test case generation based on Genetic Algorithms (GAs) can be seriously impacted by genetic drift, a phenomenon that inhibits the ability of such algorithms to effectively diversify the search and look for alternative potential solutions. In such cases, the search becomes dominated by a small set of similar individuals that lead GAs to converge to a sub-optimal solution and to stagnate, without reaching the desired objective. This problem is particularly common for hard-to-cover program branches, associated with an extremely large solution space. In this paper, we propose an approach to solve this problem by integrating a mechanism for orthogonal exploration of the search space into standard GA. The diversity in the population is enriched by adding individuals in orthogonal directions, hence providing a more effective exploration of the solution space. To the best of our knowledge, no prior work has addressed explicitly the issue of evolution direction based diversification in the context of evolutionary testing. Results achieved on 17 Java classes indicate that the proposed enhancements make GA much more effective and efficient in automating the testing process. In particular, effectiveness (coverage) was significantly improved in 47% of the subjects and efficiency (search budget consumed) was improved in 85% of the subjects on which effectiveness remains the same.

Hybrid Learning: Interface Generation through Static, Dynamic, and Symbolic Analysis
Falk Howar, Dimitra Giannakopoulou, and Zvonimir Rakamarić
(CMU, USA; NASA Ames Research Center, USA; University of Utah, USA)
This paper addresses the problem of efficient generation of component interfaces through learning. Given a white-box component C with specified unsafe states, an interface captures safe orderings of invocations of C's public methods. In previous work we presented Psyco, an interface generation framework that combines automata learning with symbolic component analysis: learning drives the framework in exploring different combinations of method invocations, and symbolic analysis computes method guards corresponding to constraints on the method parameters for safe execution. In this work we propose solutions to the two main bottlenecks of Psyco. The explosion of method sequences that learning generates to validate its computed interfaces is reduced through partial order reduction resulting from a static analysis of the component. To avoid scalability issues associated with symbolic analysis, we propose novel algorithms that are primarily based on dynamic, concrete component execution, while resorting to symbolic analysis on a limited, as needed, basis. Dynamic execution enables the introduction of a concept of state matching, based on which our proposed approach detects, in some cases, that it has exhausted the exploration of all component behaviors. On the other hand, symbolic analysis is enhanced with symbolic summaries. Our new approach, X-Psyco, has been implemented in the Java PathFinder (JPF) software model checking platform. We demonstrated the effectiveness of X-Psyco on a number of realistic software components by generating more complete and precise interfaces than was previously possible.

Info

Optimizing Monitoring of Finite State Properties through Monitor Compaction
Rahul Purandare, Matthew B. Dwyer, and Sebastian Elbaum
(University of Nebraska-Lincoln, USA)
Runtime monitoring has proven effective in detecting property violations, but it can incur high overhead when monitoring just a single property - particularly when the property relates multiple objects. In practice developers will likely monitor multiple properties in the same execution which will lead to even higher overhead.
This paper presents the first study of overhead arising during the simultaneous monitoring of multiple properties. We present two techniques for mitigating overhead in such cases that exploit reductions in cost that arise from sharing of information between property monitors. This sharing permits a single monitor to function in place of many monitors. We evaluate these techniques on the DaCapo benchmarks and 8 temporal correctness properties and find that they offer significant overhead reductions, as high as 57%, as the number of monitored properties increases.

Whitebox Testing

Does Automated White-Box Test Generation Really Help Software Testers?
Gordon Fraser, Matt Staats, Phil McMinn, Andrea Arcuri, and Frank Padberg
(University of Sheffield, UK; KAIST, South Korea; Simula Research Laboratory, Norway; KIT, Germany)
Automated test generation techniques can efficiently produce test data that systematically cover structural aspects of a program. In the absence of a specification, a common assumption is that these tests relieve a developer of most of the work, as the act of testing is reduced to checking the results of the tests. Although this assumption has persisted for decades, there has been no conclusive evidence to date confirming it. However, the fact that the approach has only seen a limited uptake in industry suggests the contrary, and calls into question its practical usefulness. To investigate this issue, we performed a controlled experiment comparing a total of 49 subjects split between writing tests manually and writing tests with the aid of an automated unit test generation tool, EvoSuite. We found that, on one hand, tool support leads to clear improvements in commonly applied quality metrics such as code coverage (up to 300% increase). However, on the other hand, there was no measurable improvement in the number of bugs actually found by developers. Our results not only cast some doubt on how the research community evaluates test generation tools, but also point to improvements and future work necessary before automated test generation tools will be widely adopted by practitioners.

Info

Comparing Non-adequate Test Suites using Coverage Criteria
Milos Gligoric, Alex Groce

, Chaoqiang Zhang, Rohan Sharma, Mohammad Amin Alipour, and Darko Marinov

(University of Illinois at Urbana-Champaign, USA; Oregon State University, USA)
A fundamental question in software testing research is how to compare test suites, often as a means for comparing test-generation techniques. Researchers frequently compare test suites by measuring their coverage. A coverage criterion C provides a set of test requirements and measures how many requirements a given suite satisfies. A suite that satisfies 100% of the (feasible) requirements is C-adequate.
Previous rigorous evaluations of coverage criteria mostly focused on such adequate test suites: given criteria C and C′, are C-adequate suites (on average) more effective than C′-adequate suites? However, in many realistic cases producing adequate suites is impractical or even impossible. We present the first extensive study that evaluates coverage criteria for the common case of non-adequate test suites: given criteria C and C′, which one is better to use to compare test suites? Namely, if suites T1, T2 . . . Tn have coverage values c1, c2 . . . cn for C and c′1, c′2 . . . c′n for C′, is it better to compare suites based on c1, c2 . . . cn or based on c′1, c′ 2 . . . c′n?
We evaluate a large set of plausible criteria, including statement and branch coverage, as well as stronger criteria used in recent studies. Two criteria perform best: branch coverage and an intra-procedural acyclic path coverage.

Threats to the Validity and Value of Empirical Assessments of the Accuracy of Coverage-Based Fault Locators
Friedrich Steimann

, Marcus Frenkel, and Rui Abreu

(Fernuniversität in Hagen, Germany; University of Porto, Portugal)
Resuming past work on coverage-based fault localization, we find that empirical assessments of its accuracy are subject to so many imponderables that they are of limited value. To improve on this situation, we have compiled a comprehensive list of threats to be considered when attempting such assessments in the future. In addition, we propose the establishment of theoretical lower and upper bounds of fault localization accuracy that depend on properties of the subject programs (including their test suites) only. We make a suggestion for a lower bound and show that well-known fault locators do not uniformly perform better.

Info

Web-Based Analysis and Testing

An Empirical Study of PHP Feature Usage: A Static Analysis Perspective
Mark Hills, Paul Klint, and Jurgen Vinju

(CWI, Netherlands; INRIA, France)
PHP is one of the most popular languages for server-side application development. The language is highly dynamic, providing programmers with a large amount of flexibility. However, these dynamic features also have a cost, making it difficult to apply traditional static analysis techniques used in standard code analysis and transformation tools. As part of our work on creating analysis tools for PHP, we have conducted a study over a significant corpus of open-source PHP systems, looking at the sizes of actual PHP programs, which features of PHP are actually used, how often dynamic features appear, and how distributed these features are across the files that make up a PHP website. We have also looked at whether uses of these dynamic features are truly dynamic or are, in some cases, statically understandable, allowing us to identify specific patterns of use which can then be taken into account to build more precise tools. We believe this work will be of interest to creators of analysis tools for PHP, and that the methodology we present can be leveraged for other dynamic languages with similar features.

Info

Practical Blended Taint Analysis for JavaScript
Shiyi Wei and Barbara G. Ryder
(Virginia Tech, USA)
JavaScript is widely used in Web applications because of its flexibility and dynamic features. However, the latter pose challenges to static analyses aimed at finding security vulnerabilities, (e.g., taint analysis).
We present blended taint analysis, an instantiation of our general-purpose analysis framework for JavaScript, to illustrate how a combined dynamic/static analysis approach can deal with dynamic features by collecting generated code and other information at runtime. In empirical comparisons with two pure static taint analyses, we show blended taint analysis to be both more scalable and precise on JavaScript benchmark codes extracted from 12 popular websites at alexa. Our results show that blended taint analysis discovered 13 unique violations in 6 of the websites. In contrast, each of the static analyses identified less than half of these violations. Moreover, given a reasonable time budget of 10 minutes, both static analyses encountered webpages they could not analyze, sometimes significantly many such pages. Case studies demonstrate the quality of the blended taint analysis solution in comparison to that of pure static analysis.

Finding Your Way in the Testing Jungle: A Learning Approach to Web Security Testing
Omer Tripp, Omri Weisman, and Lotem Guy
(IBM, Israel; Tel Aviv University, Israel; Cybereason, Israel)
Black-box security testing of web applications is a hard problem. The main complication lies in the black-box assumption: The testing tool has limited insight into the workings of server-side defenses. This has traditionally led commercial as well as research vulnerability scanners toward heuristic approaches, such as testing each input point (e.g. HTTP parameter) with a short, predefined list of effective test payloads to balance between coverage and performance.
We take a fresh approach to the problem of security testing, casting it into a learning setting. In our approach, the testing algorithm has available a comprehensive database of test payloads, such that if the web application's defenses are broken, then with near certainty one of the candidate payloads is able to demonstrate the vulnerability. The question then becomes how to efficiently search through the payload space to find a good candidate. In our solution, the learning algorithm infers from a failed test---by analyzing the website's response---which other payloads are also likely to fail, thereby pruning substantial portions of the search space.
We have realized our approach in XSS Analyzer, an industry-level cross-site scripting (XSS) scanner featuring 500,000,000 test payloads. Our evaluation on 15,552 benchmarks shows solid results: XSS Analyzer achieves >99% coverage relative to brute-force traversal over all payloads, while trying only 10 payloads on average per input point. XSS Analyzer also outperforms several competing algorithms, including a mature commercial algorithm---featured in IBM Security AppScan Standard V8.5---by a far margin. XSS Analyzer has recently been integrated into the latest version of AppScan (V8.6) instead of that algorithm.

Doctoral Symposium

Temporal Properties and Concurrency

Debugging Non-deadlock Concurrency Bugs
Sangmin Park
(Georgia Tech, USA)
Concurrency bugs are difficult to debug because of nondeterministic behaviors of concurrent programs. Existing fault-localization techniques for non-deadlock concurrency bugs do not provide the comprehensive information to identify and understand the bugs. Existing automatic fault-fix techniques for concurrency bugs do not provide confidence that the fault has been fixed. To address the limitations of existing techniques, this research will develop techniques that will assist developers in several aspects of debugging, from finding the locations of concurrency bugs to providing confidence after the fix. The expected contributions of this research include a fault-localization technique that locates concurrency bugs, a fault-comprehension technique that assists understanding of bugs, a fix-confidence technique that provides confidence of fixes, and a framework that implements the techniques and that is used to perform empirical studies to evaluate the techniques.

Generation of Java Programs Properties from Test Purposes
Simone Hanazumi and Ana C. V. de Melo
(USP, Brazil)
Software formal verification is a technique used to ensure that computational systems have high-quality and work properly. From the system specification described in a formal language, we define properties that must be satisfied during system execution to guarantee the software quality. Then, these properties should be implemented in a verifier, tool responsible for running the verification and for notifying which properties were satisfied or not. When the verification process finishes, the verifier will indicate to software developers the possible location of each code fault in the system. The disadvantages of using formal verification are the high cost to apply this technique in real systems, and the necessity of having people with experience in formal languages and formal methods. In addition, the implementation of properties related to a particular system in a verifier is a complex task.
To help software developers in the application of formal verification in Java programs, this work proposes the generation of properties code, for direct use in a verifier, from test purposes derived from system formal requirements.

Performance and Databases

Generation of Test Databases using Sampling Methods
Teodora Sandra Buda
(University College Dublin, Ireland)
Populating the testing environment with relevant data represents a great challenge in software validation, generally requiring expert knowledge about the system under development, as its data critically impacts the outcome of the tests designed to assess the system. Current practices of populating the testing environments generally focus on developing efficient algorithms for generating synthetic data or use the production environment for testing purposes. The latter is an invaluable strategy to provide real test cases in order to discover issues that critically impact the user of the system. However, the production environment generally consists of large amounts of data that are difficult to handle and analyze. Database sampling from the production environment is a potential solution to overcome these challenges.
In this research, we propose two database sampling methods, VFDS and CoDS, with the objective of populating the testing environment. The first method is a very fast random sampling approach, while the latter aims at preserving the distribution of data in order to produce a representative sample. In particular, we focus on the dependencies between the data from different tables and the method tries to preserve the distributions of these dependencies.

Analysis of Performance Regression Testing Data by Transaction Profiles
Shadi Ghaith
(University College Dublin, Ireland)
Performance regression testing is an important step in the software development lifecycle, especially for enterprise applications. Commonly the analysis of performance regression testing to find anomalies is carried out manually and therefore can be error-prone, time consuming and sensitive to the input load. In our research, we propose a new technique that overcomes the above problems which helps the performance testing teams to improve their process and speeds up the entire software production process.

Formal Verification

Product-Line Verification with Feature-Oriented Contracts
Thomas Thüm
(University of Magdeburg, Germany)
Software product lines allow programmers to reuse code across similar software products. Software products are decomposed into separate modules representing user-visible features. Based on a selection of desired features, a customized software product can be generated automatically. However, these reuse mechanisms challenge existing techniques for specification and verification of software. Specifying and verifying each product involves redundant steps, and is often infeasible. We discuss how method contracts (i.e., preconditions and postconditions) can be used to efficiently specify and verify product lines.

Info

Formal Safety Proof: A Real Case Study in a Railway Interlocking System
Andrea Bonacchi
(University of Florence, Italy)
A challenging problem for model checking is represented by railway interlocking systems. It is a well known fact that interlocking systems, due to their inherent complexity related to the high number of variables involved, are not readily amenable to automatic verification, typically incurring in state space explosion problems. The study described in this paper aims at evaluating and experimenting the industrial application of verification by model checking for this class of systems. The choices made at the beginning of the study, also on the basis of specific requirements from the industrial partner, are presented, together with the advancement status of the project and the plans for its completion.

ISSTA 2013 – Proceedings

Preface

Technical Research

Contracts, Invariants, and Stability

Fuzzing and GUI Testing

Analysis and Testing of Non-functional Properties

Heap Analysis and Testing

Races and Multi-threaded Analysis and Testing

Localisation, Patching, and Repair

Mutating and Mocking

Learning, Optimizing, and Searching

Whitebox Testing

Web-Based Analysis and Testing

Doctoral Symposium

Temporal Properties and Concurrency

Performance and Databases

Formal Verification