ISSTA 2011 – Proceedings

Foreword
It is our great pleasure to welcome you to Toronto for ISSTA 2011, the 20th International Symposium on Software Testing and Analysis. ISSTA is the leading research conference in software testing and analysis and brings together academics, industrial researchers, and practitioners to exchange new ideas, problems, and experience on how to analyze and test software systems. The ISSTA 2011 program includes keynotes, technical papers, workshops, and an industry-academia session.

Test Generation I

eXpress: Guided Path Exploration for Efficient Regression Test Generation
Kunal Taneja, Tao Xie, Nikolai Tillmann, and Jonathan De Halleux

(North Carolina State University, USA; Microsoft Research, USA)
Software programs evolve throughout their lifetime undergoing various changes. While making these changes, software developers may introduce regression faults. It is desirable to detect these faults as quickly as possible to reduce the cost involved in fixing them. One existing solution is continuous testing, which runs an existing test suite to quickly find regression faults as soon as code changes are saved. However, the effectiveness of continuous testing depends on the capability of the existing test suite for finding behavioral differences across versions. To address the issue, we propose an approach, called eXpress, that conducts efficient regression test generation based on a path-exploration-based test generation (PBTG) technique, such as dynamic symbolic execution. eXpress prunes various irrelevant paths with respect to detecting behavioral differences to optimize the search strategy of a PBTG technique. As a result, the PBTG technique focuses its efforts on regression test generation. In addition, eXpress leverages the existing test suite (if available) for the original version to efficiently execute the changed code regions of the program and infect program states. Experimental results on 67 versions (in total) of four programs (two from the subject infrastructure repository and two from real-world open source projects) show that, using eXpress, a state-of-the-art PBTG technique, called Pex, requires about 36% less amount of time (on average) to detect behavioral differences than without using eXpress. In addition, Pex with eXpress detects four behavioral differences that could not be detected without eXpress (within a time bound). Furthermore, our eXpress approach requires 67% less amount of time to find behavioral differences by exploiting an existing test suite than exploration without using the test suite.

Statically-Directed Dynamic Automated Test Generation
Domagoj Babić, Lorenzo Martignoni, Stephen McCamant, and Dawn Song
(UC Berkeley, USA)
We present a new technique for exploiting static analysis to guide dynamic automated test generation for binary programs, prioritizing the paths to be explored. Our technique is a three-stage process, which alternates dynamic and static analysis. In the first stage, we run dynamic analysis with a small number of seed tests to resolve indirect jumps in the binary code and build a visibly pushdown automaton (VPA) reflecting the global control-flow of the program. Further, we augment the computed VPA with statically computable jumps not executed by the seed tests. In the second stage, we apply static analysis to the inferred automaton to find potential vulnerabilities, i.e., targets for the dynamic analysis. In the third stage, we use the results of the prior phases to assign weights to VPA edges. Our symbolic-execution based automated test generation tool then uses the weighted shortest-path lengths in the VPA to direct its exploration to the target potential vulnerabilities. Preliminary experiments on a suite of benchmarks extracted from real applications show that static analysis allows exploration to reach vulnerabilities it otherwise would not, and the generated test inputs prove that the static warnings indicate true positives.

Automatic Partial Loop Summarization in Dynamic Test Generation
Patrice Godefroid and Daniel Luchaup
(Microsoft Research, USA; University of Wisconsin – Madison, USA)
Whitebox fuzzing extends dynamic test generation based on symbolic execution and constraint solving from unit testing to whole-application security testing. Unfortunately, input-dependent loops may cause an explosion in the number of constraints to be solved and in the number of execution paths to be explored. In practice, whitebox fuzzers arbitrarily bound the number of constraints and paths due to input-dependent loops, at the risk of missing code and bugs.
In this work, we investigate the use of simple loop-guard pattern-matching rules to automatically guess an input constraint defining the number of iterations of input-dependent loops during dynamic symbolic execution. We discover the loop structure of the program on the fly, detect induction variables, which are variables modified by a constant value during loop iterations, and infer simple partial loop invariants relating the value of such variables. Whenever a guess is confirmed later during the current dynamic symbolic execution, we then inject new constraints representing pre and post loop conditions, effectively summarizing sets of executions of that loop. These pre and post conditions are derived from partial loop invariants synthesized dynamically using pattern-matching rules on the loop guards and induction variables, without requiring any static analysis, theorem proving, or input-format specification. This technique has been implemented in the whitebox fuzzer SAGE, scales to large programs with many nested loops, and we present results of experiments with a Windows 7 image parser.

Symbolic Execution with Mixed Concrete-Symbolic Solving
Corina S. Păsăreanu, Neha Rungta, and Willem Visser
(CMU, USA; NASA Ames Research Center, USA; SGT Inc., USA; Stellenbosch University, South Africa)
Symbolic execution is a powerful static program analysis technique that has been used for the automated generation of test inputs. Directed Automated Random Testing (DART) is a dynamic variant of symbolic execution that initially uses random values to execute a program and collects symbolic path conditions during the execution. These conditions are then used to produce new inputs to execute the program along different paths. It has been argued that DART can handle situations where classical static symbolic execution fails due to incompleteness in decision procedures and its inability to handle external library calls.
We propose here a technique that mitigates these previous limitations of classical symbolic execution. The proposed technique splits the generated path conditions into (a) constraints that can be solved by a decision procedure and (b) complex non-linear constraints with uninterpreted functions to represent external library calls. The solutions generated from the decision procedure are used to simplify the complex constraints and the resulting path conditions are checked again for satisfiability. We also present heuristics that can further improve our technique. We show how our technique can enable classical symbolic execution to cover paths that other dynamic symbolic execution approaches cannot cover. Our method has been implemented within the Symbolic PathFinder tool and has been applied to several examples, including two from the NASA domain.

Models

Polyglot: Modeling and Analysis for Multiple Statechart Formalisms
Daniel Balasubramanian, Corina S. Păsăreanu, Michael W. Whalen, Gábor Karsai, and Michael Lowry
(Vanderbilt University, USA; CMU, USA; NASA Ames Research Center, USA; University of Minnesota, USA)
In large programs such as NASA Exploration, multiple systems that interact via safety-critical protocols are already designed with different Statechart variants. To verify these safety-critical systems, a unified framework is needed based on a formal semantics that captures the variants of Statecharts. We describe Polyglot, a unified framework for the analysis of models described using multiple Statechart formalisms. In this framework, Statechart models are translated into Java and analyzed using pluggable semantics for different variants operating in a polymorphic execution environment. The framework has been built on the basis of a parametric formal semantics that captures the common core of Statecharts with extensions for different variants, and addresses previous limitations. Polyglot has been integrated with the Java Pathfinder verification tool-set, providing analysis and test-case generation capabilities. We describe the application of this unified framework to the analysis of NASA/JPL's MER Arbiter whose interacting components were modeled using multiple Statechart formalisms.

Scalable Analysis of Conceptual Data Models
Matthew J. McGill, Laura K. Dillon, and R. E. K. Stirewalt
(Michigan State University, USA; LogicBlox Inc., USA)
Conceptual data models describe information systems without the burden of implementation details, and are increasingly used to generate code. They could also be analyzed for consistency and to generate test data except that the expressive constraints supported by popular modeling notations make such analysis intractable. In an earlier empirical study of conceptual models created at LogicBlox Inc., Smaragdakis, Csallner, and Subramanian found that a restricted subset of ORM, called ORM−, includes the vast majority of constraints used in practice and, moreover, allows scalable analysis. After that study, however, LogicBlox Inc obtained a new ORM modeling tool, which supports discovery and specification of more complex constraints than the previous tool. We report findings of a follow-up study of models constructed using the more powerful tool. Our study finds that LogicBlox developers increasingly rely on a small number of features not in the ORM− subset. We extend ORM− with support for two of them: objectification and a restricted class of external uniqueness constraints. The extensions significantly improve our ability to analyze the ORM models created by developers using the new tool. We also show that a recent change to ORM has rendered the original ORM− algorithms unsound, in general; but that an efficient test suffices to show that these algorithms are in fact sound for the ORM− constraints appearing in any of the models currently in use at LogicBlox.

Bounded Verification of Ruby on Rails Data Models
Jaideep Nijjar and Tevfik Bultan

(UC Santa Barbara, USA)
The use of scripting languages to build web applications has increased programmer productivity, but at the cost of degrading dependability. In this paper we focus on a class of bugs that appear in web applications that are built based on the Model-View-Controller architecture. Our goal is to automatically discover data model errors in Ruby on Rails applications. To this end, we created an automatic translator that converts data model expressions in Ruby on Rails applications to formal specifications. In particular, our translator takes Active Records specifications (which are used to specify data models in Ruby on Rails applications) as input and generates a data model in Alloy language as output. We then use bounded verification techniques implemented in the Alloy Analyzer to look for errors in these formal data model specifications. We applied our approach to two open source web applications to demonstrate its feasibility.

Automated Framework for Formal Operator Task Analysis
Ayesha Yasmeen and Elsa L. Gunter
(University of Illinois at Urbana-Champaign, USA)
Aberrant behavior of human operators in safety critical systems can lead to severe or even fatal consequences. Human operators are unique in their decision making capability, judgment and nondeterminism. There is a need for a generalized framework that can allow capturing, modeling and analyzing the interactions between computer systems and human operators where the operators are allowed to deviate from their prescribed behaviors for executing a task. This will provide a formal understanding of the robustness of a computer system against possible aberrant behaviors by its human operators. We provide a framework for (i) modeling the human operators and the computer systems; (ii) formulating tolerable human operator action variations(protection envelope); (iii) determining whether the computer system can maintain its guarantees if the human operators operate within their protection envelopes; and finally, (iv) determining robustness of the computer system under weakening of the protection envelopes. We present Tutela, a tool that assists in accomplishing the first and second step, automates the third step and modestly assists in accomplishing the fourth step.

Analysis of Systems and Binary Code

Efficient, Sensitivity Resistant Binary Instrumentation
Andrew R. Bernat, Kevin Roundy, and Barton P. Miller
(University of Wisconsin – Madison, USA)
Binary instrumentation allows users to inject new code into programs without requiring source code, symbols, or debugging information. Instrumenting a binary requires structural modifications such as moving code, adding new code, and overwriting existing code; these modifications may unintentionally change the program's semantics. Binary instrumenters attempt to preserve the intended semantics of the program by further transforming the code to compensate for these structural modifications. Current instrumenters may fail to correctly preserve program semantics or impose significant unnecessary compensation cost because they lack a formal model of the impact of their structural modifications on program semantics. These weaknesses are particularly acute when instrumenting highly optimized or malicious code, making current instrumenters less useful as tools in the security or high-performance domains. We present a formal specification of how the structural modifications used by instrumentation affect a binary's visible behavior, and have adapted the Dyninst binary instrumenter to use this specification, thereby guaranteeing correct instrumentation while greatly reducing compensation costs. When compared against the fastest widely used instrumenters our technique imposed 46% less overhead; furthermore, we can successfully instrument highly defensive binaries that are specifically looking for code patching and instrumentation.

Recovering the Toolchain Provenance of Binary Code
Nathan Rosenblum, Barton P. Miller, and Xiaojin Zhu
(University of Wisconsin – Madison, USA)
Program binaries are an artifact of a production process that begins with source code and ends with a string of bytes representing executable code. There are many reasons to want to know the specifics of this process for a given binary—for forensic investigation of malware, to diagnose the role of the compiler in crashes or performance problems, or for reverse engineering and decompilation—but binaries are not generally annotated with such provenance details. Intuitively, the binary code should exhibit properties specific to the process that produced it, but it is not at all clear how to find such properties and map them to specific elements of that process.
In this paper, we present an automatic technique to recover toolchain provenance: those details, such as the source language and the compiler and compilation options, that define the transformation process through which the binary was produced. We approach provenance recovery as a classification problem, discovering characteristics of binary code that are strongly associated with particular toolchain components and developing models that can infer the likely provenance of program binaries. Our experiments show that toolchain provenance can be recovered with high accuracy, approaching 100% accuracy for some components and yielding good results (90%) even when the binaries emitted by different components appear to be very similar.

Concurrency

Testing Concurrent Programs on Relaxed Memory Models
Jacob Burnim, Koushik Sen, and Christos Stergiou
(UC Berkeley, USA)
High-performance concurrent libraries, such as lock-free data structures and custom synchronization primitives, are notoriously difficult to write correctly. Such code is often implemented without locks, instead using plain loads and stores and low-level operations like atomic compare-and-swaps and explicit memory fences. Such code must run correctly despite the relaxed memory model of the underlying compiler, virtual machine, and/or hardware. These memory models may reorder the reads and writes issued by a thread, greatly complicating parallel reasoning.
We propose RELAXER, a combination of predictive dynamic analysis and software testing, to help programmers write correct, highly-concurrent programs. Our technique works in two phases. First, RELAXER examines a sequentially-consistent run of a program under test and dynamically detects potential data races. These races are used to predict possible violations of sequential consistency under alternate executions on a relaxed memory model. In the second phase, RELAXER re-executes the program with a biased random scheduler and with a conservative simulation of a relaxed memory model in order to create with high probability a predicted sequential consistency violation. These executions can be used to test whether or not a program works as expected when the underlying memory model is not sequentially consistent.
We have implemented RELAXER for C and have evaluated it on several synchronization algorithms, concurrent data structures, and parallel applications. RELAXER generates many executions of these benchmarks with violations of sequential consistency, highlighting a number of bugs under relaxed memory models.

Change-Aware Preemption Prioritization
Vilas Jagannath, Qingzhou Luo, and Darko Marinov

(University of Illinois at Urbana-Champaign, USA)
Successful software evolves as developers add more features,respond to requirements changes, and fix faults. Regression testing is widely used for ensuring the validity of evolving software. As regression test suites grow over time, it becomes expensive to execute them. The problem is exacerbated when test suites contain multithreaded tests. These tests are generally long running as they explore many different thread schedules searching for concurrency faults such as dataraces, atomicity violations, and deadlocks. While many techniques have been proposed for regression test prioritization, selection, and minimization for sequential tests, there is not much work for multithreaded code.
We present a novel technique, called Change-Aware Preemption Prioritization (CAPP), that uses information about the changes in software evolution to prioritize the exploration of schedules in a multithreaded regression test. We have implemented CAPP in two frameworks for systematic exploration of multithreaded Java code. We evaluated CAPP on the detection of 15 faults in multithreaded Java programs, including large open-source programs. The results show that CAPP can substantially reduce the exploration required to detect multithreaded regression faults.

Persuasive Prediction of Concurrency Access Anomalies
Jeff Huang and Charles Zhang

(Hong Kong University of Science and Technology, China)
Predictive analysis is a powerful technique that exposes concurrency bugs in un-exercised program executions. However, current predictive analysis approaches lack the persuasiveness property as they offer little assistance in helping programmers fully understand the execution history that triggers the predicted bugs. We present a persuasive bug prediction technique as well as a prototype tool, PECAN, for detecting general access anomalies (AAs) in concurrent programs. The main characteristic of PECAN is that, in addition to predict AAs in a more general way, it generates “bug hatching clips” that deterministically instruct the input program to exercise the predicted AAs. The key ingredient of PECAN is an efficient offline schedule generation algorithm, with proof of the soundness, that guarantees to generate a feasible schedule for every real AA in programs that use locks in a nested way. We evaluate PECAN using twenty multi-threaded subjects including six large concurrent systems and our experiments demonstrate that PECAN is able to effectively predict and deterministically expose real AAs. Several serious and previously unknown bugs in large open source concurrent systems were also revealed in our experiments.

Program Analysis

Demand-Driven Context-Sensitive Alias Analysis for Java
Dacong Yan, Guoqing Xu, and Atanas Rountev

(Ohio State University, USA)
Software tools for program understanding, transformation, verification, and testing often require an efficient yet highly- precise alias analysis. Typically this is done by computing points-to information, from which alias queries can be answered. This paper presents a novel context-sensitive, demand-driven alias analysis for Java that achieves efficiency by answering alias queries directly, instead of relying on an underlying points-to analysis. The analysis is formulated as a context-free-language (CFL) reachability problem over a language that models calling context sensitivity, and over another language that models field sensitivity (i.e., flow of reference values through fields of heap objects).
To improve analysis scalability, we propose to compute procedural reachability summaries online, during the CFL-reachability computation. This cannot be done indiscriminately, as the benefits of using the summary information do not necessarily outweigh the cost of computing it. Our approach selects for summarization only a subset of heavily-used methods (i.e., methods having a large number of incoming edges in the static call graph). We have performed a variety of studies on the proposed analysis. The experimental results show that, within the same time budget, the precision of the analysis is higher than that of a state-of-the-art highly-precise points-to analysis. In addition, the use of method summaries can lead to significant improvements in analysis performance.

Path- and Index-sensitive String Analysis Based on Monadic Second-order Logic
Takaaki Tateishi, Marco Pistoia, and Omer Tripp
(IBM Research Tokyo, Japan; IBM Research Watson, USA; IBM Software Group, USA; Tel Aviv University, Israel)
We propose a novel technique for statically verifying the strings generated by a program. The verification is conducted by encoding the program in Monadic Second-Order Logic (M2L). We use M2L to describe constraints among program variables and to abstract built-in string operations. Once we encode a program in M2L, a theorem prover for M2L, such as MONA, can automatically check if a string generated by the program satisfies a given specification, and if not, exhibit a counterexample. With this approach, we can naturally encode relationships among strings, accounting also for cases in which a program manipulates strings using indices. In addition, our string analysis is path sensitive in that it accounts for the effects of string and Boolean comparisons, as well as regular-expression matches.
We have implemented our string-analysis algorithm, and used it to augment an industrial security analysis for Web applications by automatically detecting and verifying sanitizers---methods that eliminate malicious patterns from untrusted strings, making those strings safe to use in security-sensitive operations. On the 8 benchmarks we analyzed, our string analyzer discovered 128 previously unknown sanitizers, compared to 71 sanitizers detected by a previously presented string analysis.

Saving the World Wide Web from Vulnerable JavaScript
Salvatore Guarnieri, Marco Pistoia, Omer Tripp, Julian Dolby

, Stephen Teilhet, and Ryan Berg
(IBM Research Watson, USA; University of Washington, USA; IBM Software Group, USA; Tel Aviv University, Israel)
JavaScript is the most popular client-side scripting language for Web applications. Exploitable JavaScript code exposes end users to integrity and confidentiality violations. Client-side vulnerabilities can cost an enterprise money and reputation, and cause serious damage to innocent users of the Web application. In spite of all this, recent research in the area of information-flow security has focused more on other languages that are more suitable for server-side programming, such as Java.
Static analysis of JavaScript code is very challenging due to the dynamic nature of the language. This paper presents Actarus, a novel, product-quality static taint analysis for JavaScript that scales to large programs and soundly models all the JavaScript constructs with the exception of reflective calls. This paper discusses the experimental results obtained by running Actarus on a collection of 9,726 Web pages obtained by crawling the 50 most visited Web sites worldwide as well as 19 other popular Web sites. The results expose 526 vulnerabilities in 11 sites. Those vulnerabilities, if exploited, can allow malicious JavaScript code execution.

Geometric Encoding: Forging the High Performance Context Sensitive Points-to Analysis for Java
Xiao Xiao and Charles Zhang

(Hong Kong University of Science and Technology, China)
Context sensitive points-to analysis suffers from the scalability problem. We present the geometric encoding to capture the redundancy in the points-to analysis. Compared to BDD and EPA, the state of the art, the geometric encoding is much more efficient in processing the encoded facts, especially for the high-order context sensitivity with the heap cloning. We also developed two precision preserving techniques, constraints distillation and 1-CFA SCC modeling, to further improve the efficiency, in addition to the precision performance trade-off scheme. We evaluate our points-to algorithm with two variants of the geometric encoding, Geom and HeapIns, on 15 widely cited Java benchmarks. The evaluation shows that the Geom based algorithm is 11x and 68x faster than the worklist/BDD based 1-object-sensitive analysis in Paddle, and the speedup steeply goes up to 24x and 111x, if the HeapIns algorithm is used. Meanwhile, being very efficient in time, the precision is still equal to and sometime better than the 1-object-sensitive analysis.

Faults I

Are Automated Debugging Techniques Actually Helping Programmers?
Chris Parnin and Alessandro Orso

(Georgia Tech, USA)
Debugging is notoriously difficult and extremely time consuming. Researchers have therefore invested a considerable amount of effort in developing automated techniques and tools for supporting various debugging tasks. Although potentially useful, most of these techniques have yet to demonstrate their practical effectiveness. One common limitation of existing approaches, for instance, is their reliance on a set of strong assumptions on how developers behave when debugging (e.g., the fact that examining a faulty statement in isolation is enough for a developer to understand and fix the corresponding bug). In more general terms, most existing techniques just focus on selecting subsets of potentially faulty statements and ranking them according to some criterion. By doing so, they ignore the fact that understanding the root cause of a failure typically involves complex activities, such as navigating program dependencies and rerunning the program with different inputs.The overall goal of this research is to investigate how developers use and benefit from automated debugging tools through a set of human studies. As a first step in this direction, we perform a preliminary study on a set of developers by providing them with an automated debugging tool and two tasks to be performed with and without the tool. Our results provide initial evidence that several assumptions made by automated debugging techniques do not hold in practice.Through an analysis of the results, we also provide insights on potential directions for future work in the area of automated debugging.

On the Influence of Multiple Faults on Coverage-Based Fault Localization
Nicholas DiGiuseppe and James A. Jones
(UC Irvine, USA)
This paper presents an empirical study on the effects of the quantity of faults on statistical, coverage-based fault localization techniques. The former belief was that the effectiveness of fault-localization techniques was inversely proportional to the quantity of faults. In an attempt to verify these beliefs, we conducted a study on three programs varying in size on more than 13,000 multiple-fault versions. We found that the influence of multiple faults (1) was not as great as expected, (2) created a negligible effect on the effectiveness of the fault localization, and (3) was often even complimentary to the fault-localization effectiveness. In general, even in the presence of many faults, at least one fault was found by the fault-localization technique with high effectiveness. We also found that some faults were localizable regardless of the presence of other faults, whereas other faults’ ability to be found by these techniques varied greatly in the presence of other faults. Because almost all real-world software contains multiple faults, these results impact the use of statistical fault-localization techniques and provide a greater understanding of their potential in practice.

Minimizing Reproduction of Software Failures
Martin Burger and Andreas Zeller
(Saarland University, Germany)
A program fails. What now? Taking a single failing run, we record and minimize the interaction between objects to the set of calls relevant for the failure. The result is a minimal unit test that faithfully reproduces the failure at will: "Out of these 14,628 calls, only 2 are required". In a study of 17 real-life bugs, our JINSI prototype reduced the search space to 13.7 % of the dynamic slice or 0.22 % of the source code, with only 1-12 calls left to examine.

Detecting Anomalies in the Order of Equally-typed Method Arguments
Michael Pradel

and Thomas R. Gross
(ETH Zurich, Switzerland)
In statically-typed programming languages, the compiler ensures that method arguments are passed in the expected order by checking the type of each argument. However, calls to methods with multiple equally-typed parameters slip through this check. The uncertainty about the correct argument order of equally-typed arguments can cause various problems, for example, if a programmer accidentally reverses two arguments. We present an automated, static program analysis that detects such problems without any input except for the source code of a program. The analysis leverages the observation that programmer-given identifier names convey information about the semantics of arguments, which can be used to assign equally-typed arguments to their expected position. We evaluate the approach with a large corpus of Java programs and show that our analysis finds relevant anomalies with a precision of 76%.

Combinatorial and Random Testing

Feedback Driven Adaptive Combinatorial Testing
Emine Dumlu, Cemal Yilmaz

, Myra B. Cohen, and Adam Porter
(Sabanci University, Turkey; University of Nebraska at Lincoln, USA; University of Maryland, USA)
The configuration spaces of modern software systems are too large to test exhaustively. Combinatorial interaction testing (CIT) approaches, such as covering arrays, systematically sample the configuration space and test only the selected configurations. The basic justification for CIT approaches is that they can cost-effectively exercise all system behaviors caused by the settings of t or fewer options. We conjecture, however, that in practice many such behaviors are not actually tested because of masking effects -- failures that perturb execution so as to prevent some behaviors from being exercised. In this work we present a feedback-driven, adaptive, combinatorial testing approach aimed at detecting and working around masking effects. At each iteration we detect potential masking effects, heuristically isolate their likely causes, and then generate new covering arrays that allow previously masked combinations to be tested in the subsequent iteration. We empirically assess the effectiveness of the proposed approach on two large widely used open source software systems. Our results suggest that masking effects do exist and that our approach provides a promising and efficient way to work around them.

Using Binary Decision Diagrams for Combinatorial Test Design
Itai Segall, Rachel Tzoref-Brill

, and Eitan Farchi
(IBM Research Haifa, Israel)
Combinatorial test design (CTD) is an effective test planning technique that reveals faulty feature interaction in a given system. The test space is modeled by a set of parameters, their respective values, and restrictions on the value combinations. A subset of the test space is then automatically constructed so that it covers all valid value combinations of every t parameters, where t is a user input. Various combinatorial testing tools exist, implementing different approaches to finding a set of tests that satisfies t-wise coverage. However, little consideration has been given to the process of defining the test space for CTD, which is usually a manual, labor-intensive, and error-prone effort. Potential errors include missing parameters and their values, wrong identification of parameters and of valid value combinations, and errors in the definition of restrictions that cause them not to capture the intended combinations. From our experience, lack of support for the test space definition process is one of the main obstacles in applying CTD to a wide range of testing domains.
In this work, we present a Cartesian product based methodology and technology that assist in defining a complete and consistent test space for CTD. We then show how using binary decision diagrams (BDDs) to represent and build the test space dramatically increases the scalability of our approach, making it applicable to large and complex real-life test design tasks, for which explicit representation of the test space is infeasible.
Finally, we show how BDDs can be used also to solve the CTD problem itself. We present a new and highly effective BDD-based approach for solving CTD, which finds a set of tests that satisfies t-wise coverage by subset selection. Our approach supports also advanced requirements such as requirements on the distribution of values in the selected tests. We apply our algorithm to real-life testing problems of varying complexity, and show its superior performance.

Adaptive Random Testing: An Illusion of Effectiveness?
Andrea Arcuri and Lionel Briand
(Simula Research Laboratory, Norway; University of Oslo, Norway)
Adaptive Random Testing (ART) has been proposed as an enhancement to random testing, based on assumptions on how failing test cases are distributed in the input domain. The main assumption is that failing test cases are usually grouped into contiguous regions. Many papers have been published in which ART has been described as an effective alternative to random testing when using the average number of test case executions needed to find a failure (F-measure). But all the work in the literature is based either on simulations or case studies with unreasonably high failure rates. In this paper, we report on the largest empirical analysis of ART in the literature, in which 3727 mutated programs and nearly ten trillion test cases were used. Results show that ART is highly inefficient even on trivial problems when accounting for distance calculations among test cases, to an extent that probably prevents its practical use in most situations. For example, on the infamous Triangle Classification program, random testing finds failures in few milliseconds whereas ART execution time is prohibitive. Even when assuming a small, fixed size test set and looking at the probability of failure (P-measure), ART only fares slightly better than random testing, which is not sufficient to make it applicable in realistic conditions. We provide precise explanations of this phenomenon based on rigorous empirical analyses. For the simpler case of single-dimension input domains, we also perform formal analyses to support our claim that ART is of little use in most situations, unless drastic enhancements are developed. Such analyses help us explain some of the empirical results and identify the components of ART that need to be improved to make it a viable option in practice.

Specification and Optimization

Iterative Refinement of Specification for Component Based Embedded Systems
Muzammil Shahbaz, K. C. Shashidhar, and Robert Eschbach
(University of Sheffield, UK; MPI-SWS, Germany; Fraunhofer IESE, Germany)
The current practice of component based engineering raises concerns in industry when the specification of proprietary components suffers from inaccuracy and incompleteness. Engineers face difficulties in producing quality systems when they lack knowledge of the interoperability of components. In order to address this issue, we present a novel framework for iterative refinement of specification for component based systems. The novelty is the use of a preliminary behavioral model as a source for triggering refinement iterations. Moreover, it exploits rigorous formal techniques to achieve high-level system validation as an integral part of the refinement procedure. The framework has been evaluated on an automotive system in which the embedded software control units were developed by third-party vendors. The final results produced an improved formal system specification that identified several behaviors that were previously unknown.

Using Automatic Persistent Memoization to Facilitate Data Analysis Scripting
Philip J. Guo and Dawson Engler
(Stanford University, US)
Programmers across a wide range of disciplines (e.g., bioinformatics, neuroscience, econometrics, finance, data mining, information retrieval, machine learning) write scripts to parse, transform, process, and extract insights from data. To speed up iteration times, they split their analyses into stages and write extra code to save the intermediate results of each stage to files so that those results do not have to be re-computed in every subsequent run. As they explore and refine hypotheses, their scripts often create and process lots of intermediate data files. They need to properly manage the myriad of dependencies between their code and data files, or else their analyses will produce incorrect results.
To enable programmers to iterate quickly without needing to manage intermediate data files, we added a set of dynamic analyses to the programming language interpreter so that it automatically memoizes (caches) the results of long-running pure function calls to disk, manages dependencies between code and on-disk data, and later re-uses memoized results, rather than re-executing those functions, when guaranteed safe to do so. We created an implementation for Python and show how it enables programmers to iterate faster on their data analysis scripts while writing less code and not having to manage dependencies between their code and datasets.

CoDeSe: Fast Deserialization via Code Generation
Milos Gligoric, Darko Marinov, and Sam Kamin
(University of Illinois at Urbana-Champaign, USA)
Many tools for automated testing, model checking, and debugging store and restore program states multiple times. Storing/restoring a program state is commonly done with serialization/deserialization. Traditionally, the format for stored states is based on data: serialization generates the data that encodes the state, and deserialization interprets this data to restore the state. We propose a new approach, called CoDeSe, where the format for stored states is based on code: serialization generates code whose execution restores the state, and deserialization simply executes the code. We implemented CoDeSe in Java and performed a number of experiments on deserialization of states. CoDeSe provides on average more than 6X speedup over the highly optimized deserialization from the standard Java library. Our new format also allows simple parallel deserialization that can provide additional speedup on top of the sequential CoDeSe but only for larger states.

Faults II

Selecting Peers for Execution Comparison
William N. Sumner, Tao Bao, and Xiangyu Zhang

(Purdue University, USA)
Execution comparison is becoming more common as a means of debugging faulty programs or simply explaining program behavior. Often, such as when debugging, the goal is to understand particular aspects of a single execution, and it is not immediately clear against what we should compare this execution. Prior work has led to approaches for acquiring a second execution, or peer, with which to compare the first. The earliest of these searched test suites for suitable candidates. More recent work focuses on synthesizing a new execution, either by generating new input for the program or by directly mutating the execution itself. In spite of these proposals, it is unclear what advantages these different techniques for finding peers might have over each other. In this paper, we implement five different existing techniques and examine their impact on 20 real bugs. These bugs represent the full set of reported bugs for three programs during one year. We present a metric to evaluate the quality of the peers. It is based on the similarity of the peers to the executions of the patched programs. We also discuss in detail the different scenarios where these techniques hold advantages.

Generating Analyses for Detecting Faults in Path Segments
Wei Le and Mary Lou Soffa
(University of Virginia, USA)
Although static bug detectors are extensively applied, there is a cost in using them. One challenge is that static analysis often reports a large number of false positives but little diagnostic information. Also, individual bug detectors need to be built in response to new types of faults, and tuning a static tool for precision and scalability is time-consuming. This paper presents a novel framework that automatically generates scalable, interprocedural, path-sensitive analyses to detect user-specified faults. The framework consists of a specification technique that expresses faults and information needed for their detection, a scalable, path-sensitive algorithm, and a generator that unifies the two. The analysis produced identifies not only faults but also the path segments where the root causes of a fault are located. The generality of the framework is accomplished for both data- and control-centric faults. We implemented our framework and generated fault detectors for identifying buffer overflows, integer violations, null-pointer dereferences and memory leaks. We experimentally demonstrate that the generated analyses scales to large deployed software, and its detection capability is comparable to tools that target a specific type of fault. In our experiments, we identify a total of 146 faults of the four types. While the length of path segments for the majority of faults is 1-4 procedures, we are able to detect faults deeply embedded in the code across 35 procedures.

Characterizing Failure-Causing Parameter Interactions by Adaptive Testing
Zhiqiang Zhang and Jian Zhang

(Chinese Academy of Sciences, China)
Combinatorial testing is a widely used black-box testing technique, which is used to detect failures caused by parameter interactions (we call them faulty interactions). Traditional combinatorial testing techniques provide fault detection, but most of them provide weak fault diagnosis. In this paper, we propose a new fault characterization method called ”faulty interaction characterization” (FIC) and its binary search alternative FIC_BS to locate one failure-causing interaction in a single failing test case. In addition, we provide a tradeoff strategy of locating multiple faulty interactions in one test case. Our methods are based on adaptive black-box testing, in which test cases are generated based on outcomes of previous tests. For locating a t-way faulty interaction, the number of test cases used is at most k (for FIC) or t(⌈log₂k⌉+1)+1 (for FIC_BS), where k is the number of parameters. Simulation experiments show that our method needs smaller number of adaptive test cases than most existing methods for locating randomly-generated faulty interactions. Yet it has stronger or equivalent ability of locating faulty interactions.

The Use of Mutation in Testing Experiments and its Sensitivity to External Threats
Akbar Siami Namin and Sahitya Kakarla
(Texas Tech University, US)
Mutation analysts has emerged as a standard approach for empirical assessment of testing techniques. The test practitioners decide about cost-effectiveness of testing strategies based on the number of mutants the testing techniques detect. Though fundamental rigor to empirical software testing, the use of mutants in the absence of real-world faults has raised the concern of whether mutants and real faults exhibit similar properties. This paper revisits this important concern and disseminates interesting findings regarding mutants and whether these synthetic faults can predict fault detection ability of test suites. The results of controlled experiments conducted in this paper show that mutation when used in testing experiments is highly sensitive to external threats caused by some influential factors including mutation operators, test suite size, and programming languages. This paper raises the awareness message of the use of mutation in testing experiment and suggests that any interpretation or generalization of experimental findings based on mutation should be justified according to the influential factors involved.

Test Generation II

Combined Static and Dynamic Automated Test Generation
Sai Zhang, David Saff, Yingyi Bu, and Michael D. Ernst

(University of Washington, USA; Google Inc., USA; UC Irvine, USA)
In an object-oriented program, a unit test often consists of a sequence of method calls that create and mutate objects, then use them as arguments to a method under test. It is challenging to automatically generate sequences that are legal and behaviorally-diverse, that is, reaching as many different program states as possible.
This paper proposes a combined static and dynamic automated test generation approach to address these problems, for code without a formal specification. Our approach first uses dynamic analysis to infer a call sequence model from a sample execution, then uses static analysis to identify method dependence relations based on the fields they may read or write. Finally, both the dynamically-inferred model (which tends to be accurate but incomplete) and the statically-identified dependence information (which tends to be conservative) guide a random test generator to create legal and behaviorally-diverse tests.
Our Palus tool implements this testing approach. We compared its effectiveness with a pure random approach, a dynamic-random approach (without a static phase), and a static-random approach (without a dynamic phase) on several popular open-source Java programs. Tests generated by Palus achieved higher structural coverage and found more bugs.
Palus is also internally used in Google. It has found 22 previously-unknown bugs in four well-tested Google products.

Generating Parameterized Unit Tests
Gordon Fraser and Andreas Zeller
(Saarland University, Germany)
State-of-the art techniques for automated test generation focus on generating executions that cover program behavior. As they do not generate oracles, it is up to the developer to figure out what a test does and how to check the correctness of the observed behavior. In this paper, we present an approach to generate parameterized unit tests - unit tests containing symbolic pre- and postconditions characterizing test input and test result. Starting from concrete inputs and results, we use test generation and mutation to systematically generalize pre- and postconditions while simplifying the computation steps. Evaluated on five open source libraries, the generated parameterized unit tests are (a) more expressive, characterizing general rather than concrete behavior; (b) need fewer computation steps, making them easier to understand; and (c) achieve a higher coverage than regular unit tests.

High Coverage Testing of Haskell Programs
Tristan Allwood, Cristian Cadar

, and Susan Eisenbach
(Imperial College Longon, UK; Imperial College London, UK)
This paper presents a new lightweight technique for automatically generating high coverage test suites for Haskell library code. Our approach combines four main features to increase test coverage: (1) automatically inferring the constructors and functions needed to generate test data; (2) using needed narrowing to take advantage of Haskell's lazy evaluation semantics; (3) inspecting elements inside returned data structures through the use of case statements, and (4) efficiently handling polymorphism by lazily instantiating all possible instances.
We have implemented this technique in Irulan, a fully automatic tool for systematic black-box unit testing of Haskell library code. We have designed Irulan to generate high coverage test suites and detect common programming errors in the process. We have applied Irulan to over 50 programs from the spectral and real suites of the nofib benchmark and show that it can effectively generate high-coverage test suites---exhibiting 70.83% coverage for spectral and 59.78% coverage for real---and find errors in these programs.
Our techniques are general enough to be useful for several other types of testing, and we also discuss our experience of using Irulan for property and regression testing.