July 15th-20th, 2012, Minneapolis, MN, USA
Powered by
Conference Publishing Consulting

2012 International Symposium on Software Testing and Analysis (ISSTA), July 15–20, 2012, Minneapolis, MN, USA

ISSTA 2012 – Proceedings

Contents - Abstracts - Authors
Online Calendar - iCal File


Title Page

It is our great pleasure to welcome you to Minneapolis for ISSTA 2012, the 21st International Symposium on Software Testing and Analysis. ISSTA is the leading research conference in software testing and analysis and brings together academics, industrial researchers, and practitioners to exchange new ideas, problems, and experi- ence on how to analyze and test software systems. The ISSTA 2012 program includes keynotes, technical papers, workshops, and the University of Minnesota Summer Soft- ware Symposium, a forum targeting the local medical device industry.


Dynamic Analysis

RefaFlex: Safer Refactorings for Reflective Java Programs
Andreas Thies and Eric Bodden
(Fernunversität in Hagen, Germany; TU Darmstadt, Germany)
If programs access types and members through reflection, refactoring tools cannot guarantee that refactorings on those programs are behavior preserving. Refactoring approaches for highly reflective languages like Smalltalk therefore check behavior preservation using regression testing. In this paper we propose RefaFlex, a novel and more defensive approach towards the refactoring of reflective (Java) programs. RefaFlex uses a dynamic program analysis to log reflective calls during test runs and then uses this information to proactively prevent the programmer from executing refactorings that could otherwise alter the program's behavior. This makes re-running test cases obsolete: when a refactoring is permitted, tests passing originally are guaranteed to pass for the refactored program as well. In some cases, we further re-write reflective calls, permitting refactorings that would otherwise have to be rejected. We have implemented RefaFlex as an open source Eclipse plugin and offer extensions for six Eclipse refactoring tools addressing naming, typing, and accessibility issues. Our evaluation with 21,524 refactoring runs on three open source programs shows that our tool successfully prevents 1,358 non-behaviour-preserving refactorings which the plain Eclipse refactorings would have incorrectly permitted.
Article Search
THeME: A System for Testing by Hardware Monitoring Events
Kristen Walcott-Justice, Jason Mars, and Mary Lou Soffa
(University of Virginia, USA)
The overhead of test coverage analysis is dominated by monitoring the application, which is traditionally performed using instrumentation. However, instrumentation can prohibitively increase the time and especially the memory overhead of an application. As an alternative to instrumentation, we explore how recent hardware advances can be leveraged to improve the overheads of test coverage analysis. These hardware advances include hardware performance monitors and multicore technology. In this work, we present our system, THeME, a testing framework that replaces instrumentation with hardware monitoring. THeME consists of a runtime system that takes advantage of hardware mechanisms and multiple cores and a static component to further extend the coverage derived from hardware event sampling. The results show that up to 90% of the actual coverage can be determined with less time overhead and negligible code growth compared to instrumentation.
Article Search
Multi-slicing: A Compiler-Supported Parallel Approach to Data Dependence Profiling
Hongtao Yu and Zhiyuan Li
(Purdue University, USA)
Retrofitting existing software for the increasingly dominant multicore microprocessors has a strong appeal from the economic point of view. One of the key issues in such an effort is to fully understand the data dependences in the existing software. Unfortunately, current compilers have quite limited ability to analyze data dependences. Therefore, execution-driven data dependence profiling has gained significant interest because it can resolve memory access ambiguity exactly during program execution, which allows data dependences to be analyzed exactly. Although such dependence profiling is valid for specific inputs only, the insight it provides can be highly valuable to software engineers in their parallelization effort. On the other hand, dependence profiling itself can take tremendous memory and machine time. In this paper, we propose a novel dependence profiling method which, with the support of several new compiler and runtime techniques, partitions the profiling task into many independent slices, each requiring significantly less memory. Different slices can be profiled in parallel, producing subgraphs which are eventually combined automatically into the complete data dependence graph by the compiler. The slices can be extracted with different degrees of granularity. Experiments show that, for several well-known benchmark programs, our parallel scheme shortens the profiling time by a few orders of magnitude.
Article Search

Web Applications

Remedying the Eval that Men Do
Simon Holm Jensen, Peter A. Jonsson, and Anders Møller
(Aarhus University, Denmark)
A range of static analysis tools and techniques have been developed in recent years with the aim of helping JavaScript web application programmers produce code that is more robust, safe, and efficient. However, as shown in a previous large-scale study, many web applications use the JavaScript eval function to dynamically construct code from text strings in ways that obstruct existing static analyses. As a consequence, the analyses either fail to reason about the web applications or produce unsound or useless results. We present an approach to soundly and automatically transform many common uses of eval into other language constructs to enable sound static analysis of web applications. By eliminating calls to eval, we expand the applicability of static analysis for JavaScript web applications in general. The transformation we propose works by incorporating a refactoring technique into a dataflow analyzer. We report on our experimental results with a small collection of programming patterns extracted from popular web sites. Although there are inevitably cases where the transformation must give up, our technique succeeds in eliminating many nontrivial occurrences of eval.
Article Search
State Aware Test Case Regeneration for Improving Web Application Test Suite Coverage and Fault Detection
Nadia Alshahwan and Mark Harman
(University College London, UK)
This paper introduces two test cases regeneration approaches for web applications, one uses standard Def-Use testing but for state variables, the other uses a novel value-aware dataflow approach. Our overall approach is to combine requests from a test suite to form client-side request sequences, based on dataflow analysis of server-side session variables and database tables. We implemented our approach as a tool SART (State Aware Regeneration Tool) and used it to evaluate our proposed approaches on 4 real world web applications. Our results show that for all 4 applications, both server-side coverage and fault detection were statistically significantly improved. Even on relatively high quality test suites our algorithms improve average coverage by 14.74% and fault detection by 9.19%.
Article Search
ViewPoints: Differential String Analysis for Discovering Client- and Server-Side Input Validation Inconsistencies
Muath Alkhalaf, Shauvik Roy Choudhary, Mattia Fazzini, Tevfik Bultan, Alessandro Orso, and Christopher Kruegel
(UC Santa Barbara, USA; Georgia Tech, USA)
Since web applications are easily accessible, and often store a large amount of sensitive user information, they are a common target for attackers. In particular, attacks that focus on input validation vulnerabilities are extremely effective and dangerous. To address this problem, we developed ViewPoints--a technique that can identify erroneous or insufficient validation and sanitization of the user inputs by automatically discovering inconsistencies between client- and server-side input validation functions. Developers typically perform redundant input validation in both the front-end (client) and the back-end (server) components of a web application. Client- side validation is used to improve the responsiveness of the application, as it allows for responding without communicating with the server, whereas server-side validation is necessary for security reasons, as malicious users can easily circumvent client-side checks. ViewPoints (1) automatically extracts client- and server-side input validation functions, (2) models them as deterministic finite automata (DFAs), and (3) compares client- and server-side DFAs to identify and report the inconsistencies between the two sets of checks. Our initial evaluation of the technique is promising: when applied to a set of real-world web applications, ViewPoints was able to automatically identify a large number of inconsistencies in their input validation functions.
Article Search

Test Generation

Search-Based System Testing: High Coverage, No False Alarms
Florian Gross, Gordon Fraser, and Andreas Zeller
(Saarland University, Germany)
Modern test case generation techniques can automatically achieve high code coverage. If they operate on the unit level, they run the risk of generating inputs infeasible in reality, which, when causing failures, are painful to identify and eliminate. Running a unit test generator on five open source Java programs, we found that all of the 181 reported failures were false failures—that is, indicating a problem in the generated test case rather than the program. By generating test cases at the GUI level, our EXSYST prototype can avoid such false alarms by construction. In our evaluation, it achieves higher coverage than search-based test generators at the unit level; yet, every failure can be shown to be caused by a real sequence of input events. Whenever a system interface is available, we recommend considering search-based system testing as an alternative to avoid false failures.
Article Search
Swarm Testing
Alex Groce, Chaoqiang Zhang, Eric Eide, Yang Chen, and John Regehr
(Oregon State University, USA; University of Utah, USA)
Swarm testing is a novel and inexpensive way to improve the diversity of test cases generated during random testing. Increased diversity leads to improved coverage and fault detection. In swarm testing, the usual practice of potentially including all features in every test case is abandoned. Rather, a large “swarm” of randomly generated configurations, each of which omits some features, is used, with configurations receiving equal resources. We have identified two mechanisms by which feature omission leads to better exploration of a system’s state space. First, some features actively prevent the system from executing interesting behaviors; e.g., “pop” calls may prevent a stack data structure from executing a bug in its overflow detection logic. Second, even when there is no active suppression of behaviors, test features compete for space in each test, limiting the depth to which logic driven by features can be explored. Experimental results show that swarm testing increases coverage and can improve fault detection dramatically; for example, in a week of testing it found 42% more distinct ways to crash a collection of C compilers than did the heavily hand-tuned default configuration of a random tester.
Article Search
Compositional Load Test Generation for Software Pipelines
Pingyu Zhang, Sebastian Elbaum, and Matthew B. Dwyer
(University of Nebraska-Lincoln, USA)
Load tests validate whether a system’s performance is acceptable under extreme conditions. Traditional load testing approaches are black-box, inducing load by increasing the size or rate of the input. Symbolic execution based load testing techniques complement traditional approaches by enabling the selection of precise input values. However, as the programs under analysis or their required inputs increase in size, the analyses required by these techniques either fail to scale up or sacrifice test effectiveness. We propose a new approach that addresses this limitation by performing load test generation compositionally. It uses existing symbolic execution based techniques to analyze the performance of each system component in isolation, summarizes the results of those analyses, and then performs an analysis across those summaries to generate load tests for the whole system. In its current form, the approach can be applied to any system that is structured in the form of a software pipeline. A study of the approach revealed that it can generate effective load tests for Unix and XML pipelines while outperforming state-of-the-art techniques.
Article Search
Combining Model-Based and Combinatorial Testing for Effective Test Case Generation
Cu D. Nguyen, Alessandro Marchetto, and Paolo Tonella
(Fondazione Bruno Kessler, Italy)
Model-based testing relies on the assumption that effective adequacy criteria can be defined in terms of model coverage achieved by a set of test paths. However, such test paths are only abstract test cases and input test data must be specified to make them concrete. We propose a novel approach that combines model-based and combinatorial testing in order to generate executable and effective test cases from a model. Our approach starts from a finite state model and applies model-based testing to generate test paths that represent sequences of events to be executed against the system under test. Such paths are transformed to classification trees, enriched with domain input specifications such as data types and partitions. Finally, executable test cases are generated from those trees using t-way combinatorial criteria. While test cases that satisfy a combinatorial criterion can be generated for each individual test path obtained from the model, we introduce a post-optimization algorithm that can guarantee the combinatorial criterion of choice on the whole set of test paths extracted from the model. The resulting test suite is smaller, but it still satisfies the same adequacy criterion. We developed a tool and used it to evaluate our approach on 6 subject systems of various types and sizes, to study the effectiveness of the generated test suites, the reduction achieved by the post-optimization algorithm, as well as the effort required to produce them.
Article Search


A First Step Towards Algorithm Plagiarism Detection
Fangfang Zhang, Yoon-Chan Jhi, Dinghao Wu, Peng Liu, and Sencun Zhu
(Pennsylvania State University, USA; Samsung, South Korea)
In this work, we address the problem of algorithm plagiarism, which occurs when a plagiarist, violating intellectual property rights, steals others' algorithms and covertly implements them. In contrast to software plagiarism, which has been extensively studied, limited attention has been paid to algorithm plagiarism. In this paper, we propose two dynamic value-based approaches, namely N-version and annotation, for algorithm plagiarism detection. Our approaches are motivated by the observation that there exist some critical runtime values which are irreplaceable and uneliminatable for all implementations of the same algorithm. The N-version approach extracts such values by filtering out non-core values. The annotation approach leverages auxiliary information to flag important variables which contain core values. We also propose a value dependence graph based similarity metric in addition to the longest common subsequence based one, in order to address the potential value reordering attack. We have implemented a prototype and evaluated the proposed schemes on various algorithms. The results show that our approaches to algorithm plagiarism detection are practical, effective and resilient to many automatic obfuscation techniques.
Article Search
A Quantitative Study of Accuracy in System Call-Based Malware Detection
Davide Canali, Andrea Lanzi, Davide Balzarotti, Christopher Kruegel, Mihai Christodorescu, and Engin Kirda
(EURECOM, France; UC Santa Barbara, USA; IBM Research, USA; Northeastern University, USA)
Over the last decade, there has been a significant increase in the number and sophistication of malware-related attacks and infections. Many detection techniques have been proposed to mitigate the malware threat. A running theme among existing detection techniques is the similar promises of high detection rates, in spite of the wildly different models (or specification classes) of malicious activity used. In addition, the lack of a common testing methodology and the limited datasets used in the experiments make difficult to compare these models in order to determine which ones yield the best detection accuracy. In this paper, we present a systematic approach to measure how the choice of behavioral models influences the quality of a malware detector. We tackle this problem by executing a large number of testing experiments, in which we explored the parameter space of over 200 different models, corresponding to more than 220 million of signatures. Our results suggest that commonly held beliefs about simple models are incorrect in how they relate changes in complexity to changes in detection accuracy. This implies that accuracy is non-linear across the model space, and that analytical reasoning is insufficient for finding an optimal model, and has to be supplemented by testing and empirical measurements.
Article Search
Undangle: Early Detection of Dangling Pointers in Use-After-Free and Double-Free Vulnerabilities
Juan Caballero, Gustavo Grieco, Mark Marron, and Antonio Nappa
(IMDEA Software Institute, Spain)
Use-after-free vulnerabilities are rapidly growing in popularity, especially for exploiting web browsers. Use-after-free (and double-free) vulnerabilities are caused by a program operating on a dangling pointer. In this work we propose early detection, a novel runtime approach for finding and diagnosing use-after-free and double-free vulnerabilities. While previous work focuses on the creation of the vulnerability (i.e., the use of a dangling pointer), early detection shifts the focus to the creation of the dangling pointer(s) at the root of the vulnerability. Early detection increases the effectiveness of testing by identifying unsafe dangling pointers in executions where they are created but not used. It also accelerates vulnerability analysis and minimizes the risk of incomplete fixes, by automatically collecting information about all dangling pointers involved in the vulnerability. We implement our early detection technique in a tool called Undangle. We evaluate Undangle for vulnerability analysis on 8 real-world vulnerabilities. The analysis uncovers that two separate vulnerabilities in Firefox had a common root cause and that their patches did not completely fix the underlying bug. We also evaluate Undangle for testing on the Firefox web browser identifying a potential vulnerability.
Article Search

Symbolic Execution

Memoized Symbolic Execution
Guowei Yang, Corina S. Păsăreanu, and Sarfraz Khurshid
(University of Texas at Austin, USA; CMU, USA; NASA Ames Research Center, USA)
This paper introduces memoized symbolic execution (Memoise), a new approach for more efficient application of forward symbolic execution, which is a well-studied technique for systematic exploration of program behaviors based on bounded execution paths. Our key insight is that application of symbolic execution often requires several successive runs of the technique on largely similar underlying problems, e.g., running it once to check a program to find a bug, fixing the bug, and running it again to check the modified program. Memoise introduces a trie-based data structure that stores the key elements of a run of symbolic execution. Maintenance of the trie during successive runs allows re-use of previously computed results of symbolic execution without the need for re-computing them as is traditionally done. Experiments using our prototype implementation of Memoise show the benefits it holds in various standard scenarios of using symbolic execution, e.g., with iterative deepening of exploration depth, to perform regression analysis, or to enhance coverage using heuristics.
Article Search
Abstracting Path Conditions
Jan Strejček and Marek Trtík
(Masaryk University, Czech Republic)
We present a symbolic-execution-based algorithm that for a given program and a given program location in it produces a nontrivial necessary condition on input values to drive the program execution to the given location. The algorithm is based on computation of loop summaries for loops along acyclic paths leading to the target location. We also propose an application of necessary conditions in contemporary bug-finding and test-generation tools. Experimental results on several small benchmarks show that the presented technique can in some cases significantly improve performance of the tools.
Article Search
Probabilistic Symbolic Execution
Jaco Geldenhuys, Matthew B. Dwyer, and Willem Visser
(Stellenbosch University, South Africa; University of Nebraska-Lincoln, USA)
The continued development of efficient automated decision procedures has spurred the resurgence of research on symbolic execution over the past decade. Researchers have applied symbolic execution to a wide range of software analysis problems including: checking programs against contract specifications, inferring bounds on worst-case execution performance, and generating path-adequate test suites for widely used library code. In this paper, we explore the adaptation of symbolic execution to perform a more quantitative type of reasoning --- the calculation of estimates of the probability of executing portions of a program. We present an extension of the widely used Symbolic PathFinder symbolic execution system that calculates path probabilities. We exploit state-of-the-art computational algebra techniques to count the number of solutions to path conditions, yielding exact results for path probabilities. To mitigate the cost of using these techniques, we present two optimizations, PC slicing and count memoization, that significantly reduce the cost of probabilistic symbolic execution. Finally, we present the results of an empirical evaluation applying our technique to challenging library container implementations and illustrate the benefits that adding probabilities to program analyses may offer.
Article Search

Empirical Studies

A Human Study of Patch Maintainability
Zachary P. Fry, Bryan Landau, and Westley Weimer
(University of Virginia, USA)
Identifying and fixing defects is a crucial and expensive part of the software lifecycle. Measuring the quality of bug-fixing patches is a difficult task that affects both functional correctness and the future maintainability of the code base. Recent research interest in automatic patch generation makes a systematic understanding of patch maintainability and understandability even more critical. We present a human study involving over 150 participants, 32 real-world defects, and 40 distinct patches. In the study, humans perform tasks that demonstrate their understanding of the control flow, state, and maintainability aspects of code patches. As a baseline we use both human-written patches that were later reverted and also patches that have stood the test of time to ground our results. To address any potential lack of readability with machine-generated patches, we propose a system wherein such patches are augmented with synthesized, human-readable documentation that summarizes their effects and context. Our results show that machine-generated patches are slightly less maintainable than human-written ones, but that trend reverses when machine patches are augmented with our synthesized documentation. Finally, we examine the relationship between code features (such as the ratio of variable uses to assignments) with participants' abilities to complete the study tasks and thus explain a portion of the broad concept of patch quality.
Article Search
Understanding User Understanding: Determining Correctness of Generated Program Invariants
Matt Staats, Shin Hong, Moonzoo Kim, and Gregg Rothermel
(KAIST, South Korea; University of Nebraska-Lincoln, USA)
Recently, work has begun on automating the generation of test oracles, which are necessary to fully automate the testing process. One approach to such automation involves dynamic invariant generation which extracts invariants from program executions. To use such invariants as test oracles, however, it is necessary to distinguish correct from incorrect invariants, a process that currently requires human intervention. In this work we examine this process. In particular, we examine the ability of 30 users, across two empirical studies, to classify invariants generated from three Java programs. Our results indicate that users struggle to classify generated invariants: on average, they misclassify 9.1% to 31.7% of correct invariants and 26.1%-58.6% of incorrect invariants. These results contradict prior studies that suggest that classification by users is easy, and indicate that further work needs to be done to bridge the gap between the effectiveness of dynamic invariant generation in theory, and the ability of users to apply it in practice. Along these lines, we suggest several areas for future work.
Article Search
Empirical Investigation of Search Algorithms for Environment Model-Based Testing of Real-Time Embedded Software
Muhammad Zohaib Iqbal, Andrea Arcuri, and Lionel Briand
(Simula Research Laboratory, Norway; University of Oslo, Norway; University of Luxembourg, Luxembourg)
System testing of real-time embedded systems (RTES) is a challenging task and only a fully automated testing approach can scale up to the testing requirements of industrial RTES. One such approach, which offers the advantage for testing teams to be black-box, is to use environment models to automatically generate test cases and oracles and an environment simulator to enable earlier and more practical testing. In this paper, we propose novel heuristics for search-based, RTES system testing which are based on these environment models. We evaluate the fault detection effectiveness of two search-based algorithms, i.e., Genetic Algorithms and (1+1) Evolutionary Algorithm, when using these novel heuristics and their combinations. Preliminary experiments on 13 carefully selected, non-trivial artificial problems, show that, under certain conditions, these novel heuristics are effective at bringing the environment into a state exhibiting a system fault. The heuristic combination that showed the best overall performance on the artificial problems was applied on an industrial case study where it showed consistent results.
Article Search


Testing Concurrent Programs to Achieve High Synchronization Coverage
Shin Hong, Jaemin Ahn, Sangmin Park, Moonzoo Kim, and Mary Jean Harrold
(KAIST, South Korea; Georgia Tech, USA)
The effectiveness of software testing is often assessed by measuring coverage of some aspect of the software, such as its code. There is much research aimed at increasing code coverage of sequential software. However, there has been little research on increasing coverage for concurrent software. This paper presents a new technique that aims to achieve high coverage of concurrent programs by generating thread schedules to cover uncovered coverage requirements. Our technique first estimates synchronization-pair coverage requirements, and then generates thread schedules that are likely to cover uncovered coverage requirements. This paper also presents a description of a prototype tool that we implemented in Java, and the results of a set of studies we performed using the tool on a several open-source programs. The results show that, for our subject programs, our technique achieves higher coverage faster than random testing techniques; the estimation-based heuristic contributes substantially to the effectiveness of our technique.
Article Search
CARISMA: a Context-sensitive Approach to Race-condition sample-Instance Selection for Multithreaded Applications
Ke Zhai, Boni Xu, W. K. Chan, and T. H. Tse
(University of Hong Kong, Hong Kong; City University of Hong Kong, Hong Kong)
Dynamic race detectors can explore multiple thread schedules of a multithreaded program over the same input to detect data races. Although existing sampling-based precise race detectors reduce overheads effectively so that lightweight precise race detection can be performed in testing or post-deployment environments, they are ineffective in detecting races if the sampling rates are low. This paper presents CARISMA to address this problem. CARISMA exploits the insight that along an execution trace, a program may potentially handle many accesses to the memory locations created at the same site for similar purposes. Iterating over multiple execution trials of the same input, CARISMA estimates and distributes the sampling budgets among such location creation sites, and probabilistically collects a fraction of all accesses to the memory locations associated with such sites for subsequent race detection. Our experiment shows that, compared with PACER on the same platform and at the same sampling rate (such as 1%), CARISMA is significantly more effective.
Article Search
Cooperative Types for Controlling Thread Interference in Java
Jaeheon Yi, Tim Disney, Stephen N. Freund, and Cormac Flanagan
(UC Santa Cruz, USA; Williams College, USA)
Multithreaded programs are notoriously prone to unintended interference between concurrent threads. To address this problem, we argue that yield annotations in the source code should document all thread interference, and we present a type system for verifying the absence of undocumented interference in Java programs. Under this type system, well-typed programs behave as if context switches occur only at yield annotations. Thus, well-typed programs can be understood using intuitive sequential reasoning, except where yield annotations remind the programmer to account for thread interference. Experimental results show that yield annotations describe thread interference more precisely than prior techniques based on method-level atomicity specifications. In particular, yield annotations reduce the number of interference points one must reason about by an order of magnitude. The type system is also more precise than prior methods targeting race freedom. Moreover, yield annotations serve to highlight all known concurrency defects in our benchmark suite.
Article Search
Finding Errors in Multithreaded GUI Applications
Sai Zhang, Hao Lü, and Michael D. Ernst
(University of Washington, USA)
To keep a Graphical User Interface (GUI) responsive and active, a GUI application often has a main UI thread (or event dispatching thread) and spawns separate threads to handle lengthy operations in the background, such as expensive computation, I/O tasks, and network requests. Many GUI frameworks require all GUI objects to be accessed exclusively by the UI thread. If a GUI object is accessed from a non-UI thread, an invalid thread access error occurs and the whole application may abort. This paper presents a general technique to find such invalid thread access errors in multithreaded GUI applications. We formulate finding invalid thread access errors as a call graph reachability problem with thread spawning as the sources and GUI object accessing as the sinks. Standard call graph construction algorithms fail to build a good call graph for some modern GUI applications, because of heavy use of reflection. Thus, our technique builds reflection-aware call graphs. We implemented our technique and instantiated it for four popular Java GUI frameworks: SWT, the Eclipse plugin framework, Swing, and Android. In an evaluation on 9 programs comprising 89273 LOC, our technique found 5 previously-known errors and 5 new ones.
Article Search

Static Analysis

Static Memory Leak Detection Using Full-Sparse Value-Flow Analysis
Yulei Sui, Ding Ye, and Jingling Xue
(UNSW, Australia)
We introduce a static detector, Saber, for detecting memory leaks in C programs. Leveraging recent advances on sparse pointer analysis, Saber is the first to use a full-sparse value-flow analysis for leak detection. Saber tracks the flow of values from allocation to free sites using a sparse value-flow graph (SVFG) that captures def-use chains and value flows via assignments for all memory locations represented by both top-level and address-taken pointers. By exploiting field-, flow- and context-sensitivity during different phases of the analysis, Saber detects leaks in a program by solving a graph reachability problem on its SVFG. Saber, which is fully implemented in Open64, is effective at detecting 211 leaks in the 15 SPEC2000 C programs and five applications, while keeping the false positive rate at 18.5%. We have also compared Saber with Fastcheck (which analyzes allocated objects flowing only into top-level pointers) and Sparrow (which handles all allocated objects using abstract interpretation) using the 15 SPEC2000 C programs. Saber is as accurate as Sparrow but is 14.2X faster and reports 40.7% more bugs than Fastcheck at a slightly higher false positive rate but is only 3.7X slower.
Article Search
Static Detection of Brittle Parameter Typing
Michael Pradel, Severin Heiniger, and Thomas R. Gross
(ETH Zurich, Switzerland)
To avoid receiving incorrect arguments, a method specifies the expected type of each formal parameter. However, some parameter types are too general and have subtypes that the method does not expect as actual argument types. For example, this may happen if there is no common supertype that precisely describes all expected types. As a result of such brittle parameter typing, a caller may accidentally pass arguments unexpected by the callee without any warnings from the type system. This paper presents a fully automatic, static analysis to find brittle parameter typing and unexpected arguments given to brittle parameters. First, the analysis infers from callers of a method the types that arguments commonly have. Then, the analysis reports potentially unexpected arguments that stand out by having an unusual type. We apply the approach to 21 real-world Java programs that use the Swing API, an API providing various methods with brittle parameters. The analysis reveals 15 previously unknown bugs and code smells where programmers pass arguments that are compatible with the declared parameter type but nevertheless unexpected by the callee. The warnings reported by the analysis have 47% precision and 83% recall.
Article Search
Measuring Enforcement Windows with Symbolic Trace Interpretation: What Well-Behaved Programs Say
Devin Coughlin, Bor-Yuh Evan Chang, Amer Diwan, and Jeremy G. Siek
(University of Colorado at Boulder, USA; Google, USA)
A static analysis design is sufficient if it can prove the property of interest with an acceptable number of false alarms. Ultimately, the only way to confirm that an analysis design is sufficient is to implement it and run it on real-world programs. If the evaluation shows that the design is insufficient, the designer must return to the drawing board and repeat the process--wasting expensive implementation effort over and over again. In this paper, we make the observation that there is a minimal range of code needed to prove a property of interest under an ideal static analysis; we call such a range of code a validation scope. Armed with this observation, we create a dynamic measurement framework that quantifies validation scopes and thus enables designers to rule out insufficient designs at lower cost. A novel attribute of our framework is the ability to model aspects of static reasoning using dynamic execution measurements. To evaluate the flexibility of our framework, we instantiate it on an example property--null dereference errors--and measure validation scopes on real-world programs. We use a broad range of metrics that capture the difficulty of analyzing programs along varying dimensions. We also examine how validation scopes evolve as developers fix null dereference errors and as code matures. We find that bug fixes shorten validation scopes, that longer validation scopes are more likely to be buggy, and that overall validation scopes are remarkably stable as programs evolve.
Article Search

Bug Detection and Diagnosis

Detecting Inconsistencies via Universal Reachability Analysis
Aaron Tomb and Cormac Flanagan
(Galois, USA; UC Santa Cruz, USA)
Recent research has suggested that a large class of software bugs fall into the category of inconsistencies, or cases where two pieces of program code make incompatible assumptions. Existing approaches to inconsistency detection have used intentionally unsound techniques aimed at bug-finding rather than verification. We describe an inconsistency detection analysis that extends previous work and is based on the foundation of the weakest precondition calculus. On a closed program, this analysis can serve as a full verification technique, while in cases where some code is unknown, a theorem prover is incomplete, or specifications are incomplete, it can serve as bug finding technique with a low false-positive rate.
Article Search
Residual Investigation: Predictive and Precise Bug Detection
Kaituo Li, Christoph Reichenbach, Christoph Csallner, and Yannis Smaragdakis
(University of Massachusetts at Amherst, USA; University of Texas at Arlington, USA; University of Athens, Greece)
We introduce the concept of “residual investigation” for program analysis. A residual investigation is a dynamic check installed as a result of running a static analysis that reports a possible program error. The purpose is to observe conditions that indicate whether the statically predicted program fault is likely to be realizable and relevant. The key feature of a residual investigation is that it has to be much more precise (i.e., with fewer false warnings) than the static analysis alone, yet significantly more general (i.e., reporting more errors) than the dynamic tests in the program’s test suite pertinent to the statically reported error. That is, good residual investigations encode dynamic conditions that, when taken in conjunction with the static error report, increase confidence in the existence of an error, as well as its severity, without needing to directly observe a fault resulting from the error. We enhance the static analyzer FindBugs with several residual investigations, appropriately tuned to the static error patterns in FindBugs, and apply it to 7 large open-source systems and their native test suites. The result is an analysis with a low occurrence of false warnings (“false positives”) while reporting several actual errors that would not have been detected by mere execution of a program’s test suite.
Article Search
Isolating Failure Causes through Test Case Generation
Jeremias Rößler, Gordon Fraser, Andreas Zeller, and Alessandro Orso
(Saarland University, Germany; Georgia Tech, USA)
Manual debugging is driven by experiments—test runs that narrow down failure causes by systematically confirming or excluding individual factors. The BUGEX approach leverages test case generation to systematically isolate such causes from a single failing test run—causes such as properties of execution states or branches taken that correlate with the failure. Identifying these causes allows for deriving conclusions as: “The failure occurs whenever the daylight savings time starts at midnight local time.” In our evaluation, a prototype of BUGEX precisely pinpointed important failure explaining facts for six out of seven real-life bugs.
Article Search

Regression Testing

Efficient Regression Testing of Ontology-Driven Systems
Mijung Kim, Jake Cobb, Mary Jean Harrold, Tahsin Kurc, Alessandro Orso, Joel Saltz, Andrew Post, Kunal Malhotra, and Shamkant B. Navathe
(Georgia Tech, USA; Emory University, USA)
To manage and integrate information gathered from heterogeneous databases, an ontology is often used. Like all systems, ontology-driven systems evolve over time and must be regression tested to gain confidence in the behavior of the modified system. Because rerunning all existing tests can be extremely expensive, researchers have developed regression-test-selection (RTS) techniques that select a subset of the available tests that are affected by the changes, and use this subset to test the modified system. Existing RTS techniques have been shown to be effective, but they operate on the code and are unable to handle changes that involve ontologies. To address this limitation, we developed and present in this paper a novel RTS technique that targets ontology-driven systems. Our technique creates representations of the old and new ontologies, compares them to identify entities affected by the changes, and uses this information to select the subset of tests to rerun. We also describe in this paper OntoRetest, a tool that implements our technique and that we used to empirically evaluate our approach on two biomedical ontology-driven database systems. The results of our evaluation show that our technique is both efficient and effective in selecting tests to rerun and in reducing the overall time required to perform regression testing.
Article Search
Regression Mutation Testing
Lingming Zhang, Darko Marinov, Lu Zhang, and Sarfraz Khurshid
(University of Texas at Austin, USA; University of Illinois at Urbana-Champaign, USA; Peking University, China)
Mutation testing is one of the most powerful approaches for evaluating quality of test suites. However, mutation testing is also one of the most expensive testing approaches. This paper presents Regression Mutation Testing (ReMT), a new technique to speed up mutation testing for evolving systems. The key novelty of ReMT is to incrementally calculate mutation testing results for the new program version based on the results from the old program version; ReMT uses a static analysis to check which results can be safely reused. ReMT also employs a mutation-specific test prioritization to further speed up mutation testing. We present an empirical study on six evolving systems, whose sizes range from 3.9KLoC to 88.8KLoC. The empirical results show that ReMT can substantially reduce mutation testing costs, indicating a promising future for applying mutation testing on evolving software systems.
Article Search

proc time: 0.06