ISSTA 2017 – Proceedings

, Josie Holmes, and Kevin Kellar
(Northern Arizona University, USA; Pennsylvania State University, USA; Crescent Valley High School, USA)
Test reduction has long been seen as critical for automated testing. However, traditional test reduction simply reduces the length of a test, but does not attempt to reduce semantic complexity. This paper extends previous efforts with algorithms for normalizing and generalizing tests. Rewriting tests into a normal form can reduce semantic complexity and even remove steps from an already delta-debugged test. Moreover, normalization dramatically reduces the number of tests that a reader must examine, partially addressing the ``fuzzer taming'' problem of discovering distinct faults in a set of failing tests. Generalization, in contrast, takes a test and reports what aspects of the test could have been changed while preserving the property that the test fails. Normalization plus generalization aids understanding of tests, including tests for complex and widely used APIs such as the NumPy numeric computation library and the ArcPy GIS scripting package. Normalization frequently reduces the number of tests to be examined by well over an order of magnitude, and often to just one test per fault. Together, ideally, normalization and generalization allow a user to replace reading a large set of tests that vary in unimportant ways with reading one annotated summary test.

Reinforcement Learning for Automatic Test Case Prioritization and Selection in Continuous Integration
Helge Spieker, Arnaud Gotlieb, Dusica Marijan, and Morten Mossige
(Simula Research Laboratory, Norway; University of Stavanger, Norway; ABB Robotics, Norway)
Testing in Continuous Integration (CI) involves test case prioritization, selection, and execution at each cycle. Selecting the most promising test cases to detect bugs is hard if there are uncertainties on the impact of committed code changes or, if traceability links between code and tests are not available. This paper introduces Retecs, a new method for automatically learning test case selection and prioritization in CI with the goal to minimize the round-trip time between code commits and developer feedback on failed test cases. The Retecs method uses reinforcement learning to select and prioritize test cases according to their duration, previous last execution and failure history. In a constantly changing environment, where new test cases are created and obsolete test cases are deleted, the Retecs method learns to prioritize error-prone test cases higher under guidance of a reward function and by observing previous CI cycles. By applying Retecs on data extracted from three industrial case studies, we show for the first time that reinforcement learning enables fruitful automatic adaptive test case selection and prioritization in CI and regression testing.

PerfRanker: Prioritization of Performance Regression Tests for Collection-Intensive Software
Shaikh Mostafa, Xiaoyin Wang

, and Tao Xie
(University of Texas at San Antonio, USA; University of Illinois at Urbana-Champaign, USA)
Regression performance testing is an important but time/resource-consuming phase during software development. Developers need to detect performance regressions as early as possible to reduce their negative impact and fixing cost. However, conducting regression performance testing frequently (e.g., after each commit) is prohibitively expensive. To address this issue, in this paper, we propose PerfRanker, the first approach to prioritizing test cases in performance regression testing for collection-intensive software, a common type of modern software heavily using collections. Our test prioritization is based on performance impact analysis that estimates the performance impact of a given code revision on a given test execution. Evaluation shows that our approach can cover top 3 test cases whose performance is most affected within top 30% to 37% prioritized test cases, in contrast to top 65% to 79% by 3 baseline techniques.

Artifacts Functional

Compiler-Assisted Test Acceleration on GPUs for Embedded Software
Vanya Yaneva, Ajitha Rajan, and Christophe Dubach
(University of Edinburgh, UK)
Embedded software is found everywhere from our highly visible mobile devices to the confines of our car in the form of smart sensors. Embedded software companies are under huge pressure to produce safe applications that limit risks, and testing is absolutely critical to alleviate concerns regarding safety and user privacy. This requires using large test suites throughout the development process, increasing time-to-market and ultimately hindering competitivity.
Speeding up test execution is, therefore, of paramount importance for embedded software developers. This is traditionally achieved by running, in parallel, multiple tests on large-scale clusters of computers. However, this approach is costly in terms of infrastructure maintenance and energy consumed, and is at times inconvenient as developers have to wait for their tests to be scheduled on a shared resource.
We propose to look at exploiting GPUs (Graphics Processing Units) for running embedded software testing. GPUs are readily available in most computers and offer tremendous amounts of parallelism, making them an ideal target for embedded software testing. In this paper, we demonstrate, for the first time, how test executions of embedded C programs can be automatically performed on a GPU, without involving the end user. We take a compiler-assisted approach which automatically compiles the C program into GPU kernels for parallel execution of the input tests. Using this technique, we achieve an average speedup of 16× when compared to CPU execution of input tests across nine programs from an industry standard embedded benchmark suite.

Testing

Targeted Property-Based Testing
Andreas Löscher and Konstantinos Sagonas
(Uppsala University, Sweden)
We introduce targeted property-based testing, an enhanced form of property-based testing that aims to make the input generation component of a property-based testing tool guided by a search strategy rather than being completely random. Thus, this testing technique combines the advantages of both search-based and property-based testing. We demonstrate the technique with the framework we have built, called Target, and show its effectiveness on three case studies. The first of them demonstrates how Target can employ simulated annealing to generate sensor network topologies that form configurations with high energy consumption. The second case study shows how the generation of routing trees for a wireless network equipped with directional antennas can be guided to fulfill different energy metrics. The third case study employs Target to test the noninterference property of information-flow control abstract machine designs, and compares it with a sophisticated hand-written generator for programs of these abstract machines.

Info

Artifacts Functional

Generating Unit Tests with Descriptive Names Or: Would You Name Your Children Thing1 and Thing2?
Ermira Daka, José Miguel Rojas

, and Gordon Fraser
(University of Sheffield, UK)
The name of a unit test helps developers to understand the purpose and scenario of the test, and test names support developers when navigating amongst sets of unit tests. When unit tests are generated automatically, however, they tend to be given non-descriptive names such as “test0”, which provide none of the benefits a descriptive name can give a test. The underlying challenge is that automatically generated tests typically do not represent real scenarios and have no clear purpose other than covering code, which makes naming them di cult. In this paper, we present an automated approach which generates descriptive names for automatically generated unit tests by summarizing API-level coverage goals. The tests are optimized to be short, descriptive of the test, have a clear relation to the covered code under test, and allow developers to uniquely distinguish tests in a test suite. An empirical evaluation with 47 participants shows that developers agree with the synthesized names, and the synthesized names are equally descriptive as manually written names. Study participants were even more accurate and faster at matching code and tests with synthesized names compared to manually derived names.

Info

Artifacts Functional

Symbolic Execution

Accelerating Array Constraints in Symbolic Execution
David M. Perry, Andrea Mattavelli, Xiangyu Zhang

, and Cristian Cadar

(Purdue University, USA; Imperial College London, UK)
Despite significant recent advances, the effectiveness of symbolic execution is limited when used to test complex, real-world software. One of the main scalability challenges is related to constraint solving: large applications and long exploration paths lead to complex constraints, often involving big arrays indexed by symbolic expressions. In this paper, we propose a set of semantics-preserving transformations for array operations that take advantage of contextual information collected during symbolic execution. Our transformations lead to simpler encodings and hence better performance in constraint solving. The results we obtain are encouraging: we show, through an extensive experimental analysis, that our transformations help to significantly improve the performance of symbolic execution in the presence of arrays. We also show that our transformations enable the analysis of new code, which would be otherwise out of reach for symbolic execution.

Info

Artifacts Functional

Improving the Cost-Effectiveness of Symbolic Testing Techniques for Transport Protocol Implementations under Packet Dynamics
Wei Sun, Lisong Xu, and Sebastian Elbaum
(University of Nebraska-Lincoln, USA)
The majority of Internet traffic is transferred by transport protocols. The correctness of these transport protocol implementations is hard to validate as their behaviors depend not only on their protocols but also on their network environments that can introduce dynamic packet delay and loss. Random testing, widely used in industry due to its simplicity and low cost, struggles to detect packet delay related faults which occur with low probability. Symbolic execution based testing is promising at detecting such low probability faults, but it requires large testing budgets as it attempts to cover a prohibitively large input space of packet dynamics. To improve its cost-effectiveness, we propose two domain-specific heuristic techniques, called packet retransmission based priority and network state based priority, which are motivated by two common transport protocol properties. In our experiments using the Linux TFTP programs, our techniques improve the cost-effectiveness of symbolic execution based testing for transport protocols, detecting three times as many faults when the budget is in the range of minutes and hours.

Artifacts Functional

Combining Symbolic Execution and Search-Based Testing for Programs with Complex Heap Inputs
Pietro Braione, Giovanni Denaro, Andrea Mattavelli, and Mauro Pezzè
(University of Milano-Bicocca, Italy; Imperial College London, UK; University of Lugano, Switerland)
Despite the recent improvements in automatic test case generation, handling complex data structures as test inputs is still an open problem. Search-based approaches can generate sequences of method calls that instantiate structured inputs to exercise a relevant portion of the code, but fall short in building inputs to execute program elements whose reachability is determined by the structural features of the input structures themselves. Symbolic execution techniques can effectively handle structured inputs, but do not identify the sequences of method calls that instantiate the input structures through legal interfaces. In this paper, we propose a new approach to automatically generate test cases for programs with complex data structures as inputs. We use symbolic execution to generate path conditions that characterise the dependencies between the program paths and the input structures, and convert the path conditions to optimisation problems that we solve with search-based techniques to produce sequences of method calls that instantiate those inputs. Our preliminary results show that the approach is indeed effective in generating test cases for programs with complex data structures as inputs, thus opening a promising research direction.

Artifacts Functional

Concurrency

Efficient Computation of Happens-Before Relation for Event-Driven Programs
Pallavi Maiya and Aditya Kanade
(IISc Bangalore, India)
An emerging style of programming is to use both threads and events to achieve better scalability. The improved scalability comes at the price of increased complexity, as both threads and events can follow non-deterministic schedules. The happens-before (HB) relation captures the space of possible schedules and forms the basis of various concurrency analyses. Improving efficiency of the HB computation can speed up these analyses.
In this paper, we identify a major bottleneck in computation of the HB relation for such event-driven programs. Event-driven programs are designed to interact continuously with their environment, and usually receive a large number of events even within a short span of time. This increases the cost of discovering the HB order among the events. We propose a novel data structure, called event graph, that maintains a subset of the HB relation to efficiently infer order between any pair of events. We present an algorithm, called EventTrack, which improves efficiency of vector clock based HB computation for event-driven programs using event graphs.
We have implemented EventTrack and evaluated it on traces of eight Android applications. Compared to the state-of-the-art technique, EventTrack gave an average speedup of 4.9X. The speedup ranged from 1.8X to 10.3X across the applications.

Info

Artifacts Functional

Automatic Detection and Validation of Race Conditions in Interrupt-Driven Embedded Software
Yu Wang, Linzhang Wang

, Tingting Yu, Jianhua Zhao, and Xuandong Li

(Nanjing University, China; University of Kentucky, USA)
Interrupt-driven programs are widely deployed in safety-critical embedded systems to perform hardware and resource dependent data operation tasks. The frequent use of interrupts in these systems can cause race conditions to occur due to interactions between application tasks and interrupt handlers. Numerous program analysis and testing techniques have been proposed to detect races in multithreaded programs. Little work, however, has addressed race condition problems related to hardware interrupts. In this paper, we present SDRacer, an automated framework that can detect and validate race conditions in interrupt-driven embedded software. It uses a combination of static analysis and symbolic execution to generate input data for exercising the potential races. It then employs virtual platforms to dynamically validate these races by forcing the interrupts to occur at the potential racing points. We evaluate SDRacer on nine real-world embedded programs written in C language. The results show that SDRacer can precisely detect race conditions.

Monitoring Decentralized Specifications
Antoine El-Hokayem and Yliès Falcone
(Grenoble Alpes University, France; Inria, France; CNRS, France; Laboratoire d'Informatique de Grenoble, France)
We define two complementary approaches to monitor decentralized systems. The first relies on those with a centralized specification, i.e, when the specification is written for the behavior of the entire system. To do so, our approach introduces a data-structure that i) keeps track of the execution of an automaton, ii) has predictable parameters and size, and iii) guarantees strong eventual consistency. The second approach defines decentralized specifications wherein multiple specifications are provided for separate parts of the system. We study decentralized monitorability, and present a general algorithm for monitoring decentralized specifications. We map three existing algorithms to our approaches and provide a framework for analyzing their behavior. Lastly, we introduce our tool, which is a framework for designing such decentralized algorithms, and simulating their behavior.

Info

Artifacts Functional

Dynamic Analysis

Effective Online Software Anomaly Detection
Yizhen Chen, Ming Ying, Daren Liu, Adil Alim, Feng Chen, and Mei-Hwa Chen
(SUNY Albany, USA)
While automatic online software anomaly detection is crucial for ensuring the quality of production software, current techniques are mostly inefficient and ineffective. For online software, its inputs are usually provided by the users at runtime and the validity of the outputs cannot be automatically verified without a predefined oracle. Furthermore, some online anomalous behavior may be caused by the anomalies in the execution context, rather than by any code defect, which are even more difficult to detect. Existing approaches tackle this problem by identifying certain properties observed from the executions of the software during a training process and using them to monitor online software behavior. However, they may require a large execution overhead for monitoring the properties, which limits the applicability of these approaches for online monitoring. We present a methodology that applies effective algorithms to select a close to optimal set of anomaly-revealing properties, which enables online anomaly detection with minimal execution overhead. Our empirical results show that an average of 76.5% of anomalies were detected by using at most 5.5% of execution overhead.

Semi-automated Discovery of Server-Based Information Oversharing Vulnerabilities in Android Applications
William Koch, Abdelberi Chaabane, Manuel Egele, William Robertson, and Engin Kirda
(Boston University, USA; Northeastern University, USA)
Modern applications are often split into separate client and server tiers that communicate via message passing over the network. One well-understood threat to privacy for such applications is the leakage of sensitive user information either in transit or at the server. In response, an array of defensive techniques have been developed to identify or block unintended or malicious information leakage. However, prior work has primarily considered privacy leaks originating at the client directed at the server, while leakage in the reverse direction -- from the server to the client -- is comparatively under-studied. The question of whether and to what degree this leakage constitutes a threat remains an open question. We answer this question in the affirmative with Hush, a technique for semi-automatically identifying Server-based InFormation OvershariNg (SIFON) vulnerabilities in multi-tier applications. In particular, the technique detects SIFON vulnerabilities using a heuristic that overshared sensitive information from server-side APIs will not be displayed by the application's user interface. The technique first performs a scalable static program analysis to screen applications for potential vulnerabilities, and then attempts to confirm these candidates as true vulnerabilities with a partially-automated dynamic analysis. Our evaluation over a large corpus of Android applications demonstrates the effectiveness of the technique by discovering several previously-unknown SIFON vulnerabilities in eight applications.

CPR: Cross Platform Binary Code Reuse via Platform Independent Trace Program
Yonghwi Kwon, Weihang Wang, Yunhui Zheng, Xiangyu Zhang

, and Dongyan Xu
(Purdue University, USA; IBM Research, USA)
The rapid growth of Internet of Things (IoT) has been created a number of new platforms recently. Unfortunately, such variety of IoT devices causes platform fragmentation which makes software development on such devices challenging. In particular, existing programs cannot be simply reused on such devices as they rely on certain underlying hardware and software interfaces which we call platform dependencies. In this paper, we present CPR, a novel technique that synthesizes a platform independent program from a platform dependent program. Specifically, we leverage an existing system called PIEtrace which can generate a platform independent trace program. The generated trace program is platform independent while it can only reproduce a specific execution path. Hence, we develop an algorithm to merge a set of platform independent trace programs and synthesize a general program that can take multiple inputs. The synthesized platform-independent program is representative of the merged trace programs and the results produced by the program is correct if no exceptions occur. Our evaluation results on 15 real-world applications show that CPR is highly effective on reusing existing binaries across platforms.

An Actionable Performance Profiler for Optimizing the Order of Evaluations
Marija Selakovic

, Thomas Glaser, and Michael Pradel

(TU Darmstadt, Germany)
The efficiency of programs often can be improved by applying rel- atively simple changes. To find such optimization opportunities, developers either rely on manual performance tuning, which is time-consuming and requires expert knowledge, or on traditional profilers, which show where resources are spent but not how to optimize the program. This paper presents a profiler that provides actionable advice, by not only finding optimization opportunities but by also suggesting code transformations that exploit them. Specifically, we focus on optimization opportunities related to the order of evaluating subexpressions that are part of a decision made by the program. To help developers find such reordering opportuni- ties, we present DecisionProf, a dynamic analysis that automatically identifies the optimal order, for a given input, of checks in logical expressions and in switch statements. The key idea is to assess the computational costs of all possible orders, to find the optimal order, and to suggest a code transformation to the developer only if reordering yields a statistically significant performance improve- ment. Applying DecisionProf to 43 real-world JavaScript projects reveals 52 beneficial reordering opportunities. Optimizing the code as proposed by DecisionProf reduces the execution time of indi- vidual functions between 2.5% and 59%, and leads to statistically significant application-level performance improvements that range between 2.5% and 6.5%.

The Web

Testing and Analysis of Web Applications using Page Models
Snigdha Athaiya and Raghavan Komondoor
(IISc Bangalore, India)
Web applications are difficult to analyze using code-based tools because data-flow and control-flow through the application occurs via both server-side code and client-side pages. Client-side pages are typically specified in a scripting language that is different from the main server-side language; moreover, the pages are generated dynamically from the scripts. To address these issues we propose a static-analysis approach that automatically constructs a ``model'' of each page in a given application. A page model is a code fragment in the same language as the server-side code, which faithfully over-approximates the possible elements of the page as well as the control-flows and data-flows due to these elements. The server-side code in conjunction with the page models then becomes a standard (non-web) program, thus amenable to analysis using standard code-based tools. We have implemented our approach in the context of J2EE applications. We demonstrate the versatility and usefulness of our approach by applying three standard analysis tools on the resultant programs from our approach: a concolic-execution based model checker (JPF), a dynamic fault localization tool (Zoltar), and a static slicer (Wala).

Artifacts Functional

Automated Layout Failure Detection for Responsive Web Pages without an Explicit Oracle
Thomas A. Walsh, Gregory M. Kapfhammer, and Phil McMinn
(University of Sheffield, UK; Allegheny College, USA)
As the number and variety of devices being used to access the World Wide Web grows exponentially, ensuring the correct presentation of a web page, regardless of the device used to browse it, is an important and challenging task. When developers adopt responsive web design (RWD) techniques, web pages modify their appearance to accommodate a device’s display constraints. However, a current lack of automated support means that presentation failures may go undetected in a page’s layout when rendered for different viewport sizes. A central problem is the difficulty in providing an automated “oracle” to validate RWD layouts against, meaning that checking for failures is largely a manual process in practice, which results in layout failures in many live responsive web sites. This paper presents an automated failure detection technique that checks the consistency of a responsive page’s layout across a range of viewport widths, obviating the need for an explicit oracle. In an empirical study, this method found failures in 16 of 26 real-world production pages studied, detecting 33 distinct failures in total.

Test Execution Checkpointing for Web Applications
Marco Guarnieri, Petar Tsankov, Tristan Buchs, Mohammad Torabi Dashti, and David Basin

(ETH Zurich, Switzerland; EPFL, Switzerland)
Test isolation is a prerequisite for the correct execution of test suites on web applications. We present Test Execution Checkpointing, a method for efficient test isolation. Our method instruments web applications to support checkpointing and exploits this support to isolate and optimize tests. We have implemented and evaluated this method on five popular PHP web applications. The results show that our method not only provides test isolation essentially for free, it also reduces testing time by 44% on average.

Experience Report

Experience Paper: A Study on Behavioral Backward Incompatibilities of Java Software Libraries
Shaikh Mostafa, Rodney Rodriguez, and Xiaoyin Wang
(University of Texas at San Antonio, USA)
Nowadays, due to the frequent technological innovation and market changes, software libraries are evolving very quickly. Backward compatibility has always been one of the most important requirements during the evolution of software platforms and libraries. However, backward compatibility is seldom fully achieved in practice, and many relevant software failures are reported. Therefore, it is important to understand the status, major reasons, and impact of backward incompatibilities in real world software. This paper presents an empirical study to understand behavioral changes of APIs during evolution of software libraries. Specifically, we performed a large-scale cross-version regression testing on 68 consecutive version pairs from 15 popular Java software libraries. Furthermore, we collected and studied 126 real-world software bugs reports on backward incompatibilities of software libraries. Our major findings include: (1) 1,094 test failures / errors and 296 behavioral backward incompatibilities are detected from 52 of 68 consecutive version pairs; (2) there is a distribution mismatch between incompatibilities detected by library-side regression testing, and bug-inducing incompatibilities; (3) the majority of behavioral backward incompatibilities are not well documented in API documents or release notes; and (4) 67% of fixed client bugs caused by backward incompatibilities in software libraries are fixed by client developers, through several simple change patterns made to the backward incompatible API invocation.

Program Repair and Patching

Identifying Test-Suite-Overfitted Patches through Test Case Generation
Qi Xin and Steven P. Reiss
(Brown University, USA)
A typical automatic program repair technique that uses a test suite as the correct criterion can produce a patched program that is test-suite-overfitted, or overfitting, which passes the test suite but does not actually repair the bug. In this paper, we propose DiffTGen which identifies a patched program to be overfitting by first generating new test inputs that uncover semantic differences between the original faulty program and the patched program, then testing the patched program based on the semantic differences, and finally generating test cases. Such a test case could be added to the original test suite to make it stronger and could prevent the repair technique from generating a similar overfitting patch again. We evaluated DiffTGen on 89 patches generated by four automatic repair techniques for Java with 79 of them being likely to be overfitting and incorrect. DiffTGen identifies in total 39 (49.4%) overfitting patches and yields the corresponding test cases. We further show that an automatic repair technique, if configured with DiffTGen, could avoid yielding overfitting patches and potentially produce correct ones.

Impact of Tool Support in Patch Construction
Anil Koyuncu, Tegawendé F. Bissyandé

, Dongsun Kim, Jacques Klein

, Martin Monperrus, and Yves Le Traon
(University of Luxembourg, Luxembourg; Inria, France; University of Lille, France)
In this work, we investigate the practice of patch construction in the Linux kernel development, focusing on the differences between three patching processes: (1) patches crafted entirely manually to fix bugs, (2) those that are derived from warnings of bug detection tools, and (3) those that are automatically generated based on fix patterns. With this study, we provide to the research community concrete insights on the practice of patching as well as how the development community is currently embracing research and commercial patching tools to improve productivity in repair. The result of our study shows that tool-supported patches are increasingly adopted by the developer community while manually-written patches are accepted more quickly. Patch application tools enable developers to remain committed to contributing patches to the code base. Our findings also include that, in actual development processes, patches generally implement several change operations spread over the code, even for patches fixing warnings by bug detection tools. Finally, this study has shown that there is an opportunity to directly leverage the output of bug detection tools to readily generate patches that are appropriate for fixing the problem, and that are consistent with manually-written patches.

Automated Repair of Layout Cross Browser Issues using Search-Based Techniques
Sonal Mahajan, Abdulmajeed Alameer, Phil McMinn, and William G. J. Halfond

(University of Southern California, USA; University of Sheffield, UK)
A consistent cross-browser user experience is crucial for the success of a website. Layout Cross Browser Issues (XBIs) can severely undermine a website’s success by causing web pages to render incorrectly in certain browsers, thereby negatively impacting users’ impression of the quality and services that the web page delivers. Existing Cross Browser Testing (XBT) techniques can only detect XBIs in websites. Repairing them is, hitherto, a manual task that is labor intensive and requires significant expertise. Addressing this concern, our paper proposes a technique for automatically repairing layout XBIs in websites using guided search-based techniques. Our empirical evaluation showed that our approach was able to successfully fix 86% of layout XBIs reported for 15 different web pages studied, thereby improving their cross-browser consistency.

Artifacts Functional

Fault Localization and Mutation Testing

Boosting Spectrum-Based Fault Localization using PageRank
Mengshi Zhang, Xia Li, Lingming Zhang, and Sarfraz Khurshid
(University of Texas at Austin, USA; University of Texas at Dallas, USA)
Manual debugging is notoriously tedious and time consuming. Therefore, various automated fault localization techniques have been proposed to help with manual debugging. Among the existing fault localization techniques, spectrum-based fault localization (SBFL) is one of the most widely studied techniques due to being lightweight. A focus of existing SBFL techniques is to consider how to differentiate program source code entities (i.e., one dimension in program spectra); indeed, this focus is aligned with the ultimate goal of finding the faulty lines of code. Our key insight is to enhance existing SBFL techniques by additionally considering how to differentiate tests (i.e., the other dimension in program spectra), which, to the best of our knowledge, has not been studied in prior work.
We present PRFL, a lightweight technique that boosts spectrum-based fault localization by differentiating tests using PageRank algorithm. Given the original program spectrum information, PRFL uses PageRank to recompute the spectrum information by considering the contributions of different tests. Then, traditional SBFL techniques can be applied on the recomputed spectrum information to achieve more effective fault localization. Although simple and lightweight, PRFL has been demonstrated to outperform state-of-the-art SBFL techniques significantly (e.g., ranking 42% more real faults within Top-1 compared with the most effective traditional SBFL technique) with low overhead (e.g., around 2 minute average extra overhead on real faults) on 357 real faults from 5 Defects4J projects and 30692 artificial (i.e., mutation) faults from 87 GitHub projects, demonstrating a promising future for considering the contributions of different tests during fault localization.

FLUCCS: Using Code and Change Metrics to Improve Fault Localization
Jeongju Sohn and Shin Yoo

(KAIST, South Korea)
Fault localization aims to support the debugging activities of human developers by highlighting the program elements that are suspected to be responsible for the observed failure. Spectrum Based Fault Localization (SBFL), an existing localization technique that only relies on the coverage and pass/fail results of executed test cases, has been widely studied but also criticized for the lack of precision and limited effort reduction. To overcome restrictions of techniques based purely on coverage, we extend SBFL with code and change metrics that have been studied in the context of defect prediction, such as size, age and code churn. Using suspiciousness values from existing SBFL formulas and these source code metrics as features, we apply two learn-to-rank techniques, Genetic Programming (GP) and linear rank Support Vector Machines (SVMs). We evaluate our approach with a ten-fold cross validation of method level fault localization, using 210 real world faults from the Defects4J repository. GP with additional source code metrics ranks the faulty method at the top for 106 faults, and within the top five for 173 faults. This is a significant improvement over the state-of-the-art SBFL formulas, the best of which can rank 49 and 127 faults at the top and within the top five, respectively.

Inferring Mutant Utility from Program Context
René Just, Bob Kurtz, and Paul Ammann
(University of Massachusetts, USA; George Mason University, USA)
Existing mutation techniques produce vast numbers of equivalent, trivial, and redundant mutants. Selective mutation strategies aim to reduce the inherent redundancy of full mutation analysis to obtain most of its benefit for a fraction of the cost. Unfortunately, recent research has shown that there is no fixed selective mutation strategy that is effective across a broad range of programs; the utility (i.e., usefulness) of a mutant produced by a given mutation operator varies greatly across programs.
This paper hypothesizes that mutant utility, in terms of equivalence, triviality, and dominance, can be predicted by incorporating context information from the program in which the mutant is embedded. Specifically, this paper (1) explains the intuition behind this hypothesis with a motivational example, (2) proposes an approach for modeling program context using a program's abstract syntax tree, and (3) proposes and evaluates a series of program-context models for predicting mutant utility. The results for 129 mutation operators show that program context information greatly increases the ability to predict mutant utility. The results further show that it is important to consider program context for individual mutation operators rather than mutation operator groups.

Artifacts Functional

Faster Mutation Analysis via Equivalence Modulo States
Bo Wang, Yingfei Xiong, Yangqingwei Shi, Lu Zhang

, and Dan Hao
(Peking University, China)
Mutation analysis has many applications, such as asserting the quality of test suites and localizing faults. One important bottleneck of mutation analysis is scalability. The latest work explores the possibility of reducing the redundant execution via split-stream execution. However, split-stream execution is only able to remove redundant execution before the first mutated statement.
In this paper we try to also reduce some of the redundant execution after the execution of the first mutated statement. We observe that, although many mutated statements are not equivalent, the execution result of those mutated statements may still be equivalent to the result of the original statement. In other words, the statements are equivalent modulo the current state. In this paper we propose a fast mutation analysis approach, AccMut. AccMut automatically detects the equivalence modulo states among a statement and its mutations, then groups the statements into equivalence classes modulo states, and uses only one process to represent each class. In this way, we can significantly reduce the number of split processes. Our experiments show that our approach can further accelerate mutation analysis on top of split-stream execution with a speedup of 2.56x on average.

Static Analysis

Just-in-Time Static Analysis
Lisa Nguyen Quang Do, Karim Ali, Benjamin Livshits

, Eric Bodden

, Justin Smith, and Emerson Murphy-Hill
(Fraunhofer IEM, Germany; University of Alberta, Canada; Imperial College London, UK; University of Paderborn, Germany; North Carolina State University, USA)
We present the concept of Just-In-Time (JIT) static analysis that interleaves code development and bug fixing in an integrated development environment. Unlike traditional batch-style analysis tools, a JIT analysis tool presents warnings to code developers over time, providing the most relevant results quickly, and computing less relevant results incrementally later. In this paper, we describe general guidelines for designing JIT analyses. We also present a general recipe for transforming static data-flow analyses to JIT analyses through a concept of layered analysis execution. We illustrate this transformation through CHEETAH, a JIT taint analysis for Android applications. Our empirical evaluation of CHEETAH on real-world applications shows that our approach returns warnings quickly enough to avoid disrupting the normal workflow of developers. This result is confirmed by our user study, in which developers fixed data leaks twice as fast when using CHEETAH compared to an equivalent batch-style analysis.

Video

Info

Artifacts Functional

Refining Interprocedural Change-Impact Analysis using Equivalence Relations
Alex Gyori, Shuvendu K. Lahiri, and Nimrod Partush
(University of Illinois at Urbana-Champaign, USA; Microsoft Research, USA; Technion, Israel)
Change-impact analysis (CIA) is the task of determining the set of program elements impacted by a program change. Precise CIA has great potential to avoid expensive testing and code reviews for (parts of) changes that are refactorings (semantics-preserving). However most statement-level CIA techniques suffer from imprecision as they do not incorporate the semantics of the change.
We formalize change impact in terms of the trace semantics of two program versions. We show how to leverage equivalence relations to make dataflow-based CIA aware of the change semantics, thereby improving precision in the presence of semantics-preserving changes. We propose an anytime algorithm that applies costly equivalence-relation inference incrementally to refine the set of impacted statements. We implemented a prototype and evaluated it on 322 real-world changes from open-source projects and benchmark programs used by prior research. The evaluation results show an average 35% improvement in the number of impacted statements compared to prior dataflow-based techniques.

Boosting the Precision of Virtual Call Integrity Protection with Partial Pointer Analysis for C++
Xiaokang Fan, Yulei Sui, Xiangke Liao, and Jingling Xue

(UNSW, Australia; National University of Defense Technology, China)
We present, VIP, an approach to boosting the precision of Virtual call Integrity Protection for large-scale real-world C++ programs (e.g., Chrome) by using pointer analysis for the first time. VIP introduces two new techniques: (1) a sound and scalable partial pointer analysis for discovering statically the sets of legitimate targets at virtual callsites from separately compiled C++ modules and (2) a lightweight instrumentation technique for performing (virtual call) integrity checks at runtime. VIP raises the bar against vtable hijacking attacks by providing stronger security guarantees than the CHA-based approach with comparable performance overhead.
VIP is implemented in LLVM-3.8.0 and evaluated using SPEC programs and Chrome. Statically, VIP protects virtual calls more effectively than CHA by significantly reducing the sets of legitimate targets permitted at 20.3% of the virtual callsites per program, on average. Dynamically, VIP incurs an average (maximum) instrumentation overhead of 0.7% (3.3%), making it practically deployable as part of a compiler tool chain.

Artifacts Functional

Lightweight Detection of Physical Unit Inconsistencies without Program Annotations
John-Paul Ore, Carrick Detweiler, and Sebastian Elbaum
(University of Nebraska-Lincoln, USA)
Systems interacting with the physical world operate on quantities measured with physical units. When unit operations in a program are inconsistent with the physical units' rules, those systems may suffer. Existing approaches to support unit consistency in programs can impose an unacceptable burden on developers. In this paper, we present a lightweight static analysis approach focused on physical unit inconsistency detection that requires no end-user program annotation, modification, or migration. It does so by capitalizing on existing shared libraries that handle standardized physical units, common in the cyber-physical domain, to link class attributes of shared libraries to physical units. Then, leveraging rules from dimensional analysis, the approach propagates and infers units in programs that use these shared libraries, and detects inconsistent unit usage. We implement and evaluate the approach in a tool, analyzing 213 open-source systems containing +900,000 LOC, finding inconsistencies in 11% of them, with an 87% true positive rate for a class of inconsistencies detected with high confidence. An initial survey of robot system developers finds that the unit inconsistencies detected by our tool are 'problematic', and we investigate how and when these inconsistencies occur.

Artifacts Functional

ISSTA 2017 – Proceedings

Technical Papers

Improving Testing

Testing

Symbolic Execution

Concurrency

Dynamic Analysis

The Web

Experience Report

Program Repair and Patching

Fault Localization and Mutation Testing

Static Analysis