ISSTA 2015 – Proceedings

Message from the Chairs
It is our great pleasure to welcome you to Baltimore, Maryland for ISSTA 2015, the 24th International Symposium on Software Testing and Analysis, to be held on July 13-17, 2015. ISSTA is the leading research symposium on software testing and analysis, bringing together academics, industrial researchers, and practitioners to exchange new ideas, problems, and experience on how to analyze and test software systems.

Main Research

Debugging

Evaluating the Usefulness of IR-Based Fault Localization Techniques
Qianqian Wang, Chris Parnin

, and Alessandro Orso

(Georgia Tech, USA; North Carolina State University, USA)
Software debugging is tedious and time consuming. To reduce the manual effort needed for debugging, researchers have proposed a considerable number of techniques to automate the process of fault localization; in particular, techniques based on information retrieval (IR) have drawn increased attention in recent years. Although reportedly effective, these techniques have some potential limitations that may affect their performance. First, their effectiveness is likely to depend heavily on the quality of the bug reports; unfortunately, high-quality bug reports that contain rich information are not always available. Second, these techniques have not been evaluated through studies that involve actual developers, which is less than ideal, as purely analytical evaluations can hardly show the actual usefulness of debugging techniques. The goal of this work is to evaluate the usefulness of IR-based techniques in real-world scenarios. Our investigation shows that bug reports do not always contain rich information, and that low-quality bug reports can considerably affect the effectiveness of these techniques. Our research also shows, through a user study, that high-quality bug reports benefit developers just as much as they benefit IR-based techniques. In fact, the information provided by IR-based techniques when operating on high-quality reports is only helpful to developers in a limited number of cases. And even in these cases, such information only helps developers get to the faulty file quickly, but does not help them in their most time consuming task: understanding and fixing the bug within that file.

Proactive Detection of Inadequate Diagnostic Messages for Software Configuration Errors
Sai Zhang and Michael D. Ernst

(University of Washington, USA)
This paper presents a technique to detect inadequate (i.e., missing or ambiguous) diagnostic messages for configuration errors issued by a configurable software system. The technique injects configuration errors into the software under test, monitors the software outcomes under the injected configuration errors, and uses natural language processing to analyze the output diagnostic message caused by each configuration error. The technique reports diagnostic messages that may be unhelpful in diagnosing a configuration error. We implemented the technique for Java in a tool, ConfDiagDetector. In an evaluation on 4 real-world, mature configurable systems, ConfDiagDetector reported 43 distinct inadequate diagnostic messages (25 missing and 18 ambiguous). 30 of the detected messages have been confirmed by their developers, and 12 more have been identified as inadequate by users in a user study. On average, Conf- DiagDetector required 5 minutes of programmer time and 3 minutes of compute time to detect each inadequate diagnostic message.

An Analysis of Patch Plausibility and Correctness for Generate-and-Validate Patch Generation Systems
Zichao Qi, Fan Long, Sara Achour, and Martin Rinard

(Massachusetts Institute of Technology, USA)
We analyze reported patches for three existing generate-and- validate patch generation systems (GenProg, RSRepair, and AE). The basic principle behind generate-and-validate systems is to accept only plausible patches that produce correct outputs for all inputs in the validation test suite. Because of errors in the patch evaluation infrastructure, the majority of the reported patches are not plausible — they do not produce correct outputs even for the inputs in the validation test suite. The overwhelming majority of the reported patches are not correct and are equivalent to a single modification that simply deletes functionality. Observed negative effects include the introduction of security vulnerabilities and the elimination of desirable functionality. We also present Kali, a generate-and-validate patch generation system that only deletes functionality. Working with a simpler and more effectively focused search space, Kali generates at least as many correct patches as prior GenProg, RSRepair, and AE systems. Kali also generates at least as many patches that produce correct outputs for the inputs in the validation test suite as the three prior systems. We also discuss the patches produced by ClearView, a generate-and-validate binary hot patching system that lever- ages learned invariants to produce patches that enable systems to survive otherwise fatal defects and security attacks. Our analysis indicates that ClearView successfully patches 9 of the 10 security vulnerabilities used to evaluate the system. At least 4 of these patches are correct.

aec-badge-issta

Web Security

BrowserAudit: Automated Testing of Browser Security Features
Charlie Hothersall-Thomas, Sergio Maffeis, and Chris Novakovic
(Netcraft, UK; Imperial College London, UK)
The security of the client side of a web application relies on browser features such as cookies, the same-origin policy and HTTPS. As the client side grows increasingly powerful and sophisticated, browser vendors have stepped up their offering of security mechanisms which can be leveraged to protect it. These are often introduced experimentally and informally and, as adoption increases, gradually become standardised (e.g., CSP, CORS and HSTS). Considering the diverse landscape of browser vendors, releases, and customised versions for mobile and embedded devices, there is a compelling need for a systematic assessment of browser security. We present BrowserAudit, a tool for testing that a deployed browser enforces the guarantees implied by the main standardised and experimental security mechanisms. It includes more than 400 fully-automated tests that exercise a broad range of security features, helping web users, application developers and security researchers to make an informed security assessment of a deployed browser. We validate BrowserAudit by discovering both fresh and known security-related bugs in major browsers.

Info

aec-badge-issta

Detection and Classiﬁcation of Malicious JavaScript via Attack Behavior Modelling
Yinxing Xue, Junjie Wang, Yang Liu

, Hao Xiao, Jun Sun, and Mahinthan Chandramohan
(Nanyang Technological University, Singapore; Singapore University of Technology and Design, Singapore)
Existing malicious JavaScript (JS) detection tools and commercial anti-virus tools mostly use feature-based or signature-based approaches to detect JS malware. These tools are weak in resistance to obfuscation and JS malware variants, not mentioning about providing detailed information of attack behaviors. Such limitations root in the incapability of capturing attack behaviors in these approches. In this paper, we propose to use Deterministic Finite Automaton (DFA) to abstract and summarize common behaviors of malicious JS of the same attack type. We propose an automatic behavior learning framework, named JS*, to learn DFAs from dynamic execution traces of JS malware, where we implement an effective online teacher by combining data dependency analysis, defense rules and trace replay mechanism. We evaluate JS* using real world data of 10000 benign and 276 malicious JS samples to cover 8 most-infectious attack types. The results demonstrate the scalability and effectiveness of our approach in the malware detection and classification, compared with commercial anti-virus tools. We also show how to use our DFAs to detect variants and new attacks.

Experience Report: An Empirical Study of PHP Security Mechanism Usage
Johannes Dahse and Thorsten Holz
(Ruhr University Bochum, Germany)
The World Wide Web mainly consists of web applications written in weakly typed scripting languages, with PHP being the most popular language in practice. Empirical evidence based on the analysis of vulnerabilities suggests that security is often added as an ad-hoc solution, rather than planning a web application with security in mind during the design phase. Although some best-practice guidelines emerged, no comprehensive security standards are available for developers. Thus, developers often apply their own favorite security mechanisms for data sanitization or validation to prohibit malicious input to a web application. In the context of our development of a new static code analysis tool for vulnerability detection, we studied commonly used input sanitization or validation mechanisms in 25 popular PHP applications. Our analysis of 2.5 million lines of code and over 26 thousand secured data flows provides a comprehensive overview of how developers utilize security mechanisms in practice regarding different markup contexts. In this paper, we discuss these security mechanisms in detail and reveal common pitfalls. For example, we found certain markup contexts and security mechanisms more frequently vulnerable than others. Our empirical study helps researchers, web developers, and tool developers to focus on error-prone markup contexts and security mechanisms in order to detect and mitigate vulnerabilities.

Mobile/Web Analysis

WuKong: A Scalable and Accurate Two-Phase Approach to Android App Clone Detection
Haoyu Wang, Yao Guo, Ziang Ma, and Xiangqun Chen

(Peking University, China)
Repackaged Android applications (app clones) have been found in many third-party markets, which not only compromise the copyright of original authors, but also pose threats to security and privacy of mobile users. Both fine-grained and coarse-grained approaches have been proposed to detect app clones. However, fine-grained techniques employing complicated clone detection algorithms are difficult to scale to hundreds of thousands of apps, while coarse-grained techniques based on simple features are scalable but less accurate. This paper proposes WuKong, a two-phase detection approach that includes a coarse-grained detection phase to identify suspicious apps by comparing light-weight static semantic features, and a fine-grained phase to compare more detailed features for only those apps found in the first phase. To further improve the detection speed and accuracy, we also introduce an automated clustering-based preprocessing step to filter third-party libraries before conducting app clone detection. Experiments on more than 100,000 Android apps collected from five Android markets demonstrate the effectiveness and scalability of our approach.

Systematic Execution of Android Test Suites in Adverse Conditions
Christoffer Quist Adamsen, Gianluca Mezzetti, and Anders Møller

(Aarhus University, Denmark)
Event-driven applications, such as, mobile apps, are difficult to test thoroughly. The application programmers often put significant effort into writing end-to-end test suites. Even though such tests often have high coverage of the source code, we find that they often focus on the expected behavior, not on occurrences of unusual events. On the other hand, automated testing tools may be capable of exploring the state space more systematically, but this is mostly without knowledge of the intended behavior of the individual applications. As a consequence, many programming errors remain unnoticed until they are encountered by the users. We propose a new methodology for testing by leveraging existing test suites such that each test case is systematically exposed to adverse conditions where certain unexpected events may interfere with the execution. In this way, we explore the interesting execution paths and take advantage of the assertions in the manually written test suite, while ensuring that the injected events do not affect the expected outcome. The main challenge that we address is how to accomplish this systematically and efficiently. We have evaluated the approach by implementing a tool, Thor, working on Android. The results on four real-world apps with existing test suites demonstrate that apps are often fragile with respect to certain unexpected events and that our methodology effectively increases the testing quality: Of 507 individual tests, 429 fail when exposed to adverse conditions, which reveals 66 distinct problems that are not detected by ordinary execution of the tests.

Info

aec-badge-issta

DLint: Dynamically Checking Bad Coding Practices in JavaScript
Liang Gong, Michael Pradel

, Manu Sridharan, and Koushik Sen
(University of California at Berkeley, USA; TU Darmstadt, Germany; Samsung Research, USA)
JavaScript has become one of the most popular programming languages, yet it is known for its suboptimal design. To effectively use JavaScript despite its design flaws, developers try to follow informal code quality rules that help avoid correctness, maintainability, performance, and security problems. Lightweight static analyses, implemented in "lint-like" tools, are widely used to find violations of these rules, but are of limited use because of the language's dynamic nature. This paper presents DLint, a dynamic analysis approach to check code quality rules in JavaScript. DLint consists of a generic framework and an extensible set of checkers that each addresses a particular rule. We formally describe and implement 28 checkers that address problems missed by state-of-the-art static approaches. Applying the approach in a comprehensive empirical study on over 200 popular web sites shows that static and dynamic checking complement each other. On average per web site, DLint detects 49 problems that are missed statically, including visible bugs on the web sites of IKEA, Hilton, eBay, and CNBC.

aec-badge-issta

Mobile Security

Scalable and Precise Taint Analysis for Android
Wei Huang, Yao Dong, Ana Milanova, and Julian Dolby

(Google, USA; Rensselaer Polytechnic Institute, USA; IBM Research, USA)
We propose a type-based taint analysis for Android. Concretely, we present DFlow, a context-sensitive information flow type system, and DroidInfer, the corresponding type inference analysis for detecting privacy leaks in Android apps. We present novel techniques for error reporting based on CFL-reachability, as well as novel techniques for handling of Android-specific features, including libraries, multiple entry points and callbacks, and inter-component communication. Empirical results show that our approach is scalable and precise. DroidInfer scales well in terms of time and memory and has false-positive rate of 15.7%. It detects privacy leaks in apps from the Google Play Store and in known malware.

aec-badge-issta

Dynamic Detection of Inter-application Communication Vulnerabilities in Android
Roee Hay, Omer Tripp, and Marco Pistoia
(IBM, Israel; IBM Research, USA)
A main aspect of the Android platform is Inter-Application Communication (IAC), which enables reuse of functionality across apps and app components via message passing. While a powerful feature, IAC also constitutes a serious attack surface. A malicious app can embed a payload into an IAC message, thereby driving the recipient app into a potentially vulnerable behavior if the message is processed without its fields first being sanitized or validated. We present what to our knowledge is the first comprehensive testing algorithm for Android IAC vulnerabilities. Toward this end, we first describe a catalog, stemming from our field experience, of 8 concrete vulnerability types that can potentially arise due to unsafe handling of incoming IAC messages. We then explain the main challenges that automated discovery of Android IAC vulnerabilities entails, including in particular path coverage and custom data fields, and present simple yet surprisingly effective solutions to these challenges. We have realized our testing approach as the IntentDroid system, which is available as a commercial cloud service. IntentDroid utilizes lightweight platform-level instrumentation, implemented via debug breakpoints (to run atop any Android device without any setup or customization), to recover IAC-relevant app-level behaviors. Evaluation of IntentDroid over a set of 80 top-popular apps has revealed a total 150 IAC vulnerabilities — some already fixed by the developers following our report — with a recall rate of 92% w.r.t. a ground truth established via manual auditing by a security expert.

aec-badge-issta

Modelgen: Mining Explicit Information Flow Specifications from Concrete Executions
Lazaro Clapp, Saswat Anand, and Alex Aiken

(Stanford University, USA)
We present a technique to mine explicit information flow specifications from concrete executions. These specifications can be consumed by a static taint analysis, enabling static analysis to work even when method definitions are missing or portions of the program are too difficult to analyze statically (e.g., due to dynamic features such as reflection). We present an implementation of our technique for the Android platform. When compared to a set of manually written specifications for 309 methods across 51 classes, our technique is able to recover 96.36% of these manual specifications and produces many more correct annotations that our manual models missed. We incorporate the generated specifications into an existing static taint analysis system, and show that they enable it to find additional true flows. Although our implementation is Android-specific, our approach is applicable to other application frameworks.

aec-badge-issta

Concurrency Analysis

When Truth Is Efficient: Analysing Concurrency
Ganesh Narayanaswamy
(University of Oxford, UK)
Concurrent systems are hard to develop and are even harder to analyse. The usual way to analyse concurrent systems is to give them interleaving semantics and exploit automata-based methods to investigate the resultant interleaved model. Such an approach is often hard to scale without additional tools to curb the interleaving-induced state space explosion. In this work we make an alternate case: for directly capturing the behaviour of concurrent systems using true concurrency. We show how to build composable, truly concurrent models for real-world programs written using one of the most widely adopted paradigms for developing massively parallel systems, the Message Passing Interface Standard (MPI). Our method employs general event structures to symbolically capture executions of MPI programs and uses this truly concurrent model, combined with our novel deadlock characterisation, to formulate a precise, scalable decision procedure that finds communication deadlocks in large MPI programs. We show that our analysis scales to systems with hundreds of processes and strongly outperforms state of the art interleaving semantics based approaches.

Pegasus: Automatic Barrier Inference for Stable Multithreaded Systems
Monika Dhok, Rashmi Mudduluru, and Murali Krishna Ramanathan
(Indian Institute of Science, India)
Deterministic multithreaded systems (DMTs) are designed to ensure reproducibility of program behavior for a given input. In these systems, even minor changes to the code (or input) can perturb the schedule. This increases the number of feasible schedules making reasoning about these programs harder. Stable multithreaded systems (StableMTs) address the problem such that a schedule is unaffected by minor changes. Unfortunately, determinism in these systems can potentially serialize the execution imposing a significant penalty on the performance. Programmer hints in the form of soft barriers attempt to eliminate the performance bottlenecks. However, the process is arduous, error-prone and requires manual intervention to reconsider the location of the barrier for every modification to the source code.
In this paper, we propose an effective approach to automate the task of adding soft barriers in the source code. Our approach analyzes the deterministic program executions to extract the program and semantic order of executions and builds a weighted constraint graph. Using this graph, a schedule is synthesized which is used to identify bottlenecks and insert soft barriers in the program source.
We validate our implementation, named PEGASUS, by applying it on multiple benchmarks. Our experimental results demonstrate that we are able to reduce the overall execution time of programs by upto 34% when compared to the execution time where barriers are inserted manually. Moreover, we observe a performance improvement ranging from 38% to 88% as compared to programs without barriers. Our experimental results show that adapting PEGASUS to infer barriers for multiple versions of a source program is seamless. The memory and time overheads associated with the usage of PEGASUS is negligible making it a vital cog in building scalable StableMTs.

ConcBugAssist: Constraint Solving for Diagnosis and Repair of Concurrency Bugs
Sepideh Khoshnood, Markus Kusano, and Chao Wang
(Virginia Tech, USA)
Programmers often have to spend a significant amount of time in- specting the software code and execution traces to identify the cause of a bug. For a multithreaded program, debugging is even more challenging due to the subtle interactions between threads and the often astronomical number of interleavings. In this work, we pro- pose a logical constraint based symbolic analysis method to aid in the diagnosis of concurrency bugs and to recommend repairs. Both diagnosis and repair are formulated as constraint solving prob- lems. Our method, by leveraging the power of satisfiability (SAT) solvers and a bounded model checker, performs a semantic analy- sis of the sequential computation as well as thread interactions. The constraint based analysis is designed for handling critical software with small to medium code size, but complex concurrency control, such as device drivers, implementations of synchronization proto- cols, and concurrent data structures. We have implemented our new method in a software tool and demonstrated its effectiveness in di- agnosing bugs in multithreaded C programs.

aec-badge-issta

Symbolic Execution

Enhancing Reuse of Constraint Solutions to Improve Symbolic Execution
Xiangyang Jia, Carlo Ghezzi, and Shi Ying
(Wuhan University, China; Politecnico di Milano, Italy)
Constraint solution reuse is an effective approach to save the time of constraint solving in symbolic execution. Most of the existing reuse approaches are based on syntactic or semantic equivalence of constraints. For example, the Green framework can reuse constraints which have different representations but are semantically equivalent, through canonizing constraints into syntactically equivalent normal forms. KLEE reuses constraints based on subset/superset querying. However, both equivalence-based approach and subset/superset-based approach cannot cover some kinds of reuse where atomic constraints are not equivalent. Our approach, called GreenTrie, is an extension to the Green framework, which supports constraint reuse based on the logical implication relations among constraints. GreenTrie provides a component, called L-Trie, which stores constraints and solutions into tries, indexed by an implication partial order graph of constraints. L-Trie is able to carry out logical reduction and logical subset and superset querying for given constraints, to check for reuse of previously solved constraints. We report the results of an experimental assessment of GreenTrie against the original Green framework and the KLEE approach, which shows that our extension achieves better reuse of constraint solving result and saves significant symbolic execution time.

S-Looper: Automatic Summarization for Multipath String Loops
Xiaofei Xie, Yang Liu

, Wei Le

, Xiaohong Li

, and Hongxu Chen
(Tianjin University, China; Nanyang Technological University, Singapore; Iowa State University, USA)
Loops are important yet most challenging program constructs to analyze for various program analysis tasks. Existing loop analysis techniques mainly handle well loops that contain only integer variables with a single path in the loop body. The key challenge in summarizing a multiple-path loop is that a loop traversal can yield a large number of possibilities due to the different execution orders of these paths located in the loop; when a loop contains a conditional branch related to string content, we potentially need to track every character in the string for loop summarization, which is expensive. In this paper, we propose an approach, named S-Looper, to automatically summarize a type of loops related to a string traversal. This type of loops can contain multiple paths, and the branch conditions in the loop can be related to string content. Our approach is to identify patterns of the string based on the branch conditions along each path in the loop. Based on such patterns, we then generate a loop summary that describes the path conditions of a loop traversal as well as the symbolic values of each variable at the exit of a loop. Combined with vulnerability conditions, we are thus able to generate test inputs that traverse a loop in a specific way and lead to exploitation. Our experiments show that handling such string loops can largely improve the buffer overflow detection capabilities of the existing symbolic analysis tool. We also compared our techniques with KLEE and PEX, and show that we can generate test inputs more effectively and efficiently.

Experience Report: How is Dynamic Symbolic Execution Different from Manual Testing? A Study on KLEE
Xiaoyin Wang

, Lingming Zhang, and Philip Tanofsky
(University of Texas at San Antonio, USA; University of Texas at Dallas, USA)
Software testing has been the major approach to software quality assurance for decades, but it typically involves intensive manual efforts. To reduce manual efforts, researchers have proposed numerous approaches to automate test-case generation, which is one of the most time-consuming tasks in software testing. One most recent achievement in the area is Dynamic Symbolic Execution (DSE), and tools based on DSE, such as KLEE, have been reported to generate test suites achieving higher code coverage than manually developed test suites. However, besides the competitive code coverage, there have been few studies to compare DSE-based test suites with manually developed test suites more thoroughly on various metrics to understand the detailed differences between the two testing methodologies. In this paper, we revisit the experimental study on the KLEE tool and GNU CoreUtils programs, and compare KLEE-based test suites with manually developed test suites on various aspects. We further carried out a qualitative study to investigates the reasons behind the differences in statistical results. The results of our studies show that while KLEE-based test suites are able to generate test cases with higher code coverage, they are relatively less effective on covering hard-to-cover code and killing mutants. Furthermore, our qualitative study reveals that KLEE-based test suites have advantages in exploring error-handling code and exhausting options, but are less effective on generating valid string inputs and exploring meaningful program behaviors.

Regression Testing

Practical Regression Test Selection with Dynamic File Dependencies
Milos Gligoric, Lamyaa Eloussi, and Darko Marinov

(University of Illinois at Urbana-Champaign, USA)
Regression testing is important but can be time-intensive. One approach to speed it up is regression test selection (RTS), which runs only a subset of tests. RTS was proposed over three decades ago but has not been widely adopted in practice. Meanwhile, testing frameworks, such as JUnit, are widely adopted and well integrated with many popular build systems. Hence, integrating RTS in a testing framework already used by many projects would increase the likelihood that RTS is adopted. We propose a new, lightweight RTS technique, called Ekstazi, that can integrate well with testing frameworks. Ekstazi tracks dynamic dependencies of tests on files, and unlike most prior RTS techniques, Ekstazi requires no integration with version-control systems. We implemented Ekstazi for Java and JUnit, and evaluated it on 615 revisions of 32 open-source projects (totaling almost 5M LOC) with shorter- and longer-running test suites. The results show that Ekstazi reduced the end-to-end testing time 32% on average, and 54% for longer-running test suites, compared to executing all tests. Ekstazi also has lower end-to-end time than the existing techniques, despite the fact that it selects more tests. Ekstazi has been adopted by several popular open source projects, including Apache Camel, Apache Commons Math, and Apache CXF.

Reliable Testing: Detecting State-Polluting Tests to Prevent Test Dependency
Alex Gyori, August Shi, Farah Hariri, and Darko Marinov

(University of Illinois at Urbana-Champaign, USA)
Writing reliable test suites for large object-oriented systems is complex and time consuming. One common cause of unreliable test suites are test dependencies that can cause tests to fail unexpectedly, not exposing bugs in the code under test but in the test code itself. Prior research has shown that the main reason for test dependencies is the “pollution” of state shared across tests.
We propose a technique, called , for finding tests that pollute the shared state. In a nutshell, finds tests that modify some location on the heap shared across tests or on the file system; a subsequent test could fail if it assumes the shared location to have the initial value before the state was modified. To aid in inspecting the pollutions, provides an access path through the heap that leads to the polluted value or the name of the file that was modified. We implemented a prototype tool for Java and evaluated it on projects, with a total of tests. reported , and our inspection found that of those are relevant pollutions that can easily affect other tests.

Empirical Evaluation of Pareto Efficient Multi-objective Regression Test Case Prioritisation
Michael G. Epitropakis, Shin Yoo, Mark Harman

, and Edmund K. Burke
(University of Stirling, UK; University College London, UK)
The aim of test case prioritisation is to determine an ordering of test cases that maximises the likelihood of early fault revelation. Previous prioritisation techniques have tended to be single objective, for which the additional greedy algorithm is the current state-of-the-art. Unlike test suite minimisation, multi objective test case prioritisation has not been thoroughly evaluated. This paper presents an extensive empirical study of the effectiveness of multi objective test case prioritisation, evaluating it on multiple versions of five widely-used benchmark programs and a much larger real world system of over 1 million lines of code. The paper also presents a lossless coverage compaction algorithm that dramatically scales the performance of all algorithms studied by between 2 and 4 orders of magnitude, making prioritisation practical for even very demanding problems.

Info

Search-Based Algorithms

Optimizing Selection of Competing Features via Feedback-Directed Evolutionary Algorithms
Tian Huat Tan, Yinxing Xue, Manman Chen, Jun Sun, Yang Liu

, and Jin Song Dong
(Singapore University of Technology and Design, Singapore; National University of Singapore, Singapore; Nanyang Technological University, Singapore)
Software that support various groups of customers usually require complicated configurations to attain different functionalities. To model the configuration options, feature model is proposed to capture the commonalities and competing variabilities of the product variants in software family or Software Product Line (SPL). A key challenge for deriving a new product is to find a set of features that do not have inconsistencies or conflicts, yet optimize multiple objectives (e.g., minimizing cost and maximizing number of features), which are often competing with each other. Existing works have attempted to make use of evolutionary algorithms (EAs) to address this problem. In this work, we incorporated a novel feedback-directed mechanism into existing EAs. Our empirical results have shown that our method has improved noticeably over all unguided version of EAs on the optimal feature selection. In particular, for case studies in SPLOT and LVAT repositories, the feedback-directed Indicator-Based EA (IBEA) has increased the number of correct solutions found by 72.33% and 75%, compared to unguided IBEA. In addition, by leveraging a pre-computed solution, we have found 34 sound solutions for Linux X86, which contains 6888 features, in less than 40 seconds.

aec-badge-issta

Automated Software Transplantation
Earl T. Barr

, Mark Harman

, Yue Jia, Alexandru Marginean, and Justyna Petke
(University College London, UK)
Automated transplantation would open many exciting avenues for software development: suppose we could autotransplant code from one system into another, entirely unrelated, system. This paper introduces a theory, an algorithm, and a tool that achieve this. Leveraging lightweight annotation, program analysis identifies an organ (interesting behavior to transplant); testing validates that the organ exhibits the desired behavior during its extraction and after its implantation into a host. While we do not claim automated transplantation is now a solved problem, our results are encouraging: we report that in 12 of 15 experiments, involving 5 donors and 3 hosts (all popular real-world systems), we successfully autotransplanted new functionality and passed all regression tests. Autotransplantation is also already useful: in 26 hours computation time we successfully autotransplanted the H.264 video encoding functionality from the x264 system to the VLC media player; compare this to upgrading x264 within VLC, a task that we estimate, from VLC's version history, took human programmers an average of 20 days of elapsed, as opposed to dedicated, time.

Info

aec-badge-issta

Automating Performance Bottleneck Detection using Search-Based Application Profiling
Du Shen, Qi Luo, Denys Poshyvanyk

, and Mark Grechanik
(College of William and Mary, USA; University of Illinois at Chicago, USA)
Application profiling is an important performance analysis technique, when an application under test is analyzed dynamically to determine its space and time complexities and the usage of its instructions. A big and important challenge is to profile nontrivial web applications with large numbers of combinations of their input parameter values. Identifying and understanding particular subsets of inputs leading to performance bottlenecks is mostly manual, intellectually intensive and laborious procedure.
We propose a novel approach for automating performance bottleneck detection using search-based input-sensitive application profiling. Our key idea is to use a genetic algorithm as a search heuristic for obtaining combinations of input parameter values that maximizes a fitness function that represents the elapsed execution time of the application. We implemented our approach, coined as Genetic Algorithm-driven Profiler (GA-Prof) that combines a search-based heuristic with contrast data mining of execution traces to accurately determine performance bottlenecks. We evaluated GA-Prof to determine how effectively and efficiently it can detect injected performance bottlenecks into three popular open source web applications. Our results demonstrate that GA-Prof efficiently explores a large space of input value combinations while automatically and accurately detecting performance bottlenecks, thus suggesting that it is effective for automatic profiling.

Verification

Test-Case Generation for Runtime Analysis and Vice Versa: Verification of Aircraft Separation Assurance
Marko Dimjašević and Dimitra Giannakopoulou
(University of Utah, USA; NASA Ames Research Center, USA)
This paper addresses the problem of specifying properties of aircraft separation assurance software, and verifying these properties at runtime. In particular, we target AutoResolver, a large, complex air-traffic control system that predicts and resolves aircraft loss of separation. In previous work, we developed a light-weight testing environment for AutoResolver. Our work contributed a wrapper around AutoResolver, which enabled the automated generation and fast execution of hundreds of thousands of tests. The focus of the work presented here is in specifying requirements for AutoResolver, in ensuring the generation of test cases that cover these requirements, and in developing a runtime infrastructure for automatically checking the requirements. Such infrastructure must be completely transparent to the AutoResolver code base. Our work combines test-case generation and runtime verification in innovative ways in order to address these challenges. The paper includes a detailed evaluation and discussion of our verification effort.

Reliability Assessment for Distributed Systems via Communication Abstraction and Refinement
Lin Gui, Jun Sun, Yang Liu

, and Jin Song Dong
(National University of Singapore, Singapore; Singapore University of Technology and Design, Singapore; Nanyang Technological University, Singapore)
Distributed systems like cloud-based services are ever more popular. Assessing the reliability of distributed systems is highly non-trivial. Particularly, the order of executions among distributed components adds a dimension of non-determinism, which invalidates existing reliability assessment methods based on Markov chains. Probabilistic model checking based on models like Markov decision processes is designed to deal with scenarios involving both probabilistic behavior (e.g., reliabilities of system components) and non-determinism. However, its application is currently limited by state space explosion, which makes reliability assessment of distributed system particularly difficult. In this work, we improve the probabilistic model checking through a method of abstraction and reduction, which controls the communications among system components and actively reduces the size of each component. We prove the soundness and completeness of the proposed approach. Through an implementation in a software toolkit and evaluations with several systems, we show that our approach often reduces the size of the state space by several orders of magnitude, while still producing sound and accurate assessment.

Reusing Constraint Proofs in Program Analysis
Andrea Aquino, Francesco A. Bianchi, Meixian Chen, Giovanni Denaro, and Mauro Pezzè

(University of Lugano, Switzerland; University of Milano-Bicocca, Italy)
Symbolic analysis techniques have largely improved over the years, and are now approaching an industrial maturity level. One of the main limitations to the scalability of symbolic analysis is the impact of constraint solving that is still a relevant bottleneck for the applicability of symbolic techniques, despite the dramatic improvements of the last decades. In this paper we discuss a novel approach to deal with the constraint solving bottleneck. Starting from the observation that constraints may recur during the analysis of the same as well as different programs, we investigate the advantages of complementing constraint solving with searching for the satisfiability proof of a constraint in a repository of constraint proofs. We extend recent proposals with powerful simplifications and an original canonical form of the constraints that reduce syntactically different albeit equivalent constraints to the same form, and thus facilitate the search for equivalent constraints in large repositories. The experimental results we attained indicate that the proposed approach improves over both similar solutions and state of the art constraint solvers.

Random Testing

Feedback-Controlled Random Test Generation
Kohsuke Yatoh, Kazunori Sakamoto, Fuyuki Ishikawa

, and Shinichi Honiden
(University of Tokyo, Japan; National Institute of Informatics, Japan)
Feedback-directed random test generation is a widely used technique to generate random method sequences. It leverages feedback to guide generation. However, the validity of feedback guidance has not been challenged yet. In this paper, we investigate the characteristics of feedback-directed random test generation and propose a method that exploits the obtained knowledge that excessive feedback limits the diversity of tests. First, we show that the feedback loop of feedback-directed generation algorithm is a positive feedback loop and amplifies the bias that emerges in the candidate value pool. This over-directs the generation and limits the diversity of generated tests. Thus, limiting the amount of feedback can improve diversity and effectiveness of generated tests. Second, we propose a method named feedback-controlled random test generation, which aggressively controls the feedback in order to promote diversity of generated tests. Experiments on eight different, real-world application libraries indicate that our method increases branch coverage by 78% to 204% over the original feedback-directed algorithm on large-scale utility libraries.

Info

aec-badge-issta

Randomized Stress-Testing of Link-Time Optimizers
Vu Le, Chengnian Sun, and Zhendong Su
(University of California at Davis, USA)
Link-time optimization (LTO) is an increasingly important and adopted modern optimization technology. It is currently supported by many production compilers, including GCC, LLVM, and Microsoft Visual C/C++. Despite its complexity, but because it is more recent, LTO is relatively less tested compared to the more mature, traditional optimizations. To evaluate and help improve the quality of LTO, we present the first extensive effort to stress-test the LTO components of GCC and LLVM, the two most widely-used production C compilers. In 11 months, we have discovered and reported 37 bugs (12 in GCC; 25 in LLVM). Developers have confirmed 21 of our bugs, and fixed 11 of them. Our core technique is differential testing and realized in the tool Proteus. We leverage existing compiler testing tools (Csmith and Orion) to generate single-file test programs and address two important challenges specific for LTO testing. First, to thoroughly exercise LTO, Proteus automatically transforms a single-file program into multiple compilation units and stochastically assigns each an optimization level. Second, for effective bug reporting, we develop a practical mechanism to reduce LTO bugs involving multiple files. Our results clearly demonstrate Proteus’s utility; we plan to make ours a continuous effort in validating link-time optimizers.

Automated Unit Test Generation during Software Development: A Controlled Experiment and Think-Aloud Observations
José Miguel Rojas

, Gordon Fraser, and Andrea Arcuri
(University of Sheffield, UK; Scienta, Norway; University of Luxembourg, Luxembourg)
Automated unit test generation tools can produce tests that are superior to manually written ones in terms of code coverage, but are these tests helpful to developers while they are writing code? A developer would first need to know when and how to apply such a tool, and would then need to understand the resulting tests in order to provide test oracles and to diagnose and fix any faults that the tests reveal. Considering all this, does automatically generating unit tests provide any benefit over simply writing unit tests manually? We empirically investigated the effects of using an automated unit test generation tool (EvoSuite) during development. A controlled experiment with 41 students shows that using EvoSuite leads to an average branch coverage increase of +13%, and 36% less time is spent on testing compared to writing unit tests manually. However, there is no clear effect on the quality of the implementations, as it depends on how the test generation tool and the generated tests are used. In-depth analysis, using five think-aloud observations with professional programmers, confirms the necessity to increase the usability of automated unit test generation tools, to integrate them better during software development, and to educate software developers on how to best use those tools.

Info

aec-badge-issta

Domain-Specific Testing

Calculation Coverage Testing in Scientific Applications
Yoshiki Sato, Shumpei Hozumi, and Shigeru Chiba

(University of Tokyo, Japan)
A typical implementation of scientific applications includes a large number of iterative calculations. For performance optimization, these calculations are often partitioned, grouped, and reordered for execution. Since this refactoring is repeatedly performed during development, it is one of the major source of bugs and thus tool support is necessary for debugging. This study discusses this problem and proposes tool support through a testing framework. This testing framework can help developers perform the tests we call calculation coverage testing. It investigates whether the grouped calculations cover all the calculations performed by the original (and often naively implemented) program. It also investigates whether their execution order is correct. To demonstrate this idea, we also presents HPCUnit, our prototype testing framework for Java, and then reports an empirical study applying it to the Java Grande Forum Benchmark Suite.

aec-badge-issta

Automatic Fault Injection for Driver Robustness Testing
Kai Cong, Li Lei, Zhenkun Yang, and Fei Xie
(Portland State University, USA)
Robustness testing is a crucial stage in the device driver development cycle. To accelerate driver robustness testing, effective fault scenarios need to be generated and injected without requiring much time and human effort. In this paper, we present a practical approach to automatic runtime generation and injection of fault scenarios for driver robustness testing. We identify target functions that can fail from runtime execution traces, generate effective fault scenarios on these target functions using a bounded trace-based iterative strategy, and inject the generated fault scenarios at runtime to test driver robustness using a permutation-based injection mechanism. We have evaluated our approach on 12 Linux device drivers and found 28 severe bugs. All these bugs have been further validated via manual fault injection. The results demonstrate that our approach is useful and efficient in generating fault scenarios for driver robustness testing with little manual effort.

Preventing Data Errors with Continuous Testing
Kıvanç Muşlu, Yuriy Brun

, and Alexandra Meliou
(University of Washington, USA; University of Massachusetts, USA)
Today, software systems that rely on data are ubiquitous, and ensuring the data's quality is an increasingly important challenge as data errors result in annual multi-billion dollar losses. While software debugging and testing have received heavy research attention, less effort has been devoted to data debugging: identifying system errors caused by well-formed but incorrect data. We present continuous data testing (CDT), a low-overhead, delay-free technique that quickly identifies likely data errors. CDT continuously executes domain-specific test queries; when a test fails, CDT unobtrusively warns the user or administrator. We implement CDT in the ConTest prototype for the PostgreSQL database management system. A feasibility user study with 96 humans shows that ConTest was extremely effective in a setting with a data entry application at guarding against data errors: With ConTest, users corrected 98.4% of their errors, as opposed to 40.2% without, even when we injected 40% false positives into ConTest's output. Further, when using ConTest, users corrected data entry errors 3.2 times faster than when using state-of-the-art methods.

Model-Based Testing

Automatic Generation of System Test Cases from Use Case Specifications
Chunhui Wang, Fabrizio Pastore

, Arda Goknil, Lionel Briand

, and Zohaib Iqbal
(University of Luxembourg, Luxembourg; National University of Computer and Emerging Sciences, Pakistan)
In safety critical domains, system test cases are often derived from functional requirements in natural language (NL) and traceability between requirements and their corresponding test cases is usually mandatory. The definition of test cases is therefore time-consuming and error prone, especially so given the quickly rising complexity of embedded systems in many critical domains. Though considerable research has been devoted to automatic generation of system test cases from NL requirements, most of the proposed approaches re- quire significant manual intervention or additional, complex behavioral modelling. This significantly hinders their applicability in practice. In this paper, we propose Use Case Modelling for System Tests Generation (UMTG), an approach that automatically generates executable system test cases from use case spec- ifications and a domain model, the latter including a class diagram and constraints. Our rationale and motivation are that, in many environments, including that of our industry partner in the reported case study, both use case specifica- tions and domain modelling are common and accepted prac- tice, whereas behavioural modelling is considered a difficult and expensive exercise if it is to be complete and precise. In order to extract behavioral information from use cases and enable test automation, UMTG employs Natural Language Processing (NLP), a restricted form of use case specifica- tions, and constraint solving.

RTCM: A Natural Language Based, Automated, and Practical Test Case Generation Framework
Tao Yue

, Shaukat Ali

, and Man Zhang
(Simula Research Laboratory, Norway; University of Oslo, Norway)
Based on our experience of collaborating with industry, we observed that test case generation usually relies on test case specifications (TCSs), commonly written in natural language, specifying test cases of a System Under Test at a high level of abstraction. In practice, TCSs are commonly used by test engineers as reference documents to perform these activities: 1) Manually executing test cases in TCSs; 2) Manually coding test cases in a test scripting language for automated test case execution. In the latter case, the gap between TCSs and executable test cases has to be filled by test engineers, requiring a significant amount of coding effort and domain knowledge. Motivated by the above observations from the industry, we first propose, in this paper, a TCS language, named as Restricted Test Case Modeling (RTCM), which is based on natural language and composed of an easy-to-use template, a set of restriction rules and keywords. Second, we propose a test case generation tool (aToucan4Test), which takes TCSs in RTCM as input and generates either manual test cases or automatically executable test cases, based on various coverage criteria defined on RTCM. To assess the applicability of RTCM, we manually modeled two industrial case studies and examined 30 automatically generated TCSs. To evaluate aToucan4Test, we modeled three subsystems of a Video Conferencing System developed by Cisco Systems, Norway and automatically generated executable test cases. These test cases were successfully executed on two commercial software versions. In the paper, we also discuss our experience of applying RTCM and aToucan4Test in an industrial context and compare our approach with other model-based testing methodologies.

Tool Demonstrations

Dynamic Taint Tracking for Java with Phosphor (Demo)
Jonathan Bell and Gail Kaiser

(Columbia University, USA)
Dynamic taint tracking is an information flow analysis that can be applied to many areas of testing. Phosphor is the first portable, accurate and performant dynamic taint tracking system for Java. While previous systems for performing general-purpose taint tracking in the JVM required specialized research JVMs, Phosphor works with standard off-the-shelf JVMs (such as Oracle's HotSpot and OpenJDK's IcedTea). Phosphor also differs from previous portable JVM taint tracking systems that were not general purpose (e.g. tracked only tags on Strings and no other type), in that it tracks tags on all variables. We have also made several enhancements to Phosphor, to track taint tags through control flow (in addition to data flow), as well as to track an arbitrary number of relationships between taint tags (rather than be limited to only 32 tags). In this demonstration, we show how developers writing testing tools can benefit from Phosphor, and explain briefly how to interact with it.

Info

TSTL: A Language and Tool for Testing (Demo)
Alex Groce

, Jervis Pinto, Pooria Azimi, and Pranjal Mittal
(Oregon State University, USA)
Writing a test harness is a difficult and repetitive program- ming task, and the lack of tool support for customized auto- mated testing is an obstacle to the adoption of more sophis- ticated testing in industry. This paper presents TSTL, the Template Scripting Testing Language, which allows users to specify the general form of valid tests for a system in a simple but expressive language, and tools to support testing based on a TSTL definition. TSTL is a minimalist template- based domain-specific language, using the source language of the Software Under Test (SUT) to support most operations, but adding declarative idioms for testing. TSTL compiles to a common testing interface that hides the details of the SUT and provides support for logging, code coverage, delta debugging, and other core testing functionality, making it easy to write universal testing tools such as random testers or model checkers that apply to all TSTL-defined harnesses. TSTL is currently available for Python, but easily adapted to other languages as well.

CanaryAdvisor: A Statistical-Based Tool for Canary Testing (Demo)
Alexander Tarvo, Peter F. Sweeney, Nick Mitchell, V.T. Rajan, Matthew Arnold, and Ioana Baldini
(IBM Research, USA)
Canary testing is an emerging technique that offers to minimize the risk of deploying a new version of software. It does so by slowly transferring load from the current to the new ("canary") version. As this ramp-up progresses, a human compares the performance and correctness of the two versions, and assesses whether to abort the canary version. For canary testing to be effective, a plethora of metrics must be analyzed, including CPU utilization and logged errors, across hundreds to thousands of machines. Performing this analysis manually is both time consuming and error prone. In this paper, we present CanaryAdvisor, a tool for automatic canary testing of cloud-based applications. CanaryAdvisor continuously monitors the deployed versions of an application and detects degradations in correctness, performance, and/or scalability. We describe our design and implementation of the CanaryAdvisor and outline open challenges.

SAMC: A Fast Model Checker for Finding Heisenbugs in Distributed Systems (Demo)
Tanakorn Leesatapornwongsa and Haryadi S. Gunawi

(University of Chicago, USA)
We present SAMC, an open-source model checker that can be integrated to many modern distributed cloud systems. SAMC can find concurrency bugs caused by non-deterministic dis- tributed events. We have successfully integrated SAMC to Hadoop, ZooKeeper and Cassandra.

Info

Doctoral Symposium

Making Your Crashes Work for You (Doctoral Symposium)
Peter Ohmann
(University of Wisconsin-Madison, USA)
Debugging is difficult and costly. Developers greatly value full traces and complete, reproducible crash recordings, but these are impractical for deployed software. Fortunately, failing applications can leave behind a snapshot of their crashing state in the form of a core dump. Unfortunately, crash data alone often leaves substantial ambiguity in the program's execution.
My thesis work aims both to improve the quality of information extracted from core dumps and enhance this readily-available information. My prior work showed that automated postmortem analysis results can be significantly improved by targeted, lightweight, and tunable instrumentation (0-5% run-time overhead). My thesis aims to expand this work in three directions: improved tracing and dump data recovery, expanded postmortem analyses, and improved tracing efficiency based on previously-observed failures.

Scalable Program Analysis through Proof Caching (Doctoral Symposium)
Andrea Aquino
(University of Lugano, Switzerland)
Despite the remarkable advances attained by the SMT community in the last decade, solving complex formulas still represents the main bottleneck to the scalability of program analysis techniques. Recent research work has shown that formulas generated during program analysis recur, and such redundancy can be captured and exploited by means of caching frameworks to avoid repeating complex queries to solvers. Although current approaches show that reusing formulas syntactically can indeed reduce the impact of SMT solvers on program analysis, they still suffer from being logic-dependent, and performing poorly on huge sets of heterogenous formulas. The core idea of our approach is to go beyond merely syntactical caching frameworks by designing a caching framework that is able to reuse proofs instead of formulas. In fact, even formulas that are syntactically different can share solutions. We aim to study the recurrence of proofs across heterogeneous formulas, and to define a technique to efficiently retrieve such proofs. We plan to exploit a suitable distance function that measures the amount of proofs shared by two formulas to allow the efficient retrieval of candidate proofs within a potentially large space of proofs. In this paper, we present the problem, draft the core idea, discuss the early results and present our research plans.

Collaborative Testing across Shared Software Components (Doctoral Symposium)
Teng Long
(University of Maryland, USA)
Components of numerous component-based software systems are often developed and tested by multiple stakeholders, and there often exists significant overlap and synergy in the processes of testing systems with shared components. This work demonstrates that testers of shared software components can save effort by avoiding redundant work, and improve the test quality of each component as well as overall software systems by using information obtained when testing across multiple components. My research contains three parts. First, I investigate current testing practices for component-based software systems to find the overlap and synergy we conjecture exists. Second, I design and implement infrastructure and related tools to facilitate communication and data sharing between testers. Third, I design various testing processes to implement different collaborative testing algorithms and apply them to large actively developed software systems.

Cost-Aware Combinatorial Interaction Testing (Doctoral Symposium)
Gulsen Demiroz
(Sabanci University, Turkey)
The configuration spaces of software systems are often too large to test exhaustively. Combinatorial interaction testing approaches, such as covering arrays, systematically sample the configuration space and test only the selected configurations. Traditional t-way covering arrays aim to cover all t-way combinations of option settings in a minimum number of configurations. By doing so, they assume that the testing cost of a configuration is the same for all configurations. In my thesis work, we however argue that, in practice, the actual testing cost may differ from one configuration to another and that accounting for these differences can improve the cost-effectiveness of covering arrays. To this end, we introduced a new novel combinatorial object, called a cost-aware covering array where a t-way cost-aware covering array is a t-way covering array that minimizes a given cost function. As part of progress, we developed an algorithm for a simple, yet important scenario, and the results of our empirical studies suggest that cost-aware covering arrays can greatly reduce the actual cost of testing compared to traditional covering arrays. We also defined a framework for defining the cost function but then we observed that manually creating these cost models is impractical. Hence our first future goal is to develop an approach for automatically discovering cost models for complex configuration spaces. Our second future goal is then to develop algorithms to generate cost-aware covering arrays for more general cost scenarios. Our focus is currently on meta-heuristic search algorithms such as simulated annealing and genetic algorithms to construct cost-aware covering arrays. Another goal is to expand the cost framework to be test-case aware where not every test case is valid for a configuration, hence the cost of running the test suite is actually different for each configuration.

Mining Change History for Test-Plan Generation (Doctoral Symposium)
Thomas Rolfsnes
(Simula Research Laboratory, Norway)
Regression testing is an essential step in safeguarding the evolution of a system, yet there is often not enough time to exercise all available tests. Identifying the subset of tests that can reveal potential issues introduced by a change is a challenge. It requires identifying the tests that test the chan- ged part of the software. Furthermore, and more challeng- ing, it requires identifying the parts of the system that are potentially affected by that change, a task typically done by means of static program analysis. In this doctoral research, we investigate an alternative approach, using software repos- itory mining. We propose a method that mines the change history of a system to uncover dependencies, and uses these for test-selection and test-prioritization. By reducing the amount of test to exercise, and limiting time spend on test- plan creation (i.e., selecting and prioritizing tests), the aim of our approach is to increase cost-effectiveness of software regression testing.

ISSTA 2015 – Proceedings

Frontmatter

Main Research

Debugging

Web Security

Mobile/Web Analysis

Mobile Security

Concurrency Analysis

Symbolic Execution

Regression Testing

Search-Based Algorithms

Verification

Random Testing

Domain-Specific Testing

Model-Based Testing

Tool Demonstrations

Doctoral Symposium