ASE 2018
33rd ACM/IEEE International Conference on Automated Software Engineering (ASE 2018)
Powered by
Conference Publishing Consulting

33rd ACM/IEEE International Conference on Automated Software Engineering (ASE 2018), September 3–7, 2018, Montpellier, France

ASE 2018 – Proceedings

Contents - Abstracts - Authors


Title Page

Message from the Chairs
It is our great pleasure and honor to welcome everyone to the 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE 2018). The ASE conference series is the premier research forum for automated software engineering. Each year it brings together researchers and practitioners from academia and industry to discuss foundations, techniques, and tools for automated analysis, design, implementation, testing, and maintenance of software systems.
ASE 2018 Conference Organization
Organizing Committee General Chair Marianne Huchard University of Montpellier, France Program Co-Chairs Gordon Fraser University of Passau, Germany Christian Kästner Carnegie Mellon University, USA
ASE 2018 Sponsors


Software Heritage: Collecting, Preserving, and Sharing All Our Source Code (Keynote)
Roberto Di Cosmo
(Inria, France; University Paris Diderot, France)
Software Heritage is a non profit initiative whose ambitious goal is to collect, preserve and share the source code of all software ever written, with its full development history, building a universal source code software knowledge base. Software Heritage addresses a variety of needs: preserving our scientific and technological knowledge, enabling better software development and reuse for society and industry, fostering better science, and building an essential infrastructure for large scale, reproducible software studies. We have already collected over 4 billions unique source files from over 80 millions repositories, and organised them into a giant Merkle graph, with full deduplication across all repositories. This allows us to cope with the growth of collaborative software development, and provides a unique vantage point for observing its evolution. In this talk, we will highlight the new challenges and opportunities that Software Heritage brings up.
Publisher's Version Article Search
Automated Requirements Engineering Challenges with Examples from Small Unmanned Aerial Systems (Keynote)
Jane Cleland-Huang
(University of Notre Dame, USA)
Requirements Engineering includes various activities aimed at discovering, analyzing, validating, evolving, and managing software and systems requirements. Many of these activities are human facing, effort intensive, and sometimes error prone. They could benefit greatly from cutting edge advances in automation. However, the software engineering community has primarily focused on automating other aspects of the development process such as testing, code analytics, and mining software respositories. As a result, advances in software analytics have had superficial impact upon advancing the state of art and practice in the field of requirements engineering. Two primary inhibitors are the lack of publicly available datasets and poorly publicized industry-relevant open requirements analytic challenges. To empower the Automated Software Engineering community to tackle open Requirements Engineering challenges, the talk will describe the rapidly evolving landscape of requirements engineering, clearly articulate open challenges, draw upon examples from an ongoing, agile, safety-critical project in the domain of Unmanned Aerial Vehicles, and introduce Dronology as a new community dataset.
Publisher's Version Article Search
Implementation Science for Software Engineering: Bridging the Gap between Research and Practice (Keynote)
Lauren Herckis
(Carnegie Mellon University, USA)
Software engineering research rarely impacts practitioners in the field. A desire to facilitate transfer alone is not sufficient to bridge the gap between research and practice. Fields from medicine to education have acknowledged a similar challenge over the past 25 years. An empirical approach to the translation of research into evidence-based practice has emerged from the resulting discussion. Implementation science has fundamentally changed the way that biomedical research is conducted, and has revolutionized the daily practice of doctors, social workers, epidemiologists, early childhood educators, and more. In this talk we will explore the methods, frameworks, and practices of implementation science and their application to novel disciplines, including software engineering research. We will close by proposing some directions for future software engineering research to facilitate transfer.
Publisher's Version Article Search
The Need for Context in Software Engineering (IEEE CS Harlan Mills Award Keynote)
Gail C. Murphy
(University of British Columbia, Canada)
The development of a software system requires the orchestration of many different people using many different tools. Despite the need for a developer who is performing a task to understand the context in which that task is undertaken, the tools we imagine, build and provide to support software developers are typically isolated. Instead of the tools helping a developer work within the appropriate context, it is the developer who must bring the context to the tools. In this talk, I will argue that the lack of context in the software engineering tools we build limits the effectiveness of developers and of our software development practices.
Publisher's Version Article Search

Technical Research Papers


On Adopting Linters to Deal with Performance Concerns in Android Apps
Sarra Habchi, Xavier Blanc, and Romain Rouvoy
(Inria, France; University of Lille, France; University of Bordeaux, France)
With millions of applications (apps) distributed through mobile markets, engaging and retaining end-users challenge Android developers to deliver a nearly perfect user experience. As mobile apps run in resource-limited devices, performance is a critical criterion for the quality of experience. Therefore, developers are expected to pay much attention to limit performance bad practices. On the one hand, many studies already identified such performance bad practices and showed that they can heavily impact app performance. Hence, many static analysers, a.k.a. linters, have been proposed to detect and fix these bad practices. On the other hand, other studies have shown that Android developers tend to deal with performance reactively and they rarely build on linters to detect and fix performance bad practices. In this paper, we therefore perform a qualitative study to investigate this gap between research and development community. In particular, we performed interviews with 14 experienced Android developers to identify the perceived benefits and constraints of using linters to identify performance bad practices in Android apps. Our observations can have a direct impact on developers and the research community. Specifically, we describe why and how developers leverage static source code analysers to improve the performance of their apps. On top of that, we bring to light important challenges faced by developers when it comes to adopting static analysis for performance purposes.
Publisher's Version Article Search Artifacts Available
PerfLearner: Learning from Bug Reports to Understand and Generate Performance Test Frames
Xue Han, Tingting Yu, and David Lo
(University of Kentucky, USA; Singapore Management University, Singapore)
Software performance is important for ensuring the quality of software products. Performance bugs, defined as programming errors that cause significant performance degradation, can lead to slow systems and poor user experience. While there has been some research on automated performance testing such as test case generation, the main idea is to select workload values to increase the program execution times. These techniques often assume the initial test cases have the right combination of input parameters and focus on evolving values of certain input parameters. However, such an assumption may not hold for highly configurable real-word applications, in which the combinations of input parameters can be very large. In this paper, we manually analyze 300 bug reports from three large open source projects - Apache HTTP Server, MySQL, and Mozilla Firefox. We found that 1) exposing performance bugs often requires combinations of multiple input parameters, and 2) certain input parameters are frequently involved in exposing performance bugs. Guided by these findings, we designed and evaluated an automated approach, PerfLearner, to extract execution commands and input parameters from descriptions of performance bug reports and use them to generate test frames for guiding actual performance test case generation.
Publisher's Version Article Search
AutoConfig: Automatic Configuration Tuning for Distributed Message Systems
Liang Bao, Xin Liu, Ziheng Xu, and Baoyin Fang
(Xidian University, China; University of California at Davis, USA)
Distributed message systems (DMSs) serve as the communication backbone for many real-time streaming data processing applications. To support the vast diversity of such applications, DMSs provide a large number of parameters to configure. However, It overwhelms for most users to configure these parameters well for better performance. Although many automatic configuration approaches have been proposed to address this issue, critical challenges still remain: 1) to train a better and robust performance prediction model using a limited number of samples, and 2) to search for a high-dimensional parameter space efficiently within a time constraint. In this paper, we propose AutoConfig -- an automatic configuration system that can optimize producer-side throughput on DMSs. AutoConfig constructs a novel comparison-based model (CBM) that is more robust that the prediction-based model (PBM) used by previous learning-based approaches. Furthermore, AutoConfig uses a weighted Latin hypercube sampling (wLHS) approach to select a set of samples that can provide a better coverage over the high-dimensional parameter space. wLHS allows AutoConfig to search for more promising configurations using the trained CBM. We have implemented AutoConfig on the Kafka platform, and evaluated it using eight different testing scenarios deployed on a public cloud. Experimental results show that our CBM can obtain better results than that of PBM under the same random forests based model. Furthermore, AutoConfig outperforms default configurations by 215.40% on average, and five state-of-the-art configuration algorithms by 7.21%-64.56%.
Publisher's Version Article Search
Is This Class Thread-Safe? Inferring Documentation using Graph-Based Learning
Andrew Habib and Michael Pradel
(TU Darmstadt, Germany)
Thread-safe classes are pervasive in concurrent, object-oriented software. However, many classes lack documentation regarding their safety guarantees under multi-threaded usage. This lack of documentation forces developers who use a class in a concurrent program to either carefully inspect the implementation of the class, to conservatively synchronize all accesses to it, or to optimistically assume that the class is thread-safe. To overcome the lack of documentation, we present TSFinder, an approach to automatically classify classes as supposedly thread-safe or thread-unsafe. The key idea is to combine a lightweight static analysis that extracts a graph representation from classes with a graph-based classifier. After training the classifier with classes known to be thread-safe and thread-unsafe, it achieves an accuracy of 94.5% on previously unseen classes, enabling the approach to infer thread safety documentation with high confidence. The classifier takes about 3 seconds per class, i.e., it is efficient enough to infer documentation for many classes.
Publisher's Version Article Search Info

Testing Studies

A Large-Scale Study of Test Coverage Evolution
Michael Hilton, Jonathan Bell, and Darko Marinov
(Carnegie Mellon University, USA; George Mason University, USA; University of Illinois at Urbana-Champaign, USA)
Statement coverage is commonly used as a measure of test suite quality. Coverage is often used as a part of a code review process: if a patch decreases overall coverage, or is itself not covered, then the patch is scrutinized more closely. Traditional studies of how coverage changes with code evolution have examined the overall coverage of the entire program, and more recent work directly examines the coverage of patches (changed statements). We present an evaluation much larger than prior studies and moreover consider a new, important kind of change --- coverage changes of unchanged statements. We present a large-scale evaluation of code coverage evolution over 7,816 builds of 47 projects written in popular languages including Java, Python, and Scala. We find that in large, mature projects, simply measuring the change to statement coverage does not capture the nuances of code evolution. Going beyond considering statement coverage as a simple ratio, we examine how the set of statements covered evolves between project revisions. We present and study new ways to assess the impact of a patch on a project's test suite quality that both separates coverage of the patch from coverage of the non-patch, and separates changes in coverage from changes in the set of statements covered.
Publisher's Version Article Search
Effectiveness and Challenges in Generating Concurrent Tests for Thread-Safe Classes
Valerio Terragni and Mauro Pezzè
(University of Lugano, Switzerland)
Developing correct and efficient concurrent programs is difficult and error-prone, due to the complexity of thread synchronization. Often, developers alleviate such problem by relying on thread-safe classes, which encapsulate most synchronization-related challenges. Thus, testing such classes is crucial to ensure the reliability of the concurrency aspects of programs. Some recent techniques and corresponding tools tackle the problem of testing thread-safe classes by automatically generating concurrent tests. In this paper, we present a comprehensive study of the state-of-the-art techniques and an independent empirical evaluation of the publicly available tools. We conducted the study by executing all tools on the JaConTeBe benchmark that contains 47 well-documented concurrency faults. Our results show that 8 out of 47 faults (17%) were detected by at least one tool. By studying the issues of the tools and the generated tests, we derive insights to guide future research on improving the effectiveness of automated concurrent test generation.
Publisher's Version Article Search

Build and Test Automation

Scalable Incremental Building with Dynamic Task Dependencies
Gabriël Konat, Sebastian Erdweg, and Eelco Visser
(Delft University of Technology, Netherlands)
Incremental build systems are essential for fast, reproducible software builds. Incremental build systems enable short feedback cycles when they capture dependencies precisely and selectively execute build tasks efficiently. A much overlooked feature of build systems is the expressiveness of the scripting language, which directly influences the maintainability of build scripts. In this paper, we present a new incremental build algorithm that allows build engineers to use a full-fledged programming language with explicit task invocation, value and file inspection facilities, and conditional and iterative language constructs. In contrast to prior work on incrementality for such programmable builds, our algorithm scales with the number of tasks affected by a change and is independent of the size of the software project being built. Specifically, our algorithm accepts a set of changed files, transitively detects and re-executes affected build tasks, but also accounts for new task dependencies discovered during building. We have evaluated the performance of our algorithm in a real-world case study and confirm its scalability.
Publisher's Version Article Search
Noise and Heterogeneity in Historical Build Data: An Empirical Study of Travis CI
Keheliya Gallaba, Christian Macho, Martin Pinzger, and Shane McIntosh
(McGill University, Canada; University of Klagenfurt, Austria)
Automated builds, which may pass or fail, provide feedback to a development team about changes to the codebase. A passing build indicates that the change compiles cleanly and tests (continue to) pass. A failing (a.k.a., broken) build indicates that there are issues that require attention. Without a closer analysis of the nature of build outcome data, practitioners and researchers are likely to make two critical assumptions: (1) build results are not noisy; however, passing builds may contain failing or skipped jobs that are actively or passively ignored; and (2) builds are equal; however, builds vary in terms of the number of jobs and configurations. To investigate the degree to which these assumptions about build breakage hold, we perform an empirical study of 3.7 million build jobs spanning 1,276 open source projects. We find that: (1) 12% of passing builds have an actively ignored failure; (2) 9% of builds have a misleading or incorrect outcome on average; and (3) at least 44% of the broken builds contain passing jobs, i.e., the breakage is local to a subset of build variants. Like other software archives, build data is noisy and complex. Analysis of build data requires nuance.
Publisher's Version Article Search

Quality Assurance for Machine Learning Techniques

Automated Directed Fairness Testing
Sakshi Udeshi, Pryanshu Arora, and Sudipta Chattopadhyay
(Singapore University of Technology and Design, Singapore; BITS Pilani, India)
Fairness is a critical trait in decision making. As machine-learning models are increasingly being used in sensitive application domains (e.g. education and employment) for decision making, it is crucial that the decisions computed by such models are free of unintended bias. But how can we automatically validate the fairness of arbitrary machine-learning models? For a given machine-learning model and a set of sensitive input parameters, our Aeqitas approach automatically discovers discriminatory inputs that highlight fairness violation. At the core of Aeqitas are three novel strategies to employ probabilistic search over the input space with the objective of uncovering fairness violation. Our Aeqitas approach leverages inherent robustness property in common machine-learning models to design and implement scalable test generation methodologies. An appealing feature of our generated test inputs is that they can be systematically added to the training set of the underlying model and improve its fairness. To this end, we design a fully automated module that guarantees to improve the fairness of the model. We implemented Aeqitas and we have evaluated it on six stateof- the-art classifiers. Our subjects also include a classifier that was designed with fairness in mind. We show that Aeqitas effectively generates inputs to uncover fairness violation in all the subject classifiers and systematically improves the fairness of respective models using the generated test inputs. In our evaluation, Aeqitas generates up to 70% discriminatory inputs (w.r.t. the total number of inputs generated) and leverages these inputs to improve the fairness up to 94%.
Publisher's Version Article Search
Concolic Testing for Deep Neural Networks
Youcheng Sun, Min Wu, Wenjie Ruan, Xiaowei Huang, Marta Kwiatkowska, and Daniel Kroening
(University of Oxford, UK; University of Liverpool, UK)
Concolic testing combines program execution and symbolic analysis to explore the execution paths of a software program. In this paper, we develop the first concolic testing approach for Deep Neural Networks (DNNs). More specifically, we utilise quantified linear arithmetic over rationals to express test requirements that have been studied in the literature, and then develop a coherent method to perform concolic testing with the aim of better coverage. Our experimental results show the effectiveness of the concolic testing approach in both achieving high coverage and finding adversarial examples.
Publisher's Version Article Search
DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems
Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang
(Harbin Institute of Technology, China; Nanyang Technological University, Singapore; Carnegie Mellon University, USA; Kyushu University, Japan; University of Illinois at Urbana-Champaign, USA; Monash University, Australia)
Deep learning (DL) defines a new data-driven programming paradigm that constructs the internal system logic of a crafted neuron network through a set of training data. We have seen wide adoption of DL in many safety-critical scenarios. However, a plethora of studies have shown that the state-of-the-art DL systems suffer from various vulnerabilities which can lead to severe consequences when applied to real-world applications. Currently, the testing adequacy of a DL system is usually measured by the accuracy of test data. Considering the limitation of accessible high quality test data, good accuracy performance on test data can hardly provide confidence to the testing adequacy and generality of DL systems. Unlike traditional software systems that have clear and controllable logic and functionality, the lack of interpretability in a DL system makes system analysis and defect detection difficult, which could potentially hinder its real-world deployment. In this paper, we propose DeepGauge, a set of multi-granularity testing criteria for DL systems, which aims at rendering a multi-faceted portrayal of the testbed. The in-depth evaluation of our proposed testing criteria is demonstrated on two well-known datasets, five DL systems, and with four state-of-the-art adversarial attack techniques against DL. The potential usefulness of DeepGauge sheds light on the construction of more generic and robust DL systems.
Publisher's Version Article Search
DeepRoad: GAN-Based Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems
Mengshi Zhang, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz Khurshid
(University of Texas at Austin, USA; Southern University of Science and Technology, China; University of Texas at Dallas, USA)
While Deep Neural Networks (DNNs) have established the fundamentals of image-based autonomous driving systems, they may exhibit erroneous behaviors and cause fatal accidents. To address the safety issues in autonomous driving systems, a recent set of testing techniques have been designed to automatically generate artificial driving scenes to enrich test suite, e.g., generating new input images transformed from the original ones. However, these techniques are insufficient due to two limitations: first, many such synthetic images often lack diversity of driving scenes, and hence compromise the resulting efficacy and reliability. Second, for machine-learning-based systems, a mismatch between training and application domain can dramatically degrade system accuracy, such that it is necessary to validate inputs for improving system robustness. In this paper, we propose DeepRoad, an unsupervised DNN-based framework for automatically testing the consistency of DNN-based autonomous driving systems and online validation. First, DeepRoad automatically synthesizes large amounts of diverse driving scenes without using image transformation rules (e.g. scale, shear and rotation). In particular, DeepRoad is able to produce driving scenes with various weather conditions (including those with rather extreme conditions) by applying Generative Adversarial Networks (GANs) along with the corresponding real-world weather scenes. Second, DeepRoad utilizes metamorphic testing techniques to check the consistency of such systems using synthetic images. Third, DeepRoad validates input images for DNN-based systems by measuring the distance of the input and training images using their VGGNet features. We implement DeepRoad to test three well-recognized DNN-based autonomous driving systems in Udacity self-driving car challenge. The experimental results demonstrate that DeepRoad can detect thousands of inconsistent behaviors for these systems, and effectively validate input images to potentially enhance the system robustness as well.
Publisher's Version Article Search


Testing Autonomous Cars for Feature Interaction Failures using Many-Objective Search
Raja Ben Abdessalem, Annibale Panichella, Shiva Nejati, Lionel C. Briand, and Thomas Stifter
(University of Luxembourg, Luxembourg; Delft University of Technology, Netherlands; IEE, Luxembourg)
Complex systems such as autonomous cars are typically built as a composition of features that are independent units of functionality. Features tend to interact and impact one another's behavior in unknown ways. A challenge is to detect and manage feature interactions, in particular, those that violate system requirements, hence leading to failures. In this paper, we propose a technique to detect feature interaction failures by casting this problem into a search-based test generation problem. We define a set of hybrid test objectives (distance functions) that combine traditional coverage-based heuristics with new heuristics specifically aimed at revealing feature interaction failures. We develop a new search-based test generation algorithm, called FITEST, that is guided by our hybrid test objectives. FITEST extends recently proposed many-objective evolutionary algorithms to reduce the time required to compute fitness values. We evaluate our approach using two versions of an industrial self-driving system. Our results show that our hybrid test objectives are able to identify more than twice as many feature interaction failures as two baseline test objectives used in the software testing literature (i.e., coverage-based and failure-based test objectives). Further, the feedback from domain experts indicates that the detected feature interaction failures represent real faults in their systems that were not previously identified based on analysis of the system features and their requirements.
Publisher's Version Article Search
Tackling Combinatorial Explosion: A Study of Industrial Needs and Practices for Analyzing Highly Configurable Systems
Mukelabai Mukelabai, Damir Nešić, Salome Maro, Thorsten Berger, and Jan-Philipp Steghöfer
(Chalmers University of Technology, Sweden; University of Gothenburg, Sweden; KTH, Sweden)
Highly configurable systems are complex pieces of software. To tackle this complexity, hundreds of dedicated analysis techniques have been conceived, many of which able to analyze system properties for all possible system configurations, as opposed to traditional, single-system analyses. Unfortunately, it is largely unknown whether these techniques are adopted in practice, whether they address actual needs, or what strategies practitioners actually apply to analyze highly configurable systems. We present a study of analysis practices and needs in industry. It relied on a survey with 27 practitioners engineering highly configurable systems and follow-up interviews with 15 of them, covering 18 different companies from eight countries. We confirm that typical properties considered in the literature (e.g., reliability) are relevant, that consistency between variability models and artifacts is critical, but that the majority of analyses for specifications of configuration options (a.k.a., variability model analysis) is not perceived as needed. We identified rather pragmatic analysis strategies, including practices to avoid the need for analysis. For instance, testing with experience-based sampling is the most commonly applied strategy, while systematic sampling is rarely applicable. We discuss analyses that are missing and synthesize our insights into suggestions for future research.
Publisher's Version Article Search Info
Understanding and Detecting Evolution-Induced Compatibility Issues in Android Apps
Dongjie He, Lian Li, Lei Wang, Hengjie Zheng, Guangwei Li, and Jingling Xue
(Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia)

The frequent release of Android OS and its various versions bring many compatibility issues to Android Apps. This paper studies and addresses such evolution-induced compatibility problems. We conduct an extensive empirical study over 11 different Android versions and 4,936 Android Apps. Our study shows that there are drastic API changes between adjacent Android versions, with averagely 140.8 new types, 1,505.6 new methods, and 979.2 new fields being introduced in each release. However, the Android Support Library (provided by the Android OS) only supports less than 23% of the newly added methods, with much less support for new types and fields. As a result, 91.84% of Android Apps write additional code to support different OS versions. Furthermore, 88.65% of the supporting codes share a common pattern, which directly compares variable android.os.Build.VERSION.SDK_INT with a constant version number, to use an API of particular versions.

Based on our findings, we develop a new tool called IctApiFinder, to detect incompatible API usages in Android applications. IctApiFinder effectively computes the OS versions on which an API may be invoked, using an inter-procedural data-flow analysis framework. It detects numerous incompatible API usages in 361 out of 1,425 Apps. Compared to Android Lint, IctApiFinder is sound and able to reduce the false positives by 82.1%. We have reported the issues to 13 Apps developers. At present, 5 of them have already been confirmed by the original developers and 3 of them have already been fixed.

Publisher's Version Article Search

Mining and Crowd Sourcing

Characterizing the Natural Language Descriptions in Software Logging Statements
Pinjia He, Zhuangbin Chen, Shilin He, and Michael R. Lyu
(Chinese University of Hong Kong, China)

Logging is a common programming practice of great importance in modern software development, because software logs have been widely used in various software maintenance tasks. To provide high-quality logs, developers need to design the description text in logging statements carefully. Inappropriate descriptions will slow down or even mislead the maintenance process, such as postmortem analysis. However, there is currently a lack of rigorous guide and specifications on developer logging behaviors, which makes the construction of description text in logging statements a challenging problem. To fill this significant gap, in this paper, we systematically study what developers log, with focus on the usage of natural language descriptions in logging statements. We obtain 6 valuable findings by conducting source code analysis on 10 Java projects and 7 C# projects, which contain 28,532,975 LOC and 115,159 logging statements in total. Furthermore, our study demonstrates the potential of automated description text generation for logging statements by obtaining up to 49.04 BLEU-4 score and 62.1 ROUGE-L score using a simple information retrieval method. To facilitate future research in this field, the datasets have been publicly released.

Publisher's Version Article Search
Assessing the Type Annotation Burden
John-Paul Ore, Sebastian Elbaum, Carrick Detweiler, and Lambros Karkazis
(University of Nebraska-Lincoln, USA)
Type annotations provide a link between program variables and domain-specific types. When combined with a type system, these annotations can enable early fault detection. For type annotations to be cost-effective in practice, they need to be both accurate and affordable for developers. We lack, however, an understanding of how burdensome type annotation is for developers. Hence, this work explores three fundamental questions: 1) how accurately do developers make type annotations; 2) how long does a single annotation take; and, 3) if a system could automatically suggest a type annotation, how beneficial to accuracy are correct suggestions and how detrimental are incorrect suggestions? We present results of a study of 71 programmers using 20 random code artifacts that contain variables with physical unit types that must be annotated. Subjects choose a correct type annotation only 51% of the time and take an average of 136 seconds to make a single correct annotation. Our qualitative analysis reveals that variable names and reasoning over mathematical operations are the leading clues for type selection. We find that suggesting the correct type boosts accuracy to 73%, while making a poor suggestion decreases accuracy to 28%. We also explore what state-of-the-art automated type annotation systems can and cannot do to help developers with type annotations, and identify implications for tool developers.
Publisher's Version Article Search
Mining File Histories: Should We Consider Branches?
Vladimir Kovalenko, Fabio Palomba, and Alberto Bacchelli
(Delft University of Technology, Netherlands; University of Zurich, Switzerland)
Modern distributed version control systems, such as Git, offer support for branching — the possibility to develop parts of software outside the master trunk. Consideration of the repository structure in Mining Software Repository (MSR) studies requires a thorough approach to mining, but there is no well-documented, widespread methodology regarding the handling of merge commits and branches. Moreover, there is still a lack of knowledge of the extent to which considering branches during MSR studies impacts the results of the studies. In this study, we set out to evaluate the importance of proper handling of branches when calculating file modification histories. We analyze over 1,400 Git repositories of four open source ecosystems and compute modification histories for over two million files, using two different algorithms. One algorithm only follows the first parent of each commit when traversing the repository, the other returns the full modification history of a file across all branches. We show that the two algorithms consistently deliver different results, but the scale of the difference varies across projects and ecosystems. Further, we evaluate the importance of accurate mining of file histories by comparing the performance of common techniques that rely on file modification history — reviewer recommendation, change recommendation, and defect prediction — for two algorithms of file history retrieval. We find that considering full file histories leads to an increase in the techniques’ performance that is rather modest.
Publisher's Version Article Search Info
Tell Them Apart: Distilling Technology Differences from Crowd-Scale Comparison Discussions
Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu
(Australian National University, Australia; Monash University, Australia; Nanyang Technological University, Singapore)
ERROR in abstract.
Publisher's Version Article Search Info


ReScue: Crafting Regular Expression DoS Attacks
Yuju Shen, Yanyan Jiang, Chang Xu, Ping Yu, Xiaoxing Ma, and Jian Lu
(Nanjing University, China)
Regular expression (regex) with modern extensions is one of the most popular string processing tools. However, poorly-designed regexes can yield exponentially many matching steps, and lead to regex Denial-of-Service (ReDoS) attacks under well-conceived string inputs. This paper presents Rescue, a three-phase gray-box analytical technique, to automatically generate ReDoS strings to highlight vulnerabilities of given regexes. Rescue systematically seeds (by a genetic search), incubates (by another genetic search), and finally pumps (by a regex-dedicated algorithm) for generating strings with maximized search time. We implemenmted the Rescue tool and evaluated it against 29,088 practical regexes in real-world projects. The evaluation results show that Rescue found 49% more attack strings compared with the best existing technique, and applying Rescue to popular GitHub projects discovered ten previously unknown ReDoS vulnerabilities.
Publisher's Version Article Search Info
TDroid: Exposing App Switching Attacks in Android with Control Flow Specialization
Jie Liu, Diyu Wu, and Jingling Xue
(UNSW, Australia)

The Android multitasking mechanism can be plagued with app switching attacks, in which a malicious app replaces the legitimate top activity of the focused app with one of its own, thus mounting, e.g., phishing and denial-of-service attacks. Existing market-level defenses are still ineffective, as static analysis is fundamentally unable to reason about the intention of an app and dynamic analysis has low coverage.

We introduce TDroid, a new market-level approach to detecting app switching attacks. The challenge lies in how to handle a plethora of input-dependent branch predicates (forming an exponential number of paths) that control the execution of the code responsible for launching such attacks. TDroid tackles this challenge by combining static and dynamic analysis to analyze an app without producing any false positives. In its static analysis, TDroid transforms the app into runnable slices containing potentially app switching attacks, one slice per attack. In its dynamic analysis, TDroid executes these slices on an Android phone or emulator to expose their malicious GUIs. The novelty lies in the use of a new trigger-oriented slicing technique in producing runnable slices so that certain input-dependent branch predicates are specialized to execute always some fixed branches.

Evaluated with a large set of malware apps, TDroid is shown to outperform the state of the art, by detecting substantially more app switching attacks, in a few minutes per app, on average.

Publisher's Version Article Search
Model-Driven Run-Time Enforcement of Complex Role-Based Access Control Policies
Ameni Ben Fadhel, Domenico Bianculli, and Lionel C. Briand
(University of Luxembourg, Luxembourg)
A Role-based Access Control (RBAC) mechanism prevents unauthorized users to perform an operation, according to authorization policies which are defined on the user’s role within an enterprise. Several models have been proposed to specify complex RBAC policies. However, existing approaches for policy enforcement do not fully support all the types of policies that can be expressed in these models, which hinders their adoption among practitioners. In this paper we propose a model-driven enforcement framework for complex policies captured by GemRBAC+CTX, a comprehensive RBAC model proposed in the literature. We reduce the problem of making an access decision to checking whether a system state (from an RBAC point of view), expressed as an instance of the GemRBAC+CTX model, satisfies the constraints corresponding to the RBAC policies to be enforced at run time. We provide enforcement algorithms for various types of access requests and events, and a prototype tool (MORRO) implementing them. We also show how to integrate MORRO into an industrial Web application. The evaluation results show the applicability of our approach on a industrial system and its scalability with respect to the various parameters characterizing an AC configuration.
Publisher's Version Article Search
ContractFuzzer: Fuzzing Smart Contracts for Vulnerability Detection
Bo Jiang, Ye Liu, and W. K. Chan
(Beihang University, China; City University of Hong Kong, China)
Decentralized cryptocurrencies feature the use of blockchain to transfer values among peers on networks without central agency. Smart contracts are programs running on top of the blockchain consensus protocol to enable people make agreements while minimizing trusts. Millions of smart contracts have been deployed in various decentralized applications. The security vulnerabilities within those smart contracts pose significant threats to their applications. Indeed, many critical security vulnerabilities within smart contracts on Ethereum platform have caused huge financial losses to their users. In this work, we present ContractFuzzer, a novel fuzzer to test Ethereum smart contracts for security vulnerabilities. ContractFuzzer generates fuzzing inputs based on the ABI specifications of smart contracts, defines test oracles to detect security vulnerabilities, instruments the EVM to log smart contracts runtime behaviors, and analyzes these logs to report security vulnerabilities. Our fuzzing of 6991 smart contracts has flagged more than 459 vulnerabilities with high precision. In particular, our fuzzing tool successfully detects the vulnerability of the DAO contract that leads to USD 60 million loss and the vulnerabilities of Parity Wallet that have led to the loss of USD 30 million and the freezing of USD 150 million worth of Ether.
Publisher's Version Article Search Info Artifacts Available

Developer Tools

SEEDE: Simultaneous Execution and Editing in a Development Environment
Steven P. Reiss, Qi Xin, and Jeff Huang
(Brown University, USA)
We introduce a tool within the Code Bubbles development environment that allows for continuous execution as the programmer edits. The tool, SEEDE, shows both the intermediate and final results of execution in terms of variables, control and data flow, output, and graphics. These results are updated as the user edits. The tool can be used to help the user write new code or to find and fix bugs. The tool is explicitly designed to let the user quickly explore the execution of a method along with all the code it invokes, possibly while writing or modifying the code. The user can start continuous execution either at a breakpoint or for a test case. This paper describes the tool, its implementation, and its user interface. It presents an initial user study of the tool demonstrating its potential utility.
Publisher's Version Article Search
Effective API Recommendation without Historical Software Repositories
Xiaoyu Liu, LiGuo Huang, and Vincent Ng
(Southern Methodist University, USA; University of Texas at Dallas, USA)
It is time-consuming and labor-intensive to learn and locate the correct API for programming tasks. Thus, it is beneficial to perform API recommendation automatically. The graph-based statistical model has been shown to recommend top-10 API candidates effectively. It falls short, however, in accurately recommending an actual top-1 API. To address this weakness, we propose RecRank, an approach and tool that applies a novel ranking-based discriminative approach leveraging API usage path features to improve top-1 API recommendation. Empirical evaluation on a large corpus of (1385+8) open source projects shows that RecRank significantly improves top-1 API recommendation accuracy and mean reciprocal rank when compared to state-of-the-art API recommendation approaches.
Publisher's Version Article Search
API Method Recommendation without Worrying about the Task-API Knowledge Gap
Qiao Huang, Xin Xia, Zhenchang Xing, David Lo, and Xinyu Wang
(Zhejiang University, China; Monash University, Australia; Australian National University, Australia; Singapore Management University, Singapore)
Developers often need to search for appropriate APIs for their programming tasks. Although most libraries have API reference documentation, it is not easy to find appropriate APIs due to the lexical gap and knowledge gap between the natural language description of the programming task and the API description in API documentation. Here, the lexical gap refers to the fact that the same semantic meaning can be expressed by different words, and the knowledge gap refers to the fact that API documentation mainly describes API functionality and structure but lacks other types of information like concepts and purposes, which are usually the key information in the task description. In this paper, we propose an API recommendation approach named BIKER (Bi-Information source based KnowledgE Recommendation) to tackle these two gaps. To bridge the lexical gap, BIKER uses word embedding technique to calculate the similarity score between two text descriptions. Inspired by our survey findings that developers incorporate Stack Overflow posts and API documentation for bridging the knowledge gap, BIKER leverages Stack Overflow posts to extract candidate APIs for a program task, and ranks candidate APIs by considering the query’s similarity with both Stack Overflow posts and API documentation. It also summarizes supplementary information (e.g., API description, code examples in Stack Overflow posts) for each API to help developers select the APIs that are most relevant to their tasks. Our evaluation with 413 API-related questions confirms the effectiveness of BIKER for both class- and method-level API recommendation, compared with state-of-the-art baselines. Our user study with 28 Java developers further demonstrates the practicality of BIKER for API search.
Publisher's Version Article Search
An Automated Approach to Estimating Code Coverage Measures via Execution Logs
Boyuan Chen, Jian Song, Peng Xu, Xing Hu, and Zhen Ming (Jack) Jiang
(York University, Canada; Baidu, China)

Software testing is a widely used technique to ensure the quality of software systems. Code coverage measures are commonly used to evaluate and improve the existing test suites. Based on our industrial and open source studies, existing state-of-the-art code coverage tools are only used during unit and integration testing due to issues like engineering challenges, performance overhead, and incomplete results. To resolve these issues, in this paper we have proposed an automated approach, called LogCoCo, to estimating code coverage measures using the readily available execution logs. Using program analysis techniques, LogCoCo matches the execution logs with their corresponding code paths and estimates three different code coverage criteria: method coverage, statement coverage, and branch coverage. Case studies on one open source system (HBase) and five commercial systems from Baidu and systems show that: (1) the results of LogCoCo are highly accurate (>96% in seven out of nine experiments) under a variety of testing activities (unit testing, integration testing, and benchmarking); and (2) the results of LogCoCo can be used to evaluate and improve the existing test suites. Our collaborators at Baidu are currently considering adopting LogCoCo and use it on a daily basis.

Publisher's Version Article Search

Static Analysis

How Many of All Bugs Do We Find? A Study of Static Bug Detectors
Andrew Habib and Michael Pradel
(TU Darmstadt, Germany)
Static bug detectors are becoming increasingly popular and are widely used by professional software developers. While most work on bug detectors focuses on whether they find bugs at all, and on how many false positives they report in addition to legitimate warnings, the inverse question is often neglected: How many of all real-world bugs do static bug detectors find? This paper addresses this question by studying the results of applying three widely used static bug detectors to an extended version of the Defects4J dataset that consists of 15 Java projects with 594 known bugs. To decide which of these bugs the tools detect, we use a novel methodology that combines an automatic analysis of warnings and bugs with a manual validation of each candidate of a detected bug. The results of the study show that: (i) static bug detectors find a non-negligible amount of all bugs, (ii) different tools are mostly complementary to each other, and (iii) current bug detectors miss the large majority of the studied bugs. A detailed analysis of bugs missed by the static detectors shows that some bugs could have been found by variants of the existing detectors, while others are domain-specific problems that do not match any existing bug pattern. These findings help potential users of such tools to assess their utility, motivate and outline directions for future work on static bug detection, and provide a basis for future comparisons of static bug detection with other bug finding techniques, such as manual and automated testing.
Publisher's Version Article Search Info
TRIMMER: Application Specialization for Code Debloating
Hashim Sharif, Muhammad Abubakar, Ashish Gehani, and Fareed Zaffar
(University of Illinois at Urbana-Champaign, USA; Lahore University of Management Sciences, Pakistan; SRI International, USA)

With the proliferation of new hardware architectures and ever-evolving user requirements, the software stack is becoming increasingly bloated. In practice, only a limited subset of the supported functionality is utilized in a particular usage context, thereby presenting an opportunity to eliminate unused features. In the past, program specialization has been proposed as a mechanism for enabling automatic software debloating. In this work, we show how existing program specialization techniques lack the analyses required for providing code simplification for real-world programs. We present an approach that uses stronger analysis techniques to take advantage of constant configuration data, thereby enabling more effective debloating. We developed Trimmer, an application specialization tool that leverages user-provided configuration data to specialize an application to its deployment context. The specialization process attempts to eliminate the application functionality that is unused in the user-defined context. Our evaluation demonstrates Trimmer can effectively reduce code bloat. For 13 applications spanning various domains, we observe a mean binary size reduction of 21% and a maximum reduction of 75%. We also show specialization reduces the surface for code-reuse attacks by reducing the number of exploitable gadgets. For the evaluated programs, we observe a 20% mean reduction in the total gadget count and a maximum reduction of 87%.

Publisher's Version Article Search
A Unified Lattice Model and Framework for Purity Analyses
Dominik Helm, Florian Kübler, Michael Eichberg, Michael Reif, and Mira Mezini
(TU Darmstadt, Germany)
Analyzing methods in object-oriented programs whether they are side-effect free and also deterministic, i.e., mathematically pure, has been the target of extensive research. Identifying such methods helps to find code smells and security related issues, and also helps analyses detecting concurrency bugs. Pure methods are also used by formal verification approaches as the foundations for specifications and proving the pureness is necessary to ensure correct specifications. However, so far no common terminology exists which describes the purity of methods. Furthermore, some terms (e.g., pure or side-effect free) are also used inconsistently. Further, all current approaches only report selected purity information making them only suitable for a smaller subset of the potential use cases. In this paper, we present a fine-grained unified lattice model which puts the purity levels found in the literature into relation and which adds a new level that generalizes existing definitions. We have also implemented a scalable, modularized purity analysis which produces significantly more precise results for real-world programs than the best-performing related work. The analysis shows that all defined levels are found in real-world projects.
Publisher's Version Article Search

Verification 1

Control Flow-Guided SMT Solving for Program Verification
Jianhui Chen and Fei He
(Tsinghua University, China)
Satisfiability modulo theories (SMT) solvers have been widely applied as the reasoning engine for diverse software analysis and verification technologies. The efficiency of the SMT solver has significant effects on the performance of these technologies. However, the current SMT solvers are designed for the general purpose of constraint solving. Many useful knowledge of programs cannot be utilized during the SMT solving. As a result, the SMT solver may spend a lot of effort to explore redundant search space. In this paper, we propose a novel approach for utilizing control-flow knowledge in SMT solving. With this technique, the search space can be considerably reduced and the efficiency of SMT solving is observably improved. We conducted extensive experiments on credible benchmarks, the results show orders of magnitude improvements of our approach.
Publisher's Version Article Search
PaMpeR: Proof Method Recommendation System for Isabelle/HOL
Yutaka Nagashima and Yilun He
(Czech Technical University, Czechia; University of Innsbruck, Austria; University of Sydney, Australia)

Deciding which sub-tool to use for a given proof state requires expertise specific to each interactive theorem prover (ITP). To mitigate this problem, we present PaMpeR, a proof method recommendation system for Isabelle/HOL. Given a proof state, PaMpeR recommends proof methods to discharge the proof goal and provides qualitative explanations as to why it suggests these methods. PaMpeR generates these recommendations based on existing hand-written proof corpora, thus transferring experienced users’ expertise to new users. Our evaluation shows that PaMpeR correctly predicts experienced users’ proof methods invocation especially when it comes to special purpose proof methods.

Publisher's Version Article Search

Maintenance and Machine Learning

Neural-Machine-Translation-Based Commit Message Generation: How Far Are We?
Zhongxin Liu, Xin Xia, Ahmed E. Hassan, David Lo, Zhenchang Xing, and Xinyu Wang
(Zhejiang University, China; Monash University, Australia; Queen's University, Canada; Singapore Management University, Singapore; Australian National University, Australia)

Commit messages can be regarded as the documentation of software changes. These messages describe the content and purposes of changes, hence are useful for program comprehension and software maintenance. However, due to the lack of time and direct motivation, commit messages sometimes are neglected by developers. To address this problem, Jiang et al. proposed an approach (we refer to it as NMT), which leverages a neural machine translation algorithm to automatically generate short commit messages from code. The reported performance of their approach is promising, however, they did not explore why their approach performs well. Thus, in this paper, we first perform an in-depth analysis of their experimental results. We find that (1) Most of the test diffs from which NMT can generate high-quality messages are similar to one or more training diffs at the token level. (2) About 16% of the commit messages in Jiang et al.’s dataset are noisy due to being automatically generated or due to them describing repetitive trivial changes. (3) The performance of NMT declines by a large amount after removing such noisy commit messages. In addition, NMT is complicated and time-consuming. Inspired by our first finding, we proposed a simpler and faster approach, named NNGen (Nearest Neighbor Generator), to generate concise commit messages using the nearest neighbor algorithm. Our experimental results show that NNGen is over 2,600 times faster than NMT, and outperforms NMT in terms of BLEU (an accuracy measure that is widely used to evaluate machine translation systems) by 21%. Finally, we also discuss some observations for the road ahead for automated commit message generation to inspire other researchers.

Publisher's Version Article Search
Deep Learning Based Feature Envy Detection
Hui Liu, Zhifeng Xu, and Yanzhen Zou
(Beijing Institute of Technology, China; Peking University, China)
Software refactoring is widely employed to improve software quality. A key step in software refactoring is to identify which part of the software should be refactored. To facilitate the identification, a number of approaches have been proposed to identify certain structures in the code (called code smells) that suggest the possibility of refactoring. Most of such approaches rely on manually designed heuristics to map manually selected source code metrics to predictions. However, it is challenging to manually select the best features, especially textual features. It is also difficult to manually construct the optimal heuristics. To this end, in this paper we propose a deep learning based novel approach to detecting feature envy, one of the most common code smells. The key insight is that deep neural networks and advanced deep learning techniques could automatically select features (especially textual features) of source code for feature envy detection, and could automatically build the complex mapping between such features and predictions. We also propose an automatic approach to generating labeled training data for the neural network based classifier, which does not require any human intervention. Evaluation results on open-source applications suggest that the proposed approach significantly improves the state-of-the-art in both detecting feature envy smells and recommending destinations for identified smelly methods.
Publisher's Version Article Search
Improving Automatic Source Code Summarization via Deep Reinforcement Learning
Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and Philip S. Yu
(Zhejiang University, China; Chinese Academy of Sciences, China; University of Technology Sydney, Australia; University of Illinois at Chicago, USA)
Code summarization provides a high level natural language description of the function performed by code, as it can benefit the software maintenance, code categorization and retrieval. To the best of our knowledge, most state-of-the-art approaches follow an encoder-decoder framework which encodes the code into a hidden space and then decode it into natural language space, suffering from two major drawbacks: a) Their encoders only consider the sequential content of code, ignoring the tree structure which is also critical for the task of code summarization; b) Their decoders are typically trained to predict the next word by maximizing the likelihood of next ground-truth word with previous ground-truth word given. However, it is expected to generate the entire sequence from scratch at test time. This discrepancy can cause an exposure bias issue, making the learnt decoder suboptimal. In this paper, we incorporate an abstract syntax tree structure as well as sequential content of code snippets into a deep reinforcement learning framework (i.e., actor-critic network). The actor network provides the confidence of predicting the next word according to current state. On the other hand, the critic network evaluates the reward value of all possible extensions of the current state and can provide global guidance for explorations. We employ an advantage reward composed of BLEU metric to train both networks. Comprehensive experiments on a real-world dataset show the effectiveness of our proposed model when compared with some state-of-the-art methods.
Publisher's Version Article Search

Symbolic Execution

Template-Guided Concolic Testing via Online Learning
Sooyoung Cha, Seonho Lee, and Hakjoo Oh
(Korea University, South Korea)
We present template-guided concolic testing, a new technique for effectively reducing the search space in concolic testing. Addressing the path-explosion problem has been a significant challenge in concolic testing. Diverse search heuristics have been proposed to mitigate this problem but using search heuristics alone is not sufficient to substantially improve code coverage for real-world programs. The goal of this paper is to complement existing techniques and achieve higher coverage by exploiting templates in concolic testing. In our approach, a template is a partially symbolized input vector whose job is to reduce the search space. However, choosing a right set of templates is nontrivial and significantly affects the final performance of our approach. We present an algorithm that automatically learns useful templates online, based on data collected from previous runs of concolic testing. The experimental results with open-source programs show that our technique achieves greater branch coverage and finds bugs more effectively than conventional concolic testing.
Publisher's Version Article Search
Android Testing via Synthetic Symbolic Execution
Xiang Gao, Shin Hwei Tan, Zhen Dong, and Abhik Roychoudhury
(National University of Singapore, Singapore; Southern University of Science and Technology, China)
ERROR in abstract.
Publisher's Version Article Search
PARTI: A Multi-interval Theory Solver for Symbolic Execution
Oscar Soria Dustmann, Klaus Wehrle, and Cristian Cadar
(RWTH Aachen University, Germany; Imperial College London, UK)
Symbolic execution is an effective program analysis technique whose scalability largely depends on the ability to quickly solve large numbers of first-order logic queries. We propose an effective general technique for speeding up the solving of queries in the theory of arrays and bit-vectors with a specific structure, while otherwise falling back to a complete solver. The technique has two stages: a learning stage that determines the solution sets of each symbolic variable, and a decision stage that uses this information to quickly determine the satisfiability of certain types of queries. The main challenges involve deciding which operators to support and precisely dealing with integer type casts and arithmetic underflow and overflow. We implemented this technique in an incomplete solver called PARTI (``PARtial Theory solver for Intervals''), directly integrating it into the popular KLEE symbolic execution engine. We applied KLEE with PARTI and a state-of-the-art SMT solver to synthetic and real-world benchmarks. We found that PARTI practically does not hurt performance while many times achieving order-of-magnitude speedups.
Publisher's Version Article Search


Client-Specific Equivalence Checking
Federico Mora, Yi Li, Julia Rubin, and Marsha Chechik
(University of Toronto, Canada; University of British Columbia, Canada)
ERROR in abstract.
Publisher's Version Article Search
Replay without Recording of Production Bugs for Service Oriented Applications
Nipun Arora, Jonathan Bell, Franjo Ivančić, Gail Kaiser, and Baishakhi Ray
(Dropbox, USA; George Mason University, USA; Google, USA; Columbia University, USA)
Short time-to-localize and time-to-fix for production bugs is extremely important for any 24x7 service-oriented application (SOA). Debugging buggy behavior in deployed applications is hard, as it requires careful reproduction of a similar environment and workload. Prior approaches for automatically reproducing production failures do not scale to large SOA systems. Our key insight is that for many failures in SOA systems (e.g., many semantic and performance bugs), a failure can automatically be reproduced solely by relaying network packets to replicas of suspect services, an insight that we validated through a manual study of 16 real bugs across five different systems. This paper presents Parikshan, an application monitoring framework that leverages user-space virtualization and network proxy technologies to provide a sandbox “debug” environment. In this “debug” environment, developers are free to attach debuggers and analysis tools without impacting performance or correctness of the production environment. In comparison to existing monitoring solutions that can slow down production applications, Parikshan allows application monitoring at significantly lower overhead.
Publisher's Version Article Search
Reducing Interactive Refactoring Effort via Clustering-Based Multi-objective Search
Vahid Alizadeh and Marouane Kessentini
(University of Michigan, USA)
Refactoring is nowadays widely adopted in the industry because bad design decisions can be very costly and extremely risky. On the one hand, automated refactoring does not always lead to the desired design. On the other hand, manual refactoring is error-prone, time-consuming and not practical for radical changes. Thus, recent research trends in the field focused on integrating developers feedback into automated refactoring recommendations because developers understand the problem domain intuitively and may have a clear target design in mind. However, this interactive process can be repetitive, expensive, and tedious since developers must evaluate recommended refactorings, and adapt them to the targeted design especially in large systems where the number of possible strategies can grow exponentially. In this paper, we propose an interactive approach combining the use of multi-objective and unsupervised learning to reduce the developer's interaction effort when refactoring systems. We generate, first, using multi-objective search different possible refactoring strategies by finding a trade-off between several conflicting quality attributes. Then, an unsupervised learning algorithm clusters the different trade-off solutions, called the Pareto front, to guide the developers in selecting their region of interests and reduce the number of refactoring options to explore. The feedback from the developer, both at the cluster and solution levels, are used to automatically generate constraints to reduce the search space in the next iterations and focus on the region of developer preferences. We selected 14 active developers to manually evaluate the effectiveness our tool on 5 open source projects and one industrial system. The results show that the participants found their desired refactorings faster and more accurate than the current state of the art.
Publisher's Version Article Search

Software Quality

FairFuzz: A Targeted Mutation Strategy for Increasing Greybox Fuzz Testing Coverage
Caroline Lemieux and Koushik Sen
(University of California at Berkeley, USA)
In recent years, fuzz testing has proven itself to be one of the most effective techniques for finding correctness bugs and security vulnerabilities in practice. One particular fuzz testing tool, American Fuzzy Lop (AFL), has become popular thanks to its ease-of-use and bug-finding power. However, AFL remains limited in the bugs it can find since it simply does not cover large regions of code. If it does not cover parts of the code, it will not find bugs there. We propose a two-pronged approach to increase the coverage achieved by AFL. First, the approach automatically identifies branches exercised by few AFL-produced inputs (rare branches), which often guard code that is empirically hard to cover by naively mutating inputs. The second part of the approach is a novel mutation mask creation algorithm, which allows mutations to be biased towards producing inputs hitting a given rare branch. This mask is dynamically computed during fuzz testing and can be adapted to other testing targets. We implement this approach on top of AFL in a tool named FairFuzz. We conduct evaluation on real-world programs against state-of-the-art versions of AFL. We find that on these programs FairFuzz achieves high branch coverage at a faster rate that state-of-the-art versions of AFL. In addition, on programs with nested conditional structure, it achieves sustained increases in branch coverage after 24 hours (average 10.6% increase). In qualitative analysis, we find that FairFuzz has an increased capacity to automatically discover keywords.
Publisher's Version Article Search
Efficiently Manifesting Asynchronous Programming Errors in Android Apps
Lingling Fan, Ting Su, Sen Chen, Guozhu Meng, Yang Liu, Lihua Xu, and Geguang Pu
(East China Normal University, China; Nanyang Technological University, Singapore; Chinese Academy of Sciences, China; New York University Shanghai, China)
Android, the #1 mobile app framework, enforces the single-GUI-thread model, in which a single UI thread manages GUI rendering and event dispatching. Due to this model, it is vital to avoid blocking the UI thread for responsiveness. One common practice is to offload long-running tasks into async threads. To achieve this, Android provides various async programming constructs, and leaves evelopers themselves to obey the rules implied by the model. However, as our study reveals, more than 25% apps violate these rules and introduce hard-to-detect, fail-stop errors, which we term as aysnc programming errors (APEs). To this end, this paper introduces APEChecker, a technique to automatically and efficiently manifest APEs. The key idea is to characterize APEs as specific fault patterns, and synergistically combine static analysis and dynamic UI exploration to detect and verify such errors. Among the 40 real-world Android apps, APEChecker unveils and processes 61 APEs, of which 51 are confirmed (83.6% hit rate). Specifically, APEChecker detects 3X more APEs than the state-of-art testing tools (Monkey, Sapienz and Stoat), and reduces testing time from half an hour to a few minutes. On a specific type of APEs, APEChecker confirms 5X more errors than the data race detection tool, EventRacer, with very few false alarms.
Publisher's Version Article Search
Expandable Group Identification in Spreadsheets
Wensheng Dou, Shi Han, Liang Xu, Dongmei Zhang, and Jun Wei
(University of Chinese Academy of Sciences, China; Institute of Software at Chinese Academy of Sciences, China; Microsoft Research, China)
Spreadsheets are widely used in various business tasks. Spreadsheet users may put similar data and computations by repeating a block of cells (a unit) in their spreadsheets. We name the unit and all its expanding ones as an expandable group. All units in an expandable group share the same or similar formats and semantics. As a data storage and management tool, expandable groups represent the fundamental structure in spreadsheets. However, existing spreadsheet systems do not recognize any expandable groups. Therefore, other spreadsheet analysis tools, e.g., data integration and fault detection, cannot utilize this structure of expandable groups to perform precise analysis. In this paper, we propose ExpCheck to automatically extract expandable groups in spreadsheets. We observe that continuous units that share the similar formats and semantics are likely to be an expandable group. Inspired by this, we inspect the format of each cell and its corresponding semantics, and further classify them into expandable groups according to their similarity. We evaluate ExpCheck on 120 spreadsheets randomly sampled from the EUSES and VEnron corpora. The experimental results show that ExpCheck is effective. ExpCheck successfully detect expandable groups with F1-measure of 73.1%, significantly outperforming the state-of-the-art techniques (F1-measure of 13.3%).
Publisher's Version Article Search
Break the Dead End of Dynamic Slicing: Localizing Data and Control Omission Bug
Yun Lin, Jun Sun, Lyly Tran, Guangdong Bai, Haijun Wang, and Jinsong Dong
(National University of Singapore, Singapore; Singapore University of Technology and Design, Singapore; Singapore Institute of Technology, Singapore; Nanyang Technological University, Singapore)
Dynamic slicing is a common way of identifying the root cause when a program fault is revealed. With the dynamic slicing technique, the programmers can follow data and control flow along the program execution trace to the root cause. However, the technique usually fails to work on omission bugs, i.e., the faults which are caused by missing executing some code. In many cases, dynamic slicing over-skips the root cause when an omission bug happens, leading the debugging process to a dead end. In this work, we conduct an empirical study on the omission bugs in the Defects4J bug repository. Our study shows that (1) omission bugs are prevalent (46.4%) among all the studied bugs; (2) there are repeating patterns on causes and fixes of the omission bugs; (3) the patterns of fixing omission bugs serve as a strong hint to break the slicing dead end. Based on our findings, we train a neural network model on the omission bugs in Defects4J repository to recommend where to approach when slicing can no long work. We conduct an experiment by applying our approach on 3193 mutated omission bugs which slicing fails to locate. The results show that our approach outperforms random benchmark on breaking the dead end and localizing the mutated omission bugs (63.8% over 2.8%).
Publisher's Version Article Search

Architecture and Requirements

A Genetic Algorithm for Goal-Conflict Identification
Renzo Degiovanni, Facundo Molina, Germán Regis, and Nazareno Aguirre
(National University of Río Cuarto, Argentina; CONICET, Argentina)
ERROR in abstract.
Publisher's Version Article Search Info

Mobile Analysis

Understanding and Detecting Callback Compatibility Issues for Android Applications
Huaxun Huang, Lili Wei, Yepang Liu, and Shing-Chi Cheung
(Hong Kong University of Science and Technology, China; Southern University of Science and Technology, China)
ERROR in abstract.
Publisher's Version Article Search
Detecting and Summarizing GUI Changes in Evolving Mobile Apps
Kevin Moran, Cody Watson, John Hoskins, George Purnell, and Denys Poshyvanyk
(College of William and Mary, USA)
Mobile applications have become a popular software development domain in recent years due in part to a large user base, capable hardware, and accessible platforms. However, mobile developers also face unique challenges, including pressure for frequent releases to keep pace with rapid platform evolution, hardware iteration, and user feedback. Due to this rapid pace of evolution, developers need automated support for documenting the changes made to their apps in order to aid in program comprehension. One of the more challenging types of changes to document in mobile apps are those made to the graphical user interface (GUI) due to its abstract, pixel-based representation. In this paper, we present a fully automated approach, called GCAT, for detecting and summarizing GUI changes during the evolution of mobile apps. GCAT leverages computer vision techniques and natural language generation to accurately and concisely summarize changes made to the GUI of a mobile app between successive commits or releases. We evaluate the performance of our approach in terms of its precision and recall in detecting GUI changes compared to developer specified changes, and investigate the utility of the generated change reports in a controlled user study. Our results indicate that GCAT is capable of accurately detecting and classifying GUI changes - outperforming developers - while providing useful documentation.
Publisher's Version Article Search
Empirically Assessing Opportunities for Prefetching and Caching in Mobile Apps
Yixue Zhao, Paul Wat, Marcelo Schmitt Laser, and Nenad Medvidović
(University of Southern California, USA)
Network latency in mobile software has a large impact on user experience, with potentially severe economic consequences. Prefetching and caching have been shown effective in reducing the latencies in browser-based systems. However, those techniques cannot be directly applied to the emerging domain of mobile apps because of the differences in network interactions. Moreover, there is a lack of research on prefetching and caching techniques that may be suitable for the mobile app domain, and it is not clear whether such techniques can be effective or whether they are even feasible. This paper takes the first step toward answering these questions by conducting a comprehensive study to understand the characteristics of HTTP requests in over 1,000 popular Android apps. Our work focuses on the prefetchability of requests using static program analysis techniques and cacheability of resulting responses. We find that there is a substantial opportunity to leverage prefetching and caching in mobile apps, but that suitable techniques must take into account the nature of apps’ network interactions and idiosyncrasies such as untrustworthy HTTP header information. Our observations provide guidelines for developers to utilize prefetching and caching schemes in app development, and motivate future research in this area.
Publisher's Version Article Search
Safe Stream-Based Programming with Refinement Types
Benno Stein, Lazaro Clapp, Manu Sridharan, and Bor-Yuh Evan Chang
(University of Colorado Boulder, USA; Uber Technologies, USA)
In stream-based programming, data sources are abstracted as a stream of values that can be manipulated via callback functions. Stream-based programming is exploding in popularity, as it provides a powerful and expressive paradigm for handling asynchronous data sources in interactive software. However, high-level stream abstractions can also make it difficult for developers to reason about control- and data-flow relationships in their programs. This is particularly impactful when asynchronous stream-based code interacts with thread-limited features such as UI frameworks that restrict UI access to a single thread, since the threading behavior of streaming constructs is often non-intuitive and insufficiently documented. In this paper, we present a type-based approach that can statically prove the thread-safety of UI accesses in stream-based software. Our key insight is that the fluent APIs of stream-processing frameworks enable the tracking of threads via type-refinement, making it possible to reason automatically about what thread a piece of code runs on -- a difficult problem in general. We implement the system as an annotation-based Java typechecker for Android programs built upon the popular ReactiveX framework and evaluate its efficacy by annotating and analyzing 8 open-source apps, where we find 33 instances of unsafe UI access while incurring an annotation burden of only one annotation per 186 source lines of code. We also report on our experience applying the typechecker to two much larger apps from the Uber Technologies, Inc. codebase, where it currently runs on every code change and blocks changes that introduce potential threading bugs.
Publisher's Version Article Search


Automated Model Repair for Alloy
Kaiyuan Wang, Allison Sullivan, and Sarfraz Khurshid
(University of Texas at Austin, USA)
Automated program repair is an active research area. However, existing research focuses mostly on imperative code, e.g. in Java. In this paper, we study the problem of repairing declarative models in Alloy -- a first order relational logic with transitive closure. We introduce ARepair, the first technique for repairing Alloy models. ARepair follows the spirit of traditional automated program repair techniques. Specifically, ARepair takes as input a faulty Alloy model and a test suite that contains some failing test, and outputs a repaired model that is correct with respect to the given tests. ARepair integrates ideas from mutation testing and program synthesis to provide an effective solution for repairing Alloy models. The experimental results show that ARepair can fix 28 out of 38 real-world faulty models we collected.
Publisher's Version Article Search
PFix: Fixing Concurrency Bugs Based on Memory Access Patterns
Huarui Lin, Zan Wang, Shuang Liu, Jun Sun, Dongdi Zhang, and Guangning Wei
(Tianjin University, China; Singapore University of Technology and Design, Singapore)
Concurrency bugs of a multi-threaded program may only manifest with certain scheduling, i.e., they are heisenbugs which are observed only from time to time if we execute the same program with the same input multiple times. They are notoriously hard to fix. In this work, we propose an approach to automatically fix concurrency bugs. Compared to previous approaches, our key idea is to systematically fix concurrency bugs by inferring locking policies from failure inducing memory-access patterns. That is, we automatically identify memory-access patterns which are correlated with the manifestation of the bug, and then conjecture what is the intended locking policy of the program. Afterwards, we fix the program by implementing the locking policy so that the failure inducing memory-access patterns are made impossible. We have implemented our approach in a toolkit called PFix which supports Java programs. We applied PFix to a set of 23 concurrency bugs and are able to automatically fix 19 of them. In comparison, Grail which is the state-of-the-art tool for fixing concurrency bugs in Java programs can only fix 3 of them correctly.
Publisher's Version Article Search
Generating Reusable Web Components from Mockups
Mohammad Bajammal, Davood Mazinanian, and Ali Mesbah
(University of British Columbia, Canada)
The transformation of a user interface mockup designed by a graphic designer to web components in the final app built by a web developer is often laborious, involving manual and time consuming steps. We propose an approach to automate this aspect of web development by generating reusable web components from a mockup. Our approach employs visual analysis of the mockup, and unsupervised learning of visual cues to create reusable web components (e.g., React components). We evaluated our approach, implemented in a tool called VizMod, on five real-world web mockups, and assessed the transformations and generated components through comparison with web development experts. The results show that VizMod achieves on average 94% precision and 75% recall in terms of agreement with the developers' assessment. Furthermore, the refactorings yielded 22% code reusability, on average.
Publisher's Version Article Search
Semantic Crash Bucketing
Rijnard van Tonder, John Kotheimer, and Claire Le Goues
(Carnegie Mellon University, USA)
Precise crash triage is important for automated dynamic testing tools, like fuzzers. At scale, fuzzers produce millions of crashing inputs. Fuzzers use heuristics, like stack hashes, to cut down on duplicate bug reports. These heuristics are fast, but often imprecise: even after deduplication, hundreds of uniquely reported crashes can still correspond to the same bug. Remaining crashes must be inspected manually, incurring considerable effort. In this paper we present Semantic Crash Bucketing, a generic method for precise crash bucketing using program transformation. Semantic Crash Bucketing maps crashing inputs to unique bugs as a function of changing a program (i.e., a semantic delta). We observe that a real bug fix precisely identifies crashes belonging to the same bug. Our insight is to approximate real bug fixes with lightweight program transformation to obtain the same level of precision. Our approach uses (a) patch templates and (b) semantic feedback from the program to automatically generate and apply approximate fixes for general bug classes. Our evaluation shows that approximate fixes are competitive with using true fixes for crash bucketing, and significantly outperforms built-in deduplication techniques for three state of the art fuzzers.
Publisher's Version Article Search

Verification 2

A Symbolic Model Checking Approach to the Analysis of String and Length Constraints
Hung-En Wang, Shih-Yu Chen, Fang Yu, and Jie-Hong R. Jiang
(National Taiwan University, Taiwan; National Chengchi University, Taiwan)
Strings with length constraints are prominent in software security analysis. Recent endeavors have made significant progress in developing constraint solvers for strings and integers. Most prior methods are based on deduction with inference rules or analysis using automata. The former may be inefficient when the constraints involve complex string manipulations such as language replacement; the latter may not be easily extended to handle length constraints and may be inadequate for counterexample generation due to approximation. Inspired by recent work on string analysis with logic circuit representation, we propose a new method for solving string with length constraints by an implicit representation of automata with length encoding. The length-encoded automata are of infinite states and can represent languages beyond regular expressions. By converting string and length constraints into a dependency graph of manipulations over length-encoded automata, a symbolic model checker for infinite state systems can be leveraged as an engine for the analysis of string and length constraints. Experiments show that our method has its unique capability of handling complex string and length constraints not solvable by existing methods.
Publisher's Version Article Search Info
Domain-Independent Multi-threaded Software Model Checking
Dirk Beyer and Karlheinz Friedberger
(LMU Munich, Germany)
Recent development of software aims at massively parallel execution, because of the trend to increase the number of processing units per CPU socket. But many approaches for program analysis are not designed to benefit from a multi-threaded execution and lack support to utilize multi-core computers. Rewriting existing algorithms is difficult and error-prone, and the design of new parallel algorithms also has limitations. An orthogonal problem is the granularity: computing each successor state in parallel seems too fine-grained, so the open question is to find the right structural level for parallel execution. We propose an elegant solution to these problems: Block summaries should be computed in parallel. Many successful approaches to software verification are based on summaries of control-flow blocks, large blocks, or function bodies. Block-abstraction memoization is a successful domain-independent approach for summary-based program analysis. We redesigned the verification approach of block-abstraction memoization starting from its original recursive definition, such that it can run in a parallel manner for utilizing the available computation resources without losing its advantages of being independent from a certain abstract domain. We present an implementation of our new approach for multi-core shared-memory machines. The experimental evaluation shows that our summary-based approach has no significant overhead compared to the existing sequential approach and that it has a significant speedup when using multi-threading.
Publisher's Version Article Search Artifacts Available
Scheduling Constraint Based Abstraction Refinement for Weak Memory Models
Liangze Yin, Wei Dong, Wanwei Liu, and Ji Wang
(National University of Defense Technology, China)
Scheduling constraint based abstraction refinement (SCAR) is one of the most efficient methods for verifying programs under sequential consistency (SC). However, most multi-processor architectures implement weak memory models (WMMs) in order to improve the performance of a program. Due to the nondeterministic execution of those memory operations by the same thread, the behavior of a program under WMMs is much more complex than that under SC, which significantly increases the verification complexity. This paper elegantly extends the SCAR method to WMMs such as TSO and PSO. To capture the order requirements of an abstraction counterexample under WMMs, we have enriched the event order graph (EOG) of a counterexample such that it is competent for both SC and WMMs. We have also proposed a unified EOG generation method which can always obtain a minimal EOG efficiently. Experimental results on a large set of multi-threaded C programs show promising results of our method. It significantly outperforms state-of-the-art tools, and the time and memory it required to verify a program under TSO and PSO are roughly comparable to that under SC.
Publisher's Version Article Search

Code Differencing and Merging

Datalog-Based Scalable Semantic Diffing of Concurrent Programs
Chungha Sung, Shuvendu K. Lahiri, Constantin Enea, and Chao Wang
(University of Southern California, USA; Microsoft Research, USA; University of Paris Diderot, France)
When an evolving program is modified to address issues related to thread synchronization, there is a need to confirm the change is correct, i.e., it does not introduce unexpected behavior. However, manually comparing two programs to identify the semantic difference is labor intensive and error prone, whereas techniques based on model checking are computationally expensive. To fill the gap, we develop a fast and approximate static analysis for computing synchronization differences of two programs. The method is fast because, instead of relying on heavy-weight model checking techniques, it leverages a polynomial-time Datalog-based program analysis framework to compute differentiating data-flow edges, i.e., edges allowed by one program but not the other. Although approximation is used our method is sufficiently accurate due to careful design of the Datalog inference rules and iterative increase of the required data-flow edges for representing a difference. We have implemented our method and evaluated it on a large number of multithreaded C programs to confirm its ability to produce, often within seconds, the same differences obtained by human; in contrast, prior techniques based on model checking take minutes or even hours and thus can be 10x to 1000x slower.
Publisher's Version Article Search Info
αDiff: Cross-Version Binary Code Similarity Detection with DNN
Bingchang Liu, Wei Huo, Chao Zhang, Wenchao Li, Feng Li, Aihua Piao, and Wei Zou
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Tsinghua University, China)
ERROR in abstract.
Publisher's Version Article Search
ClDiff: Generating Concise Linked Code Differences
Kaifeng Huang, Bihuan Chen, Xin Peng, Daihong Zhou, Ying Wang, Yang Liu, and Wenyun Zhao
(Fudan University, China; Nanyang Technological University, Singapore)
Analyzing and understanding source code changes is important in a variety of software maintenance tasks. To this end, many code differencing and code change summarization methods have been proposed. For some tasks (e.g. code review and software merging), however, those differencing methods generate too fine-grained a representation of code changes, and those summarization methods generate too coarse-grained a representation of code changes. Moreover, they do not consider the relationships among code changes. Therefore, the generated differences or summaries make it not easy to analyze and understand code changes in some software maintenance tasks. In this paper, we propose a code differencing approach, named CLDIFF, to generate concise linked code differences whose granularity is in between the existing code differencing and code change summarization methods. The goal of CLDIFF is to generate more easily understandable code differences. CLDIFF takes source code files before and after changes as inputs, and consists of three steps. First, it pre-processes the source code files by pruning unchanged declara- tions from the parsed abstract syntax trees. Second, it generates concise code differences by grouping fine-grained code differences at or above the statement level and describing high-level changes in each group. Third, it links the related concise code differences according to five pre-defined links. Experiments with 12 Java projects (74,387 commits) and a human study with 10 participants have indicated the accuracy, conciseness, performance and usefulness of CLDIFF.
Publisher's Version Article Search

Mobile Security

Characterizing and Identifying Misexposed Activities in Android Applications
Jiwei Yan, Xi Deng, Ping Wang, Tianyong Wu, Jun Yan, and Jian Zhang
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)
Exported Activity (EA), a kind of activities in Android apps that can be launched by external components, is one of the most important inter-component communication (ICC) mechanisms to realize the interaction and cooperation among multiple apps. Existing works have pointed out that, once exposed, an activity will be vulnerable to malicious ICC attacks, such as permission leakage attack. Unfortunately, it is observed that a considerable number of activities in commercial apps are exposed inadvertently, while few works have studied the necessity and reasonability of such exposure. This work takes the first step to systematically study the exposing behavior of EAs through analyzing 13,873 Android apps. It utilizes the EA associated call relationships extracted from byte-code via data-flow analysis, as well as the launch conditions obtained from the manifest files, to guide the study on the usage and misexposure of EAs. The empirical findings are that the EA mechanism is widely adopted in development and the activities are liable to be misexposed due to the developers' misunderstanding or carelessness. Further study on subsets of apps selected according to different criteria indicates that the misexposed EAs have specific characteristics, which are manually summarized into six typical misuse patterns. As a consequence, ten heuristics are designed to decide whether an activity should be exposed or not and are implemented into an automatic tool called Mist. Experiments on the collected apps show that around one fifth EAs are unnecessarily exposed and there are more than one third EAs whose exposure may not be suggested.
Publisher's Version Article Search
A Tale of Two Cities: How WebView Induces Bugs to Android Applications
Jiajun Hu, Lili Wei, Yepang Liu, Shing-Chi Cheung, and Huaxun Huang
(Hong Kong University of Science and Technology, China; Southern University of Science and Technology, China)

WebView is a widely used Android component that augments a native app with web browser capabilities. It eases the interactions between an app’s native code and web code. However, the interaction mechanism of WebView induces new types of bugs in Android apps. Understanding the characteristics and manifestation of these WebView-induced bugs (ωBugs for short) facilitates the correct usages of WebViews in Android apps. This motivates us to conduct the first empirical study on ωBugs based on those found in popular open-source Android apps. Our study identified the major root causes and consequences of ωBugs and made interesting observations that can be leveraged for detecting and diagnosing ωBugs. Based on the empirical study, we further propose an automated testing technique ωDroid to effectively expose ωBugs in Android apps. In our experiments, ωDroid successfully discovered 30 unique and previously-unknown ωBugs when applied to 146 open-source Android apps. We reported the 30 ωBugs to the corresponding app developers. Out of these 30 ωBugs, 14 were confirmed and 7 of them were fixed. This shows that ωDroid can effectively detect ωBugs that are of the developers’ concern.

Publisher's Version Article Search
Dual-Force: Understanding WebView Malware via Cross-Language Forced Execution
Zhenhao Tang, Juan Zhai, Minxue Pan, Yousra Aafer, Shiqing Ma, Xiangyu Zhang, and Jianhua Zhao
(Nanjing University, China; Purdue University, USA)
Modern Android malwares tend to use advanced techniques to cover their malicious behaviors. They usually feature multi-staged, condition-guarded and environment-specific payloads. An increasing number of them utilize WebView, particularly the two-way communications between Java and JavaScript, to evade detection and analysis of existing techniques. We propose Dual-Force, a forced execution technique which simultaneously forces both Java and JavaScript code of WebView applications to execute along various paths without requiring any environment setup or providing any inputs manually. As such, the hidden payloads of WebView malwares are forcefully exposed. The technique features a novel execution model that allows forced execution to suppress exceptions and continue execution. Experimental results show that Dual-Force precisely exposes malicious payload in 119 out of 150 WebView malwares. Compared to the state-of-the-art, Dual-Force can expose 23% more malicious behaviors.
Publisher's Version Article Search
Self-Protection of Android Systems from Inter-component Communication Attacks
Mahmoud Hammad, Joshua Garcia, and Sam Malek
(University of California at Irvine, USA)
ERROR in abstract.
Publisher's Version Article Search

Experience Papers

An Empirical Study of Android Test Generation Tools in Industrial Cases
Wenyu Wang, Dengfeng Li, Wei Yang, Yurui Cao, Zhenwen Zhang, Yuetang Deng, and Tao Xie
(University of Illinois at Urbana-Champaign, USA; University of Texas at Dallas, USA; Tencent, China)
User Interface (UI) testing is a popular approach to ensure the quality of mobile apps. Numerous test generation tools have been developed to support UI testing on mobile apps, especially for Android apps. Previous work evaluates and compares different test generation tools using only relatively simple open-source apps, while real-world industrial apps tend to have more complex functionalities and implementations. There is no direct comparison among test generation tools with regard to effectiveness and ease-of-use on these industrial apps. To address such limitation, we study existing state-of-the-art or state-of-the-practice test generation tools on 68 widely-used industrial apps. We directly compare the tools with regard to code coverage and fault-detection ability. According to our results, Monkey, a state-of-the-practice tool from Google, achieves the highest method coverage on 22 of 41 apps whose method coverage data can be obtained. Of all 68 apps under study, Monkey also achieves the highest activity coverage on 35 apps, while Stoat, a state-of-the-art tool, is able to trigger the highest number of unique crashes on 23 apps. By analyzing the experimental results, we provide suggestions for combining different test generation tools to achieve better performance. We also report our experience in applying these tools to industrial apps under study. Our study results give insights on how Android UI test generation tools could be improved to better handle complex industrial apps.
Publisher's Version Article Search
Achieving Test Automation with Testers without Coding Skills: An Industrial Report
Davrondzhon Gafurov, Arne Erik Hurum, and Martin Markman
(Norwegian Directorate of eHealth, Norway)
We present a process driven test automation solution which enables delegating (part of) automation tasks from test automation engineer (expensive resource) to test analyst (non-developer, less expensive). In our approach, a test automation engineer implements test steps (or actions) which are executed automatically. Such automated test steps represent user actions in the system under test and specified by a natural language which is understandable by a non-technical person. Then, a test analyst with a domain knowledge organizes automated steps combined with test input to create an automated test case. It should be emphasized that the test analyst does not need to possess programming skills to create, modify or execute automated test cases. We refine benchmark test automation architecture to be better suitable for an effective separation and sharing of responsibilities between the test automation engineer (with coding skills) and test analyst (with a domain knowledge). In addition, we propose a metric to empirically estimate cooperation between test automation engineer and test analyst's works. The proposed automation solution has been defined based on our experience in the development and maintenance of Helsenorg, the national electronic health services in Norway which has had over one million of visits per month past year, and we still use it to automate the execution of regression tests.
Publisher's Version Article Search
Navigating the Maze: The Impact of Configurability in Bioinformatics Software
Mikaela Cashman, Myra B. Cohen, Priya Ranjan, and Robert W. Cottingham
(University of Nebraska-Lincoln, USA; Oak Ridge National Laboratory, USA)
The bioinformatics software domain contains thousands of applications for automating tasks such as the pairwise alignment of DNA sequences, building and reasoning about metabolic models or simulating growth of an organism. Its end users range from sophisticated developers to those with little computational experience. In response to their needs, developers provide many options to customize the way their algorithms are tuned. Yet there is little or no automated help for the user in determining the consequences or impact of the options they choose. In this paper we describe our experience working with configurable bioinformatics tools. We find limited documentation and help for combining and selecting options along with variation in both functionality and performance. We also find previously undetected faults. We summarize our findings with a set of lessons learned, and present a roadmap for creating automated techniques to interact with bioinformatics software. We believe these will generalize to other types of scientific software.
Publisher's Version Article Search Info
Automatically Testing Implementations of Numerical Abstract Domains
Alexandra Bugariu, Valentin Wüstholz, Maria Christakis, and Peter Müller
(ETH Zurich, Switzerland; MPI-SWS, Germany)
ERROR in abstract.
Publisher's Version Article Search
Experiences Applying Automated Architecture Analysis Tool Suites
Ran Mo, Will Snipes, Yuanfang Cai, Srini Ramaswamy, Rick Kazman, and Martin Naedele
(Central China Normal University, China; ABB Corporate Research, USA; Drexel University, USA; ABB, USA; Carnegie Mellon University, USA; University of Hawaii, USA; ABB, Switzerland)
In this paper, we report our experiences of applying three complementary automated software architecture analysis techniques, supported by a tool suite, called DV8, to 8 industrial projects within a large company. DV8 includes two state-of-the-art architecture-level maintainability metrics—Decoupling Level and Propagation Cost, an architecture flaw detection tool, and an architecture root detection tool. We collected development process data from the project teams as input to these tools, reported the results back to the practitioners, and followed up with telephone conferences and interviews. Our experiences revealed that the metrics scores, quantitative debt analysis, and architecture flaw visualization can effectively bridge the gap between management and development, help them decide if, when, and where to refactor. In particular, the metrics scores, compared against industrial benchmarks, faithfully reflected the practitioners’ intuitions about the maintainability of their projects, and enabled them to better understand the maintainability relative to other projects internal to their company, and to other industrial products. The automatically detected architecture flaws and roots enabled the practitioners to precisely pinpoint, visualize, and quantify the “hotspots" within the systems that are responsible for high maintenance costs. Except for the two smallest projects for which both architecture metrics indicated high maintainability, all other projects are planning or have already begun refactorings to address the problems detected by our analyses. We are working on further automating the tool chain, and transforming the analysis suite into deployable services accessible by all projects within the company.
Publisher's Version Article Search

New Ideas Papers

Continuous Code Quality: Are We (Really) Doing That?
Carmine Vassallo, Fabio Palomba, Alberto Bacchelli, and Harald C. Gall
(University of Zurich, Switzerland)
Continuous Integration (CI) is a software engineering practice where developers constantly integrate their changes to a project through an automated build process. The goal of CI is to provide developers with prompt feedback on several quality dimensions after each change. Indeed, previous studies provided empirical evidence on a positive association between properly following CI principles and source code quality. A core principle behind CI is Continuous Code Quality (also known as CCQ, which includes automated testing and automated code inspection) may appear simple and effective, yet we know little about its practical adoption. In this paper, we propose a preliminary empirical investigation aimed at understanding how rigorously practitioners follow CCQ. Our study reveals a strong dichotomy between theory and practice: developers do not perform continuous inspection but rather control for quality only at the end of a sprint and most of the times only on the release branch. Preprint []. Data and Materials [].
Publisher's Version Article Search
RUDSEA: Recommending Updates of Dockerfiles via Software Environment Analysis
Foyzul Hassan, Rodney Rodriguez, and Xiaoyin Wang
(University of Texas at San Antonio, USA)
Dockerfiles are configuration files of docker images which package all dependencies of a software to enable convenient software deployment and porting. In other words, dockerfiles list all environment assumptions of a software application's build and / or execution, so they need to be frequently updated when the environment assumptions change during fast software evolution. In this paper, we propose RUDSEA, a novel approach to recommend updates of dockerfiles to developers based on analyzing changes on software environment assumptions and their impacts. Our evaluation on 1,199 real-world instruction updates shows that RUDSEA can recommend correct update locations for 78.5% of the updates, and correct code changes for 44.1% of the updates.
Publisher's Version Article Search
Delta Debugging Microservice Systems
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Wenhai Li, Chao Ji, and Dan Ding
(Fudan University, China; University of Illinois at Urbana-Champaign, USA; Singapore University of Technology and Design, Singapore)
Debugging microservice systems involves the deployment and manipulation of microservice systems on a containerized environment and faces unique challenges due to the high complexity and dynamism of microservices. To address these challenges, in this paper, we propose a debugging approach for microservice systems based on the delta debugging algorithm, which is to minimize failureinducing deltas of circumstances (e.g., deployment, environmental configurations) for effective debugging. Our approach includes novel techniques for defining, deploying/manipulating, and executing deltas following the idea of delta debugging. In particular, to construct a (failing) circumstance space for delta debugging to minimize, our approach defines a set of dimensions that can affect the execution of microservice systems. Our experimental study on a medium-size microservice benchmark system shows that our approach can effectively identify failure-inducing deltas that help diagnose the root causes.
Publisher's Version Article Search
Personalized Teammate Recommendation for Crowdsourced Software Developers
Luting Ye, Hailong Sun, Xu Wang, and Jiaruijue Wang
(Beihang University, China; Beijing Advanced Innovation Center for Big Data and Brain Computing, China)
Most crowdsourced software development platforms adopt contest paradigm to solicit contributions from the community. To attain competitiveness in complex tasks, crowdsourced software developers often choose to work with others collaboratively. However, existing crowdsourcing platforms generally assume independent contributions from developers and do not provide effective support for team formation. Prior studies on team recommendation aim at optimizing task outcomes by recommending the most suitable team for a task instead of finding appropriate collaborators for a specific person. In this work, we are concerned with teammate recommendation for crowdsourcing developers. First, we present the results of an empirical study of Kaggle, which shows that developers’personal teammate preferences are mainly affected by three factors. Second, we give a collaboration willingness model to characterize developers’ teammate preferences and formulate teammate recommendation as an optimization problem. Then we design a heuristic algorithm to find suitable teammates for a developer. Finally, we have conducted a set of experiments on a Kaggle dataset to evaluate the effectiveness of our approach.
Publisher's Version Article Search
S-gram: Towards Semantic-Aware Security Auditing for Ethereum Smart Contracts
Han Liu, Chao Liu, Wenqi Zhao, Yu Jiang, and Jiaguang Sun
(Tsinghua University, China; Peking University, China; Ant Financial, China)
Smart contracts, as a promising and powerful application on the Ethereum blockchain, have been growing rapidly in the past few years. Since they are highly vulnerable to different forms of attacks, their security becomes a top priority. However, existing security auditing techniques are either limited in fnding vulnerabilities (rely on pre-defned bug paterns) or very expensive (rely on program analysis), thus are insufcient for Ethereum. To mitigate these limitations, we proposed a novel semanticaware security auditing technique called S-gram for Ethereum. The key insight is a combination of N-gram language modeling and lightweight static semantic labeling, which can learn statistical regularities of contract tokens and capture high-level semantics as well (e.g., flow sensitivity of a transaction). S-gram can be used to predict potential vulnerabilities by identifying irregular token sequences and optimize existing in-depth analyzers (e.g., symbolic execution engines, fuzzers etc.). We have implemented S-gram for Solidity smart contracts in Ethereum. The evaluation demonstrated the potential of S-gram in identifying possible security issues.
Publisher's Version Article Search
An Evolutionary Approach for Analyzing Alloy Specifications
Jianghao Wang, Hamid Bagheri, and Myra B. Cohen
(University of Nebraska-Lincoln, USA)
Formal methods use mathematical notations and logical reasoning to precisely define a program's specifications, from which we can instantiate valid instances of a system. With these techniques we can perform a multitude of tasks to check system dependability. Despite the existence of many automated tools including ones considered lightweight, they still lack a strong adoption in practice. At the crux of this problem, is scalability and applicability to large real world applications. In this paper we show how to relax the completeness guarantee without much loss, since soundness is maintained. We have extended a popular lightweight analysis, Alloy, with a genetic algorithm. Our new tool, EvoAlloy, works at the level of finite relations generated by Kodkod and evolves the chromosomes based on the failed constraints. In a feasibility study we demonstrate that we can find solutions to a set of specifications beyond the scope where traditional Alloy fails. While small specifications take longer with EvoAlloy, the scalability means we can handle larger specifications. Our future vision is that when specifications are small we can maintain both soundness and completeness, but when this fails, EvoAlloy can switch to its genetic algorithm.
Publisher's Version Article Search
A Neural Framework for Retrieval and Summarization of Source Code
Qingying Chen and Minghui Zhou
(Peking University, China)
Code retrieval and summarization are two tasks often employed by software developers to reuse code that spreads over online repositories. In this paper, we present a neural framework that allows bidirectional mapping between source code and natural language to improve these two tasks. Our framework, BVAE, is designed to have two Variational AutoEncoders (VAEs) to model bimodal data: C-VAE for source code and L-VAE for natural language. Both VAEs are trained jointly to reconstruct their input as much as possible with regularization that captures the closeness between the latent variables of code and description. BVAE could learn semantic vector representations for both code and description and generate completely new descriptions for arbitrary code snippets. We design two instance models of BVAE for retrieval and summarization tasks respectively and evaluate their performance on a benchmark which involves two programming languages: C# and SQL. Experiments demonstrate BVAE’s potential on the two tasks.
Publisher's Version Article Search
An Empirical Investigation into Learning Bug-Fixing Patches in the Wild via Neural Machine Translation
Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk
(College of William and Mary, USA; University of Lugano, Switzerland; University of Sannio, Italy)
Millions of open-source projects with numerous bug fixes are available in code repositories. This proliferation of software development histories can be leveraged to learn how to fix common programming bugs. To explore such a potential, we perform an empirical study to assess the feasibility of using Neural Machine Translation techniques for learning bug-fixing patches for real defects. We mine millions of bug-fixes from the change histories of GitHub repositories to extract meaningful examples of such bug-fixes. Then, we abstract the buggy and corresponding fixed code, and use them to train an Encoder-Decoder model able to translate buggy code into its fixed version. Our model is able to fix hundreds of unique buggy methods in the wild. Overall, this model is capable of predicting fixed patches generated by developers in 9% of the cases.
Publisher's Version Article Search
Loop Path Reduction by State Pruning
Jianxiong Gao and Steven S. Lumetta
(University of Illinois at Urbana-Champaign, USA)

Path explosion has been a problem for symbolic execution for a long time. The key to avoid path explosion is to limit the number of paths generated within loops while maintaining high code coverage. Full symbolic execution creates paths for every possible execution path. Frequently, paths within loops do not contribute to code coverage. Branches within loops may generate new states at every iteration. The path explosion problem created by loops often stops symbolic execution to reach deeper parts of the code.

In this paper, we propose a new path exploration method that reduces the number of states needed to achieve high coverage. Our algorithm limits the number of new states created by first prioritizing states, and then pruning the states that do not contribute to code coverage. Our algorithm does not require loop invariant inference/loop summarization, nor does it bound the number of iterations of loop exploration. The proposed algorithm can thus handle a broader set of loops than previous approaches. In fact, our algorithm is orthogonal to loop summarization techniques and search-guide heuristics, so it complements the current methods.

We have implemented our algorithm using KLEE and tested with 235 student-generated versions of a classroom assignment. Our results show that our algorithm helps to achieve the same coverage with speedup of 11.8× for 117 out of the 235 programs, while adding 15% max observed and 2% average overhead over the 50% of programs not benefiting from the technique. The maximum speedup for a single program is 52.3×.

Publisher's Version Article Search
node2defect: Using Network Embedding to Improve Software Defect Prediction
Yu Qu, Ting Liu, Jianlei Chi, Yangxu Jin, Di Cui, Ancheng He, and Qinghua Zheng
(Xi'an Jiaotong University, China)
Network measures have been proved to be useful in predicting software defects. Leveraging the dependency relationships between software modules, network measures can capture various structural features of software systems. However, existing studies have relied on user-defined network measures (e.g., degree statistics or centrality metrics), which are inflexible and require high computation cost, to describe the structural features. In this paper, we propose a new method called node2defect which uses a newly proposed network embedding technique, node2vec, to automatically learn to encode dependency network structure into low-dimensional vector spaces to improve software defect prediction. Specifically, we firstly construct a program's Class Dependency Network. Then node2vec is used to automatically learn structural features of the network. After that, we combine the learned features with traditional software engineering features, for accurate defect prediction. We evaluate our method on 15 open source programs. The experimental results show that in average, node2defect improves the state-of-the-art approach by 9.15% in terms of F-measure.
Publisher's Version Article Search
Towards Automating Disambiguation of Regulations: Using the Wisdom of Crowds
Manasi Patwardhan, Abhishek Sainani, Richa Sharma, Shirish Karande, and Smita Ghaisas
(TCS Research, India)
Compliant software is a critical need of all modern businesses. Disambiguating regulations to derive requirements is therefore an important software engineering activity. Regulations however are ridden with ambiguities that make their comprehension a challenge, seemingly surmountable only by legal experts. Since legal experts' involvement in every project is expensive, approaches to automate the disambiguation need to be explored. These approaches however require a large amount of annotated data. Collecting data exclusively from experts is not a scalable and affordable solution. In this paper, we present the results of a crowd sourcing experiment to collect annotations on ambiguities in regulations from professional software engineers. We discuss an approach to automate the arduous and critical step of identifying ground truth labels by employing crowd consensus using Expectation Maximization (EM). We demonstrate that the annotations reaching a consensus match those of experts with an accuracy of 87%.
Publisher's Version Article Search

Tool Demonstrations

jStanley: Placing a Green Thumb on Java Collections
Rui Pereira, Pedro Simão, Jácome Cunha, and João Saraiva
(INESC TEC, Portugal; University of Minho, Portugal; NOVA-LINCS, Portugal; Universidade Nova Lisboa, Portugal)
Software developers are more and more eager to understand their code’s energy performance. However, even with such knowledge it is di cult to know how to improve the code. Indeed, little tool support exists to understand the energy consumption pro le of a software system and to eventually (automatically) improve its code. In this paper we present a tool termed jStanley which automatically nds collections in Java programs that can be replaced by others with a positive impact on the energy consumption as well as on the execution time. In seconds, developers obtain information about energy-eager collection usage. jStanley will further suggest alternative collections to improve the code, making it use less time, energy, or a combination of both. The preliminary evaluation we ran using jStanley shows energy gains between 2% and 17%, and a reduction in execution time between 2% and 13%. A video can be seen at
Publisher's Version Article Search Info
SRCIROR: A Toolset for Mutation Testing of C Source Code and LLVM Intermediate Representation
Farah Hariri and August Shi
(University of Illinois at Urbana-Champaign, USA)
We present SRCIROR (pronounced “sorcerer“), a toolset for performing mutation testing at the levels of C/C++ source code (SRC) and the LLVM compiler intermediate representation (IR). At the SRC level, SRCIROR identifies program constructs for mutation by pattern-matching on the Clang AST. At the IR level, SRCIROR directly mutates the LLVM IR instructions through LLVM passes. Our implementation enables SRCIROR to (1) handle any program that Clang can handle, extending to large programs with a minimal overhead, and (2) have a small percentage of invalid mutants that do not compile. SRCIROR enables performing mutation testing using the same classes of mutation operators at both the SRC and IR levels, and it is easily extensible to support more operators. In addition, SRCIROR can collect coverage to generate mutants only for covered code elements. Our tool is publicly available on GitHub ( We evaluate SRCIROR on Coreutils subjects. Our evaluation shows interesting differences between SRC and IR, demonstrating the value of SRCIROR in enabling mutation testing research across different levels of code representation.
Publisher's Version Article Search
Lightweight Source Code Monitoring with Triggr
Alim Ozdemir, Ayse Tosun, Hakan Erdogmus, and Rui Abreu
(Istanbul Technical University, Turkey; Carnegie Mellon University, USA; University of Lisbon, Portugal)
Existing tools for monitoring the quality of codebases modified by multiple developers tend to be centralized and inflexible. These tools increase the visibility of quality by producing effective reports and visualizations when a change is made to the codebase and triggering alerts when undesirable situations occur. However, their configuration is invariably both (a) centrally managed in that individual maintainers cannot define local rules to receive customized feedback when a change occurs in a specific part of the code in which they are particularly interested, and (b) coarse-grained in that analyses cannot be turned on and off below the file level. Triggr, the tool proposed in this paper, addresses these limitations by allowing distributed, customized, and fine-grained monitoring. It is a lightweight re-implementation of our previous tool, CodeAware, which adopts the same paradigm. The tool listens on a codebase’s shared repository using an event-based approach, and can send alerts to subscribed developers based on rules defined locally by them. Triggr is open-source and available at A demonstration video can be found at
Publisher's Version Article Search
OCELOT: A Search-Based Test-Data Generation Tool for C
Simone Scalabrino, Giovanni Grano, Dario Di Nucci, Michele Guerra, Andrea De Lucia, Harald C. Gall, and Rocco Oliveto
(University of Molise, Italy; University of Zurich, Switzerland; Vrije Universiteit Brussel, Belgium; University of Salerno, Italy)
Automatically generating test cases plays an important role to reduce the time spent by developers during the testing phase. In last years, several approaches have been proposed to tackle such a problem: amongst others, search-based techniques have been shown to be particularly promising. In this paper we describe Ocelot, a search-based tool for the automatic generation of test cases in C. Ocelot allows practitioners to write skeletons of test cases for their programs and researchers to easily implement and experiment new approaches for automatic test-data generation. We show that Ocelot achieves a higher coverage compared to a competitive tool in 81% of the cases. Ocelot is publicly available to support both researchers and practitioners.
Publisher's Version Article Search
Live Versioning of Web Applications through Refactoring
Julián Grigera, Juan Cruz Gardey, Alejandra Garrido, and Gustavo Rossi
(National University of La Plata, Argentina; CIC, Argentina; CONICET, Argentina)
Client-Side Web Refactorings (CSWRs) allow improving interaction aspects of web applications by applying changes to the user interface without altering the code base, even in production settings. However, developers are not always willing, or even allowed to apply external adaptations to their applications’ user interface. Besides, CSWRs do not guarantee improvements in all contexts, so it may be unwise to install them in a production version. We propose a tool that allows creating private versions of a running web application almost automatically. Using this tool, developers or usability experts can easily combine CSWRs to create alternative versions of web applications, without the need of creating a cloned sandbox environment for each version. This yields many uses, such as quickly setting up user tests, showing live alternatives to Product Owners, and even performing A/B testing. The tool is built on top of Kobold, a service that allows applying CSWRs to fix usability smells. Kobold with versioning is available at: A screencast of the tool is available at
Publisher's Version Article Search Video
code_call_lens: Raising the Developer Awareness of Critical Code
Andrea Janes, Michael Mairegger, and Barbara Russo
(Free University of Bolzano, Italy)
As a developer, it is often complex to foresee the impact of changes in source code on usage, e.g., it is time-consuming to find out all components that will be impacted by a change or estimate the impact on the usability of a failing piece of code. It is therefore hard to decide how much effort in quality assurance is justifiable to obtain the desired business goals. In this paper, to reduce the difficulty for developers to understand the importance of source code, we propose an automated way to provide this information to developers as they are working on a given piece of code. As a proof-of-concept, we developed a plug-in for Microsoft Visual Studio Code that informs about the importance of source code methods based on the frequency of usage by the end-users of the developed software. The plug-in aims to increase the awareness developers have about the importance of source code in an unobtrusive way, helping them to prioritize their effort to quality assurance, technical excellence, and usability. code_call_lens can be downloaded from GitHub at
Publisher's Version Article Search
Code2graph: Automatic Generation of Static Call Graphs for Python Source Code
Gharib Gharibi, Rashmi Tripathi, and Yugyung Lee
(University of Missouri-Kansas City, USA)
A static call graph is an imperative prerequisite used in most interprocedural analyses and software comprehension tools. However, there is a lack of software tools that can automatically analyze the Python source-code and construct its static call graph. In this paper, we introduce a prototype Python tool, named code2graph, which automates the tasks of (1) analyzing the Python source-code and extracting its structure, (2) constructing static call graphs from the source code, and (3) generating a similarity matrix of all possible execution paths in the system. Our goal is twofold: First, assist the developers in understanding the overall structure of the system. Second, provide a stepping stone for further research that can utilize the tool in software searching and similarity detection applications. For example, clustering the execution paths into a logical workflow of the system would be applied to automate specific software tasks. Code2graph has been successfully used to generate static call graphs and similarity matrices of the paths for three popular open-source Deep Learning projects (TensorFlow, Keras, PyTorch). A tool demo is available at
Publisher's Version Article Search Video
The Electrum Analyzer: Model Checking Relational First-Order Temporal Specifications
Julien Brunel, David Chemouil, Alcino Cunha, and Nuno Macedo
(ONERA, France; University of Toulouse, France; INESC TEC, Portugal; University of Minho, Portugal)
This paper presents the Electrum Analyzer, a free-software tool to validate and perform model checking of Electrum specifications. Electrum is an extension of Alloy that enriches its relational logic with LTL operators, thus simplifying the specification of dynamic systems. The Analyzer supports both automatic bounded model checking, with an encoding into SAT, and unbounded model checking, with an encoding into SMV. Instance, or counter-example, traces are presented back to the user in a unified visualizer. Features to speed up model checking are offered, including a decomposed parallel solving strategy and the extraction of symbolic bounds.
Publisher's Version Article Search Video Info
ESBMC 5.0: An Industrial-Strength C Model Checker
Mikhail R. Gadelha, Felipe R. Monteiro, Jeremy Morse, Lucas C. Cordeiro, Bernd Fischer, and Denis A. Nicole
(University of Southampton, UK; Federal University of Amazonas, Brazil; University of Bristol, UK; University of Manchester, UK; Stellenbosch University, South Africa)
ERROR in abstract.
Publisher's Version Article Search Video Info
L-CMP: An Automatic Learning-Based Parameterized Verification Tool
Jialun Cao, Yongjian Li, and Jun Pang
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; University of Luxembourg, Luxembourg)
This demo introduces L-CMP, an automatic learning-based parameterized verification tool. It can verify parameterized protocols by combining machine learning and model checking techniques. Given a parameterized protocol, L-CMP learns a set of auxiliary invariants and implements verification of the protocol using the invariants automatically. In particular, the learned auxiliary invariants are straightforward and readable. The experimental results show that L-CMP can successfully verify a number of cache coherence protocols, including the industrial-scale FLASH protocol. The video is available at, and L-CMPL-CMP can be downloaded at ArabelaTso/Learning-Based-ParaVerifer.
Publisher's Version Article Search
VulSeeker: A Semantic Learning Based Vulnerability Seeker for Cross-Platform Binary
Jian Gao, Xin Yang, Ying Fu, Yu Jiang, and Jiaguang Sun
(Tsinghua University, China)
Code reuse improves software development efficiency, however, vulnerabilities can be introduced inadvertently. Many existing works compute the code similarity based on CFGs to determine whether a binary function contains a known vulnerability. Unfortunately, their performance in cross-platform binary search is challenged. This paper presents VulSeeker, a semantic learning based vulnerability seeker for cross-platform binary. Given a target function and a vulnerable function, VulSeeker first constructs the labeled semantic flow graphs and extracts basic block features as numerical vectors for both of them. Then the embedding vector of the whole binary function is generated by feeding the numerical vectors of basic blocks to the customized semantics aware DNN model. Finally, the similarity of the two binary functions is measured based on the Cosine distance. The experimental results show that VulSeeker outperforms the state-of-the-art approaches in terms of accuracy. For example, compared to the most recent and related work Gemini, VulSeeker finds 50.00% more vulnerabilities in the top-10 candidates and 13.89% more in the top-50 candidates, and improves the values of AUC and ACC for 8.23% and 12.14% respectively. The video is presented at
Publisher's Version Article Search
CPA-SymExec: Efficient Symbolic Execution in CPAchecker
Dirk Beyer and Thomas Lemberger
(LMU Munich, Germany)
We present CPA-SymExec, a tool for symbolic execution that is implemented in the open-source, configurable verification framework CPAchecker. Our implementation automatically detects which symbolic facts to track, in order to obtain a small set of constraints that are necessary to decide reachability of a program area of interest. CPA-SymExec is based on abstraction and counterexample-guided abstraction refinement (CEGAR), and uses a constraint-interpolation approach to detect symbolic facts. We show that our implementation can better mitigate the path-explosion problem than symbolic execution without abstraction, by comparing the performance to the state-of-the-art Klee-based symbolic-execution engine Symbiotic and to Klee itself. For the experiments we use two kinds of analysis tasks: one for finding an executable path to a specific location of interest (e.g., if a test vector is desired to show that a certain behavior occurs), and one for confirming that no executable path to a specific location exists (e.g., if it is desired to show that a certain behavior never occurs). CPA-SymExec is released under the Apache 2 license and available (inclusive source code) at A demonstration video is available at
Publisher's Version Article Search Video Info Artifacts Available
CANAL: A Cache Timing Analysis Framework via LLVM Transformation
Chungha Sung, Brandon Paulsen, and Chao Wang
(University of Southern California, USA)
ERROR in abstract.
Publisher's Version Article Search Video Info
Descartes: A PITest Engine to Detect Pseudo-Tested Methods: Tool Demonstration
Oscar Luis Vera-Pérez, Martin Monperrus, and Benoit Baudry
(Inria, France; KTH, Sweden)
Descartes is a tool that implements extreme mutation operators and aims at finding pseudo-tested methods in Java projects. It leverages the efficient transformation and runtime features of PITest. The demonstration compares Descartes with Gregor, the default mutation engine provided by PITest, in a set of real open source projects. It considers the execution time, number of mutants created and the relationship between the mutation scores produced by both engines. It provides some insights on the main features exposed byDescartes.
Publisher's Version Article Search Info
DKVF: A Framework for Rapid Prototyping and Evaluating Distributed Key-Value Stores
Mohammad Roohitavaf and Sandeep Kulkarni
(Michigan State University, USA)
We present our framework DKVF that enables one to quickly prototype and evaluate new consistency protocols for key-value stores. DKVF is designed based on the separation of concerns in creating distributed data stores. This separation of concerns allows the designers of consistency protocols to only focus on the high-level consistency protocols which gives them the opportunity to quickly deploy a consistency protocol and evaluate its performance. Moreover, the loose coupling of the different components allows us to easily change different components (e.g. storage engine) of an implementation. We demonstrate DKVF by implementing four existing protocols --eventual consistency, COPS, GentleRain and CausalSpartan-- with it. The implementation of these protocols was very convenient with DKVF, as it only required to write a piece of code for the consistency component that is very close to the pseudocode of the original papers. Hence, it was possible to achieve this in just 1-2 days per protocol. DKVF also comes with a toolset that facilitates running clusters and performing experiments. Tutorial video:
Publisher's Version Article Search Video Info
DroidMate-2: A Platform for Android Test Generation
Nataniel P. Borges Jr., Jenny Hotzkow, and Andreas Zeller
(Saarland University, Germany)
Android applications (apps) represent an ever increasing portion of the software market. Automated test input generators are the state of the art for testing and security analysis. We introduce DroidMate-2 (DM-2), a platform to easily assist both developers and researchers to customize, develop and test new test generators. DM-2 can be used without app instrumentation or operating system modifications, as a test generator on real devices and emulators for app testing or regression testing. Additionally, it provides sensitive resource monitoring or blocking capabilities through a lightweight app instrumentation, out-of-thebox statement coverage measurement through a fully-fledged app instrumentation and native experiment reproducibility. In our experiments we compared DM-2 against DroidBot, a state-of-the-art test generator by measuring statement coverage. Our results show that DM-2 reached 96% of its peak coverage in less than 2/3 of the time needed by DroidBot, allowing for better and more efficient tests. On short runs (5 minutes) DM-2 outperformed DroidBot by 7% while in longer runs (1 hour) this difference increases to 8%. ACM DL Artifact: For the details see:
Publisher's Version Article Search Info Artifacts Available

Doctoral Symposium

Session 1

Assessing and Evaluating Functional Suitability of Software
Philipp Haindl
(JKU Linz, Austria)
While formal task models allow definition and time-based assessment of user-interaction, they have not yet been used as a baseline for assessing the variability patterns of user interaction with software. Consequently, operationalizing these variability patterns enables us to evaluate how suitable the defined task-execution paths are for the user to achieve a predefined goal. Improvements of the software could thereby be evidence-based on knowledge of changes' effects on functional suitability for the user rather than on prospective or opinion-based usage scenarios. Following a design thinking approach tailored to software engineering, understanding and observing these variability patterns are mandatory steps for continuous user-centered improvement of software. In practice however, due to the absence of an operational quality model for functional suitability it is hardly possible to effectively put these operational measures into context and to derive concrete software improvement actions. Having operational suitability metrics is even more important for software engineers as it allows to increasingly focus development activities especially on functionality which has business value for the user.
Publisher's Version Article Search

Session 2

Automatic Mining of Constraints for Monitoring Systems of Systems
Thomas Krismayer
(JKU Linz, Austria)
The behavior of complex software-intensive systems of systems often only fully emerges during operation, when all systems interact with each other and with their environment. Runtime monitoring approaches are thus used to detect deviations from the expected behavior, which is commonly defined by engineers, e.g., using temporal logic or domain-specific languages. However, the deep domain knowledge required to specify constraints is often not available during the development of systems of systems with multiple teams independently working on heterogeneous components. In this paper, we thus describe our ongoing PhD research to automatically mine constraints for runtime monitoring from recorded events. Our approach mines constraints on event occurrence, timing, data, and combinations of these properties. The approach further presents the mined constraints to users offering multiple ranking strategies and can also be used to support users in system evolution scenarios.
Publisher's Version Article Search
Towards Automatic Restrictification of CUDA Kernel Arguments
Rokiatou Diarra
(University of Paris-Sud, France)
ERROR in abstract.
Publisher's Version Article Search
A DSL for Requirements in the Context of a Seamless Approach
Florian Galinier
(IRIT, France; University of Toulouse, France)
Reducing the lack of consistency between requirements and the system that should satisfy these requirements is one of the major issue in Requirement Engineering (RE). The objective of my thesis work is to propose a seamless approach, allowing users to express requirements, specifications and the system itself in a unique language. The purpose of formal approaches is to reduce inconsistency. However, most developers are not familiar with these approaches, and they are not often used outside the critical systems domain. Since we want that non-experts can also use our approach to validate systems in the early stage of their development, we propose a Domain Specific Language (DSL) that is: (i) close to natural language, and (ii) based on a formal semantics. Using Model-Driven Engineering (MDE), this language bridges the gap not only between the several stakeholders that can be involved in a project, considering their different backgrounds, but also between the requirements and the code.
Publisher's Version Article Search

Session 3

A Multi-objective Framework for Effective Performance Fault Injection in Distributed Systems
Luca Traini
(University of L'Aquila, Italy)
Modern distributed systems should be built to anticipate performance degradation. Often requests in these systems involve ten to thousands Remote Procedure Calls, each of which can be a source of performance degradation. The PhD programme presented here intends to address this issue by providing automated instruments to effectively drive performance fault injection in distributed systems. The envisioned approach exploits multi-objective search-based techniques to automatically find small combinations of tiny performance degradations induced by specific RPCs,which have significant impacts on the user-perceived performance. Automating the search of these events will improve the ability to inject performance issues in production in order to force developers to anticipate and mitigate them.
Publisher's Version Article Search
Top-Down Model-Driven Engineering of Web Services from Extended OpenAPI Models
David Sferruzza
(LS2N, France)
Web services engineering is a crucial subject, because web services are often built to be used by other programs; thus they should have a good documentation targeting developers. Furthermore, when building a digital product, engineers need to build several programs that interact with a central instance of web services. OpenAPI, a popular industry standard, makes possible to document web services in order to quickly make a prototype of the product. It allows a top-down process where developers iterate to build an OpenAPI model that describes the web services they want, and then implement both the web services and the programs that will consume them. However, when doing such rapid prototyping, developers tend to either skip this design phase and implement web services right away, or stop updating the OpenAPI model when the product is released; in both cases they cannot take advantage of having an OpenAPI model aligned with the implementation. We show how OpenAPI can be extended to add implementation details inside models. These extensions link services to assemblies of components that describe computations. Hence a top-down development process that keeps model and implementation aligned. Moreover, this makes possible for developers to benefit from more support features while keeping the same flexibility.
Publisher's Version Article Search
Differential Program Analysis with Fuzzing and Symbolic Execution
Yannic Noller
(Humboldt-Universität zu Berlin, Germany)
Differential program analysis means to identify the behavioral divergences in one or multiple programs, and it can be classified into two categories: identify the behavioral divergences (1) between two program versions for the same input (aka regression analysis), and (2) for the same program with two different inputs (e.g, side-channel analysis). Most of the existent approaches for both subproblems try to solve it with single techniques, which suffer from its weaknesses like scalability issues or imprecision. This research proposes to combine two very strong techniques, namely fuzzing and symbolic execution to tackle these problems and provide scalable solutions for real-world applications. The proposed approaches will be implemented on top of state-of-the-art tools like AFL and Symbolic PathFinder to evaluate them against existent work.
Publisher's Version Article Search
Software Engineering Techniques Applied to Relational Databases
Julien Delplanque
(University of Lille, France; CNRS, France; Inria, France)
Relational databases play a central role in software systems. As any software artefact they are subject to software engineering related problems. That apply even more when these databases hold behaviour in the form of stored functions, triggers, etc... The purpose of this article is to present a research plan to handle this problematic.
Publisher's Version Article Search

Journal-First Papers

Automatically Quantifying the Impact of a Change in Systems (Journal-First Abstract)
Nada Almasri, Luay Tahat, and Bogdan Korel
(Gulf University for Science and Technology, Kuwait; Illinois Institute of Technology, USA)
Software maintenance is becoming more challenging with the increased complexity of the software and the frequently applied changes. Performing impact analysis before the actual implementation of a change is a crucial task during system maintenance. While many tools and techniques are available to measure the impact of a change at the code level, only a few research work is done to measure the impact of a change at an earlier stage in the development process. Measuring the impact of a change at the model level speeds up the maintenance process allowing early discovery of critical components of the system before applying the actual change at the code level. In this paper, we present model-based impact analysis approach for state-based systems such as telecommunication or embedded systems. The proposed approach uses model dependencies to automatically measure the expected impact for a requested change instead of relying on the expertise of system maintainers, and it generates two impact sets representing the lower bound and the upper bound of the impact. Although it can be extended to other behavioral models, the presented approach mainly addresses extended finite-state machine (EFSM) models. An empirical study is conducted on six EFSM models to investigate the usefulness of the proposed approach. The results show that on average the size of the impact after a single modification (a change in a one EFSM transition) ranges between 14 and 38 % of the total size of the model. For a modification involving multiple transitions, the average size of the impact ranges between 30 and 64 % of the total size of the model. Additionally, we investigated the relationships (correlation) between the structure of the EFSM model, and the size of the impact sets. Upon preliminary analysis of the correlation, the concepts of model density and data density were defined, and it was found that they could be the major factors influencing the sizes of impact sets for models. As a result, these factors can be used to determine the types of models for which the proposed approach is the most appropriate.
Publisher's Version Article Search
Estimating the Number of Remaining Links in Traceability Recovery (Journal-First Abstract)
Davide Falessi, Massimiliano Di Penta, Gerardo Canfora, and Giovanni Cantone
(California Polytechnic State University, USA; University of Sannio, Italy; University of Rome Tor Vergata, Italy)
Although very important in software engineering, establishing traceability links between software artifacts is extremely tedious, error-prone, and it requires significant effort. Even when approaches for automated traceability recovery exist, these provide the requirements analyst with a, usually very long, ranked list of candidate links that needs to be manually inspected. In this paper we introduce an approach called Estimation of the Number of Remaining Links (ENRL) which aims at estimating, via Machine Learning (ML) classifiers, the number of remaining positive links in a ranked list of candidate traceability links produced by a Natural Language Processing techniques-based recovery approach. We have evaluated the accuracy of the ENRL approach by considering several ML classifiers and NLP techniques on three datasets from industry and academia, and concerning traceability links among different kinds of software artifacts including requirements, use cases, design documents, source code, and test cases. Results from our study indicate that: (i) specific estimation models are able to provide accurate estimates of the number of remaining positive links; (ii) the estimation accuracy depends on the choice of the NLP technique, and (iii) univariate estimation models outperform multivariate ones.
Publisher's Version Article Search
Bounded Model Checking of C++ Programs Based on the Qt Cross-Platform Framework (Journal-First Abstract)
Felipe R. Monteiro, Mário A. P. Garcia, Lucas C. Cordeiro, and Eddie B. de Lima Filho
(Federal University of Amazonas, Brazil; University of Manchester, UK; TPV Technology, Brazil)
This work proposes an abstraction of the Qt framework, named as Qt Operational Model (QtOM), which is integrated into two different verification approaches: explicit-state model checking and symbolic (bounded) model checking. The proposed methodology is the first one to formally verify Qt-based applications, which has the potential to devise new directions for software verification of portable code. The full version of this paper is published in Software Testing, Verification and Reliability, on 02 March 2017 and it is available at
Publisher's Version Article Search Video Info
Understanding Semi-structured Merge Conflict Characteristics in Open-Source Java Projects (Journal-First Abstract)
Paola Accioly, Paulo Borba, and Guilherme Cavalcanti
(Federal University of Pernambuco, Brazil)
ERROR in abstract.
Publisher's Version Article Search

proc time: 0.57