ASE 2018 – Author Index |
Contents -
Abstracts -
Authors
|
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Aafer, Yousra |
![]() Zhenhao Tang, Juan Zhai, Minxue Pan, Yousra Aafer, Shiqing Ma, Xiangyu Zhang, and Jianhua Zhao (Nanjing University, China; Purdue University, USA) Modern Android malwares tend to use advanced techniques to cover their malicious behaviors. They usually feature multi-staged, condition-guarded and environment-specific payloads. An increasing number of them utilize WebView, particularly the two-way communications between Java and JavaScript, to evade detection and analysis of existing techniques. We propose Dual-Force, a forced execution technique which simultaneously forces both Java and JavaScript code of WebView applications to execute along various paths without requiring any environment setup or providing any inputs manually. As such, the hidden payloads of WebView malwares are forcefully exposed. The technique features a novel execution model that allows forced execution to suppress exceptions and continue execution. Experimental results show that Dual-Force precisely exposes malicious payload in 119 out of 150 WebView malwares. Compared to the state-of-the-art, Dual-Force can expose 23% more malicious behaviors. ![]() ![]() |
|
Abdessalem, Raja Ben |
![]() Raja Ben Abdessalem, Annibale Panichella, Shiva Nejati, Lionel C. Briand, and Thomas Stifter (University of Luxembourg, Luxembourg; Delft University of Technology, Netherlands; IEE, Luxembourg) Complex systems such as autonomous cars are typically built as a composition of features that are independent units of functionality. Features tend to interact and impact one another's behavior in unknown ways. A challenge is to detect and manage feature interactions, in particular, those that violate system requirements, hence leading to failures. In this paper, we propose a technique to detect feature interaction failures by casting this problem into a search-based test generation problem. We define a set of hybrid test objectives (distance functions) that combine traditional coverage-based heuristics with new heuristics specifically aimed at revealing feature interaction failures. We develop a new search-based test generation algorithm, called FITEST, that is guided by our hybrid test objectives. FITEST extends recently proposed many-objective evolutionary algorithms to reduce the time required to compute fitness values. We evaluate our approach using two versions of an industrial self-driving system. Our results show that our hybrid test objectives are able to identify more than twice as many feature interaction failures as two baseline test objectives used in the software testing literature (i.e., coverage-based and failure-based test objectives). Further, the feedback from domain experts indicates that the detected feature interaction failures represent real faults in their systems that were not previously identified based on analysis of the system features and their requirements. ![]() ![]() |
|
Abreu, Rui |
![]() Alim Ozdemir, Ayse Tosun, Hakan Erdogmus, and Rui Abreu (Istanbul Technical University, Turkey; Carnegie Mellon University, USA; University of Lisbon, Portugal) Existing tools for monitoring the quality of codebases modified by multiple developers tend to be centralized and inflexible. These tools increase the visibility of quality by producing effective reports and visualizations when a change is made to the codebase and triggering alerts when undesirable situations occur. However, their configuration is invariably both (a) centrally managed in that individual maintainers cannot define local rules to receive customized feedback when a change occurs in a specific part of the code in which they are particularly interested, and (b) coarse-grained in that analyses cannot be turned on and off below the file level. Triggr, the tool proposed in this paper, addresses these limitations by allowing distributed, customized, and fine-grained monitoring. It is a lightweight re-implementation of our previous tool, CodeAware, which adopts the same paradigm. The tool listens on a codebase’s shared repository using an event-based approach, and can send alerts to subscribed developers based on rules defined locally by them. Triggr is open-source and available at https://github.com/lyzerk/Triggr. A demonstration video can be found at https://youtu.be/qQs9aDwXJjY. ![]() ![]() |
|
Abubakar, Muhammad |
![]() Hashim Sharif, Muhammad Abubakar, Ashish Gehani, and Fareed Zaffar (University of Illinois at Urbana-Champaign, USA; Lahore University of Management Sciences, Pakistan; SRI International, USA) With the proliferation of new hardware architectures and ever-evolving user requirements, the software stack is becoming increasingly bloated. In practice, only a limited subset of the supported functionality is utilized in a particular usage context, thereby presenting an opportunity to eliminate unused features. In the past, program specialization has been proposed as a mechanism for enabling automatic software debloating. In this work, we show how existing program specialization techniques lack the analyses required for providing code simplification for real-world programs. We present an approach that uses stronger analysis techniques to take advantage of constant configuration data, thereby enabling more effective debloating. We developed Trimmer, an application specialization tool that leverages user-provided configuration data to specialize an application to its deployment context. The specialization process attempts to eliminate the application functionality that is unused in the user-defined context. Our evaluation demonstrates Trimmer can effectively reduce code bloat. For 13 applications spanning various domains, we observe a mean binary size reduction of 21% and a maximum reduction of 75%. We also show specialization reduces the surface for code-reuse attacks by reducing the number of exploitable gadgets. For the evaluated programs, we observe a 20% mean reduction in the total gadget count and a maximum reduction of 87%. ![]() ![]() |
|
Accioly, Paola |
![]() Paola Accioly, Paulo Borba, and Guilherme Cavalcanti (Federal University of Pernambuco, Brazil) In a collaborative development environment, tasks are commonly assigned to developers working independent from each other. As a result, when trying to integrate these contributions, one might have to deal with conflicting changes. Such conflicts might be detected when merging contributions (merge conflicts), when building the system (build conflicts), or when running tests (semantic conflicts). Regarding such conflicts, previous studies show that they occur frequently, and impair developers’ productivity, as understanding and solving them is a demanding and tedious task that might introduce defects. However, despite the existing evidence in the literature, the structure of changes that lead to conflicts has not been studied yet. Understanding the underlying structure of conflicts, and the involved syntactic language elements, might shed light on how to better avoid them. For example, awareness tools that inform users about ongoing parallel changes such as Syde and Palantír can benefit from knowing the most common conflict patterns to become more efficient. With that aim, in this paper we focus on understanding the underlying structure of merge conflicts. ![]() ![]() |
|
Aguirre, Nazareno |
![]() Renzo Degiovanni, Facundo Molina, Germán Regis, and Nazareno Aguirre (National University of Río Cuarto, Argentina; CONICET, Argentina) Goal-conflict analysis has been widely used as an abstraction for risk analysis in goal-oriented requirements engineering approaches. In this context, where the expected behaviour of the system-to-be is captured in terms of domain properties and goals, identifying combinations of circumstances that may make the goals diverge, i.e., not to be satisfied as a whole, is of most importance. Various approaches have been proposed in order to automatically identify boundary conditions, i.e., formulas capturing goal-divergent situations, but they either apply only to some specific goal expressions, or are affected by scalability issues that make them applicable only to relatively small specifications. In this paper, we present a novel approach to automatically identify boundary conditions, using evolutionary computation. More precisely, we develop a genetic algorithm that, given the LTL formulation of the domain properties and the goals, it searches for formulas that capture divergences in the specification. We exploit a modern LTL satisfiability checker to successfully guide our genetic algorithm to the solutions. We assess our technique on a set of case studies, and show that our genetic algorithm is able to find boundary conditions that cannot be generated by related approaches, and is able to efficiently scale to LTL specifications that other approaches are unable to deal with. ![]() ![]() ![]() |
|
Alizadeh, Vahid |
![]() Vahid Alizadeh and Marouane Kessentini (University of Michigan, USA) Refactoring is nowadays widely adopted in the industry because bad design decisions can be very costly and extremely risky. On the one hand, automated refactoring does not always lead to the desired design. On the other hand, manual refactoring is error-prone, time-consuming and not practical for radical changes. Thus, recent research trends in the field focused on integrating developers feedback into automated refactoring recommendations because developers understand the problem domain intuitively and may have a clear target design in mind. However, this interactive process can be repetitive, expensive, and tedious since developers must evaluate recommended refactorings, and adapt them to the targeted design especially in large systems where the number of possible strategies can grow exponentially. In this paper, we propose an interactive approach combining the use of multi-objective and unsupervised learning to reduce the developer's interaction effort when refactoring systems. We generate, first, using multi-objective search different possible refactoring strategies by finding a trade-off between several conflicting quality attributes. Then, an unsupervised learning algorithm clusters the different trade-off solutions, called the Pareto front, to guide the developers in selecting their region of interests and reduce the number of refactoring options to explore. The feedback from the developer, both at the cluster and solution levels, are used to automatically generate constraints to reduce the search space in the next iterations and focus on the region of developer preferences. We selected 14 active developers to manually evaluate the effectiveness our tool on 5 open source projects and one industrial system. The results show that the participants found their desired refactorings faster and more accurate than the current state of the art. ![]() ![]() |
|
Almasri, Nada |
![]() Nada Almasri, Luay Tahat, and Bogdan Korel (Gulf University for Science and Technology, Kuwait; Illinois Institute of Technology, USA) Software maintenance is becoming more challenging with the increased complexity of the software and the frequently applied changes. Performing impact analysis before the actual implementation of a change is a crucial task during system maintenance. While many tools and techniques are available to measure the impact of a change at the code level, only a few research work is done to measure the impact of a change at an earlier stage in the development process. Measuring the impact of a change at the model level speeds up the maintenance process allowing early discovery of critical components of the system before applying the actual change at the code level. In this paper, we present model-based impact analysis approach for state-based systems such as telecommunication or embedded systems. The proposed approach uses model dependencies to automatically measure the expected impact for a requested change instead of relying on the expertise of system maintainers, and it generates two impact sets representing the lower bound and the upper bound of the impact. Although it can be extended to other behavioral models, the presented approach mainly addresses extended finite-state machine (EFSM) models. An empirical study is conducted on six EFSM models to investigate the usefulness of the proposed approach. The results show that on average the size of the impact after a single modification (a change in a one EFSM transition) ranges between 14 and 38 % of the total size of the model. For a modification involving multiple transitions, the average size of the impact ranges between 30 and 64 % of the total size of the model. Additionally, we investigated the relationships (correlation) between the structure of the EFSM model, and the size of the impact sets. Upon preliminary analysis of the correlation, the concepts of model density and data density were defined, and it was found that they could be the major factors influencing the sizes of impact sets for models. As a result, these factors can be used to determine the types of models for which the proposed approach is the most appropriate. ![]() ![]() |
|
Arora, Nipun |
![]() Nipun Arora, Jonathan Bell, Franjo Ivančić, Gail Kaiser, and Baishakhi Ray (Dropbox, USA; George Mason University, USA; Google, USA; Columbia University, USA) Short time-to-localize and time-to-fix for production bugs is extremely important for any 24x7 service-oriented application (SOA). Debugging buggy behavior in deployed applications is hard, as it requires careful reproduction of a similar environment and workload. Prior approaches for automatically reproducing production failures do not scale to large SOA systems. Our key insight is that for many failures in SOA systems (e.g., many semantic and performance bugs), a failure can automatically be reproduced solely by relaying network packets to replicas of suspect services, an insight that we validated through a manual study of 16 real bugs across five different systems. This paper presents Parikshan, an application monitoring framework that leverages user-space virtualization and network proxy technologies to provide a sandbox “debug” environment. In this “debug” environment, developers are free to attach debuggers and analysis tools without impacting performance or correctness of the production environment. In comparison to existing monitoring solutions that can slow down production applications, Parikshan allows application monitoring at significantly lower overhead. ![]() ![]() |
|
Arora, Pryanshu |
![]() Sakshi Udeshi, Pryanshu Arora, and Sudipta Chattopadhyay (Singapore University of Technology and Design, Singapore; BITS Pilani, India) Fairness is a critical trait in decision making. As machine-learning models are increasingly being used in sensitive application domains (e.g. education and employment) for decision making, it is crucial that the decisions computed by such models are free of unintended bias. But how can we automatically validate the fairness of arbitrary machine-learning models? For a given machine-learning model and a set of sensitive input parameters, our Aeqitas approach automatically discovers discriminatory inputs that highlight fairness violation. At the core of Aeqitas are three novel strategies to employ probabilistic search over the input space with the objective of uncovering fairness violation. Our Aeqitas approach leverages inherent robustness property in common machine-learning models to design and implement scalable test generation methodologies. An appealing feature of our generated test inputs is that they can be systematically added to the training set of the underlying model and improve its fairness. To this end, we design a fully automated module that guarantees to improve the fairness of the model. We implemented Aeqitas and we have evaluated it on six stateof- the-art classifiers. Our subjects also include a classifier that was designed with fairness in mind. We show that Aeqitas effectively generates inputs to uncover fairness violation in all the subject classifiers and systematically improves the fairness of respective models using the generated test inputs. In our evaluation, Aeqitas generates up to 70% discriminatory inputs (w.r.t. the total number of inputs generated) and leverages these inputs to improve the fairness up to 94%. ![]() ![]() |
|
Bacchelli, Alberto |
![]() Carmine Vassallo, Fabio Palomba, Alberto Bacchelli, and Harald C. Gall (University of Zurich, Switzerland) Continuous Integration (CI) is a software engineering practice where developers constantly integrate their changes to a project through an automated build process. The goal of CI is to provide developers with prompt feedback on several quality dimensions after each change. Indeed, previous studies provided empirical evidence on a positive association between properly following CI principles and source code quality. A core principle behind CI is Continuous Code Quality (also known as CCQ, which includes automated testing and automated code inspection) may appear simple and effective, yet we know little about its practical adoption. In this paper, we propose a preliminary empirical investigation aimed at understanding how rigorously practitioners follow CCQ. Our study reveals a strong dichotomy between theory and practice: developers do not perform continuous inspection but rather control for quality only at the end of a sprint and most of the times only on the release branch. Preprint [https://doi.org/10.5281/zenodo.1341036]. Data and Materials [http://doi.org/10.5281/zenodo.1341015]. ![]() ![]() ![]() Vladimir Kovalenko, Fabio Palomba, and Alberto Bacchelli (Delft University of Technology, Netherlands; University of Zurich, Switzerland) Modern distributed version control systems, such as Git, offer support for branching — the possibility to develop parts of software outside the master trunk. Consideration of the repository structure in Mining Software Repository (MSR) studies requires a thorough approach to mining, but there is no well-documented, widespread methodology regarding the handling of merge commits and branches. Moreover, there is still a lack of knowledge of the extent to which considering branches during MSR studies impacts the results of the studies. In this study, we set out to evaluate the importance of proper handling of branches when calculating file modification histories. We analyze over 1,400 Git repositories of four open source ecosystems and compute modification histories for over two million files, using two different algorithms. One algorithm only follows the first parent of each commit when traversing the repository, the other returns the full modification history of a file across all branches. We show that the two algorithms consistently deliver different results, but the scale of the difference varies across projects and ecosystems. Further, we evaluate the importance of accurate mining of file histories by comparing the performance of common techniques that rely on file modification history — reviewer recommendation, change recommendation, and defect prediction — for two algorithms of file history retrieval. We find that considering full file histories leads to an increase in the techniques’ performance that is rather modest. ![]() ![]() ![]() |
|
Bagheri, Hamid |
![]() Jianghao Wang, Hamid Bagheri, and Myra B. Cohen (University of Nebraska-Lincoln, USA) Formal methods use mathematical notations and logical reasoning to precisely define a program's specifications, from which we can instantiate valid instances of a system. With these techniques we can perform a multitude of tasks to check system dependability. Despite the existence of many automated tools including ones considered lightweight, they still lack a strong adoption in practice. At the crux of this problem, is scalability and applicability to large real world applications. In this paper we show how to relax the completeness guarantee without much loss, since soundness is maintained. We have extended a popular lightweight analysis, Alloy, with a genetic algorithm. Our new tool, EvoAlloy, works at the level of finite relations generated by Kodkod and evolves the chromosomes based on the failed constraints. In a feasibility study we demonstrate that we can find solutions to a set of specifications beyond the scope where traditional Alloy fails. While small specifications take longer with EvoAlloy, the scalability means we can handle larger specifications. Our future vision is that when specifications are small we can maintain both soundness and completeness, but when this fails, EvoAlloy can switch to its genetic algorithm. ![]() ![]() |
|
Bai, Guangdong |
![]() Yun Lin, Jun Sun, Lyly Tran, Guangdong Bai, Haijun Wang, and Jinsong Dong (National University of Singapore, Singapore; Singapore University of Technology and Design, Singapore; Singapore Institute of Technology, Singapore; Nanyang Technological University, Singapore) Dynamic slicing is a common way of identifying the root cause when a program fault is revealed. With the dynamic slicing technique, the programmers can follow data and control flow along the program execution trace to the root cause. However, the technique usually fails to work on omission bugs, i.e., the faults which are caused by missing executing some code. In many cases, dynamic slicing over-skips the root cause when an omission bug happens, leading the debugging process to a dead end. In this work, we conduct an empirical study on the omission bugs in the Defects4J bug repository. Our study shows that (1) omission bugs are prevalent (46.4%) among all the studied bugs; (2) there are repeating patterns on causes and fixes of the omission bugs; (3) the patterns of fixing omission bugs serve as a strong hint to break the slicing dead end. Based on our findings, we train a neural network model on the omission bugs in Defects4J repository to recommend where to approach when slicing can no long work. We conduct an experiment by applying our approach on 3193 mutated omission bugs which slicing fails to locate. The results show that our approach outperforms random benchmark on breaking the dead end and localizing the mutated omission bugs (63.8% over 2.8%). ![]() ![]() |
|
Bajammal, Mohammad |
![]() Mohammad Bajammal, Davood Mazinanian, and Ali Mesbah (University of British Columbia, Canada) The transformation of a user interface mockup designed by a graphic designer to web components in the final app built by a web developer is often laborious, involving manual and time consuming steps. We propose an approach to automate this aspect of web development by generating reusable web components from a mockup. Our approach employs visual analysis of the mockup, and unsupervised learning of visual cues to create reusable web components (e.g., React components). We evaluated our approach, implemented in a tool called VizMod, on five real-world web mockups, and assessed the transformations and generated components through comparison with web development experts. The results show that VizMod achieves on average 94% precision and 75% recall in terms of agreement with the developers' assessment. Furthermore, the refactorings yielded 22% code reusability, on average. ![]() ![]() |
|
Bao, Liang |
![]() Liang Bao, Xin Liu, Ziheng Xu, and Baoyin Fang (Xidian University, China; University of California at Davis, USA) Distributed message systems (DMSs) serve as the communication backbone for many real-time streaming data processing applications. To support the vast diversity of such applications, DMSs provide a large number of parameters to configure. However, It overwhelms for most users to configure these parameters well for better performance. Although many automatic configuration approaches have been proposed to address this issue, critical challenges still remain: 1) to train a better and robust performance prediction model using a limited number of samples, and 2) to search for a high-dimensional parameter space efficiently within a time constraint. In this paper, we propose AutoConfig -- an automatic configuration system that can optimize producer-side throughput on DMSs. AutoConfig constructs a novel comparison-based model (CBM) that is more robust that the prediction-based model (PBM) used by previous learning-based approaches. Furthermore, AutoConfig uses a weighted Latin hypercube sampling (wLHS) approach to select a set of samples that can provide a better coverage over the high-dimensional parameter space. wLHS allows AutoConfig to search for more promising configurations using the trained CBM. We have implemented AutoConfig on the Kafka platform, and evaluated it using eight different testing scenarios deployed on a public cloud. Experimental results show that our CBM can obtain better results than that of PBM under the same random forests based model. Furthermore, AutoConfig outperforms default configurations by 215.40% on average, and five state-of-the-art configuration algorithms by 7.21%-64.56%. ![]() ![]() |
|
Baudry, Benoit |
![]() Oscar Luis Vera-Pérez, Martin Monperrus, and Benoit Baudry (Inria, France; KTH, Sweden) Descartes is a tool that implements extreme mutation operators and aims at finding pseudo-tested methods in Java projects. It leverages the efficient transformation and runtime features of PITest. The demonstration compares Descartes with Gregor, the default mutation engine provided by PITest, in a set of real open source projects. It considers the execution time, number of mutants created and the relationship between the mutation scores produced by both engines. It provides some insights on the main features exposed byDescartes. ![]() ![]() ![]() |
|
Bavota, Gabriele |
![]() Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk (College of William and Mary, USA; University of Lugano, Switzerland; University of Sannio, Italy) Millions of open-source projects with numerous bug fixes are available in code repositories. This proliferation of software development histories can be leveraged to learn how to fix common programming bugs. To explore such a potential, we perform an empirical study to assess the feasibility of using Neural Machine Translation techniques for learning bug-fixing patches for real defects. We mine millions of bug-fixes from the change histories of GitHub repositories to extract meaningful examples of such bug-fixes. Then, we abstract the buggy and corresponding fixed code, and use them to train an Encoder-Decoder model able to translate buggy code into its fixed version. Our model is able to fix hundreds of unique buggy methods in the wild. Overall, this model is capable of predicting fixed patches generated by developers in 9% of the cases. ![]() ![]() |
|
Bell, Jonathan |
![]() Nipun Arora, Jonathan Bell, Franjo Ivančić, Gail Kaiser, and Baishakhi Ray (Dropbox, USA; George Mason University, USA; Google, USA; Columbia University, USA) Short time-to-localize and time-to-fix for production bugs is extremely important for any 24x7 service-oriented application (SOA). Debugging buggy behavior in deployed applications is hard, as it requires careful reproduction of a similar environment and workload. Prior approaches for automatically reproducing production failures do not scale to large SOA systems. Our key insight is that for many failures in SOA systems (e.g., many semantic and performance bugs), a failure can automatically be reproduced solely by relaying network packets to replicas of suspect services, an insight that we validated through a manual study of 16 real bugs across five different systems. This paper presents Parikshan, an application monitoring framework that leverages user-space virtualization and network proxy technologies to provide a sandbox “debug” environment. In this “debug” environment, developers are free to attach debuggers and analysis tools without impacting performance or correctness of the production environment. In comparison to existing monitoring solutions that can slow down production applications, Parikshan allows application monitoring at significantly lower overhead. ![]() ![]() ![]() Michael Hilton, Jonathan Bell, and Darko Marinov (Carnegie Mellon University, USA; George Mason University, USA; University of Illinois at Urbana-Champaign, USA) Statement coverage is commonly used as a measure of test suite quality. Coverage is often used as a part of a code review process: if a patch decreases overall coverage, or is itself not covered, then the patch is scrutinized more closely. Traditional studies of how coverage changes with code evolution have examined the overall coverage of the entire program, and more recent work directly examines the coverage of patches (changed statements). We present an evaluation much larger than prior studies and moreover consider a new, important kind of change --- coverage changes of unchanged statements. We present a large-scale evaluation of code coverage evolution over 7,816 builds of 47 projects written in popular languages including Java, Python, and Scala. We find that in large, mature projects, simply measuring the change to statement coverage does not capture the nuances of code evolution. Going beyond considering statement coverage as a simple ratio, we examine how the set of statements covered evolves between project revisions. We present and study new ways to assess the impact of a patch on a project's test suite quality that both separates coverage of the patch from coverage of the non-patch, and separates changes in coverage from changes in the set of statements covered. ![]() ![]() |
|
Ben Fadhel, Ameni |
![]() Ameni Ben Fadhel, Domenico Bianculli, and Lionel C. Briand (University of Luxembourg, Luxembourg) A Role-based Access Control (RBAC) mechanism prevents unauthorized users to perform an operation, according to authorization policies which are defined on the user’s role within an enterprise. Several models have been proposed to specify complex RBAC policies. However, existing approaches for policy enforcement do not fully support all the types of policies that can be expressed in these models, which hinders their adoption among practitioners. In this paper we propose a model-driven enforcement framework for complex policies captured by GemRBAC+CTX, a comprehensive RBAC model proposed in the literature. We reduce the problem of making an access decision to checking whether a system state (from an RBAC point of view), expressed as an instance of the GemRBAC+CTX model, satisfies the constraints corresponding to the RBAC policies to be enforced at run time. We provide enforcement algorithms for various types of access requests and events, and a prototype tool (MORRO) implementing them. We also show how to integrate MORRO into an industrial Web application. The evaluation results show the applicability of our approach on a industrial system and its scalability with respect to the various parameters characterizing an AC configuration. ![]() ![]() |
|
Berger, Thorsten |
![]() Mukelabai Mukelabai, Damir Nešić, Salome Maro, Thorsten Berger, and Jan-Philipp Steghöfer (Chalmers University of Technology, Sweden; University of Gothenburg, Sweden; KTH, Sweden) Highly configurable systems are complex pieces of software. To tackle this complexity, hundreds of dedicated analysis techniques have been conceived, many of which able to analyze system properties for all possible system configurations, as opposed to traditional, single-system analyses. Unfortunately, it is largely unknown whether these techniques are adopted in practice, whether they address actual needs, or what strategies practitioners actually apply to analyze highly configurable systems. We present a study of analysis practices and needs in industry. It relied on a survey with 27 practitioners engineering highly configurable systems and follow-up interviews with 15 of them, covering 18 different companies from eight countries. We confirm that typical properties considered in the literature (e.g., reliability) are relevant, that consistency between variability models and artifacts is critical, but that the majority of analyses for specifications of configuration options (a.k.a., variability model analysis) is not perceived as needed. We identified rather pragmatic analysis strategies, including practices to avoid the need for analysis. For instance, testing with experience-based sampling is the most commonly applied strategy, while systematic sampling is rarely applicable. We discuss analyses that are missing and synthesize our insights into suggestions for future research. ![]() ![]() ![]() |
|
Beyer, Dirk |
![]() Dirk Beyer and Karlheinz Friedberger (LMU Munich, Germany) Recent development of software aims at massively parallel execution, because of the trend to increase the number of processing units per CPU socket. But many approaches for program analysis are not designed to benefit from a multi-threaded execution and lack support to utilize multi-core computers. Rewriting existing algorithms is difficult and error-prone, and the design of new parallel algorithms also has limitations. An orthogonal problem is the granularity: computing each successor state in parallel seems too fine-grained, so the open question is to find the right structural level for parallel execution. We propose an elegant solution to these problems: Block summaries should be computed in parallel. Many successful approaches to software verification are based on summaries of control-flow blocks, large blocks, or function bodies. Block-abstraction memoization is a successful domain-independent approach for summary-based program analysis. We redesigned the verification approach of block-abstraction memoization starting from its original recursive definition, such that it can run in a parallel manner for utilizing the available computation resources without losing its advantages of being independent from a certain abstract domain. We present an implementation of our new approach for multi-core shared-memory machines. The experimental evaluation shows that our summary-based approach has no significant overhead compared to the existing sequential approach and that it has a significant speedup when using multi-threading. ![]() ![]() ![]() ![]() ![]() Dirk Beyer and Thomas Lemberger (LMU Munich, Germany) We present CPA-SymExec, a tool for symbolic execution that is implemented in the open-source, configurable verification framework CPAchecker. Our implementation automatically detects which symbolic facts to track, in order to obtain a small set of constraints that are necessary to decide reachability of a program area of interest. CPA-SymExec is based on abstraction and counterexample-guided abstraction refinement (CEGAR), and uses a constraint-interpolation approach to detect symbolic facts. We show that our implementation can better mitigate the path-explosion problem than symbolic execution without abstraction, by comparing the performance to the state-of-the-art Klee-based symbolic-execution engine Symbiotic and to Klee itself. For the experiments we use two kinds of analysis tasks: one for finding an executable path to a specific location of interest (e.g., if a test vector is desired to show that a certain behavior occurs), and one for confirming that no executable path to a specific location exists (e.g., if it is desired to show that a certain behavior never occurs). CPA-SymExec is released under the Apache 2 license and available (inclusive source code) at https://cpachecker.sosy-lab.org. A demonstration video is available at https://youtu.be/qoBHtvPKtnw. ![]() ![]() ![]() ![]() ![]() ![]() |
|
Bianculli, Domenico |
![]() Ameni Ben Fadhel, Domenico Bianculli, and Lionel C. Briand (University of Luxembourg, Luxembourg) A Role-based Access Control (RBAC) mechanism prevents unauthorized users to perform an operation, according to authorization policies which are defined on the user’s role within an enterprise. Several models have been proposed to specify complex RBAC policies. However, existing approaches for policy enforcement do not fully support all the types of policies that can be expressed in these models, which hinders their adoption among practitioners. In this paper we propose a model-driven enforcement framework for complex policies captured by GemRBAC+CTX, a comprehensive RBAC model proposed in the literature. We reduce the problem of making an access decision to checking whether a system state (from an RBAC point of view), expressed as an instance of the GemRBAC+CTX model, satisfies the constraints corresponding to the RBAC policies to be enforced at run time. We provide enforcement algorithms for various types of access requests and events, and a prototype tool (MORRO) implementing them. We also show how to integrate MORRO into an industrial Web application. The evaluation results show the applicability of our approach on a industrial system and its scalability with respect to the various parameters characterizing an AC configuration. ![]() ![]() |
|
Blanc, Xavier |
![]() Sarra Habchi, Xavier Blanc, and Romain Rouvoy (Inria, France; University of Lille, France; University of Bordeaux, France) With millions of applications (apps) distributed through mobile markets, engaging and retaining end-users challenge Android developers to deliver a nearly perfect user experience. As mobile apps run in resource-limited devices, performance is a critical criterion for the quality of experience. Therefore, developers are expected to pay much attention to limit performance bad practices. On the one hand, many studies already identified such performance bad practices and showed that they can heavily impact app performance. Hence, many static analysers, a.k.a. linters, have been proposed to detect and fix these bad practices. On the other hand, other studies have shown that Android developers tend to deal with performance reactively and they rarely build on linters to detect and fix performance bad practices. In this paper, we therefore perform a qualitative study to investigate this gap between research and development community. In particular, we performed interviews with 14 experienced Android developers to identify the perceived benefits and constraints of using linters to identify performance bad practices in Android apps. Our observations can have a direct impact on developers and the research community. Specifically, we describe why and how developers leverage static source code analysers to improve the performance of their apps. On top of that, we bring to light important challenges faced by developers when it comes to adopting static analysis for performance purposes. ![]() ![]() ![]() ![]() |
|
Borba, Paulo |
![]() Paola Accioly, Paulo Borba, and Guilherme Cavalcanti (Federal University of Pernambuco, Brazil) In a collaborative development environment, tasks are commonly assigned to developers working independent from each other. As a result, when trying to integrate these contributions, one might have to deal with conflicting changes. Such conflicts might be detected when merging contributions (merge conflicts), when building the system (build conflicts), or when running tests (semantic conflicts). Regarding such conflicts, previous studies show that they occur frequently, and impair developers’ productivity, as understanding and solving them is a demanding and tedious task that might introduce defects. However, despite the existing evidence in the literature, the structure of changes that lead to conflicts has not been studied yet. Understanding the underlying structure of conflicts, and the involved syntactic language elements, might shed light on how to better avoid them. For example, awareness tools that inform users about ongoing parallel changes such as Syde and Palantír can benefit from knowing the most common conflict patterns to become more efficient. With that aim, in this paper we focus on understanding the underlying structure of merge conflicts. ![]() ![]() |
|
Borges Jr., Nataniel P. |
![]() Nataniel P. Borges Jr., Jenny Hotzkow, and Andreas Zeller (Saarland University, Germany) Android applications (apps) represent an ever increasing portion of the software market. Automated test input generators are the state of the art for testing and security analysis. We introduce DroidMate-2 (DM-2), a platform to easily assist both developers and researchers to customize, develop and test new test generators. DM-2 can be used without app instrumentation or operating system modifications, as a test generator on real devices and emulators for app testing or regression testing. Additionally, it provides sensitive resource monitoring or blocking capabilities through a lightweight app instrumentation, out-of-thebox statement coverage measurement through a fully-fledged app instrumentation and native experiment reproducibility. In our experiments we compared DM-2 against DroidBot, a state-of-the-art test generator by measuring statement coverage. Our results show that DM-2 reached 96% of its peak coverage in less than 2/3 of the time needed by DroidBot, allowing for better and more efficient tests. On short runs (5 minutes) DM-2 outperformed DroidBot by 7% while in longer runs (1 hour) this difference increases to 8%. ACM DL Artifact: https://www.doi.org/10.1145/3264864 For the details see: https://github.com/uds-se/droidmate/wiki/ASE-2018:-Data ![]() ![]() ![]() ![]() ![]() |
|
Briand, Lionel C. |
![]() Raja Ben Abdessalem, Annibale Panichella, Shiva Nejati, Lionel C. Briand, and Thomas Stifter (University of Luxembourg, Luxembourg; Delft University of Technology, Netherlands; IEE, Luxembourg) Complex systems such as autonomous cars are typically built as a composition of features that are independent units of functionality. Features tend to interact and impact one another's behavior in unknown ways. A challenge is to detect and manage feature interactions, in particular, those that violate system requirements, hence leading to failures. In this paper, we propose a technique to detect feature interaction failures by casting this problem into a search-based test generation problem. We define a set of hybrid test objectives (distance functions) that combine traditional coverage-based heuristics with new heuristics specifically aimed at revealing feature interaction failures. We develop a new search-based test generation algorithm, called FITEST, that is guided by our hybrid test objectives. FITEST extends recently proposed many-objective evolutionary algorithms to reduce the time required to compute fitness values. We evaluate our approach using two versions of an industrial self-driving system. Our results show that our hybrid test objectives are able to identify more than twice as many feature interaction failures as two baseline test objectives used in the software testing literature (i.e., coverage-based and failure-based test objectives). Further, the feedback from domain experts indicates that the detected feature interaction failures represent real faults in their systems that were not previously identified based on analysis of the system features and their requirements. ![]() ![]() ![]() Ameni Ben Fadhel, Domenico Bianculli, and Lionel C. Briand (University of Luxembourg, Luxembourg) A Role-based Access Control (RBAC) mechanism prevents unauthorized users to perform an operation, according to authorization policies which are defined on the user’s role within an enterprise. Several models have been proposed to specify complex RBAC policies. However, existing approaches for policy enforcement do not fully support all the types of policies that can be expressed in these models, which hinders their adoption among practitioners. In this paper we propose a model-driven enforcement framework for complex policies captured by GemRBAC+CTX, a comprehensive RBAC model proposed in the literature. We reduce the problem of making an access decision to checking whether a system state (from an RBAC point of view), expressed as an instance of the GemRBAC+CTX model, satisfies the constraints corresponding to the RBAC policies to be enforced at run time. We provide enforcement algorithms for various types of access requests and events, and a prototype tool (MORRO) implementing them. We also show how to integrate MORRO into an industrial Web application. The evaluation results show the applicability of our approach on a industrial system and its scalability with respect to the various parameters characterizing an AC configuration. ![]() ![]() |
|
Brunel, Julien |
![]() Julien Brunel, David Chemouil, Alcino Cunha, and Nuno Macedo (ONERA, France; University of Toulouse, France; INESC TEC, Portugal; University of Minho, Portugal) This paper presents the Electrum Analyzer, a free-software tool to validate and perform model checking of Electrum specifications. Electrum is an extension of Alloy that enriches its relational logic with LTL operators, thus simplifying the specification of dynamic systems. The Analyzer supports both automatic bounded model checking, with an encoding into SAT, and unbounded model checking, with an encoding into SMV. Instance, or counter-example, traces are presented back to the user in a unified visualizer. Features to speed up model checking are offered, including a decomposed parallel solving strategy and the extraction of symbolic bounds. ![]() ![]() ![]() ![]() |
|
Bugariu, Alexandra |
![]() Alexandra Bugariu, Valentin Wüstholz, Maria Christakis, and Peter Müller (ETH Zurich, Switzerland; MPI-SWS, Germany) Static program analyses are routinely applied as the basis of code optimizations and to detect safety and security issues in software systems. For their results to be reliable, static analyses should be sound (i.e., should not produce false negatives) and precise (i.e., should report a low number of false positives). Even though it is possible to prove properties of the design of a static analysis, ensuring soundness and precision for its implementation is challenging. Complex algorithms and sophisticated optimizations make static analyzers difficult to implement and test. In this paper, we present an automatic technique to test, among other properties, the soundness and precision of abstract domains, the core of all static analyzers based on abstract interpretation. In order to cover a wide range of test data and input states, we construct inputs by applying sequences of abstract-domain operations to representative domain elements, and vary the operations through gray-box fuzzing. We use mathematical properties of abstract domains as test oracles. Our experimental evaluation demonstrates the effectiveness of our approach. We detected several previously unknown soundness and precision errors in widely-used abstract domains. Our experiments also show that our approach is more effective than dynamic symbolic execution and than fuzzing the test inputs directly. ![]() ![]() |
|
Cadar, Cristian |
![]() Oscar Soria Dustmann, Klaus Wehrle, and Cristian Cadar (RWTH Aachen University, Germany; Imperial College London, UK) Symbolic execution is an effective program analysis technique whose scalability largely depends on the ability to quickly solve large numbers of first-order logic queries. We propose an effective general technique for speeding up the solving of queries in the theory of arrays and bit-vectors with a specific structure, while otherwise falling back to a complete solver. The technique has two stages: a learning stage that determines the solution sets of each symbolic variable, and a decision stage that uses this information to quickly determine the satisfiability of certain types of queries. The main challenges involve deciding which operators to support and precisely dealing with integer type casts and arithmetic underflow and overflow. We implemented this technique in an incomplete solver called PARTI (``PARtial Theory solver for Intervals''), directly integrating it into the popular KLEE symbolic execution engine. We applied KLEE with PARTI and a state-of-the-art SMT solver to synthetic and real-world benchmarks. We found that PARTI practically does not hurt performance while many times achieving order-of-magnitude speedups. ![]() ![]() |
|
Cai, Yuanfang |
![]() Ran Mo, Will Snipes, Yuanfang Cai, Srini Ramaswamy, Rick Kazman, and Martin Naedele (Central China Normal University, China; ABB Corporate Research, USA; Drexel University, USA; ABB, USA; Carnegie Mellon University, USA; University of Hawaii, USA; ABB, Switzerland) In this paper, we report our experiences of applying three complementary automated software architecture analysis techniques, supported by a tool suite, called DV8, to 8 industrial projects within a large company. DV8 includes two state-of-the-art architecture-level maintainability metrics—Decoupling Level and Propagation Cost, an architecture flaw detection tool, and an architecture root detection tool. We collected development process data from the project teams as input to these tools, reported the results back to the practitioners, and followed up with telephone conferences and interviews. Our experiences revealed that the metrics scores, quantitative debt analysis, and architecture flaw visualization can effectively bridge the gap between management and development, help them decide if, when, and where to refactor. In particular, the metrics scores, compared against industrial benchmarks, faithfully reflected the practitioners’ intuitions about the maintainability of their projects, and enabled them to better understand the maintainability relative to other projects internal to their company, and to other industrial products. The automatically detected architecture flaws and roots enabled the practitioners to precisely pinpoint, visualize, and quantify the “hotspots" within the systems that are responsible for high maintenance costs. Except for the two smallest projects for which both architecture metrics indicated high maintainability, all other projects are planning or have already begun refactorings to address the problems detected by our analyses. We are working on further automating the tool chain, and transforming the analysis suite into deployable services accessible by all projects within the company. ![]() ![]() |
|
Canfora, Gerardo |
![]() Davide Falessi, Massimiliano Di Penta, Gerardo Canfora, and Giovanni Cantone (California Polytechnic State University, USA; University of Sannio, Italy; University of Rome Tor Vergata, Italy) Although very important in software engineering, establishing traceability links between software artifacts is extremely tedious, error-prone, and it requires significant effort. Even when approaches for automated traceability recovery exist, these provide the requirements analyst with a, usually very long, ranked list of candidate links that needs to be manually inspected. In this paper we introduce an approach called Estimation of the Number of Remaining Links (ENRL) which aims at estimating, via Machine Learning (ML) classifiers, the number of remaining positive links in a ranked list of candidate traceability links produced by a Natural Language Processing techniques-based recovery approach. We have evaluated the accuracy of the ENRL approach by considering several ML classifiers and NLP techniques on three datasets from industry and academia, and concerning traceability links among different kinds of software artifacts including requirements, use cases, design documents, source code, and test cases. Results from our study indicate that: (i) specific estimation models are able to provide accurate estimates of the number of remaining positive links; (ii) the estimation accuracy depends on the choice of the NLP technique, and (iii) univariate estimation models outperform multivariate ones. ![]() ![]() |
|
Cantone, Giovanni |
![]() Davide Falessi, Massimiliano Di Penta, Gerardo Canfora, and Giovanni Cantone (California Polytechnic State University, USA; University of Sannio, Italy; University of Rome Tor Vergata, Italy) Although very important in software engineering, establishing traceability links between software artifacts is extremely tedious, error-prone, and it requires significant effort. Even when approaches for automated traceability recovery exist, these provide the requirements analyst with a, usually very long, ranked list of candidate links that needs to be manually inspected. In this paper we introduce an approach called Estimation of the Number of Remaining Links (ENRL) which aims at estimating, via Machine Learning (ML) classifiers, the number of remaining positive links in a ranked list of candidate traceability links produced by a Natural Language Processing techniques-based recovery approach. We have evaluated the accuracy of the ENRL approach by considering several ML classifiers and NLP techniques on three datasets from industry and academia, and concerning traceability links among different kinds of software artifacts including requirements, use cases, design documents, source code, and test cases. Results from our study indicate that: (i) specific estimation models are able to provide accurate estimates of the number of remaining positive links; (ii) the estimation accuracy depends on the choice of the NLP technique, and (iii) univariate estimation models outperform multivariate ones. ![]() ![]() |
|
Cao, Jialun |
![]() Jialun Cao, Yongjian Li, and Jun Pang (Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; University of Luxembourg, Luxembourg) This demo introduces L-CMP, an automatic learning-based parameterized verification tool. It can verify parameterized protocols by combining machine learning and model checking techniques. Given a parameterized protocol, L-CMP learns a set of auxiliary invariants and implements verification of the protocol using the invariants automatically. In particular, the learned auxiliary invariants are straightforward and readable. The experimental results show that L-CMP can successfully verify a number of cache coherence protocols, including the industrial-scale FLASH protocol. The video is available at https://youtu.be/6Dl2HiiiS4E, and L-CMPL-CMP can be downloaded at https://github.com/ ArabelaTso/Learning-Based-ParaVerifer. ![]() ![]() |
|
Cao, Yurui |
![]() Wenyu Wang, Dengfeng Li, Wei Yang, Yurui Cao, Zhenwen Zhang, Yuetang Deng, and Tao Xie (University of Illinois at Urbana-Champaign, USA; University of Texas at Dallas, USA; Tencent, China) User Interface (UI) testing is a popular approach to ensure the quality of mobile apps. Numerous test generation tools have been developed to support UI testing on mobile apps, especially for Android apps. Previous work evaluates and compares different test generation tools using only relatively simple open-source apps, while real-world industrial apps tend to have more complex functionalities and implementations. There is no direct comparison among test generation tools with regard to effectiveness and ease-of-use on these industrial apps. To address such limitation, we study existing state-of-the-art or state-of-the-practice test generation tools on 68 widely-used industrial apps. We directly compare the tools with regard to code coverage and fault-detection ability. According to our results, Monkey, a state-of-the-practice tool from Google, achieves the highest method coverage on 22 of 41 apps whose method coverage data can be obtained. Of all 68 apps under study, Monkey also achieves the highest activity coverage on 35 apps, while Stoat, a state-of-the-art tool, is able to trigger the highest number of unique crashes on 23 apps. By analyzing the experimental results, we provide suggestions for combining different test generation tools to achieve better performance. We also report our experience in applying these tools to industrial apps under study. Our study results give insights on how Android UI test generation tools could be improved to better handle complex industrial apps. ![]() ![]() |
|
Cashman, Mikaela |
![]() Mikaela Cashman, Myra B. Cohen, Priya Ranjan, and Robert W. Cottingham (University of Nebraska-Lincoln, USA; Oak Ridge National Laboratory, USA) The bioinformatics software domain contains thousands of applications for automating tasks such as the pairwise alignment of DNA sequences, building and reasoning about metabolic models or simulating growth of an organism. Its end users range from sophisticated developers to those with little computational experience. In response to their needs, developers provide many options to customize the way their algorithms are tuned. Yet there is little or no automated help for the user in determining the consequences or impact of the options they choose. In this paper we describe our experience working with configurable bioinformatics tools. We find limited documentation and help for combining and selecting options along with variation in both functionality and performance. We also find previously undetected faults. We summarize our findings with a set of lessons learned, and present a roadmap for creating automated techniques to interact with bioinformatics software. We believe these will generalize to other types of scientific software. ![]() ![]() ![]() |
|
Cavalcanti, Guilherme |
![]() Paola Accioly, Paulo Borba, and Guilherme Cavalcanti (Federal University of Pernambuco, Brazil) In a collaborative development environment, tasks are commonly assigned to developers working independent from each other. As a result, when trying to integrate these contributions, one might have to deal with conflicting changes. Such conflicts might be detected when merging contributions (merge conflicts), when building the system (build conflicts), or when running tests (semantic conflicts). Regarding such conflicts, previous studies show that they occur frequently, and impair developers’ productivity, as understanding and solving them is a demanding and tedious task that might introduce defects. However, despite the existing evidence in the literature, the structure of changes that lead to conflicts has not been studied yet. Understanding the underlying structure of conflicts, and the involved syntactic language elements, might shed light on how to better avoid them. For example, awareness tools that inform users about ongoing parallel changes such as Syde and Palantír can benefit from knowing the most common conflict patterns to become more efficient. With that aim, in this paper we focus on understanding the underlying structure of merge conflicts. ![]() ![]() |
|
Cha, Sooyoung |
![]() Sooyoung Cha, Seonho Lee, and Hakjoo Oh (Korea University, South Korea) We present template-guided concolic testing, a new technique for effectively reducing the search space in concolic testing. Addressing the path-explosion problem has been a significant challenge in concolic testing. Diverse search heuristics have been proposed to mitigate this problem but using search heuristics alone is not sufficient to substantially improve code coverage for real-world programs. The goal of this paper is to complement existing techniques and achieve higher coverage by exploiting templates in concolic testing. In our approach, a template is a partially symbolized input vector whose job is to reduce the search space. However, choosing a right set of templates is nontrivial and significantly affects the final performance of our approach. We present an algorithm that automatically learns useful templates online, based on data collected from previous runs of concolic testing. The experimental results with open-source programs show that our technique achieves greater branch coverage and finds bugs more effectively than conventional concolic testing. ![]() ![]() |
|
Chan, W. K. |
![]() Bo Jiang, Ye Liu, and W. K. Chan (Beihang University, China; City University of Hong Kong, China) Decentralized cryptocurrencies feature the use of blockchain to transfer values among peers on networks without central agency. Smart contracts are programs running on top of the blockchain consensus protocol to enable people make agreements while minimizing trusts. Millions of smart contracts have been deployed in various decentralized applications. The security vulnerabilities within those smart contracts pose significant threats to their applications. Indeed, many critical security vulnerabilities within smart contracts on Ethereum platform have caused huge financial losses to their users. In this work, we present ContractFuzzer, a novel fuzzer to test Ethereum smart contracts for security vulnerabilities. ContractFuzzer generates fuzzing inputs based on the ABI specifications of smart contracts, defines test oracles to detect security vulnerabilities, instruments the EVM to log smart contracts runtime behaviors, and analyzes these logs to report security vulnerabilities. Our fuzzing of 6991 smart contracts has flagged more than 459 vulnerabilities with high precision. In particular, our fuzzing tool successfully detects the vulnerability of the DAO contract that leads to USD 60 million loss and the vulnerabilities of Parity Wallet that have led to the loss of USD 30 million and the freezing of USD 150 million worth of Ether. ![]() ![]() ![]() ![]() ![]() |
|
Chang, Bor-Yuh Evan |
![]() Benno Stein, Lazaro Clapp, Manu Sridharan, and Bor-Yuh Evan Chang (University of Colorado Boulder, USA; Uber Technologies, USA) In stream-based programming, data sources are abstracted as a stream of values that can be manipulated via callback functions. Stream-based programming is exploding in popularity, as it provides a powerful and expressive paradigm for handling asynchronous data sources in interactive software. However, high-level stream abstractions can also make it difficult for developers to reason about control- and data-flow relationships in their programs. This is particularly impactful when asynchronous stream-based code interacts with thread-limited features such as UI frameworks that restrict UI access to a single thread, since the threading behavior of streaming constructs is often non-intuitive and insufficiently documented. In this paper, we present a type-based approach that can statically prove the thread-safety of UI accesses in stream-based software. Our key insight is that the fluent APIs of stream-processing frameworks enable the tracking of threads via type-refinement, making it possible to reason automatically about what thread a piece of code runs on -- a difficult problem in general. We implement the system as an annotation-based Java typechecker for Android programs built upon the popular ReactiveX framework and evaluate its efficacy by annotating and analyzing 8 open-source apps, where we find 33 instances of unsafe UI access while incurring an annotation burden of only one annotation per 186 source lines of code. We also report on our experience applying the typechecker to two much larger apps from the Uber Technologies, Inc. codebase, where it currently runs on every code change and blocks changes that introduce potential threading bugs. ![]() ![]() |
|
Chattopadhyay, Sudipta |
![]() Sakshi Udeshi, Pryanshu Arora, and Sudipta Chattopadhyay (Singapore University of Technology and Design, Singapore; BITS Pilani, India) Fairness is a critical trait in decision making. As machine-learning models are increasingly being used in sensitive application domains (e.g. education and employment) for decision making, it is crucial that the decisions computed by such models are free of unintended bias. But how can we automatically validate the fairness of arbitrary machine-learning models? For a given machine-learning model and a set of sensitive input parameters, our Aeqitas approach automatically discovers discriminatory inputs that highlight fairness violation. At the core of Aeqitas are three novel strategies to employ probabilistic search over the input space with the objective of uncovering fairness violation. Our Aeqitas approach leverages inherent robustness property in common machine-learning models to design and implement scalable test generation methodologies. An appealing feature of our generated test inputs is that they can be systematically added to the training set of the underlying model and improve its fairness. To this end, we design a fully automated module that guarantees to improve the fairness of the model. We implemented Aeqitas and we have evaluated it on six stateof- the-art classifiers. Our subjects also include a classifier that was designed with fairness in mind. We show that Aeqitas effectively generates inputs to uncover fairness violation in all the subject classifiers and systematically improves the fairness of respective models using the generated test inputs. In our evaluation, Aeqitas generates up to 70% discriminatory inputs (w.r.t. the total number of inputs generated) and leverages these inputs to improve the fairness up to 94%. ![]() ![]() |
|
Chechik, Marsha |
![]() Federico Mora, Yi Li, Julia Rubin, and Marsha Chechik (University of Toronto, Canada; University of British Columbia, Canada) Software is often built by integrating components created by different teams or even different organizations. With little understanding of changes in dependent components, it is challenging to maintain correctness and robustness of the entire system. In this paper, we investigate the effect of component changes on the behavior of their clients. We observe that changes in a component are often irrelevant to a particular client and thus can be adopted without any delays or negative effects. Following this observation, we formulate the notion of client-specific equivalence checking (CSE) and develop an automated technique optimized for checking such equivalence. We evaluate our technique on a set of benchmarks, including those from the existing literature on equivalence checking, and show its applicability and effectiveness. ![]() ![]() |
|
Chemouil, David |
![]() Julien Brunel, David Chemouil, Alcino Cunha, and Nuno Macedo (ONERA, France; University of Toulouse, France; INESC TEC, Portugal; University of Minho, Portugal) This paper presents the Electrum Analyzer, a free-software tool to validate and perform model checking of Electrum specifications. Electrum is an extension of Alloy that enriches its relational logic with LTL operators, thus simplifying the specification of dynamic systems. The Analyzer supports both automatic bounded model checking, with an encoding into SAT, and unbounded model checking, with an encoding into SMV. Instance, or counter-example, traces are presented back to the user in a unified visualizer. Features to speed up model checking are offered, including a decomposed parallel solving strategy and the extraction of symbolic bounds. ![]() ![]() ![]() ![]() |
|
Chen, Bihuan |
![]() Kaifeng Huang, Bihuan Chen, Xin Peng, Daihong Zhou, Ying Wang, Yang Liu, and Wenyun Zhao (Fudan University, China; Nanyang Technological University, Singapore) Analyzing and understanding source code changes is important in a variety of software maintenance tasks. To this end, many code differencing and code change summarization methods have been proposed. For some tasks (e.g. code review and software merging), however, those differencing methods generate too fine-grained a representation of code changes, and those summarization methods generate too coarse-grained a representation of code changes. Moreover, they do not consider the relationships among code changes. Therefore, the generated differences or summaries make it not easy to analyze and understand code changes in some software maintenance tasks. In this paper, we propose a code differencing approach, named CLDIFF, to generate concise linked code differences whose granularity is in between the existing code differencing and code change summarization methods. The goal of CLDIFF is to generate more easily understandable code differences. CLDIFF takes source code files before and after changes as inputs, and consists of three steps. First, it pre-processes the source code files by pruning unchanged declara- tions from the parsed abstract syntax trees. Second, it generates concise code differences by grouping fine-grained code differences at or above the statement level and describing high-level changes in each group. Third, it links the related concise code differences according to five pre-defined links. Experiments with 12 Java projects (74,387 commits) and a human study with 10 participants have indicated the accuracy, conciseness, performance and usefulness of CLDIFF. ![]() ![]() |
|
Chen, Boyuan |
![]() Boyuan Chen, Jian Song, Peng Xu, Xing Hu, and Zhen Ming (Jack) Jiang (York University, Canada; Baidu, China) Software testing is a widely used technique to ensure the quality of software systems. Code coverage measures are commonly used to evaluate and improve the existing test suites. Based on our industrial and open source studies, existing state-of-the-art code coverage tools are only used during unit and integration testing due to issues like engineering challenges, performance overhead, and incomplete results. To resolve these issues, in this paper we have proposed an automated approach, called LogCoCo, to estimating code coverage measures using the readily available execution logs. Using program analysis techniques, LogCoCo matches the execution logs with their corresponding code paths and estimates three different code coverage criteria: method coverage, statement coverage, and branch coverage. Case studies on one open source system (HBase) and five commercial systems from Baidu and systems show that: (1) the results of LogCoCo are highly accurate (>96% in seven out of nine experiments) under a variety of testing activities (unit testing, integration testing, and benchmarking); and (2) the results of LogCoCo can be used to evaluate and improve the existing test suites. Our collaborators at Baidu are currently considering adopting LogCoCo and use it on a daily basis. ![]() ![]() |
|
Chen, Chunyang |
![]() Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang (Harbin Institute of Technology, China; Nanyang Technological University, Singapore; Carnegie Mellon University, USA; Kyushu University, Japan; University of Illinois at Urbana-Champaign, USA; Monash University, Australia) Deep learning (DL) defines a new data-driven programming paradigm that constructs the internal system logic of a crafted neuron network through a set of training data. We have seen wide adoption of DL in many safety-critical scenarios. However, a plethora of studies have shown that the state-of-the-art DL systems suffer from various vulnerabilities which can lead to severe consequences when applied to real-world applications. Currently, the testing adequacy of a DL system is usually measured by the accuracy of test data. Considering the limitation of accessible high quality test data, good accuracy performance on test data can hardly provide confidence to the testing adequacy and generality of DL systems. Unlike traditional software systems that have clear and controllable logic and functionality, the lack of interpretability in a DL system makes system analysis and defect detection difficult, which could potentially hinder its real-world deployment. In this paper, we propose DeepGauge, a set of multi-granularity testing criteria for DL systems, which aims at rendering a multi-faceted portrayal of the testbed. The in-depth evaluation of our proposed testing criteria is demonstrated on two well-known datasets, five DL systems, and with four state-of-the-art adversarial attack techniques against DL. The potential usefulness of DeepGauge sheds light on the construction of more generic and robust DL systems. ![]() ![]() ![]() Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu (Australian National University, Australia; Monash University, Australia; Nanyang Technological University, Singapore) Developers can use different technologies for many software development tasks in their work. However, when faced with several technologies with comparable functionalities, it is not easy for developers to select the most appropriate one, as comparisons among technologies are time-consuming by trial and error. Instead, developers can resort to expert articles, read official documents or ask questions in QA sites for technology comparison, but it is opportunistic to get a comprehensive comparison as online information is often fragmented or contradictory. To overcome these limitations, we propose the diffTech system that exploits the crowdsourced discussions from Stack Overflow, and assists technology comparison with an informative summary of different comparison aspects. We first build a large database of comparable technologies in software engineering by mining tags in Stack Overflow, and then locate comparative sentences about comparable technologies with natural language processing methods. We further mine prominent comparison aspects by clustering similar comparative sentences and representing each cluster with its keywords. The evaluation demonstrates both the accuracy and usefulness of our model and we implement our approach into a practical website for public use. ![]() ![]() ![]() |
|
Chen, Jianhui |
![]() Jianhui Chen and Fei He (Tsinghua University, China) Satisfiability modulo theories (SMT) solvers have been widely applied as the reasoning engine for diverse software analysis and verification technologies. The efficiency of the SMT solver has significant effects on the performance of these technologies. However, the current SMT solvers are designed for the general purpose of constraint solving. Many useful knowledge of programs cannot be utilized during the SMT solving. As a result, the SMT solver may spend a lot of effort to explore redundant search space. In this paper, we propose a novel approach for utilizing control-flow knowledge in SMT solving. With this technique, the search space can be considerably reduced and the efficiency of SMT solving is observably improved. We conducted extensive experiments on credible benchmarks, the results show orders of magnitude improvements of our approach. ![]() ![]() |
|
Chen, Qingying |
![]() Qingying Chen and Minghui Zhou (Peking University, China) Code retrieval and summarization are two tasks often employed by software developers to reuse code that spreads over online repositories. In this paper, we present a neural framework that allows bidirectional mapping between source code and natural language to improve these two tasks. Our framework, BVAE, is designed to have two Variational AutoEncoders (VAEs) to model bimodal data: C-VAE for source code and L-VAE for natural language. Both VAEs are trained jointly to reconstruct their input as much as possible with regularization that captures the closeness between the latent variables of code and description. BVAE could learn semantic vector representations for both code and description and generate completely new descriptions for arbitrary code snippets. We design two instance models of BVAE for retrieval and summarization tasks respectively and evaluate their performance on a benchmark which involves two programming languages: C# and SQL. Experiments demonstrate BVAE’s potential on the two tasks. ![]() ![]() |
|
Chen, Sen |
![]() Lingling Fan, Ting Su, Sen Chen, Guozhu Meng, Yang Liu, Lihua Xu, and Geguang Pu (East China Normal University, China; Nanyang Technological University, Singapore; Chinese Academy of Sciences, China; New York University Shanghai, China) Android, the #1 mobile app framework, enforces the single-GUI-thread model, in which a single UI thread manages GUI rendering and event dispatching. Due to this model, it is vital to avoid blocking the UI thread for responsiveness. One common practice is to offload long-running tasks into async threads. To achieve this, Android provides various async programming constructs, and leaves evelopers themselves to obey the rules implied by the model. However, as our study reveals, more than 25% apps violate these rules and introduce hard-to-detect, fail-stop errors, which we term as aysnc programming errors (APEs). To this end, this paper introduces APEChecker, a technique to automatically and efficiently manifest APEs. The key idea is to characterize APEs as specific fault patterns, and synergistically combine static analysis and dynamic UI exploration to detect and verify such errors. Among the 40 real-world Android apps, APEChecker unveils and processes 61 APEs, of which 51 are confirmed (83.6% hit rate). Specifically, APEChecker detects 3X more APEs than the state-of-art testing tools (Monkey, Sapienz and Stoat), and reduces testing time from half an hour to a few minutes. On a specific type of APEs, APEChecker confirms 5X more errors than the data race detection tool, EventRacer, with very few false alarms. ![]() ![]() |
|
Chen, Shih-Yu |
![]() Hung-En Wang, Shih-Yu Chen, Fang Yu, and Jie-Hong R. Jiang (National Taiwan University, Taiwan; National Chengchi University, Taiwan) Strings with length constraints are prominent in software security analysis. Recent endeavors have made significant progress in developing constraint solvers for strings and integers. Most prior methods are based on deduction with inference rules or analysis using automata. The former may be inefficient when the constraints involve complex string manipulations such as language replacement; the latter may not be easily extended to handle length constraints and may be inadequate for counterexample generation due to approximation. Inspired by recent work on string analysis with logic circuit representation, we propose a new method for solving string with length constraints by an implicit representation of automata with length encoding. The length-encoded automata are of infinite states and can represent languages beyond regular expressions. By converting string and length constraints into a dependency graph of manipulations over length-encoded automata, a symbolic model checker for infinite state systems can be leveraged as an engine for the analysis of string and length constraints. Experiments show that our method has its unique capability of handling complex string and length constraints not solvable by existing methods. ![]() ![]() ![]() |
|
Chen, Zhuangbin |
![]() Pinjia He, Zhuangbin Chen, Shilin He, and Michael R. Lyu (Chinese University of Hong Kong, China) Logging is a common programming practice of great importance in modern software development, because software logs have been widely used in various software maintenance tasks. To provide high-quality logs, developers need to design the description text in logging statements carefully. Inappropriate descriptions will slow down or even mislead the maintenance process, such as postmortem analysis. However, there is currently a lack of rigorous guide and specifications on developer logging behaviors, which makes the construction of description text in logging statements a challenging problem. To fill this significant gap, in this paper, we systematically study what developers log, with focus on the usage of natural language descriptions in logging statements. We obtain 6 valuable findings by conducting source code analysis on 10 Java projects and 7 C# projects, which contain 28,532,975 LOC and 115,159 logging statements in total. Furthermore, our study demonstrates the potential of automated description text generation for logging statements by obtaining up to 49.04 BLEU-4 score and 62.1 ROUGE-L score using a simple information retrieval method. To facilitate future research in this field, the datasets have been publicly released. ![]() ![]() |
|
Cheung, Shing-Chi |
![]() Huaxun Huang, Lili Wei, Yepang Liu, and Shing-Chi Cheung (Hong Kong University of Science and Technology, China; Southern University of Science and Technology, China) The control flows of Android apps are largely driven by the protocols that govern how callback APIs are invoked in response to various events. When these callback APIs evolve along with the Android framework, the changes in their invocation protocols can induce unexpected control flows to existing Android apps, causing various compatibility issues. We refer to these issues as callback compatibility issues. While Android framework updates have received due attention, little is known about their impacts on app control flows and the callback compatibility issues thus induced. To bridge the gap, we examined Android documentations and conducted an empirical study on 100 real-world callback compatibility issues to investigate how these issues were induced by callback API evolutions. Based on our empirical findings, we propose a graph-based model to capture the control flow inconsistencies caused by API evolutions and devise a static analysis technique, Cider, to detect callback compatibility issues. Our evaluation of Cider on 20 popular open-source Android apps shows that Cider is effective. It detected 13 new callback compatibility issues in these apps, among which 12 issues were confirmed and 9 issues were fixed. ![]() ![]() ![]() Jiajun Hu, Lili Wei, Yepang Liu, Shing-Chi Cheung, and Huaxun Huang (Hong Kong University of Science and Technology, China; Southern University of Science and Technology, China) WebView is a widely used Android component that augments a native app with web browser capabilities. It eases the interactions between an app’s native code and web code. However, the interaction mechanism of WebView induces new types of bugs in Android apps. Understanding the characteristics and manifestation of these WebView-induced bugs (ωBugs for short) facilitates the correct usages of WebViews in Android apps. This motivates us to conduct the first empirical study on ωBugs based on those found in popular open-source Android apps. Our study identified the major root causes and consequences of ωBugs and made interesting observations that can be leveraged for detecting and diagnosing ωBugs. Based on the empirical study, we further propose an automated testing technique ωDroid to effectively expose ωBugs in Android apps. In our experiments, ωDroid successfully discovered 30 unique and previously-unknown ωBugs when applied to 146 open-source Android apps. We reported the 30 ωBugs to the corresponding app developers. Out of these 30 ωBugs, 14 were confirmed and 7 of them were fixed. This shows that ωDroid can effectively detect ωBugs that are of the developers’ concern. ![]() ![]() |
|
Chi, Jianlei |
![]() Yu Qu, Ting Liu, Jianlei Chi, Yangxu Jin, Di Cui, Ancheng He, and Qinghua Zheng (Xi'an Jiaotong University, China) Network measures have been proved to be useful in predicting software defects. Leveraging the dependency relationships between software modules, network measures can capture various structural features of software systems. However, existing studies have relied on user-defined network measures (e.g., degree statistics or centrality metrics), which are inflexible and require high computation cost, to describe the structural features. In this paper, we propose a new method called node2defect which uses a newly proposed network embedding technique, node2vec, to automatically learn to encode dependency network structure into low-dimensional vector spaces to improve software defect prediction. Specifically, we firstly construct a program's Class Dependency Network. Then node2vec is used to automatically learn structural features of the network. After that, we combine the learned features with traditional software engineering features, for accurate defect prediction. We evaluate our method on 15 open source programs. The experimental results show that in average, node2defect improves the state-of-the-art approach by 9.15% in terms of F-measure. ![]() ![]() |
|
Christakis, Maria |
![]() Alexandra Bugariu, Valentin Wüstholz, Maria Christakis, and Peter Müller (ETH Zurich, Switzerland; MPI-SWS, Germany) Static program analyses are routinely applied as the basis of code optimizations and to detect safety and security issues in software systems. For their results to be reliable, static analyses should be sound (i.e., should not produce false negatives) and precise (i.e., should report a low number of false positives). Even though it is possible to prove properties of the design of a static analysis, ensuring soundness and precision for its implementation is challenging. Complex algorithms and sophisticated optimizations make static analyzers difficult to implement and test. In this paper, we present an automatic technique to test, among other properties, the soundness and precision of abstract domains, the core of all static analyzers based on abstract interpretation. In order to cover a wide range of test data and input states, we construct inputs by applying sequences of abstract-domain operations to representative domain elements, and vary the operations through gray-box fuzzing. We use mathematical properties of abstract domains as test oracles. Our experimental evaluation demonstrates the effectiveness of our approach. We detected several previously unknown soundness and precision errors in widely-used abstract domains. Our experiments also show that our approach is more effective than dynamic symbolic execution and than fuzzing the test inputs directly. ![]() ![]() |
|
Clapp, Lazaro |
![]() Benno Stein, Lazaro Clapp, Manu Sridharan, and Bor-Yuh Evan Chang (University of Colorado Boulder, USA; Uber Technologies, USA) In stream-based programming, data sources are abstracted as a stream of values that can be manipulated via callback functions. Stream-based programming is exploding in popularity, as it provides a powerful and expressive paradigm for handling asynchronous data sources in interactive software. However, high-level stream abstractions can also make it difficult for developers to reason about control- and data-flow relationships in their programs. This is particularly impactful when asynchronous stream-based code interacts with thread-limited features such as UI frameworks that restrict UI access to a single thread, since the threading behavior of streaming constructs is often non-intuitive and insufficiently documented. In this paper, we present a type-based approach that can statically prove the thread-safety of UI accesses in stream-based software. Our key insight is that the fluent APIs of stream-processing frameworks enable the tracking of threads via type-refinement, making it possible to reason automatically about what thread a piece of code runs on -- a difficult problem in general. We implement the system as an annotation-based Java typechecker for Android programs built upon the popular ReactiveX framework and evaluate its efficacy by annotating and analyzing 8 open-source apps, where we find 33 instances of unsafe UI access while incurring an annotation burden of only one annotation per 186 source lines of code. We also report on our experience applying the typechecker to two much larger apps from the Uber Technologies, Inc. codebase, where it currently runs on every code change and blocks changes that introduce potential threading bugs. ![]() ![]() |
|
Cleland-Huang, Jane |
![]() Jane Cleland-Huang (University of Notre Dame, USA) Requirements Engineering includes various activities aimed at discovering, analyzing, validating, evolving, and managing software and systems requirements. Many of these activities are human facing, effort intensive, and sometimes error prone. They could benefit greatly from cutting edge advances in automation. However, the software engineering community has primarily focused on automating other aspects of the development process such as testing, code analytics, and mining software respositories. As a result, advances in software analytics have had superficial impact upon advancing the state of art and practice in the field of requirements engineering. Two primary inhibitors are the lack of publicly available datasets and poorly publicized industry-relevant open requirements analytic challenges. To empower the Automated Software Engineering community to tackle open Requirements Engineering challenges, the talk will describe the rapidly evolving landscape of requirements engineering, clearly articulate open challenges, draw upon examples from an ongoing, agile, safety-critical project in the domain of Unmanned Aerial Vehicles, and introduce Dronology as a new community dataset. ![]() ![]() |
|
Cohen, Myra B. |
![]() Mikaela Cashman, Myra B. Cohen, Priya Ranjan, and Robert W. Cottingham (University of Nebraska-Lincoln, USA; Oak Ridge National Laboratory, USA) The bioinformatics software domain contains thousands of applications for automating tasks such as the pairwise alignment of DNA sequences, building and reasoning about metabolic models or simulating growth of an organism. Its end users range from sophisticated developers to those with little computational experience. In response to their needs, developers provide many options to customize the way their algorithms are tuned. Yet there is little or no automated help for the user in determining the consequences or impact of the options they choose. In this paper we describe our experience working with configurable bioinformatics tools. We find limited documentation and help for combining and selecting options along with variation in both functionality and performance. We also find previously undetected faults. We summarize our findings with a set of lessons learned, and present a roadmap for creating automated techniques to interact with bioinformatics software. We believe these will generalize to other types of scientific software. ![]() ![]() ![]() ![]() Jianghao Wang, Hamid Bagheri, and Myra B. Cohen (University of Nebraska-Lincoln, USA) Formal methods use mathematical notations and logical reasoning to precisely define a program's specifications, from which we can instantiate valid instances of a system. With these techniques we can perform a multitude of tasks to check system dependability. Despite the existence of many automated tools including ones considered lightweight, they still lack a strong adoption in practice. At the crux of this problem, is scalability and applicability to large real world applications. In this paper we show how to relax the completeness guarantee without much loss, since soundness is maintained. We have extended a popular lightweight analysis, Alloy, with a genetic algorithm. Our new tool, EvoAlloy, works at the level of finite relations generated by Kodkod and evolves the chromosomes based on the failed constraints. In a feasibility study we demonstrate that we can find solutions to a set of specifications beyond the scope where traditional Alloy fails. While small specifications take longer with EvoAlloy, the scalability means we can handle larger specifications. Our future vision is that when specifications are small we can maintain both soundness and completeness, but when this fails, EvoAlloy can switch to its genetic algorithm. ![]() ![]() |
|
Cordeiro, Lucas C. |
![]() Felipe R. Monteiro, Mário A. P. Garcia, Lucas C. Cordeiro, and Eddie B. de Lima Filho (Federal University of Amazonas, Brazil; University of Manchester, UK; TPV Technology, Brazil) This work proposes an abstraction of the Qt framework, named as Qt Operational Model (QtOM), which is integrated into two different verification approaches: explicit-state model checking and symbolic (bounded) model checking. The proposed methodology is the first one to formally verify Qt-based applications, which has the potential to devise new directions for software verification of portable code. The full version of this paper is published in Software Testing, Verification and Reliability, on 02 March 2017 and it is available at https://doi.org/10.1002/stvr.1632. ![]() ![]() ![]() ![]() ![]() Mikhail R. Gadelha, Felipe R. Monteiro, Jeremy Morse, Lucas C. Cordeiro, Bernd Fischer, and Denis A. Nicole (University of Southampton, UK; Federal University of Amazonas, Brazil; University of Bristol, UK; University of Manchester, UK; Stellenbosch University, South Africa) ESBMC is a mature, permissively licensed open-source context-bounded model checker for the verification of single- and multi-threaded C programs. It can verify both predefined safety properties (e.g., bounds check, pointer safety, overflow) and user-defined program assertions automatically. ESBMC provides C++ and Python APIs to access internal data structures, allowing inspection and extension at any stage of the verification process. We discuss improvements over previous versions of ESBMC, including the description of new front- and back-ends, IEEE floating-point support, and an improved k-induction algorithm. A demonstration is available at https://www.youtube.com/watch?v=YcJjXHlN1v8. ![]() ![]() ![]() ![]() |
|
Cottingham, Robert W. |
![]() Mikaela Cashman, Myra B. Cohen, Priya Ranjan, and Robert W. Cottingham (University of Nebraska-Lincoln, USA; Oak Ridge National Laboratory, USA) The bioinformatics software domain contains thousands of applications for automating tasks such as the pairwise alignment of DNA sequences, building and reasoning about metabolic models or simulating growth of an organism. Its end users range from sophisticated developers to those with little computational experience. In response to their needs, developers provide many options to customize the way their algorithms are tuned. Yet there is little or no automated help for the user in determining the consequences or impact of the options they choose. In this paper we describe our experience working with configurable bioinformatics tools. We find limited documentation and help for combining and selecting options along with variation in both functionality and performance. We also find previously undetected faults. We summarize our findings with a set of lessons learned, and present a roadmap for creating automated techniques to interact with bioinformatics software. We believe these will generalize to other types of scientific software. ![]() ![]() ![]() |
|
Cui, Di |
![]() Yu Qu, Ting Liu, Jianlei Chi, Yangxu Jin, Di Cui, Ancheng He, and Qinghua Zheng (Xi'an Jiaotong University, China) Network measures have been proved to be useful in predicting software defects. Leveraging the dependency relationships between software modules, network measures can capture various structural features of software systems. However, existing studies have relied on user-defined network measures (e.g., degree statistics or centrality metrics), which are inflexible and require high computation cost, to describe the structural features. In this paper, we propose a new method called node2defect which uses a newly proposed network embedding technique, node2vec, to automatically learn to encode dependency network structure into low-dimensional vector spaces to improve software defect prediction. Specifically, we firstly construct a program's Class Dependency Network. Then node2vec is used to automatically learn structural features of the network. After that, we combine the learned features with traditional software engineering features, for accurate defect prediction. We evaluate our method on 15 open source programs. The experimental results show that in average, node2defect improves the state-of-the-art approach by 9.15% in terms of F-measure. ![]() ![]() |
|
Cunha, Alcino |
![]() Julien Brunel, David Chemouil, Alcino Cunha, and Nuno Macedo (ONERA, France; University of Toulouse, France; INESC TEC, Portugal; University of Minho, Portugal) This paper presents the Electrum Analyzer, a free-software tool to validate and perform model checking of Electrum specifications. Electrum is an extension of Alloy that enriches its relational logic with LTL operators, thus simplifying the specification of dynamic systems. The Analyzer supports both automatic bounded model checking, with an encoding into SAT, and unbounded model checking, with an encoding into SMV. Instance, or counter-example, traces are presented back to the user in a unified visualizer. Features to speed up model checking are offered, including a decomposed parallel solving strategy and the extraction of symbolic bounds. ![]() ![]() ![]() ![]() |
|
Cunha, Jácome |
![]() Rui Pereira, Pedro Simão, Jácome Cunha, and João Saraiva (INESC TEC, Portugal; University of Minho, Portugal; NOVA-LINCS, Portugal; Universidade Nova Lisboa, Portugal) Software developers are more and more eager to understand their code’s energy performance. However, even with such knowledge it is di cult to know how to improve the code. Indeed, little tool support exists to understand the energy consumption pro le of a software system and to eventually (automatically) improve its code. In this paper we present a tool termed jStanley which automatically nds collections in Java programs that can be replaced by others with a positive impact on the energy consumption as well as on the execution time. In seconds, developers obtain information about energy-eager collection usage. jStanley will further suggest alternative collections to improve the code, making it use less time, energy, or a combination of both. The preliminary evaluation we ran using jStanley shows energy gains between 2% and 17%, and a reduction in execution time between 2% and 13%. A video can be seen at https://greensoftwarelab.github.io/jStanley. ![]() ![]() ![]() |
|
Degiovanni, Renzo |
![]() Renzo Degiovanni, Facundo Molina, Germán Regis, and Nazareno Aguirre (National University of Río Cuarto, Argentina; CONICET, Argentina) Goal-conflict analysis has been widely used as an abstraction for risk analysis in goal-oriented requirements engineering approaches. In this context, where the expected behaviour of the system-to-be is captured in terms of domain properties and goals, identifying combinations of circumstances that may make the goals diverge, i.e., not to be satisfied as a whole, is of most importance. Various approaches have been proposed in order to automatically identify boundary conditions, i.e., formulas capturing goal-divergent situations, but they either apply only to some specific goal expressions, or are affected by scalability issues that make them applicable only to relatively small specifications. In this paper, we present a novel approach to automatically identify boundary conditions, using evolutionary computation. More precisely, we develop a genetic algorithm that, given the LTL formulation of the domain properties and the goals, it searches for formulas that capture divergences in the specification. We exploit a modern LTL satisfiability checker to successfully guide our genetic algorithm to the solutions. We assess our technique on a set of case studies, and show that our genetic algorithm is able to find boundary conditions that cannot be generated by related approaches, and is able to efficiently scale to LTL specifications that other approaches are unable to deal with. ![]() ![]() ![]() |
|
De Lima Filho, Eddie B. |
![]() Felipe R. Monteiro, Mário A. P. Garcia, Lucas C. Cordeiro, and Eddie B. de Lima Filho (Federal University of Amazonas, Brazil; University of Manchester, UK; TPV Technology, Brazil) This work proposes an abstraction of the Qt framework, named as Qt Operational Model (QtOM), which is integrated into two different verification approaches: explicit-state model checking and symbolic (bounded) model checking. The proposed methodology is the first one to formally verify Qt-based applications, which has the potential to devise new directions for software verification of portable code. The full version of this paper is published in Software Testing, Verification and Reliability, on 02 March 2017 and it is available at https://doi.org/10.1002/stvr.1632. ![]() ![]() ![]() ![]() |
|
Delplanque, Julien |
![]() Julien Delplanque (University of Lille, France; CNRS, France; Inria, France) Relational databases play a central role in software systems. As any software artefact they are subject to software engineering related problems. That apply even more when these databases hold behaviour in the form of stored functions, triggers, etc... The purpose of this article is to present a research plan to handle this problematic. ![]() ![]() |
|
De Lucia, Andrea |
![]() Simone Scalabrino, Giovanni Grano, Dario Di Nucci, Michele Guerra, Andrea De Lucia, Harald C. Gall, and Rocco Oliveto (University of Molise, Italy; University of Zurich, Switzerland; Vrije Universiteit Brussel, Belgium; University of Salerno, Italy) Automatically generating test cases plays an important role to reduce the time spent by developers during the testing phase. In last years, several approaches have been proposed to tackle such a problem: amongst others, search-based techniques have been shown to be particularly promising. In this paper we describe Ocelot, a search-based tool for the automatic generation of test cases in C. Ocelot allows practitioners to write skeletons of test cases for their programs and researchers to easily implement and experiment new approaches for automatic test-data generation. We show that Ocelot achieves a higher coverage compared to a competitive tool in 81% of the cases. Ocelot is publicly available to support both researchers and practitioners. ![]() ![]() |
|
Deng, Xi |
![]() Jiwei Yan, Xi Deng, Ping Wang, Tianyong Wu, Jun Yan, and Jian Zhang (Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China) Exported Activity (EA), a kind of activities in Android apps that can be launched by external components, is one of the most important inter-component communication (ICC) mechanisms to realize the interaction and cooperation among multiple apps. Existing works have pointed out that, once exposed, an activity will be vulnerable to malicious ICC attacks, such as permission leakage attack. Unfortunately, it is observed that a considerable number of activities in commercial apps are exposed inadvertently, while few works have studied the necessity and reasonability of such exposure. This work takes the first step to systematically study the exposing behavior of EAs through analyzing 13,873 Android apps. It utilizes the EA associated call relationships extracted from byte-code via data-flow analysis, as well as the launch conditions obtained from the manifest files, to guide the study on the usage and misexposure of EAs. The empirical findings are that the EA mechanism is widely adopted in development and the activities are liable to be misexposed due to the developers' misunderstanding or carelessness. Further study on subsets of apps selected according to different criteria indicates that the misexposed EAs have specific characteristics, which are manually summarized into six typical misuse patterns. As a consequence, ten heuristics are designed to decide whether an activity should be exposed or not and are implemented into an automatic tool called Mist. Experiments on the collected apps show that around one fifth EAs are unnecessarily exposed and there are more than one third EAs whose exposure may not be suggested. ![]() ![]() |
|
Deng, Yuetang |
![]() Wenyu Wang, Dengfeng Li, Wei Yang, Yurui Cao, Zhenwen Zhang, Yuetang Deng, and Tao Xie (University of Illinois at Urbana-Champaign, USA; University of Texas at Dallas, USA; Tencent, China) User Interface (UI) testing is a popular approach to ensure the quality of mobile apps. Numerous test generation tools have been developed to support UI testing on mobile apps, especially for Android apps. Previous work evaluates and compares different test generation tools using only relatively simple open-source apps, while real-world industrial apps tend to have more complex functionalities and implementations. There is no direct comparison among test generation tools with regard to effectiveness and ease-of-use on these industrial apps. To address such limitation, we study existing state-of-the-art or state-of-the-practice test generation tools on 68 widely-used industrial apps. We directly compare the tools with regard to code coverage and fault-detection ability. According to our results, Monkey, a state-of-the-practice tool from Google, achieves the highest method coverage on 22 of 41 apps whose method coverage data can be obtained. Of all 68 apps under study, Monkey also achieves the highest activity coverage on 35 apps, while Stoat, a state-of-the-art tool, is able to trigger the highest number of unique crashes on 23 apps. By analyzing the experimental results, we provide suggestions for combining different test generation tools to achieve better performance. We also report our experience in applying these tools to industrial apps under study. Our study results give insights on how Android UI test generation tools could be improved to better handle complex industrial apps. ![]() ![]() |
|
Detweiler, Carrick |
![]() John-Paul Ore, Sebastian Elbaum, Carrick Detweiler, and Lambros Karkazis (University of Nebraska-Lincoln, USA) Type annotations provide a link between program variables and domain-specific types. When combined with a type system, these annotations can enable early fault detection. For type annotations to be cost-effective in practice, they need to be both accurate and affordable for developers. We lack, however, an understanding of how burdensome type annotation is for developers. Hence, this work explores three fundamental questions: 1) how accurately do developers make type annotations; 2) how long does a single annotation take; and, 3) if a system could automatically suggest a type annotation, how beneficial to accuracy are correct suggestions and how detrimental are incorrect suggestions? We present results of a study of 71 programmers using 20 random code artifacts that contain variables with physical unit types that must be annotated. Subjects choose a correct type annotation only 51% of the time and take an average of 136 seconds to make a single correct annotation. Our qualitative analysis reveals that variable names and reasoning over mathematical operations are the leading clues for type selection. We find that suggesting the correct type boosts accuracy to 73%, while making a poor suggestion decreases accuracy to 28%. We also explore what state-of-the-art automated type annotation systems can and cannot do to help developers with type annotations, and identify implications for tool developers. ![]() ![]() |
|
Diarra, Rokiatou |
![]() Rokiatou Diarra (University of Paris-Sud, France) Many procedural languages, such as C and C++, have pointers. Pointers are powerful and convenient, but pointer aliasing still hinders compiler optimizations, despite several years of research on pointer aliasing analysis. Because alias analysis is a difficult task and results are not always accurate, the ISO C standard 99 has added a keyword, named restrict to allow the programmer to specify non−aliasing as an aid to the compiler′s optimizer and to thereby possibly improve performance. The task of annotating pointers with the restrict keyword is still left to the programmer. This task is, in general, tedious and prone to errors especially since the C does not perform any verification to ensure that restrict keyword is not misplaced. In this paper we present a static analysis tool that (i) finds CUDA kernels call sites in which actual parameters do not alias; (ii) clones the kernels called at such sites; (iii) after performing an alias analysis in these kernels, adds the restrict keyword to their arguments; and (iv) replaces the original kernel call by a call to the optimized clone whenever possible. ![]() ![]() |
|
Di Cosmo, Roberto |
![]() Roberto Di Cosmo (Inria, France; University Paris Diderot, France) Software Heritage is a non profit initiative whose ambitious goal is to collect, preserve and share the source code of all software ever written, with its full development history, building a universal source code software knowledge base. Software Heritage addresses a variety of needs: preserving our scientific and technological knowledge, enabling better software development and reuse for society and industry, fostering better science, and building an essential infrastructure for large scale, reproducible software studies. We have already collected over 4 billions unique source files from over 80 millions repositories, and organised them into a giant Merkle graph, with full deduplication across all repositories. This allows us to cope with the growth of collaborative software development, and provides a unique vantage point for observing its evolution. In this talk, we will highlight the new challenges and opportunities that Software Heritage brings up. ![]() ![]() |
|
Ding, Dan |
![]() Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Wenhai Li, Chao Ji, and Dan Ding (Fudan University, China; University of Illinois at Urbana-Champaign, USA; Singapore University of Technology and Design, Singapore) Debugging microservice systems involves the deployment and manipulation of microservice systems on a containerized environment and faces unique challenges due to the high complexity and dynamism of microservices. To address these challenges, in this paper, we propose a debugging approach for microservice systems based on the delta debugging algorithm, which is to minimize failureinducing deltas of circumstances (e.g., deployment, environmental configurations) for effective debugging. Our approach includes novel techniques for defining, deploying/manipulating, and executing deltas following the idea of delta debugging. In particular, to construct a (failing) circumstance space for delta debugging to minimize, our approach defines a set of dimensions that can affect the execution of microservice systems. Our experimental study on a medium-size microservice benchmark system shows that our approach can effectively identify failure-inducing deltas that help diagnose the root causes. ![]() ![]() |
|
Di Nucci, Dario |
![]() Simone Scalabrino, Giovanni Grano, Dario Di Nucci, Michele Guerra, Andrea De Lucia, Harald C. Gall, and Rocco Oliveto (University of Molise, Italy; University of Zurich, Switzerland; Vrije Universiteit Brussel, Belgium; University of Salerno, Italy) Automatically generating test cases plays an important role to reduce the time spent by developers during the testing phase. In last years, several approaches have been proposed to tackle such a problem: amongst others, search-based techniques have been shown to be particularly promising. In this paper we describe Ocelot, a search-based tool for the automatic generation of test cases in C. Ocelot allows practitioners to write skeletons of test cases for their programs and researchers to easily implement and experiment new approaches for automatic test-data generation. We show that Ocelot achieves a higher coverage compared to a competitive tool in 81% of the cases. Ocelot is publicly available to support both researchers and practitioners. ![]() ![]() |
|
Di Penta, Massimiliano |
![]() Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk (College of William and Mary, USA; University of Lugano, Switzerland; University of Sannio, Italy) Millions of open-source projects with numerous bug fixes are available in code repositories. This proliferation of software development histories can be leveraged to learn how to fix common programming bugs. To explore such a potential, we perform an empirical study to assess the feasibility of using Neural Machine Translation techniques for learning bug-fixing patches for real defects. We mine millions of bug-fixes from the change histories of GitHub repositories to extract meaningful examples of such bug-fixes. Then, we abstract the buggy and corresponding fixed code, and use them to train an Encoder-Decoder model able to translate buggy code into its fixed version. Our model is able to fix hundreds of unique buggy methods in the wild. Overall, this model is capable of predicting fixed patches generated by developers in 9% of the cases. ![]() ![]() ![]() Davide Falessi, Massimiliano Di Penta, Gerardo Canfora, and Giovanni Cantone (California Polytechnic State University, USA; University of Sannio, Italy; University of Rome Tor Vergata, Italy) Although very important in software engineering, establishing traceability links between software artifacts is extremely tedious, error-prone, and it requires significant effort. Even when approaches for automated traceability recovery exist, these provide the requirements analyst with a, usually very long, ranked list of candidate links that needs to be manually inspected. In this paper we introduce an approach called Estimation of the Number of Remaining Links (ENRL) which aims at estimating, via Machine Learning (ML) classifiers, the number of remaining positive links in a ranked list of candidate traceability links produced by a Natural Language Processing techniques-based recovery approach. We have evaluated the accuracy of the ENRL approach by considering several ML classifiers and NLP techniques on three datasets from industry and academia, and concerning traceability links among different kinds of software artifacts including requirements, use cases, design documents, source code, and test cases. Results from our study indicate that: (i) specific estimation models are able to provide accurate estimates of the number of remaining positive links; (ii) the estimation accuracy depends on the choice of the NLP technique, and (iii) univariate estimation models outperform multivariate ones. ![]() ![]() |
|
Dong, Jinsong |
![]() Yun Lin, Jun Sun, Lyly Tran, Guangdong Bai, Haijun Wang, and Jinsong Dong (National University of Singapore, Singapore; Singapore University of Technology and Design, Singapore; Singapore Institute of Technology, Singapore; Nanyang Technological University, Singapore) Dynamic slicing is a common way of identifying the root cause when a program fault is revealed. With the dynamic slicing technique, the programmers can follow data and control flow along the program execution trace to the root cause. However, the technique usually fails to work on omission bugs, i.e., the faults which are caused by missing executing some code. In many cases, dynamic slicing over-skips the root cause when an omission bug happens, leading the debugging process to a dead end. In this work, we conduct an empirical study on the omission bugs in the Defects4J bug repository. Our study shows that (1) omission bugs are prevalent (46.4%) among all the studied bugs; (2) there are repeating patterns on causes and fixes of the omission bugs; (3) the patterns of fixing omission bugs serve as a strong hint to break the slicing dead end. Based on our findings, we train a neural network model on the omission bugs in Defects4J repository to recommend where to approach when slicing can no long work. We conduct an experiment by applying our approach on 3193 mutated omission bugs which slicing fails to locate. The results show that our approach outperforms random benchmark on breaking the dead end and localizing the mutated omission bugs (63.8% over 2.8%). ![]() ![]() |
|
Dong, Wei |
![]() Liangze Yin, Wei Dong, Wanwei Liu, and Ji Wang (National University of Defense Technology, China) Scheduling constraint based abstraction refinement (SCAR) is one of the most efficient methods for verifying programs under sequential consistency (SC). However, most multi-processor architectures implement weak memory models (WMMs) in order to improve the performance of a program. Due to the nondeterministic execution of those memory operations by the same thread, the behavior of a program under WMMs is much more complex than that under SC, which significantly increases the verification complexity. This paper elegantly extends the SCAR method to WMMs such as TSO and PSO. To capture the order requirements of an abstraction counterexample under WMMs, we have enriched the event order graph (EOG) of a counterexample such that it is competent for both SC and WMMs. We have also proposed a unified EOG generation method which can always obtain a minimal EOG efficiently. Experimental results on a large set of multi-threaded C programs show promising results of our method. It significantly outperforms state-of-the-art tools, and the time and memory it required to verify a program under TSO and PSO are roughly comparable to that under SC. ![]() ![]() |
|
Dong, Zhen |
![]() Xiang Gao, Shin Hwei Tan, Zhen Dong, and Abhik Roychoudhury (National University of Singapore, Singapore; Southern University of Science and Technology, China) Symbolic execution of Android applications is challenging as it involves either building a customized VM for Android or modeling the Android libraries. Since the Android Runtime evolves from one version to another, building a high-fidelity symbolic execution engine involves modeling the effect of the libraries and their evolved versions. Without simulating the behavior of Android libraries, path divergence may occur due to constraint loss when the symbolic values flow into Android framework and these values later affect the subsequent path taken. Previous works such as JPF-Android have relied on the modeling of execution environment such as libraries. In this work, we build a dynamic symbolic execution engine for Android apps, without any manual modeling of execution environment. Environment (or library) dependent control flow decisions in the application will trigger an on-demand program synthesis step to automatically deduce a representation of the library.This representation is refined on-the-fly by running the corresponding library multiple times.The overarching goal of the refinement is to enhance behavioral coverage and to alleviate the path divergence problem during symbolic execution. Moreover, our library synthesis can be made context-specific. Compared to traditional synthesis approaches which aim to synthesize the complete library code, our context-specific synthesis engine can generate more precise expressions for a given context. The evaluation of our dynamic symbolic execution engine, built on top of JDART, shows that the library models obtained from program synthesis are often more accurate than the semi-manual models in JPF-Android. Furthermore, our symbolic execution engine could reach more branch targets, as compared to using the JPF-Android models. ![]() ![]() |
|
Dou, Wensheng |
![]() Wensheng Dou, Shi Han, Liang Xu, Dongmei Zhang, and Jun Wei (University of Chinese Academy of Sciences, China; Institute of Software at Chinese Academy of Sciences, China; Microsoft Research, China) Spreadsheets are widely used in various business tasks. Spreadsheet users may put similar data and computations by repeating a block of cells (a unit) in their spreadsheets. We name the unit and all its expanding ones as an expandable group. All units in an expandable group share the same or similar formats and semantics. As a data storage and management tool, expandable groups represent the fundamental structure in spreadsheets. However, existing spreadsheet systems do not recognize any expandable groups. Therefore, other spreadsheet analysis tools, e.g., data integration and fault detection, cannot utilize this structure of expandable groups to perform precise analysis. In this paper, we propose ExpCheck to automatically extract expandable groups in spreadsheets. We observe that continuous units that share the similar formats and semantics are likely to be an expandable group. Inspired by this, we inspect the format of each cell and its corresponding semantics, and further classify them into expandable groups according to their similarity. We evaluate ExpCheck on 120 spreadsheets randomly sampled from the EUSES and VEnron corpora. The experimental results show that ExpCheck is effective. ExpCheck successfully detect expandable groups with F1-measure of 73.1%, significantly outperforming the state-of-the-art techniques (F1-measure of 13.3%). ![]() ![]() |
|
Dustmann, Oscar Soria |
![]() Oscar Soria Dustmann, Klaus Wehrle, and Cristian Cadar (RWTH Aachen University, Germany; Imperial College London, UK) Symbolic execution is an effective program analysis technique whose scalability largely depends on the ability to quickly solve large numbers of first-order logic queries. We propose an effective general technique for speeding up the solving of queries in the theory of arrays and bit-vectors with a specific structure, while otherwise falling back to a complete solver. The technique has two stages: a learning stage that determines the solution sets of each symbolic variable, and a decision stage that uses this information to quickly determine the satisfiability of certain types of queries. The main challenges involve deciding which operators to support and precisely dealing with integer type casts and arithmetic underflow and overflow. We implemented this technique in an incomplete solver called PARTI (``PARtial Theory solver for Intervals''), directly integrating it into the popular KLEE symbolic execution engine. We applied KLEE with PARTI and a state-of-the-art SMT solver to synthetic and real-world benchmarks. We found that PARTI practically does not hurt performance while many times achieving order-of-magnitude speedups. ![]() ![]() |
|
Eichberg, Michael |
![]() Dominik Helm, Florian Kübler, Michael Eichberg, Michael Reif, and Mira Mezini (TU Darmstadt, Germany) Analyzing methods in object-oriented programs whether they are side-effect free and also deterministic, i.e., mathematically pure, has been the target of extensive research. Identifying such methods helps to find code smells and security related issues, and also helps analyses detecting concurrency bugs. Pure methods are also used by formal verification approaches as the foundations for specifications and proving the pureness is necessary to ensure correct specifications. However, so far no common terminology exists which describes the purity of methods. Furthermore, some terms (e.g., pure or side-effect free) are also used inconsistently. Further, all current approaches only report selected purity information making them only suitable for a smaller subset of the potential use cases. In this paper, we present a fine-grained unified lattice model which puts the purity levels found in the literature into relation and which adds a new level that generalizes existing definitions. We have also implemented a scalable, modularized purity analysis which produces significantly more precise results for real-world programs than the best-performing related work. The analysis shows that all defined levels are found in real-world projects. ![]() ![]() |
|
Elbaum, Sebastian |
![]() John-Paul Ore, Sebastian Elbaum, Carrick Detweiler, and Lambros Karkazis (University of Nebraska-Lincoln, USA) Type annotations provide a link between program variables and domain-specific types. When combined with a type system, these annotations can enable early fault detection. For type annotations to be cost-effective in practice, they need to be both accurate and affordable for developers. We lack, however, an understanding of how burdensome type annotation is for developers. Hence, this work explores three fundamental questions: 1) how accurately do developers make type annotations; 2) how long does a single annotation take; and, 3) if a system could automatically suggest a type annotation, how beneficial to accuracy are correct suggestions and how detrimental are incorrect suggestions? We present results of a study of 71 programmers using 20 random code artifacts that contain variables with physical unit types that must be annotated. Subjects choose a correct type annotation only 51% of the time and take an average of 136 seconds to make a single correct annotation. Our qualitative analysis reveals that variable names and reasoning over mathematical operations are the leading clues for type selection. We find that suggesting the correct type boosts accuracy to 73%, while making a poor suggestion decreases accuracy to 28%. We also explore what state-of-the-art automated type annotation systems can and cannot do to help developers with type annotations, and identify implications for tool developers. ![]() ![]() |
|
Enea, Constantin |
![]() Chungha Sung, Shuvendu K. Lahiri, Constantin Enea, and Chao Wang (University of Southern California, USA; Microsoft Research, USA; University of Paris Diderot, France) When an evolving program is modified to address issues related to thread synchronization, there is a need to confirm the change is correct, i.e., it does not introduce unexpected behavior. However, manually comparing two programs to identify the semantic difference is labor intensive and error prone, whereas techniques based on model checking are computationally expensive. To fill the gap, we develop a fast and approximate static analysis for computing synchronization differences of two programs. The method is fast because, instead of relying on heavy-weight model checking techniques, it leverages a polynomial-time Datalog-based program analysis framework to compute differentiating data-flow edges, i.e., edges allowed by one program but not the other. Although approximation is used our method is sufficiently accurate due to careful design of the Datalog inference rules and iterative increase of the required data-flow edges for representing a difference. We have implemented our method and evaluated it on a large number of multithreaded C programs to confirm its ability to produce, often within seconds, the same differences obtained by human; in contrast, prior techniques based on model checking take minutes or even hours and thus can be 10x to 1000x slower. ![]() ![]() ![]() |
|
Erdogmus, Hakan |
![]() Alim Ozdemir, Ayse Tosun, Hakan Erdogmus, and Rui Abreu (Istanbul Technical University, Turkey; Carnegie Mellon University, USA; University of Lisbon, Portugal) Existing tools for monitoring the quality of codebases modified by multiple developers tend to be centralized and inflexible. These tools increase the visibility of quality by producing effective reports and visualizations when a change is made to the codebase and triggering alerts when undesirable situations occur. However, their configuration is invariably both (a) centrally managed in that individual maintainers cannot define local rules to receive customized feedback when a change occurs in a specific part of the code in which they are particularly interested, and (b) coarse-grained in that analyses cannot be turned on and off below the file level. Triggr, the tool proposed in this paper, addresses these limitations by allowing distributed, customized, and fine-grained monitoring. It is a lightweight re-implementation of our previous tool, CodeAware, which adopts the same paradigm. The tool listens on a codebase’s shared repository using an event-based approach, and can send alerts to subscribed developers based on rules defined locally by them. Triggr is open-source and available at https://github.com/lyzerk/Triggr. A demonstration video can be found at https://youtu.be/qQs9aDwXJjY. ![]() ![]() |
|
Erdweg, Sebastian |
![]() Gabriël Konat, Sebastian Erdweg, and Eelco Visser (Delft University of Technology, Netherlands) Incremental build systems are essential for fast, reproducible software builds. Incremental build systems enable short feedback cycles when they capture dependencies precisely and selectively execute build tasks efficiently. A much overlooked feature of build systems is the expressiveness of the scripting language, which directly influences the maintainability of build scripts. In this paper, we present a new incremental build algorithm that allows build engineers to use a full-fledged programming language with explicit task invocation, value and file inspection facilities, and conditional and iterative language constructs. In contrast to prior work on incrementality for such programmable builds, our algorithm scales with the number of tasks affected by a change and is independent of the size of the software project being built. Specifically, our algorithm accepts a set of changed files, transitively detects and re-executes affected build tasks, but also accounts for new task dependencies discovered during building. We have evaluated the performance of our algorithm in a real-world case study and confirm its scalability. ![]() ![]() |
|
Falessi, Davide |
![]() Davide Falessi, Massimiliano Di Penta, Gerardo Canfora, and Giovanni Cantone (California Polytechnic State University, USA; University of Sannio, Italy; University of Rome Tor Vergata, Italy) Although very important in software engineering, establishing traceability links between software artifacts is extremely tedious, error-prone, and it requires significant effort. Even when approaches for automated traceability recovery exist, these provide the requirements analyst with a, usually very long, ranked list of candidate links that needs to be manually inspected. In this paper we introduce an approach called Estimation of the Number of Remaining Links (ENRL) which aims at estimating, via Machine Learning (ML) classifiers, the number of remaining positive links in a ranked list of candidate traceability links produced by a Natural Language Processing techniques-based recovery approach. We have evaluated the accuracy of the ENRL approach by considering several ML classifiers and NLP techniques on three datasets from industry and academia, and concerning traceability links among different kinds of software artifacts including requirements, use cases, design documents, source code, and test cases. Results from our study indicate that: (i) specific estimation models are able to provide accurate estimates of the number of remaining positive links; (ii) the estimation accuracy depends on the choice of the NLP technique, and (iii) univariate estimation models outperform multivariate ones. ![]() ![]() |
|
Fan, Lingling |
![]() Lingling Fan, Ting Su, Sen Chen, Guozhu Meng, Yang Liu, Lihua Xu, and Geguang Pu (East China Normal University, China; Nanyang Technological University, Singapore; Chinese Academy of Sciences, China; New York University Shanghai, China) Android, the #1 mobile app framework, enforces the single-GUI-thread model, in which a single UI thread manages GUI rendering and event dispatching. Due to this model, it is vital to avoid blocking the UI thread for responsiveness. One common practice is to offload long-running tasks into async threads. To achieve this, Android provides various async programming constructs, and leaves evelopers themselves to obey the rules implied by the model. However, as our study reveals, more than 25% apps violate these rules and introduce hard-to-detect, fail-stop errors, which we term as aysnc programming errors (APEs). To this end, this paper introduces APEChecker, a technique to automatically and efficiently manifest APEs. The key idea is to characterize APEs as specific fault patterns, and synergistically combine static analysis and dynamic UI exploration to detect and verify such errors. Among the 40 real-world Android apps, APEChecker unveils and processes 61 APEs, of which 51 are confirmed (83.6% hit rate). Specifically, APEChecker detects 3X more APEs than the state-of-art testing tools (Monkey, Sapienz and Stoat), and reduces testing time from half an hour to a few minutes. On a specific type of APEs, APEChecker confirms 5X more errors than the data race detection tool, EventRacer, with very few false alarms. ![]() ![]() |
|
Fang, Baoyin |
![]() Liang Bao, Xin Liu, Ziheng Xu, and Baoyin Fang (Xidian University, China; University of California at Davis, USA) Distributed message systems (DMSs) serve as the communication backbone for many real-time streaming data processing applications. To support the vast diversity of such applications, DMSs provide a large number of parameters to configure. However, It overwhelms for most users to configure these parameters well for better performance. Although many automatic configuration approaches have been proposed to address this issue, critical challenges still remain: 1) to train a better and robust performance prediction model using a limited number of samples, and 2) to search for a high-dimensional parameter space efficiently within a time constraint. In this paper, we propose AutoConfig -- an automatic configuration system that can optimize producer-side throughput on DMSs. AutoConfig constructs a novel comparison-based model (CBM) that is more robust that the prediction-based model (PBM) used by previous learning-based approaches. Furthermore, AutoConfig uses a weighted Latin hypercube sampling (wLHS) approach to select a set of samples that can provide a better coverage over the high-dimensional parameter space. wLHS allows AutoConfig to search for more promising configurations using the trained CBM. We have implemented AutoConfig on the Kafka platform, and evaluated it using eight different testing scenarios deployed on a public cloud. Experimental results show that our CBM can obtain better results than that of PBM under the same random forests based model. Furthermore, AutoConfig outperforms default configurations by 215.40% on average, and five state-of-the-art configuration algorithms by 7.21%-64.56%. ![]() ![]() |
|
Fischer, Bernd |
![]() Mikhail R. Gadelha, Felipe R. Monteiro, Jeremy Morse, Lucas C. Cordeiro, Bernd Fischer, and Denis A. Nicole (University of Southampton, UK; Federal University of Amazonas, Brazil; University of Bristol, UK; University of Manchester, UK; Stellenbosch University, South Africa) ESBMC is a mature, permissively licensed open-source context-bounded model checker for the verification of single- and multi-threaded C programs. It can verify both predefined safety properties (e.g., bounds check, pointer safety, overflow) and user-defined program assertions automatically. ESBMC provides C++ and Python APIs to access internal data structures, allowing inspection and extension at any stage of the verification process. We discuss improvements over previous versions of ESBMC, including the description of new front- and back-ends, IEEE floating-point support, and an improved k-induction algorithm. A demonstration is available at https://www.youtube.com/watch?v=YcJjXHlN1v8. ![]() ![]() ![]() ![]() |
|
Friedberger, Karlheinz |
![]() Dirk Beyer and Karlheinz Friedberger (LMU Munich, Germany) Recent development of software aims at massively parallel execution, because of the trend to increase the number of processing units per CPU socket. But many approaches for program analysis are not designed to benefit from a multi-threaded execution and lack support to utilize multi-core computers. Rewriting existing algorithms is difficult and error-prone, and the design of new parallel algorithms also has limitations. An orthogonal problem is the granularity: computing each successor state in parallel seems too fine-grained, so the open question is to find the right structural level for parallel execution. We propose an elegant solution to these problems: Block summaries should be computed in parallel. Many successful approaches to software verification are based on summaries of control-flow blocks, large blocks, or function bodies. Block-abstraction memoization is a successful domain-independent approach for summary-based program analysis. We redesigned the verification approach of block-abstraction memoization starting from its original recursive definition, such that it can run in a parallel manner for utilizing the available computation resources without losing its advantages of being independent from a certain abstract domain. We present an implementation of our new approach for multi-core shared-memory machines. The experimental evaluation shows that our summary-based approach has no significant overhead compared to the existing sequential approach and that it has a significant speedup when using multi-threading. ![]() ![]() ![]() ![]() |
|
Fu, Ying |
![]() Jian Gao, Xin Yang, Ying Fu, Yu Jiang, and Jiaguang Sun (Tsinghua University, China) Code reuse improves software development efficiency, however, vulnerabilities can be introduced inadvertently. Many existing works compute the code similarity based on CFGs to determine whether a binary function contains a known vulnerability. Unfortunately, their performance in cross-platform binary search is challenged. This paper presents VulSeeker, a semantic learning based vulnerability seeker for cross-platform binary. Given a target function and a vulnerable function, VulSeeker first constructs the labeled semantic flow graphs and extracts basic block features as numerical vectors for both of them. Then the embedding vector of the whole binary function is generated by feeding the numerical vectors of basic blocks to the customized semantics aware DNN model. Finally, the similarity of the two binary functions is measured based on the Cosine distance. The experimental results show that VulSeeker outperforms the state-of-the-art approaches in terms of accuracy. For example, compared to the most recent and related work Gemini, VulSeeker finds 50.00% more vulnerabilities in the top-10 candidates and 13.89% more in the top-50 candidates, and improves the values of AUC and ACC for 8.23% and 12.14% respectively. The video is presented at https://youtu.be/Mw0mr84gpI8. ![]() ![]() |
|
Gadelha, Mikhail R. |
![]() Mikhail R. Gadelha, Felipe R. Monteiro, Jeremy Morse, Lucas C. Cordeiro, Bernd Fischer, and Denis A. Nicole (University of Southampton, UK; Federal University of Amazonas, Brazil; University of Bristol, UK; University of Manchester, UK; Stellenbosch University, South Africa) ESBMC is a mature, permissively licensed open-source context-bounded model checker for the verification of single- and multi-threaded C programs. It can verify both predefined safety properties (e.g., bounds check, pointer safety, overflow) and user-defined program assertions automatically. ESBMC provides C++ and Python APIs to access internal data structures, allowing inspection and extension at any stage of the verification process. We discuss improvements over previous versions of ESBMC, including the description of new front- and back-ends, IEEE floating-point support, and an improved k-induction algorithm. A demonstration is available at https://www.youtube.com/watch?v=YcJjXHlN1v8. ![]() ![]() ![]() ![]() |
|
Gafurov, Davrondzhon |
![]() Davrondzhon Gafurov, Arne Erik Hurum, and Martin Markman (Norwegian Directorate of eHealth, Norway) We present a process driven test automation solution which enables delegating (part of) automation tasks from test automation engineer (expensive resource) to test analyst (non-developer, less expensive). In our approach, a test automation engineer implements test steps (or actions) which are executed automatically. Such automated test steps represent user actions in the system under test and specified by a natural language which is understandable by a non-technical person. Then, a test analyst with a domain knowledge organizes automated steps combined with test input to create an automated test case. It should be emphasized that the test analyst does not need to possess programming skills to create, modify or execute automated test cases. We refine benchmark test automation architecture to be better suitable for an effective separation and sharing of responsibilities between the test automation engineer (with coding skills) and test analyst (with a domain knowledge). In addition, we propose a metric to empirically estimate cooperation between test automation engineer and test analyst's works. The proposed automation solution has been defined based on our experience in the development and maintenance of Helsenorg, the national electronic health services in Norway which has had over one million of visits per month past year, and we still use it to automate the execution of regression tests. ![]() ![]() |
|
Galinier, Florian |
![]() Florian Galinier (IRIT, France; University of Toulouse, France) Reducing the lack of consistency between requirements and the system that should satisfy these requirements is one of the major issue in Requirement Engineering (RE). The objective of my thesis work is to propose a seamless approach, allowing users to express requirements, specifications and the system itself in a unique language. The purpose of formal approaches is to reduce inconsistency. However, most developers are not familiar with these approaches, and they are not often used outside the critical systems domain. Since we want that non-experts can also use our approach to validate systems in the early stage of their development, we propose a Domain Specific Language (DSL) that is: (i) close to natural language, and (ii) based on a formal semantics. Using Model-Driven Engineering (MDE), this language bridges the gap not only between the several stakeholders that can be involved in a project, considering their different backgrounds, but also between the requirements and the code. ![]() ![]() |
|
Gall, Harald C. |
![]() Carmine Vassallo, Fabio Palomba, Alberto Bacchelli, and Harald C. Gall (University of Zurich, Switzerland) Continuous Integration (CI) is a software engineering practice where developers constantly integrate their changes to a project through an automated build process. The goal of CI is to provide developers with prompt feedback on several quality dimensions after each change. Indeed, previous studies provided empirical evidence on a positive association between properly following CI principles and source code quality. A core principle behind CI is Continuous Code Quality (also known as CCQ, which includes automated testing and automated code inspection) may appear simple and effective, yet we know little about its practical adoption. In this paper, we propose a preliminary empirical investigation aimed at understanding how rigorously practitioners follow CCQ. Our study reveals a strong dichotomy between theory and practice: developers do not perform continuous inspection but rather control for quality only at the end of a sprint and most of the times only on the release branch. Preprint [https://doi.org/10.5281/zenodo.1341036]. Data and Materials [http://doi.org/10.5281/zenodo.1341015]. ![]() ![]() ![]() Simone Scalabrino, Giovanni Grano, Dario Di Nucci, Michele Guerra, Andrea De Lucia, Harald C. Gall, and Rocco Oliveto (University of Molise, Italy; University of Zurich, Switzerland; Vrije Universiteit Brussel, Belgium; University of Salerno, Italy) Automatically generating test cases plays an important role to reduce the time spent by developers during the testing phase. In last years, several approaches have been proposed to tackle such a problem: amongst others, search-based techniques have been shown to be particularly promising. In this paper we describe Ocelot, a search-based tool for the automatic generation of test cases in C. Ocelot allows practitioners to write skeletons of test cases for their programs and researchers to easily implement and experiment new approaches for automatic test-data generation. We show that Ocelot achieves a higher coverage compared to a competitive tool in 81% of the cases. Ocelot is publicly available to support both researchers and practitioners. ![]() ![]() |
|
Gallaba, Keheliya |
![]() Keheliya Gallaba, Christian Macho, Martin Pinzger, and Shane McIntosh (McGill University, Canada; University of Klagenfurt, Austria) Automated builds, which may pass or fail, provide feedback to a development team about changes to the codebase. A passing build indicates that the change compiles cleanly and tests (continue to) pass. A failing (a.k.a., broken) build indicates that there are issues that require attention. Without a closer analysis of the nature of build outcome data, practitioners and researchers are likely to make two critical assumptions: (1) build results are not noisy; however, passing builds may contain failing or skipped jobs that are actively or passively ignored; and (2) builds are equal; however, builds vary in terms of the number of jobs and configurations. To investigate the degree to which these assumptions about build breakage hold, we perform an empirical study of 3.7 million build jobs spanning 1,276 open source projects. We find that: (1) 12% of passing builds have an actively ignored failure; (2) 9% of builds have a misleading or incorrect outcome on average; and (3) at least 44% of the broken builds contain passing jobs, i.e., the breakage is local to a subset of build variants. Like other software archives, build data is noisy and complex. Analysis of build data requires nuance. ![]() ![]() |
|
Gao, Jian |
![]() Jian Gao, Xin Yang, Ying Fu, Yu Jiang, and Jiaguang Sun (Tsinghua University, China) Code reuse improves software development efficiency, however, vulnerabilities can be introduced inadvertently. Many existing works compute the code similarity based on CFGs to determine whether a binary function contains a known vulnerability. Unfortunately, their performance in cross-platform binary search is challenged. This paper presents VulSeeker, a semantic learning based vulnerability seeker for cross-platform binary. Given a target function and a vulnerable function, VulSeeker first constructs the labeled semantic flow graphs and extracts basic block features as numerical vectors for both of them. Then the embedding vector of the whole binary function is generated by feeding the numerical vectors of basic blocks to the customized semantics aware DNN model. Finally, the similarity of the two binary functions is measured based on the Cosine distance. The experimental results show that VulSeeker outperforms the state-of-the-art approaches in terms of accuracy. For example, compared to the most recent and related work Gemini, VulSeeker finds 50.00% more vulnerabilities in the top-10 candidates and 13.89% more in the top-50 candidates, and improves the values of AUC and ACC for 8.23% and 12.14% respectively. The video is presented at https://youtu.be/Mw0mr84gpI8. ![]() ![]() |
|
Gao, Jianxiong |
![]() Jianxiong Gao and Steven S. Lumetta (University of Illinois at Urbana-Champaign, USA) Path explosion has been a problem for symbolic execution for a long time. The key to avoid path explosion is to limit the number of paths generated within loops while maintaining high code coverage. Full symbolic execution creates paths for every possible execution path. Frequently, paths within loops do not contribute to code coverage. Branches within loops may generate new states at every iteration. The path explosion problem created by loops often stops symbolic execution to reach deeper parts of the code. In this paper, we propose a new path exploration method that reduces the number of states needed to achieve high coverage. Our algorithm limits the number of new states created by first prioritizing states, and then pruning the states that do not contribute to code coverage. Our algorithm does not require loop invariant inference/loop summarization, nor does it bound the number of iterations of loop exploration. The proposed algorithm can thus handle a broader set of loops than previous approaches. In fact, our algorithm is orthogonal to loop summarization techniques and search-guide heuristics, so it complements the current methods. We have implemented our algorithm using KLEE and tested with 235 student-generated versions of a classroom assignment. Our results show that our algorithm helps to achieve the same coverage with speedup of 11.8× for 117 out of the 235 programs, while adding 15% max observed and 2% average overhead over the 50% of programs not benefiting from the technique. The maximum speedup for a single program is 52.3×. ![]() ![]() |
|
Gao, Xiang |
![]() Xiang Gao, Shin Hwei Tan, Zhen Dong, and Abhik Roychoudhury (National University of Singapore, Singapore; Southern University of Science and Technology, China) Symbolic execution of Android applications is challenging as it involves either building a customized VM for Android or modeling the Android libraries. Since the Android Runtime evolves from one version to another, building a high-fidelity symbolic execution engine involves modeling the effect of the libraries and their evolved versions. Without simulating the behavior of Android libraries, path divergence may occur due to constraint loss when the symbolic values flow into Android framework and these values later affect the subsequent path taken. Previous works such as JPF-Android have relied on the modeling of execution environment such as libraries. In this work, we build a dynamic symbolic execution engine for Android apps, without any manual modeling of execution environment. Environment (or library) dependent control flow decisions in the application will trigger an on-demand program synthesis step to automatically deduce a representation of the library.This representation is refined on-the-fly by running the corresponding library multiple times.The overarching goal of the refinement is to enhance behavioral coverage and to alleviate the path divergence problem during symbolic execution. Moreover, our library synthesis can be made context-specific. Compared to traditional synthesis approaches which aim to synthesize the complete library code, our context-specific synthesis engine can generate more precise expressions for a given context. The evaluation of our dynamic symbolic execution engine, built on top of JDART, shows that the library models obtained from program synthesis are often more accurate than the semi-manual models in JPF-Android. Furthermore, our symbolic execution engine could reach more branch targets, as compared to using the JPF-Android models. ![]() ![]() |
|
Garcia, Joshua |
![]() Mahmoud Hammad, Joshua Garcia, and Sam Malek (University of California at Irvine, USA) The current security mechanisms for Android apps, both static and dynamic analysis approaches, are insufficient for detection and prevention of the increasingly dynamic and sophisticated security attacks. Static analysis approaches suffer from false positives whereas dynamic analysis approaches suffer from false negatives. Moreover, they all lack the ability to efficiently analyze systems with incremental changes—such as adding/removing apps, granting/revoking permissions, and dynamic components’ communications. Each time the system changes, the entire analysis needs to be repeated, making the existing approaches inefficient for practical use. To mitigate their shortcomings, we have developed SALMA, a novel self-protecting Android software system that monitors itself and adapts its behavior at runtime to prevent a wide-range of security risks. SALMA maintains a precise architectural model, represented as a Multiple-Domain-Matrix, and incrementally and efficiently analyzes an Android system in response to incremental system changes. The maintained architecture is used to reason about the running Android system. Every time the system changes, SALMA determines (1) the impacted part of the system, and (2) the subset of the security analyses that need to be performed, thereby greatly improving the performance of the approach. Our experimental results on hundreds of real-world apps corroborate SALMA’s scalability and efficiency as well as its ability to detect and prevent security attacks at runtime with minimal disruption. ![]() ![]() |
|
Garcia, Mário A. P. |
![]() Felipe R. Monteiro, Mário A. P. Garcia, Lucas C. Cordeiro, and Eddie B. de Lima Filho (Federal University of Amazonas, Brazil; University of Manchester, UK; TPV Technology, Brazil) This work proposes an abstraction of the Qt framework, named as Qt Operational Model (QtOM), which is integrated into two different verification approaches: explicit-state model checking and symbolic (bounded) model checking. The proposed methodology is the first one to formally verify Qt-based applications, which has the potential to devise new directions for software verification of portable code. The full version of this paper is published in Software Testing, Verification and Reliability, on 02 March 2017 and it is available at https://doi.org/10.1002/stvr.1632. ![]() ![]() ![]() ![]() |
|
Gardey, Juan Cruz |
![]() Julián Grigera, Juan Cruz Gardey, Alejandra Garrido, and Gustavo Rossi (National University of La Plata, Argentina; CIC, Argentina; CONICET, Argentina) Client-Side Web Refactorings (CSWRs) allow improving interaction aspects of web applications by applying changes to the user interface without altering the code base, even in production settings. However, developers are not always willing, or even allowed to apply external adaptations to their applications’ user interface. Besides, CSWRs do not guarantee improvements in all contexts, so it may be unwise to install them in a production version. We propose a tool that allows creating private versions of a running web application almost automatically. Using this tool, developers or usability experts can easily combine CSWRs to create alternative versions of web applications, without the need of creating a cloned sandbox environment for each version. This yields many uses, such as quickly setting up user tests, showing live alternatives to Product Owners, and even performing A/B testing. The tool is built on top of Kobold, a service that allows applying CSWRs to fix usability smells. Kobold with versioning is available at: https://autorefactoring.lifia.info.unlp.edu.ar. A screencast of the tool is available at https://youtu.be/LVc3BOtVP3I. ![]() ![]() ![]() |
|
Garrido, Alejandra |
![]() Julián Grigera, Juan Cruz Gardey, Alejandra Garrido, and Gustavo Rossi (National University of La Plata, Argentina; CIC, Argentina; CONICET, Argentina) Client-Side Web Refactorings (CSWRs) allow improving interaction aspects of web applications by applying changes to the user interface without altering the code base, even in production settings. However, developers are not always willing, or even allowed to apply external adaptations to their applications’ user interface. Besides, CSWRs do not guarantee improvements in all contexts, so it may be unwise to install them in a production version. We propose a tool that allows creating private versions of a running web application almost automatically. Using this tool, developers or usability experts can easily combine CSWRs to create alternative versions of web applications, without the need of creating a cloned sandbox environment for each version. This yields many uses, such as quickly setting up user tests, showing live alternatives to Product Owners, and even performing A/B testing. The tool is built on top of Kobold, a service that allows applying CSWRs to fix usability smells. Kobold with versioning is available at: https://autorefactoring.lifia.info.unlp.edu.ar. A screencast of the tool is available at https://youtu.be/LVc3BOtVP3I. ![]() ![]() ![]() |
|
Gehani, Ashish |
![]() Hashim Sharif, Muhammad Abubakar, Ashish Gehani, and Fareed Zaffar (University of Illinois at Urbana-Champaign, USA; Lahore University of Management Sciences, Pakistan; SRI International, USA) With the proliferation of new hardware architectures and ever-evolving user requirements, the software stack is becoming increasingly bloated. In practice, only a limited subset of the supported functionality is utilized in a particular usage context, thereby presenting an opportunity to eliminate unused features. In the past, program specialization has been proposed as a mechanism for enabling automatic software debloating. In this work, we show how existing program specialization techniques lack the analyses required for providing code simplification for real-world programs. We present an approach that uses stronger analysis techniques to take advantage of constant configuration data, thereby enabling more effective debloating. We developed Trimmer, an application specialization tool that leverages user-provided configuration data to specialize an application to its deployment context. The specialization process attempts to eliminate the application functionality that is unused in the user-defined context. Our evaluation demonstrates Trimmer can effectively reduce code bloat. For 13 applications spanning various domains, we observe a mean binary size reduction of 21% and a maximum reduction of 75%. We also show specialization reduces the surface for code-reuse attacks by reducing the number of exploitable gadgets. For the evaluated programs, we observe a 20% mean reduction in the total gadget count and a maximum reduction of 87%. ![]() ![]() |
|
Ghaisas, Smita |
![]() Manasi Patwardhan, Abhishek Sainani, Richa Sharma, Shirish Karande, and Smita Ghaisas (TCS Research, India) Compliant software is a critical need of all modern businesses. Disambiguating regulations to derive requirements is therefore an important software engineering activity. Regulations however are ridden with ambiguities that make their comprehension a challenge, seemingly surmountable only by legal experts. Since legal experts' involvement in every project is expensive, approaches to automate the disambiguation need to be explored. These approaches however require a large amount of annotated data. Collecting data exclusively from experts is not a scalable and affordable solution. In this paper, we present the results of a crowd sourcing experiment to collect annotations on ambiguities in regulations from professional software engineers. We discuss an approach to automate the arduous and critical step of identifying ground truth labels by employing crowd consensus using Expectation Maximization (EM). We demonstrate that the annotations reaching a consensus match those of experts with an accuracy of 87%. ![]() ![]() |
|
Gharibi, Gharib |
![]() Gharib Gharibi, Rashmi Tripathi, and Yugyung Lee (University of Missouri-Kansas City, USA) A static call graph is an imperative prerequisite used in most interprocedural analyses and software comprehension tools. However, there is a lack of software tools that can automatically analyze the Python source-code and construct its static call graph. In this paper, we introduce a prototype Python tool, named code2graph, which automates the tasks of (1) analyzing the Python source-code and extracting its structure, (2) constructing static call graphs from the source code, and (3) generating a similarity matrix of all possible execution paths in the system. Our goal is twofold: First, assist the developers in understanding the overall structure of the system. Second, provide a stepping stone for further research that can utilize the tool in software searching and similarity detection applications. For example, clustering the execution paths into a logical workflow of the system would be applied to automate specific software tasks. Code2graph has been successfully used to generate static call graphs and similarity matrices of the paths for three popular open-source Deep Learning projects (TensorFlow, Keras, PyTorch). A tool demo is available at https://youtu.be/ecctePpcAKU. ![]() ![]() ![]() |
|
Grano, Giovanni |
![]() Simone Scalabrino, Giovanni Grano, Dario Di Nucci, Michele Guerra, Andrea De Lucia, Harald C. Gall, and Rocco Oliveto (University of Molise, Italy; University of Zurich, Switzerland; Vrije Universiteit Brussel, Belgium; University of Salerno, Italy) Automatically generating test cases plays an important role to reduce the time spent by developers during the testing phase. In last years, several approaches have been proposed to tackle such a problem: amongst others, search-based techniques have been shown to be particularly promising. In this paper we describe Ocelot, a search-based tool for the automatic generation of test cases in C. Ocelot allows practitioners to write skeletons of test cases for their programs and researchers to easily implement and experiment new approaches for automatic test-data generation. We show that Ocelot achieves a higher coverage compared to a competitive tool in 81% of the cases. Ocelot is publicly available to support both researchers and practitioners. ![]() ![]() |
|
Grigera, Julián |
![]() Julián Grigera, Juan Cruz Gardey, Alejandra Garrido, and Gustavo Rossi (National University of La Plata, Argentina; CIC, Argentina; CONICET, Argentina) Client-Side Web Refactorings (CSWRs) allow improving interaction aspects of web applications by applying changes to the user interface without altering the code base, even in production settings. However, developers are not always willing, or even allowed to apply external adaptations to their applications’ user interface. Besides, CSWRs do not guarantee improvements in all contexts, so it may be unwise to install them in a production version. We propose a tool that allows creating private versions of a running web application almost automatically. Using this tool, developers or usability experts can easily combine CSWRs to create alternative versions of web applications, without the need of creating a cloned sandbox environment for each version. This yields many uses, such as quickly setting up user tests, showing live alternatives to Product Owners, and even performing A/B testing. The tool is built on top of Kobold, a service that allows applying CSWRs to fix usability smells. Kobold with versioning is available at: https://autorefactoring.lifia.info.unlp.edu.ar. A screencast of the tool is available at https://youtu.be/LVc3BOtVP3I. ![]() ![]() ![]() |
|
Guerra, Michele |
![]() Simone Scalabrino, Giovanni Grano, Dario Di Nucci, Michele Guerra, Andrea De Lucia, Harald C. Gall, and Rocco Oliveto (University of Molise, Italy; University of Zurich, Switzerland; Vrije Universiteit Brussel, Belgium; University of Salerno, Italy) Automatically generating test cases plays an important role to reduce the time spent by developers during the testing phase. In last years, several approaches have been proposed to tackle such a problem: amongst others, search-based techniques have been shown to be particularly promising. In this paper we describe Ocelot, a search-based tool for the automatic generation of test cases in C. Ocelot allows practitioners to write skeletons of test cases for their programs and researchers to easily implement and experiment new approaches for automatic test-data generation. We show that Ocelot achieves a higher coverage compared to a competitive tool in 81% of the cases. Ocelot is publicly available to support both researchers and practitioners. ![]() ![]() |
|
Habchi, Sarra |
![]() Sarra Habchi, Xavier Blanc, and Romain Rouvoy (Inria, France; University of Lille, France; University of Bordeaux, France) With millions of applications (apps) distributed through mobile markets, engaging and retaining end-users challenge Android developers to deliver a nearly perfect user experience. As mobile apps run in resource-limited devices, performance is a critical criterion for the quality of experience. Therefore, developers are expected to pay much attention to limit performance bad practices. On the one hand, many studies already identified such performance bad practices and showed that they can heavily impact app performance. Hence, many static analysers, a.k.a. linters, have been proposed to detect and fix these bad practices. On the other hand, other studies have shown that Android developers tend to deal with performance reactively and they rarely build on linters to detect and fix performance bad practices. In this paper, we therefore perform a qualitative study to investigate this gap between research and development community. In particular, we performed interviews with 14 experienced Android developers to identify the perceived benefits and constraints of using linters to identify performance bad practices in Android apps. Our observations can have a direct impact on developers and the research community. Specifically, we describe why and how developers leverage static source code analysers to improve the performance of their apps. On top of that, we bring to light important challenges faced by developers when it comes to adopting static analysis for performance purposes. ![]() ![]() ![]() ![]() |
|
Habib, Andrew |
![]() Andrew Habib and Michael Pradel (TU Darmstadt, Germany) Static bug detectors are becoming increasingly popular and are widely used by professional software developers. While most work on bug detectors focuses on whether they find bugs at all, and on how many false positives they report in addition to legitimate warnings, the inverse question is often neglected: How many of all real-world bugs do static bug detectors find? This paper addresses this question by studying the results of applying three widely used static bug detectors to an extended version of the Defects4J dataset that consists of 15 Java projects with 594 known bugs. To decide which of these bugs the tools detect, we use a novel methodology that combines an automatic analysis of warnings and bugs with a manual validation of each candidate of a detected bug. The results of the study show that: (i) static bug detectors find a non-negligible amount of all bugs, (ii) different tools are mostly complementary to each other, and (iii) current bug detectors miss the large majority of the studied bugs. A detailed analysis of bugs missed by the static detectors shows that some bugs could have been found by variants of the existing detectors, while others are domain-specific problems that do not match any existing bug pattern. These findings help potential users of such tools to assess their utility, motivate and outline directions for future work on static bug detection, and provide a basis for future comparisons of static bug detection with other bug finding techniques, such as manual and automated testing. ![]() ![]() ![]() ![]() Andrew Habib and Michael Pradel (TU Darmstadt, Germany) Thread-safe classes are pervasive in concurrent, object-oriented software. However, many classes lack documentation regarding their safety guarantees under multi-threaded usage. This lack of documentation forces developers who use a class in a concurrent program to either carefully inspect the implementation of the class, to conservatively synchronize all accesses to it, or to optimistically assume that the class is thread-safe. To overcome the lack of documentation, we present TSFinder, an approach to automatically classify classes as supposedly thread-safe or thread-unsafe. The key idea is to combine a lightweight static analysis that extracts a graph representation from classes with a graph-based classifier. After training the classifier with classes known to be thread-safe and thread-unsafe, it achieves an accuracy of 94.5% on previously unseen classes, enabling the approach to infer thread safety documentation with high confidence. The classifier takes about 3 seconds per class, i.e., it is efficient enough to infer documentation for many classes. ![]() ![]() ![]() |
|
Haindl, Philipp |
![]() Philipp Haindl (JKU Linz, Austria) While formal task models allow definition and time-based assessment of user-interaction, they have not yet been used as a baseline for assessing the variability patterns of user interaction with software. Consequently, operationalizing these variability patterns enables us to evaluate how suitable the defined task-execution paths are for the user to achieve a predefined goal. Improvements of the software could thereby be evidence-based on knowledge of changes' effects on functional suitability for the user rather than on prospective or opinion-based usage scenarios. Following a design thinking approach tailored to software engineering, understanding and observing these variability patterns are mandatory steps for continuous user-centered improvement of software. In practice however, due to the absence of an operational quality model for functional suitability it is hardly possible to effectively put these operational measures into context and to derive concrete software improvement actions. Having operational suitability metrics is even more important for software engineers as it allows to increasingly focus development activities especially on functionality which has business value for the user. ![]() ![]() |
|
Hammad, Mahmoud |
![]() Mahmoud Hammad, Joshua Garcia, and Sam Malek (University of California at Irvine, USA) The current security mechanisms for Android apps, both static and dynamic analysis approaches, are insufficient for detection and prevention of the increasingly dynamic and sophisticated security attacks. Static analysis approaches suffer from false positives whereas dynamic analysis approaches suffer from false negatives. Moreover, they all lack the ability to efficiently analyze systems with incremental changes—such as adding/removing apps, granting/revoking permissions, and dynamic components’ communications. Each time the system changes, the entire analysis needs to be repeated, making the existing approaches inefficient for practical use. To mitigate their shortcomings, we have developed SALMA, a novel self-protecting Android software system that monitors itself and adapts its behavior at runtime to prevent a wide-range of security risks. SALMA maintains a precise architectural model, represented as a Multiple-Domain-Matrix, and incrementally and efficiently analyzes an Android system in response to incremental system changes. The maintained architecture is used to reason about the running Android system. Every time the system changes, SALMA determines (1) the impacted part of the system, and (2) the subset of the security analyses that need to be performed, thereby greatly improving the performance of the approach. Our experimental results on hundreds of real-world apps corroborate SALMA’s scalability and efficiency as well as its ability to detect and prevent security attacks at runtime with minimal disruption. ![]() ![]() |
|
Han, Shi |
![]() Wensheng Dou, Shi Han, Liang Xu, Dongmei Zhang, and Jun Wei (University of Chinese Academy of Sciences, China; Institute of Software at Chinese Academy of Sciences, China; Microsoft Research, China) Spreadsheets are widely used in various business tasks. Spreadsheet users may put similar data and computations by repeating a block of cells (a unit) in their spreadsheets. We name the unit and all its expanding ones as an expandable group. All units in an expandable group share the same or similar formats and semantics. As a data storage and management tool, expandable groups represent the fundamental structure in spreadsheets. However, existing spreadsheet systems do not recognize any expandable groups. Therefore, other spreadsheet analysis tools, e.g., data integration and fault detection, cannot utilize this structure of expandable groups to perform precise analysis. In this paper, we propose ExpCheck to automatically extract expandable groups in spreadsheets. We observe that continuous units that share the similar formats and semantics are likely to be an expandable group. Inspired by this, we inspect the format of each cell and its corresponding semantics, and further classify them into expandable groups according to their similarity. We evaluate ExpCheck on 120 spreadsheets randomly sampled from the EUSES and VEnron corpora. The experimental results show that ExpCheck is effective. ExpCheck successfully detect expandable groups with F1-measure of 73.1%, significantly outperforming the state-of-the-art techniques (F1-measure of 13.3%). ![]() ![]() |
|
Han, Xue |
![]() Xue Han, Tingting Yu, and David Lo (University of Kentucky, USA; Singapore Management University, Singapore) Software performance is important for ensuring the quality of software products. Performance bugs, defined as programming errors that cause significant performance degradation, can lead to slow systems and poor user experience. While there has been some research on automated performance testing such as test case generation, the main idea is to select workload values to increase the program execution times. These techniques often assume the initial test cases have the right combination of input parameters and focus on evolving values of certain input parameters. However, such an assumption may not hold for highly configurable real-word applications, in which the combinations of input parameters can be very large. In this paper, we manually analyze 300 bug reports from three large open source projects - Apache HTTP Server, MySQL, and Mozilla Firefox. We found that 1) exposing performance bugs often requires combinations of multiple input parameters, and 2) certain input parameters are frequently involved in exposing performance bugs. Guided by these findings, we designed and evaluated an automated approach, PerfLearner, to extract execution commands and input parameters from descriptions of performance bug reports and use them to generate test frames for guiding actual performance test case generation. ![]() ![]() |
|
Hariri, Farah |
![]() Farah Hariri and August Shi (University of Illinois at Urbana-Champaign, USA) We present SRCIROR (pronounced “sorcerer“), a toolset for performing mutation testing at the levels of C/C++ source code (SRC) and the LLVM compiler intermediate representation (IR). At the SRC level, SRCIROR identifies program constructs for mutation by pattern-matching on the Clang AST. At the IR level, SRCIROR directly mutates the LLVM IR instructions through LLVM passes. Our implementation enables SRCIROR to (1) handle any program that Clang can handle, extending to large programs with a minimal overhead, and (2) have a small percentage of invalid mutants that do not compile. SRCIROR enables performing mutation testing using the same classes of mutation operators at both the SRC and IR levels, and it is easily extensible to support more operators. In addition, SRCIROR can collect coverage to generate mutants only for covered code elements. Our tool is publicly available on GitHub (https://github.com/TestingResearchIllinois/srciror). We evaluate SRCIROR on Coreutils subjects. Our evaluation shows interesting differences between SRC and IR, demonstrating the value of SRCIROR in enabling mutation testing research across different levels of code representation. ![]() ![]() |
|
Hassan, Ahmed E. |
![]() Zhongxin Liu, Xin Xia, Ahmed E. Hassan, David Lo, Zhenchang Xing, and Xinyu Wang (Zhejiang University, China; Monash University, Australia; Queen's University, Canada; Singapore Management University, Singapore; Australian National University, Australia) Commit messages can be regarded as the documentation of software changes. These messages describe the content and purposes of changes, hence are useful for program comprehension and software maintenance. However, due to the lack of time and direct motivation, commit messages sometimes are neglected by developers. To address this problem, Jiang et al. proposed an approach (we refer to it as NMT), which leverages a neural machine translation algorithm to automatically generate short commit messages from code. The reported performance of their approach is promising, however, they did not explore why their approach performs well. Thus, in this paper, we first perform an in-depth analysis of their experimental results. We find that (1) Most of the test diffs from which NMT can generate high-quality messages are similar to one or more training diffs at the token level. (2) About 16% of the commit messages in Jiang et al.’s dataset are noisy due to being automatically generated or due to them describing repetitive trivial changes. (3) The performance of NMT declines by a large amount after removing such noisy commit messages. In addition, NMT is complicated and time-consuming. Inspired by our first finding, we proposed a simpler and faster approach, named NNGen (Nearest Neighbor Generator), to generate concise commit messages using the nearest neighbor algorithm. Our experimental results show that NNGen is over 2,600 times faster than NMT, and outperforms NMT in terms of BLEU (an accuracy measure that is widely used to evaluate machine translation systems) by 21%. Finally, we also discuss some observations for the road ahead for automated commit message generation to inspire other researchers. ![]() ![]() |
|
Hassan, Foyzul |
![]() Foyzul Hassan, Rodney Rodriguez, and Xiaoyin Wang (University of Texas at San Antonio, USA) Dockerfiles are configuration files of docker images which package all dependencies of a software to enable convenient software deployment and porting. In other words, dockerfiles list all environment assumptions of a software application's build and / or execution, so they need to be frequently updated when the environment assumptions change during fast software evolution. In this paper, we propose RUDSEA, a novel approach to recommend updates of dockerfiles to developers based on analyzing changes on software environment assumptions and their impacts. Our evaluation on 1,199 real-world instruction updates shows that RUDSEA can recommend correct update locations for 78.5% of the updates, and correct code changes for 44.1% of the updates. ![]() ![]() |
|
He, Ancheng |
![]() Yu Qu, Ting Liu, Jianlei Chi, Yangxu Jin, Di Cui, Ancheng He, and Qinghua Zheng (Xi'an Jiaotong University, China) Network measures have been proved to be useful in predicting software defects. Leveraging the dependency relationships between software modules, network measures can capture various structural features of software systems. However, existing studies have relied on user-defined network measures (e.g., degree statistics or centrality metrics), which are inflexible and require high computation cost, to describe the structural features. In this paper, we propose a new method called node2defect which uses a newly proposed network embedding technique, node2vec, to automatically learn to encode dependency network structure into low-dimensional vector spaces to improve software defect prediction. Specifically, we firstly construct a program's Class Dependency Network. Then node2vec is used to automatically learn structural features of the network. After that, we combine the learned features with traditional software engineering features, for accurate defect prediction. We evaluate our method on 15 open source programs. The experimental results show that in average, node2defect improves the state-of-the-art approach by 9.15% in terms of F-measure. ![]() ![]() |
|
He, Dongjie |
![]() Dongjie He, Lian Li, Lei Wang, Hengjie Zheng, Guangwei Li, and Jingling Xue (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia) The frequent release of Android OS and its various versions bring many compatibility issues to Android Apps. This paper studies and addresses such evolution-induced compatibility problems. We conduct an extensive empirical study over 11 different Android versions and 4,936 Android Apps. Our study shows that there are drastic API changes between adjacent Android versions, with averagely 140.8 new types, 1,505.6 new methods, and 979.2 new fields being introduced in each release. However, the Android Support Library (provided by the Android OS) only supports less than 23% of the newly added methods, with much less support for new types and fields. As a result, 91.84% of Android Apps write additional code to support different OS versions. Furthermore, 88.65% of the supporting codes share a common pattern, which directly compares variable android.os.Build.VERSION.SDK_INT with a constant version number, to use an API of particular versions. Based on our findings, we develop a new tool called IctApiFinder, to detect incompatible API usages in Android applications. IctApiFinder effectively computes the OS versions on which an API may be invoked, using an inter-procedural data-flow analysis framework. It detects numerous incompatible API usages in 361 out of 1,425 Apps. Compared to Android Lint, IctApiFinder is sound and able to reduce the false positives by 82.1%. We have reported the issues to 13 Apps developers. At present, 5 of them have already been confirmed by the original developers and 3 of them have already been fixed. ![]() ![]() |
|
He, Fei |
![]() Jianhui Chen and Fei He (Tsinghua University, China) Satisfiability modulo theories (SMT) solvers have been widely applied as the reasoning engine for diverse software analysis and verification technologies. The efficiency of the SMT solver has significant effects on the performance of these technologies. However, the current SMT solvers are designed for the general purpose of constraint solving. Many useful knowledge of programs cannot be utilized during the SMT solving. As a result, the SMT solver may spend a lot of effort to explore redundant search space. In this paper, we propose a novel approach for utilizing control-flow knowledge in SMT solving. With this technique, the search space can be considerably reduced and the efficiency of SMT solving is observably improved. We conducted extensive experiments on credible benchmarks, the results show orders of magnitude improvements of our approach. ![]() ![]() |
|
He, Pinjia |
![]() Pinjia He, Zhuangbin Chen, Shilin He, and Michael R. Lyu (Chinese University of Hong Kong, China) Logging is a common programming practice of great importance in modern software development, because software logs have been widely used in various software maintenance tasks. To provide high-quality logs, developers need to design the description text in logging statements carefully. Inappropriate descriptions will slow down or even mislead the maintenance process, such as postmortem analysis. However, there is currently a lack of rigorous guide and specifications on developer logging behaviors, which makes the construction of description text in logging statements a challenging problem. To fill this significant gap, in this paper, we systematically study what developers log, with focus on the usage of natural language descriptions in logging statements. We obtain 6 valuable findings by conducting source code analysis on 10 Java projects and 7 C# projects, which contain 28,532,975 LOC and 115,159 logging statements in total. Furthermore, our study demonstrates the potential of automated description text generation for logging statements by obtaining up to 49.04 BLEU-4 score and 62.1 ROUGE-L score using a simple information retrieval method. To facilitate future research in this field, the datasets have been publicly released. ![]() ![]() |
|
He, Shilin |
![]() Pinjia He, Zhuangbin Chen, Shilin He, and Michael R. Lyu (Chinese University of Hong Kong, China) Logging is a common programming practice of great importance in modern software development, because software logs have been widely used in various software maintenance tasks. To provide high-quality logs, developers need to design the description text in logging statements carefully. Inappropriate descriptions will slow down or even mislead the maintenance process, such as postmortem analysis. However, there is currently a lack of rigorous guide and specifications on developer logging behaviors, which makes the construction of description text in logging statements a challenging problem. To fill this significant gap, in this paper, we systematically study what developers log, with focus on the usage of natural language descriptions in logging statements. We obtain 6 valuable findings by conducting source code analysis on 10 Java projects and 7 C# projects, which contain 28,532,975 LOC and 115,159 logging statements in total. Furthermore, our study demonstrates the potential of automated description text generation for logging statements by obtaining up to 49.04 BLEU-4 score and 62.1 ROUGE-L score using a simple information retrieval method. To facilitate future research in this field, the datasets have been publicly released. ![]() ![]() |
|
He, Yilun |
![]() Yutaka Nagashima and Yilun He (Czech Technical University, Czechia; University of Innsbruck, Austria; University of Sydney, Australia) Deciding which sub-tool to use for a given proof state requires expertise specific to each interactive theorem prover (ITP). To mitigate this problem, we present PaMpeR, a proof method recommendation system for Isabelle/HOL. Given a proof state, PaMpeR recommends proof methods to discharge the proof goal and provides qualitative explanations as to why it suggests these methods. PaMpeR generates these recommendations based on existing hand-written proof corpora, thus transferring experienced users’ expertise to new users. Our evaluation shows that PaMpeR correctly predicts experienced users’ proof methods invocation especially when it comes to special purpose proof methods. ![]() ![]() |
|
Helm, Dominik |
![]() Dominik Helm, Florian Kübler, Michael Eichberg, Michael Reif, and Mira Mezini (TU Darmstadt, Germany) Analyzing methods in object-oriented programs whether they are side-effect free and also deterministic, i.e., mathematically pure, has been the target of extensive research. Identifying such methods helps to find code smells and security related issues, and also helps analyses detecting concurrency bugs. Pure methods are also used by formal verification approaches as the foundations for specifications and proving the pureness is necessary to ensure correct specifications. However, so far no common terminology exists which describes the purity of methods. Furthermore, some terms (e.g., pure or side-effect free) are also used inconsistently. Further, all current approaches only report selected purity information making them only suitable for a smaller subset of the potential use cases. In this paper, we present a fine-grained unified lattice model which puts the purity levels found in the literature into relation and which adds a new level that generalizes existing definitions. We have also implemented a scalable, modularized purity analysis which produces significantly more precise results for real-world programs than the best-performing related work. The analysis shows that all defined levels are found in real-world projects. ![]() ![]() |
|
Herckis, Lauren |
![]() Lauren Herckis (Carnegie Mellon University, USA) Software engineering research rarely impacts practitioners in the field. A desire to facilitate transfer alone is not sufficient to bridge the gap between research and practice. Fields from medicine to education have acknowledged a similar challenge over the past 25 years. An empirical approach to the translation of research into evidence-based practice has emerged from the resulting discussion. Implementation science has fundamentally changed the way that biomedical research is conducted, and has revolutionized the daily practice of doctors, social workers, epidemiologists, early childhood educators, and more. In this talk we will explore the methods, frameworks, and practices of implementation science and their application to novel disciplines, including software engineering research. We will close by proposing some directions for future software engineering research to facilitate transfer. ![]() ![]() |
|
Hilton, Michael |
![]() Michael Hilton, Jonathan Bell, and Darko Marinov (Carnegie Mellon University, USA; George Mason University, USA; University of Illinois at Urbana-Champaign, USA) Statement coverage is commonly used as a measure of test suite quality. Coverage is often used as a part of a code review process: if a patch decreases overall coverage, or is itself not covered, then the patch is scrutinized more closely. Traditional studies of how coverage changes with code evolution have examined the overall coverage of the entire program, and more recent work directly examines the coverage of patches (changed statements). We present an evaluation much larger than prior studies and moreover consider a new, important kind of change --- coverage changes of unchanged statements. We present a large-scale evaluation of code coverage evolution over 7,816 builds of 47 projects written in popular languages including Java, Python, and Scala. We find that in large, mature projects, simply measuring the change to statement coverage does not capture the nuances of code evolution. Going beyond considering statement coverage as a simple ratio, we examine how the set of statements covered evolves between project revisions. We present and study new ways to assess the impact of a patch on a project's test suite quality that both separates coverage of the patch from coverage of the non-patch, and separates changes in coverage from changes in the set of statements covered. ![]() ![]() |
|
Hoskins, John |
![]() Kevin Moran, Cody Watson, John Hoskins, George Purnell, and Denys Poshyvanyk (College of William and Mary, USA) Mobile applications have become a popular software development domain in recent years due in part to a large user base, capable hardware, and accessible platforms. However, mobile developers also face unique challenges, including pressure for frequent releases to keep pace with rapid platform evolution, hardware iteration, and user feedback. Due to this rapid pace of evolution, developers need automated support for documenting the changes made to their apps in order to aid in program comprehension. One of the more challenging types of changes to document in mobile apps are those made to the graphical user interface (GUI) due to its abstract, pixel-based representation. In this paper, we present a fully automated approach, called GCAT, for detecting and summarizing GUI changes during the evolution of mobile apps. GCAT leverages computer vision techniques and natural language generation to accurately and concisely summarize changes made to the GUI of a mobile app between successive commits or releases. We evaluate the performance of our approach in terms of its precision and recall in detecting GUI changes compared to developer specified changes, and investigate the utility of the generated change reports in a controlled user study. Our results indicate that GCAT is capable of accurately detecting and classifying GUI changes - outperforming developers - while providing useful documentation. ![]() ![]() |
|
Hotzkow, Jenny |
![]() Nataniel P. Borges Jr., Jenny Hotzkow, and Andreas Zeller (Saarland University, Germany) Android applications (apps) represent an ever increasing portion of the software market. Automated test input generators are the state of the art for testing and security analysis. We introduce DroidMate-2 (DM-2), a platform to easily assist both developers and researchers to customize, develop and test new test generators. DM-2 can be used without app instrumentation or operating system modifications, as a test generator on real devices and emulators for app testing or regression testing. Additionally, it provides sensitive resource monitoring or blocking capabilities through a lightweight app instrumentation, out-of-thebox statement coverage measurement through a fully-fledged app instrumentation and native experiment reproducibility. In our experiments we compared DM-2 against DroidBot, a state-of-the-art test generator by measuring statement coverage. Our results show that DM-2 reached 96% of its peak coverage in less than 2/3 of the time needed by DroidBot, allowing for better and more efficient tests. On short runs (5 minutes) DM-2 outperformed DroidBot by 7% while in longer runs (1 hour) this difference increases to 8%. ACM DL Artifact: https://www.doi.org/10.1145/3264864 For the details see: https://github.com/uds-se/droidmate/wiki/ASE-2018:-Data ![]() ![]() ![]() ![]() ![]() |
|
Hu, Jiajun |
![]() Jiajun Hu, Lili Wei, Yepang Liu, Shing-Chi Cheung, and Huaxun Huang (Hong Kong University of Science and Technology, China; Southern University of Science and Technology, China) WebView is a widely used Android component that augments a native app with web browser capabilities. It eases the interactions between an app’s native code and web code. However, the interaction mechanism of WebView induces new types of bugs in Android apps. Understanding the characteristics and manifestation of these WebView-induced bugs (ωBugs for short) facilitates the correct usages of WebViews in Android apps. This motivates us to conduct the first empirical study on ωBugs based on those found in popular open-source Android apps. Our study identified the major root causes and consequences of ωBugs and made interesting observations that can be leveraged for detecting and diagnosing ωBugs. Based on the empirical study, we further propose an automated testing technique ωDroid to effectively expose ωBugs in Android apps. In our experiments, ωDroid successfully discovered 30 unique and previously-unknown ωBugs when applied to 146 open-source Android apps. We reported the 30 ωBugs to the corresponding app developers. Out of these 30 ωBugs, 14 were confirmed and 7 of them were fixed. This shows that ωDroid can effectively detect ωBugs that are of the developers’ concern. ![]() ![]() |
|
Hu, Xing |
![]() Boyuan Chen, Jian Song, Peng Xu, Xing Hu, and Zhen Ming (Jack) Jiang (York University, Canada; Baidu, China) Software testing is a widely used technique to ensure the quality of software systems. Code coverage measures are commonly used to evaluate and improve the existing test suites. Based on our industrial and open source studies, existing state-of-the-art code coverage tools are only used during unit and integration testing due to issues like engineering challenges, performance overhead, and incomplete results. To resolve these issues, in this paper we have proposed an automated approach, called LogCoCo, to estimating code coverage measures using the readily available execution logs. Using program analysis techniques, LogCoCo matches the execution logs with their corresponding code paths and estimates three different code coverage criteria: method coverage, statement coverage, and branch coverage. Case studies on one open source system (HBase) and five commercial systems from Baidu and systems show that: (1) the results of LogCoCo are highly accurate (>96% in seven out of nine experiments) under a variety of testing activities (unit testing, integration testing, and benchmarking); and (2) the results of LogCoCo can be used to evaluate and improve the existing test suites. Our collaborators at Baidu are currently considering adopting LogCoCo and use it on a daily basis. ![]() ![]() |
|
Huang, Huaxun |
![]() Huaxun Huang, Lili Wei, Yepang Liu, and Shing-Chi Cheung (Hong Kong University of Science and Technology, China; Southern University of Science and Technology, China) The control flows of Android apps are largely driven by the protocols that govern how callback APIs are invoked in response to various events. When these callback APIs evolve along with the Android framework, the changes in their invocation protocols can induce unexpected control flows to existing Android apps, causing various compatibility issues. We refer to these issues as callback compatibility issues. While Android framework updates have received due attention, little is known about their impacts on app control flows and the callback compatibility issues thus induced. To bridge the gap, we examined Android documentations and conducted an empirical study on 100 real-world callback compatibility issues to investigate how these issues were induced by callback API evolutions. Based on our empirical findings, we propose a graph-based model to capture the control flow inconsistencies caused by API evolutions and devise a static analysis technique, Cider, to detect callback compatibility issues. Our evaluation of Cider on 20 popular open-source Android apps shows that Cider is effective. It detected 13 new callback compatibility issues in these apps, among which 12 issues were confirmed and 9 issues were fixed. ![]() ![]() ![]() Jiajun Hu, Lili Wei, Yepang Liu, Shing-Chi Cheung, and Huaxun Huang (Hong Kong University of Science and Technology, China; Southern University of Science and Technology, China) WebView is a widely used Android component that augments a native app with web browser capabilities. It eases the interactions between an app’s native code and web code. However, the interaction mechanism of WebView induces new types of bugs in Android apps. Understanding the characteristics and manifestation of these WebView-induced bugs (ωBugs for short) facilitates the correct usages of WebViews in Android apps. This motivates us to conduct the first empirical study on ωBugs based on those found in popular open-source Android apps. Our study identified the major root causes and consequences of ωBugs and made interesting observations that can be leveraged for detecting and diagnosing ωBugs. Based on the empirical study, we further propose an automated testing technique ωDroid to effectively expose ωBugs in Android apps. In our experiments, ωDroid successfully discovered 30 unique and previously-unknown ωBugs when applied to 146 open-source Android apps. We reported the 30 ωBugs to the corresponding app developers. Out of these 30 ωBugs, 14 were confirmed and 7 of them were fixed. This shows that ωDroid can effectively detect ωBugs that are of the developers’ concern. ![]() ![]() |
|
Huang, Jeff |
![]() Steven P. Reiss, Qi Xin, and Jeff Huang (Brown University, USA) We introduce a tool within the Code Bubbles development environment that allows for continuous execution as the programmer edits. The tool, SEEDE, shows both the intermediate and final results of execution in terms of variables, control and data flow, output, and graphics. These results are updated as the user edits. The tool can be used to help the user write new code or to find and fix bugs. The tool is explicitly designed to let the user quickly explore the execution of a method along with all the code it invokes, possibly while writing or modifying the code. The user can start continuous execution either at a breakpoint or for a test case. This paper describes the tool, its implementation, and its user interface. It presents an initial user study of the tool demonstrating its potential utility. ![]() ![]() |
|
Huang, Kaifeng |
![]() Kaifeng Huang, Bihuan Chen, Xin Peng, Daihong Zhou, Ying Wang, Yang Liu, and Wenyun Zhao (Fudan University, China; Nanyang Technological University, Singapore) Analyzing and understanding source code changes is important in a variety of software maintenance tasks. To this end, many code differencing and code change summarization methods have been proposed. For some tasks (e.g. code review and software merging), however, those differencing methods generate too fine-grained a representation of code changes, and those summarization methods generate too coarse-grained a representation of code changes. Moreover, they do not consider the relationships among code changes. Therefore, the generated differences or summaries make it not easy to analyze and understand code changes in some software maintenance tasks. In this paper, we propose a code differencing approach, named CLDIFF, to generate concise linked code differences whose granularity is in between the existing code differencing and code change summarization methods. The goal of CLDIFF is to generate more easily understandable code differences. CLDIFF takes source code files before and after changes as inputs, and consists of three steps. First, it pre-processes the source code files by pruning unchanged declara- tions from the parsed abstract syntax trees. Second, it generates concise code differences by grouping fine-grained code differences at or above the statement level and describing high-level changes in each group. Third, it links the related concise code differences according to five pre-defined links. Experiments with 12 Java projects (74,387 commits) and a human study with 10 participants have indicated the accuracy, conciseness, performance and usefulness of CLDIFF. ![]() ![]() |
|
Huang, LiGuo |
![]() Xiaoyu Liu, LiGuo Huang, and Vincent Ng (Southern Methodist University, USA; University of Texas at Dallas, USA) It is time-consuming and labor-intensive to learn and locate the correct API for programming tasks. Thus, it is beneficial to perform API recommendation automatically. The graph-based statistical model has been shown to recommend top-10 API candidates effectively. It falls short, however, in accurately recommending an actual top-1 API. To address this weakness, we propose RecRank, an approach and tool that applies a novel ranking-based discriminative approach leveraging API usage path features to improve top-1 API recommendation. Empirical evaluation on a large corpus of (1385+8) open source projects shows that RecRank significantly improves top-1 API recommendation accuracy and mean reciprocal rank when compared to state-of-the-art API recommendation approaches. ![]() ![]() |
|
Huang, Qiao |
![]() Qiao Huang, Xin Xia, Zhenchang Xing, David Lo, and Xinyu Wang (Zhejiang University, China; Monash University, Australia; Australian National University, Australia; Singapore Management University, Singapore) Developers often need to search for appropriate APIs for their programming tasks. Although most libraries have API reference documentation, it is not easy to find appropriate APIs due to the lexical gap and knowledge gap between the natural language description of the programming task and the API description in API documentation. Here, the lexical gap refers to the fact that the same semantic meaning can be expressed by different words, and the knowledge gap refers to the fact that API documentation mainly describes API functionality and structure but lacks other types of information like concepts and purposes, which are usually the key information in the task description. In this paper, we propose an API recommendation approach named BIKER (Bi-Information source based KnowledgE Recommendation) to tackle these two gaps. To bridge the lexical gap, BIKER uses word embedding technique to calculate the similarity score between two text descriptions. Inspired by our survey findings that developers incorporate Stack Overflow posts and API documentation for bridging the knowledge gap, BIKER leverages Stack Overflow posts to extract candidate APIs for a program task, and ranks candidate APIs by considering the query’s similarity with both Stack Overflow posts and API documentation. It also summarizes supplementary information (e.g., API description, code examples in Stack Overflow posts) for each API to help developers select the APIs that are most relevant to their tasks. Our evaluation with 413 API-related questions confirms the effectiveness of BIKER for both class- and method-level API recommendation, compared with state-of-the-art baselines. Our user study with 28 Java developers further demonstrates the practicality of BIKER for API search. ![]() ![]() |
|
Huang, Xiaowei |
![]() Youcheng Sun, Min Wu, Wenjie Ruan, Xiaowei Huang, Marta Kwiatkowska, and Daniel Kroening (University of Oxford, UK; University of Liverpool, UK) Concolic testing combines program execution and symbolic analysis to explore the execution paths of a software program. In this paper, we develop the first concolic testing approach for Deep Neural Networks (DNNs). More specifically, we utilise quantified linear arithmetic over rationals to express test requirements that have been studied in the literature, and then develop a coherent method to perform concolic testing with the aim of better coverage. Our experimental results show the effectiveness of the concolic testing approach in both achieving high coverage and finding adversarial examples. ![]() ![]() |
|
Huang, Yi |
![]() Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu (Australian National University, Australia; Monash University, Australia; Nanyang Technological University, Singapore) Developers can use different technologies for many software development tasks in their work. However, when faced with several technologies with comparable functionalities, it is not easy for developers to select the most appropriate one, as comparisons among technologies are time-consuming by trial and error. Instead, developers can resort to expert articles, read official documents or ask questions in QA sites for technology comparison, but it is opportunistic to get a comprehensive comparison as online information is often fragmented or contradictory. To overcome these limitations, we propose the diffTech system that exploits the crowdsourced discussions from Stack Overflow, and assists technology comparison with an informative summary of different comparison aspects. We first build a large database of comparable technologies in software engineering by mining tags in Stack Overflow, and then locate comparative sentences about comparable technologies with natural language processing methods. We further mine prominent comparison aspects by clustering similar comparative sentences and representing each cluster with its keywords. The evaluation demonstrates both the accuracy and usefulness of our model and we implement our approach into a practical website for public use. ![]() ![]() ![]() |
|
Huo, Wei |
![]() Bingchang Liu, Wei Huo, Chao Zhang, Wenchao Li, Feng Li, Aihua Piao, and Wei Zou (Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Tsinghua University, China) Binary code similarity detection (BCSD) has many applications, including patch analysis, plagiarism detection, malware detection, and vulnerability search etc. Existing solutions usually perform comparisons over specific syntactic features extracted from binary code, based on expert knowledge. They have either high performance overheads or low detection accuracy. Moreover, few solutions are suitable for detecting similarities between cross-version binaries, which may not only diverge in syntactic structures but also diverge slightly in semantics. In this paper, we propose a solution αDiff, employing three semantic features, to address the cross-version BCSD challenge. It first extracts the intra-function feature of each binary function using a deep neural network (DNN). The DNN works directly on raw bytes of each function, rather than features (e.g., syntactic structures) provided by experts. αDiff further analyzes the function call graph of each binary, which are relatively stable in cross-version binaries, and extracts the inter-function and inter-module features. Then, a distance is computed based on these three features and used for BCSD. We have implemented a prototype of αDiff, and evaluated it on a dataset with about 2.5 million samples. The result shows that αDiff outperforms state-of-the-art static solutions by over 10 percentages on average in different BCSD settings. ![]() ![]() |
|
Hurum, Arne Erik |
![]() Davrondzhon Gafurov, Arne Erik Hurum, and Martin Markman (Norwegian Directorate of eHealth, Norway) We present a process driven test automation solution which enables delegating (part of) automation tasks from test automation engineer (expensive resource) to test analyst (non-developer, less expensive). In our approach, a test automation engineer implements test steps (or actions) which are executed automatically. Such automated test steps represent user actions in the system under test and specified by a natural language which is understandable by a non-technical person. Then, a test analyst with a domain knowledge organizes automated steps combined with test input to create an automated test case. It should be emphasized that the test analyst does not need to possess programming skills to create, modify or execute automated test cases. We refine benchmark test automation architecture to be better suitable for an effective separation and sharing of responsibilities between the test automation engineer (with coding skills) and test analyst (with a domain knowledge). In addition, we propose a metric to empirically estimate cooperation between test automation engineer and test analyst's works. The proposed automation solution has been defined based on our experience in the development and maintenance of Helsenorg, the national electronic health services in Norway which has had over one million of visits per month past year, and we still use it to automate the execution of regression tests. ![]() ![]() |
|
Ivančić, Franjo |
![]() Nipun Arora, Jonathan Bell, Franjo Ivančić, Gail Kaiser, and Baishakhi Ray (Dropbox, USA; George Mason University, USA; Google, USA; Columbia University, USA) Short time-to-localize and time-to-fix for production bugs is extremely important for any 24x7 service-oriented application (SOA). Debugging buggy behavior in deployed applications is hard, as it requires careful reproduction of a similar environment and workload. Prior approaches for automatically reproducing production failures do not scale to large SOA systems. Our key insight is that for many failures in SOA systems (e.g., many semantic and performance bugs), a failure can automatically be reproduced solely by relaying network packets to replicas of suspect services, an insight that we validated through a manual study of 16 real bugs across five different systems. This paper presents Parikshan, an application monitoring framework that leverages user-space virtualization and network proxy technologies to provide a sandbox “debug” environment. In this “debug” environment, developers are free to attach debuggers and analysis tools without impacting performance or correctness of the production environment. In comparison to existing monitoring solutions that can slow down production applications, Parikshan allows application monitoring at significantly lower overhead. ![]() ![]() |
|
Janes, Andrea |
![]() Andrea Janes, Michael Mairegger, and Barbara Russo (Free University of Bolzano, Italy) As a developer, it is often complex to foresee the impact of changes in source code on usage, e.g., it is time-consuming to find out all components that will be impacted by a change or estimate the impact on the usability of a failing piece of code. It is therefore hard to decide how much effort in quality assurance is justifiable to obtain the desired business goals. In this paper, to reduce the difficulty for developers to understand the importance of source code, we propose an automated way to provide this information to developers as they are working on a given piece of code. As a proof-of-concept, we developed a plug-in for Microsoft Visual Studio Code that informs about the importance of source code methods based on the frequency of usage by the end-users of the developed software. The plug-in aims to increase the awareness developers have about the importance of source code in an unobtrusive way, helping them to prioritize their effort to quality assurance, technical excellence, and usability. code_call_lens can be downloaded from GitHub at https://github.com/xxMUROxx/vscode.code_call_lens. ![]() ![]() |
|
Ji, Chao |
![]() Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Wenhai Li, Chao Ji, and Dan Ding (Fudan University, China; University of Illinois at Urbana-Champaign, USA; Singapore University of Technology and Design, Singapore) Debugging microservice systems involves the deployment and manipulation of microservice systems on a containerized environment and faces unique challenges due to the high complexity and dynamism of microservices. To address these challenges, in this paper, we propose a debugging approach for microservice systems based on the delta debugging algorithm, which is to minimize failureinducing deltas of circumstances (e.g., deployment, environmental configurations) for effective debugging. Our approach includes novel techniques for defining, deploying/manipulating, and executing deltas following the idea of delta debugging. In particular, to construct a (failing) circumstance space for delta debugging to minimize, our approach defines a set of dimensions that can affect the execution of microservice systems. Our experimental study on a medium-size microservice benchmark system shows that our approach can effectively identify failure-inducing deltas that help diagnose the root causes. ![]() ![]() |
|
Jiang, Bo |
![]() Bo Jiang, Ye Liu, and W. K. Chan (Beihang University, China; City University of Hong Kong, China) Decentralized cryptocurrencies feature the use of blockchain to transfer values among peers on networks without central agency. Smart contracts are programs running on top of the blockchain consensus protocol to enable people make agreements while minimizing trusts. Millions of smart contracts have been deployed in various decentralized applications. The security vulnerabilities within those smart contracts pose significant threats to their applications. Indeed, many critical security vulnerabilities within smart contracts on Ethereum platform have caused huge financial losses to their users. In this work, we present ContractFuzzer, a novel fuzzer to test Ethereum smart contracts for security vulnerabilities. ContractFuzzer generates fuzzing inputs based on the ABI specifications of smart contracts, defines test oracles to detect security vulnerabilities, instruments the EVM to log smart contracts runtime behaviors, and analyzes these logs to report security vulnerabilities. Our fuzzing of 6991 smart contracts has flagged more than 459 vulnerabilities with high precision. In particular, our fuzzing tool successfully detects the vulnerability of the DAO contract that leads to USD 60 million loss and the vulnerabilities of Parity Wallet that have led to the loss of USD 30 million and the freezing of USD 150 million worth of Ether. ![]() ![]() ![]() ![]() ![]() |
|
Jiang, Jie-Hong R. |
![]() Hung-En Wang, Shih-Yu Chen, Fang Yu, and Jie-Hong R. Jiang (National Taiwan University, Taiwan; National Chengchi University, Taiwan) Strings with length constraints are prominent in software security analysis. Recent endeavors have made significant progress in developing constraint solvers for strings and integers. Most prior methods are based on deduction with inference rules or analysis using automata. The former may be inefficient when the constraints involve complex string manipulations such as language replacement; the latter may not be easily extended to handle length constraints and may be inadequate for counterexample generation due to approximation. Inspired by recent work on string analysis with logic circuit representation, we propose a new method for solving string with length constraints by an implicit representation of automata with length encoding. The length-encoded automata are of infinite states and can represent languages beyond regular expressions. By converting string and length constraints into a dependency graph of manipulations over length-encoded automata, a symbolic model checker for infinite state systems can be leveraged as an engine for the analysis of string and length constraints. Experiments show that our method has its unique capability of handling complex string and length constraints not solvable by existing methods. ![]() ![]() ![]() |
|
Jiang, Yanyan |
![]() Yuju Shen, Yanyan Jiang, Chang Xu, Ping Yu, Xiaoxing Ma, and Jian Lu (Nanjing University, China) Regular expression (regex) with modern extensions is one of the most popular string processing tools. However, poorly-designed regexes can yield exponentially many matching steps, and lead to regex Denial-of-Service (ReDoS) attacks under well-conceived string inputs. This paper presents Rescue, a three-phase gray-box analytical technique, to automatically generate ReDoS strings to highlight vulnerabilities of given regexes. Rescue systematically seeds (by a genetic search), incubates (by another genetic search), and finally pumps (by a regex-dedicated algorithm) for generating strings with maximized search time. We implemenmted the Rescue tool and evaluated it against 29,088 practical regexes in real-world projects. The evaluation results show that Rescue found 49% more attack strings compared with the best existing technique, and applying Rescue to popular GitHub projects discovered ten previously unknown ReDoS vulnerabilities. ![]() ![]() ![]() |
|
Jiang, Yu |
![]() Han Liu, Chao Liu, Wenqi Zhao, Yu Jiang, and Jiaguang Sun (Tsinghua University, China; Peking University, China; Ant Financial, China) Smart contracts, as a promising and powerful application on the Ethereum blockchain, have been growing rapidly in the past few years. Since they are highly vulnerable to different forms of attacks, their security becomes a top priority. However, existing security auditing techniques are either limited in fnding vulnerabilities (rely on pre-defned bug paterns) or very expensive (rely on program analysis), thus are insufcient for Ethereum. To mitigate these limitations, we proposed a novel semanticaware security auditing technique called S-gram for Ethereum. The key insight is a combination of N-gram language modeling and lightweight static semantic labeling, which can learn statistical regularities of contract tokens and capture high-level semantics as well (e.g., flow sensitivity of a transaction). S-gram can be used to predict potential vulnerabilities by identifying irregular token sequences and optimize existing in-depth analyzers (e.g., symbolic execution engines, fuzzers etc.). We have implemented S-gram for Solidity smart contracts in Ethereum. The evaluation demonstrated the potential of S-gram in identifying possible security issues. ![]() ![]() ![]() Jian Gao, Xin Yang, Ying Fu, Yu Jiang, and Jiaguang Sun (Tsinghua University, China) Code reuse improves software development efficiency, however, vulnerabilities can be introduced inadvertently. Many existing works compute the code similarity based on CFGs to determine whether a binary function contains a known vulnerability. Unfortunately, their performance in cross-platform binary search is challenged. This paper presents VulSeeker, a semantic learning based vulnerability seeker for cross-platform binary. Given a target function and a vulnerable function, VulSeeker first constructs the labeled semantic flow graphs and extracts basic block features as numerical vectors for both of them. Then the embedding vector of the whole binary function is generated by feeding the numerical vectors of basic blocks to the customized semantics aware DNN model. Finally, the similarity of the two binary functions is measured based on the Cosine distance. The experimental results show that VulSeeker outperforms the state-of-the-art approaches in terms of accuracy. For example, compared to the most recent and related work Gemini, VulSeeker finds 50.00% more vulnerabilities in the top-10 candidates and 13.89% more in the top-50 candidates, and improves the values of AUC and ACC for 8.23% and 12.14% respectively. The video is presented at https://youtu.be/Mw0mr84gpI8. ![]() ![]() |
|
Jiang, Zhen Ming (Jack) |
![]() Boyuan Chen, Jian Song, Peng Xu, Xing Hu, and Zhen Ming (Jack) Jiang (York University, Canada; Baidu, China) Software testing is a widely used technique to ensure the quality of software systems. Code coverage measures are commonly used to evaluate and improve the existing test suites. Based on our industrial and open source studies, existing state-of-the-art code coverage tools are only used during unit and integration testing due to issues like engineering challenges, performance overhead, and incomplete results. To resolve these issues, in this paper we have proposed an automated approach, called LogCoCo, to estimating code coverage measures using the readily available execution logs. Using program analysis techniques, LogCoCo matches the execution logs with their corresponding code paths and estimates three different code coverage criteria: method coverage, statement coverage, and branch coverage. Case studies on one open source system (HBase) and five commercial systems from Baidu and systems show that: (1) the results of LogCoCo are highly accurate (>96% in seven out of nine experiments) under a variety of testing activities (unit testing, integration testing, and benchmarking); and (2) the results of LogCoCo can be used to evaluate and improve the existing test suites. Our collaborators at Baidu are currently considering adopting LogCoCo and use it on a daily basis. ![]() ![]() |
|
Jin, Yangxu |
![]() Yu Qu, Ting Liu, Jianlei Chi, Yangxu Jin, Di Cui, Ancheng He, and Qinghua Zheng (Xi'an Jiaotong University, China) Network measures have been proved to be useful in predicting software defects. Leveraging the dependency relationships between software modules, network measures can capture various structural features of software systems. However, existing studies have relied on user-defined network measures (e.g., degree statistics or centrality metrics), which are inflexible and require high computation cost, to describe the structural features. In this paper, we propose a new method called node2defect which uses a newly proposed network embedding technique, node2vec, to automatically learn to encode dependency network structure into low-dimensional vector spaces to improve software defect prediction. Specifically, we firstly construct a program's Class Dependency Network. Then node2vec is used to automatically learn structural features of the network. After that, we combine the learned features with traditional software engineering features, for accurate defect prediction. We evaluate our method on 15 open source programs. The experimental results show that in average, node2defect improves the state-of-the-art approach by 9.15% in terms of F-measure. ![]() ![]() |
|
Juefei-Xu, Felix |
![]() Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang (Harbin Institute of Technology, China; Nanyang Technological University, Singapore; Carnegie Mellon University, USA; Kyushu University, Japan; University of Illinois at Urbana-Champaign, USA; Monash University, Australia) Deep learning (DL) defines a new data-driven programming paradigm that constructs the internal system logic of a crafted neuron network through a set of training data. We have seen wide adoption of DL in many safety-critical scenarios. However, a plethora of studies have shown that the state-of-the-art DL systems suffer from various vulnerabilities which can lead to severe consequences when applied to real-world applications. Currently, the testing adequacy of a DL system is usually measured by the accuracy of test data. Considering the limitation of accessible high quality test data, good accuracy performance on test data can hardly provide confidence to the testing adequacy and generality of DL systems. Unlike traditional software systems that have clear and controllable logic and functionality, the lack of interpretability in a DL system makes system analysis and defect detection difficult, which could potentially hinder its real-world deployment. In this paper, we propose DeepGauge, a set of multi-granularity testing criteria for DL systems, which aims at rendering a multi-faceted portrayal of the testbed. The in-depth evaluation of our proposed testing criteria is demonstrated on two well-known datasets, five DL systems, and with four state-of-the-art adversarial attack techniques against DL. The potential usefulness of DeepGauge sheds light on the construction of more generic and robust DL systems. ![]() ![]() |
|
Kaiser, Gail |
![]() Nipun Arora, Jonathan Bell, Franjo Ivančić, Gail Kaiser, and Baishakhi Ray (Dropbox, USA; George Mason University, USA; Google, USA; Columbia University, USA) Short time-to-localize and time-to-fix for production bugs is extremely important for any 24x7 service-oriented application (SOA). Debugging buggy behavior in deployed applications is hard, as it requires careful reproduction of a similar environment and workload. Prior approaches for automatically reproducing production failures do not scale to large SOA systems. Our key insight is that for many failures in SOA systems (e.g., many semantic and performance bugs), a failure can automatically be reproduced solely by relaying network packets to replicas of suspect services, an insight that we validated through a manual study of 16 real bugs across five different systems. This paper presents Parikshan, an application monitoring framework that leverages user-space virtualization and network proxy technologies to provide a sandbox “debug” environment. In this “debug” environment, developers are free to attach debuggers and analysis tools without impacting performance or correctness of the production environment. In comparison to existing monitoring solutions that can slow down production applications, Parikshan allows application monitoring at significantly lower overhead. ![]() ![]() |
|
Karande, Shirish |
![]() Manasi Patwardhan, Abhishek Sainani, Richa Sharma, Shirish Karande, and Smita Ghaisas (TCS Research, India) Compliant software is a critical need of all modern businesses. Disambiguating regulations to derive requirements is therefore an important software engineering activity. Regulations however are ridden with ambiguities that make their comprehension a challenge, seemingly surmountable only by legal experts. Since legal experts' involvement in every project is expensive, approaches to automate the disambiguation need to be explored. These approaches however require a large amount of annotated data. Collecting data exclusively from experts is not a scalable and affordable solution. In this paper, we present the results of a crowd sourcing experiment to collect annotations on ambiguities in regulations from professional software engineers. We discuss an approach to automate the arduous and critical step of identifying ground truth labels by employing crowd consensus using Expectation Maximization (EM). We demonstrate that the annotations reaching a consensus match those of experts with an accuracy of 87%. ![]() ![]() |
|
Karkazis, Lambros |
![]() John-Paul Ore, Sebastian Elbaum, Carrick Detweiler, and Lambros Karkazis (University of Nebraska-Lincoln, USA) Type annotations provide a link between program variables and domain-specific types. When combined with a type system, these annotations can enable early fault detection. For type annotations to be cost-effective in practice, they need to be both accurate and affordable for developers. We lack, however, an understanding of how burdensome type annotation is for developers. Hence, this work explores three fundamental questions: 1) how accurately do developers make type annotations; 2) how long does a single annotation take; and, 3) if a system could automatically suggest a type annotation, how beneficial to accuracy are correct suggestions and how detrimental are incorrect suggestions? We present results of a study of 71 programmers using 20 random code artifacts that contain variables with physical unit types that must be annotated. Subjects choose a correct type annotation only 51% of the time and take an average of 136 seconds to make a single correct annotation. Our qualitative analysis reveals that variable names and reasoning over mathematical operations are the leading clues for type selection. We find that suggesting the correct type boosts accuracy to 73%, while making a poor suggestion decreases accuracy to 28%. We also explore what state-of-the-art automated type annotation systems can and cannot do to help developers with type annotations, and identify implications for tool developers. ![]() ![]() |
|
Kazman, Rick |
![]() Ran Mo, Will Snipes, Yuanfang Cai, Srini Ramaswamy, Rick Kazman, and Martin Naedele (Central China Normal University, China; ABB Corporate Research, USA; Drexel University, USA; ABB, USA; Carnegie Mellon University, USA; University of Hawaii, USA; ABB, Switzerland) In this paper, we report our experiences of applying three complementary automated software architecture analysis techniques, supported by a tool suite, called DV8, to 8 industrial projects within a large company. DV8 includes two state-of-the-art architecture-level maintainability metrics—Decoupling Level and Propagation Cost, an architecture flaw detection tool, and an architecture root detection tool. We collected development process data from the project teams as input to these tools, reported the results back to the practitioners, and followed up with telephone conferences and interviews. Our experiences revealed that the metrics scores, quantitative debt analysis, and architecture flaw visualization can effectively bridge the gap between management and development, help them decide if, when, and where to refactor. In particular, the metrics scores, compared against industrial benchmarks, faithfully reflected the practitioners’ intuitions about the maintainability of their projects, and enabled them to better understand the maintainability relative to other projects internal to their company, and to other industrial products. The automatically detected architecture flaws and roots enabled the practitioners to precisely pinpoint, visualize, and quantify the “hotspots" within the systems that are responsible for high maintenance costs. Except for the two smallest projects for which both architecture metrics indicated high maintainability, all other projects are planning or have already begun refactorings to address the problems detected by our analyses. We are working on further automating the tool chain, and transforming the analysis suite into deployable services accessible by all projects within the company. ![]() ![]() |
|
Kessentini, Marouane |
![]() Vahid Alizadeh and Marouane Kessentini (University of Michigan, USA) Refactoring is nowadays widely adopted in the industry because bad design decisions can be very costly and extremely risky. On the one hand, automated refactoring does not always lead to the desired design. On the other hand, manual refactoring is error-prone, time-consuming and not practical for radical changes. Thus, recent research trends in the field focused on integrating developers feedback into automated refactoring recommendations because developers understand the problem domain intuitively and may have a clear target design in mind. However, this interactive process can be repetitive, expensive, and tedious since developers must evaluate recommended refactorings, and adapt them to the targeted design especially in large systems where the number of possible strategies can grow exponentially. In this paper, we propose an interactive approach combining the use of multi-objective and unsupervised learning to reduce the developer's interaction effort when refactoring systems. We generate, first, using multi-objective search different possible refactoring strategies by finding a trade-off between several conflicting quality attributes. Then, an unsupervised learning algorithm clusters the different trade-off solutions, called the Pareto front, to guide the developers in selecting their region of interests and reduce the number of refactoring options to explore. The feedback from the developer, both at the cluster and solution levels, are used to automatically generate constraints to reduce the search space in the next iterations and focus on the region of developer preferences. We selected 14 active developers to manually evaluate the effectiveness our tool on 5 open source projects and one industrial system. The results show that the participants found their desired refactorings faster and more accurate than the current state of the art. ![]() ![]() |
|
Khurshid, Sarfraz |
![]() Mengshi Zhang, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz Khurshid (University of Texas at Austin, USA; Southern University of Science and Technology, China; University of Texas at Dallas, USA) While Deep Neural Networks (DNNs) have established the fundamentals of image-based autonomous driving systems, they may exhibit erroneous behaviors and cause fatal accidents. To address the safety issues in autonomous driving systems, a recent set of testing techniques have been designed to automatically generate artificial driving scenes to enrich test suite, e.g., generating new input images transformed from the original ones. However, these techniques are insufficient due to two limitations: first, many such synthetic images often lack diversity of driving scenes, and hence compromise the resulting efficacy and reliability. Second, for machine-learning-based systems, a mismatch between training and application domain can dramatically degrade system accuracy, such that it is necessary to validate inputs for improving system robustness. In this paper, we propose DeepRoad, an unsupervised DNN-based framework for automatically testing the consistency of DNN-based autonomous driving systems and online validation. First, DeepRoad automatically synthesizes large amounts of diverse driving scenes without using image transformation rules (e.g. scale, shear and rotation). In particular, DeepRoad is able to produce driving scenes with various weather conditions (including those with rather extreme conditions) by applying Generative Adversarial Networks (GANs) along with the corresponding real-world weather scenes. Second, DeepRoad utilizes metamorphic testing techniques to check the consistency of such systems using synthetic images. Third, DeepRoad validates input images for DNN-based systems by measuring the distance of the input and training images using their VGGNet features. We implement DeepRoad to test three well-recognized DNN-based autonomous driving systems in Udacity self-driving car challenge. The experimental results demonstrate that DeepRoad can detect thousands of inconsistent behaviors for these systems, and effectively validate input images to potentially enhance the system robustness as well. ![]() ![]() ![]() Kaiyuan Wang, Allison Sullivan, and Sarfraz Khurshid (University of Texas at Austin, USA) Automated program repair is an active research area. However, existing research focuses mostly on imperative code, e.g. in Java. In this paper, we study the problem of repairing declarative models in Alloy -- a first order relational logic with transitive closure. We introduce ARepair, the first technique for repairing Alloy models. ARepair follows the spirit of traditional automated program repair techniques. Specifically, ARepair takes as input a faulty Alloy model and a test suite that contains some failing test, and outputs a repaired model that is correct with respect to the given tests. ARepair integrates ideas from mutation testing and program synthesis to provide an effective solution for repairing Alloy models. The experimental results show that ARepair can fix 28 out of 38 real-world faulty models we collected. ![]() ![]() |
|
Konat, Gabriël |
![]() Gabriël Konat, Sebastian Erdweg, and Eelco Visser (Delft University of Technology, Netherlands) Incremental build systems are essential for fast, reproducible software builds. Incremental build systems enable short feedback cycles when they capture dependencies precisely and selectively execute build tasks efficiently. A much overlooked feature of build systems is the expressiveness of the scripting language, which directly influences the maintainability of build scripts. In this paper, we present a new incremental build algorithm that allows build engineers to use a full-fledged programming language with explicit task invocation, value and file inspection facilities, and conditional and iterative language constructs. In contrast to prior work on incrementality for such programmable builds, our algorithm scales with the number of tasks affected by a change and is independent of the size of the software project being built. Specifically, our algorithm accepts a set of changed files, transitively detects and re-executes affected build tasks, but also accounts for new task dependencies discovered during building. We have evaluated the performance of our algorithm in a real-world case study and confirm its scalability. ![]() ![]() |
|
Korel, Bogdan |
![]() Nada Almasri, Luay Tahat, and Bogdan Korel (Gulf University for Science and Technology, Kuwait; Illinois Institute of Technology, USA) Software maintenance is becoming more challenging with the increased complexity of the software and the frequently applied changes. Performing impact analysis before the actual implementation of a change is a crucial task during system maintenance. While many tools and techniques are available to measure the impact of a change at the code level, only a few research work is done to measure the impact of a change at an earlier stage in the development process. Measuring the impact of a change at the model level speeds up the maintenance process allowing early discovery of critical components of the system before applying the actual change at the code level. In this paper, we present model-based impact analysis approach for state-based systems such as telecommunication or embedded systems. The proposed approach uses model dependencies to automatically measure the expected impact for a requested change instead of relying on the expertise of system maintainers, and it generates two impact sets representing the lower bound and the upper bound of the impact. Although it can be extended to other behavioral models, the presented approach mainly addresses extended finite-state machine (EFSM) models. An empirical study is conducted on six EFSM models to investigate the usefulness of the proposed approach. The results show that on average the size of the impact after a single modification (a change in a one EFSM transition) ranges between 14 and 38 % of the total size of the model. For a modification involving multiple transitions, the average size of the impact ranges between 30 and 64 % of the total size of the model. Additionally, we investigated the relationships (correlation) between the structure of the EFSM model, and the size of the impact sets. Upon preliminary analysis of the correlation, the concepts of model density and data density were defined, and it was found that they could be the major factors influencing the sizes of impact sets for models. As a result, these factors can be used to determine the types of models for which the proposed approach is the most appropriate. ![]() ![]() |
|
Kotheimer, John |
![]() Rijnard van Tonder, John Kotheimer, and Claire Le Goues (Carnegie Mellon University, USA) Precise crash triage is important for automated dynamic testing tools, like fuzzers. At scale, fuzzers produce millions of crashing inputs. Fuzzers use heuristics, like stack hashes, to cut down on duplicate bug reports. These heuristics are fast, but often imprecise: even after deduplication, hundreds of uniquely reported crashes can still correspond to the same bug. Remaining crashes must be inspected manually, incurring considerable effort. In this paper we present Semantic Crash Bucketing, a generic method for precise crash bucketing using program transformation. Semantic Crash Bucketing maps crashing inputs to unique bugs as a function of changing a program (i.e., a semantic delta). We observe that a real bug fix precisely identifies crashes belonging to the same bug. Our insight is to approximate real bug fixes with lightweight program transformation to obtain the same level of precision. Our approach uses (a) patch templates and (b) semantic feedback from the program to automatically generate and apply approximate fixes for general bug classes. Our evaluation shows that approximate fixes are competitive with using true fixes for crash bucketing, and significantly outperforms built-in deduplication techniques for three state of the art fuzzers. ![]() ![]() |
|
Kovalenko, Vladimir |
![]() Vladimir Kovalenko, Fabio Palomba, and Alberto Bacchelli (Delft University of Technology, Netherlands; University of Zurich, Switzerland) Modern distributed version control systems, such as Git, offer support for branching — the possibility to develop parts of software outside the master trunk. Consideration of the repository structure in Mining Software Repository (MSR) studies requires a thorough approach to mining, but there is no well-documented, widespread methodology regarding the handling of merge commits and branches. Moreover, there is still a lack of knowledge of the extent to which considering branches during MSR studies impacts the results of the studies. In this study, we set out to evaluate the importance of proper handling of branches when calculating file modification histories. We analyze over 1,400 Git repositories of four open source ecosystems and compute modification histories for over two million files, using two different algorithms. One algorithm only follows the first parent of each commit when traversing the repository, the other returns the full modification history of a file across all branches. We show that the two algorithms consistently deliver different results, but the scale of the difference varies across projects and ecosystems. Further, we evaluate the importance of accurate mining of file histories by comparing the performance of common techniques that rely on file modification history — reviewer recommendation, change recommendation, and defect prediction — for two algorithms of file history retrieval. We find that considering full file histories leads to an increase in the techniques’ performance that is rather modest. ![]() ![]() ![]() |
|
Krismayer, Thomas |
![]() Thomas Krismayer (JKU Linz, Austria) The behavior of complex software-intensive systems of systems often only fully emerges during operation, when all systems interact with each other and with their environment. Runtime monitoring approaches are thus used to detect deviations from the expected behavior, which is commonly defined by engineers, e.g., using temporal logic or domain-specific languages. However, the deep domain knowledge required to specify constraints is often not available during the development of systems of systems with multiple teams independently working on heterogeneous components. In this paper, we thus describe our ongoing PhD research to automatically mine constraints for runtime monitoring from recorded events. Our approach mines constraints on event occurrence, timing, data, and combinations of these properties. The approach further presents the mined constraints to users offering multiple ranking strategies and can also be used to support users in system evolution scenarios. ![]() ![]() |
|
Kroening, Daniel |
![]() Youcheng Sun, Min Wu, Wenjie Ruan, Xiaowei Huang, Marta Kwiatkowska, and Daniel Kroening (University of Oxford, UK; University of Liverpool, UK) Concolic testing combines program execution and symbolic analysis to explore the execution paths of a software program. In this paper, we develop the first concolic testing approach for Deep Neural Networks (DNNs). More specifically, we utilise quantified linear arithmetic over rationals to express test requirements that have been studied in the literature, and then develop a coherent method to perform concolic testing with the aim of better coverage. Our experimental results show the effectiveness of the concolic testing approach in both achieving high coverage and finding adversarial examples. ![]() ![]() |
|
Kübler, Florian |
![]() Dominik Helm, Florian Kübler, Michael Eichberg, Michael Reif, and Mira Mezini (TU Darmstadt, Germany) Analyzing methods in object-oriented programs whether they are side-effect free and also deterministic, i.e., mathematically pure, has been the target of extensive research. Identifying such methods helps to find code smells and security related issues, and also helps analyses detecting concurrency bugs. Pure methods are also used by formal verification approaches as the foundations for specifications and proving the pureness is necessary to ensure correct specifications. However, so far no common terminology exists which describes the purity of methods. Furthermore, some terms (e.g., pure or side-effect free) are also used inconsistently. Further, all current approaches only report selected purity information making them only suitable for a smaller subset of the potential use cases. In this paper, we present a fine-grained unified lattice model which puts the purity levels found in the literature into relation and which adds a new level that generalizes existing definitions. We have also implemented a scalable, modularized purity analysis which produces significantly more precise results for real-world programs than the best-performing related work. The analysis shows that all defined levels are found in real-world projects. ![]() ![]() |
|
Kulkarni, Sandeep |
![]() Mohammad Roohitavaf and Sandeep Kulkarni (Michigan State University, USA) We present our framework DKVF that enables one to quickly prototype and evaluate new consistency protocols for key-value stores. DKVF is designed based on the separation of concerns in creating distributed data stores. This separation of concerns allows the designers of consistency protocols to only focus on the high-level consistency protocols which gives them the opportunity to quickly deploy a consistency protocol and evaluate its performance. Moreover, the loose coupling of the different components allows us to easily change different components (e.g. storage engine) of an implementation. We demonstrate DKVF by implementing four existing protocols --eventual consistency, COPS, GentleRain and CausalSpartan-- with it. The implementation of these protocols was very convenient with DKVF, as it only required to write a piece of code for the consistency component that is very close to the pseudocode of the original papers. Hence, it was possible to achieve this in just 1-2 days per protocol. DKVF also comes with a toolset that facilitates running clusters and performing experiments. Tutorial video: https://www.youtube.com/watch?v=MFJQzsJkwfc&list=PLErtSVEHsnBJvoQQI6iqGn61oNrUVVuST ![]() ![]() ![]() ![]() |
|
Kwiatkowska, Marta |
![]() Youcheng Sun, Min Wu, Wenjie Ruan, Xiaowei Huang, Marta Kwiatkowska, and Daniel Kroening (University of Oxford, UK; University of Liverpool, UK) Concolic testing combines program execution and symbolic analysis to explore the execution paths of a software program. In this paper, we develop the first concolic testing approach for Deep Neural Networks (DNNs). More specifically, we utilise quantified linear arithmetic over rationals to express test requirements that have been studied in the literature, and then develop a coherent method to perform concolic testing with the aim of better coverage. Our experimental results show the effectiveness of the concolic testing approach in both achieving high coverage and finding adversarial examples. ![]() ![]() |
|
Lahiri, Shuvendu K. |
![]() Chungha Sung, Shuvendu K. Lahiri, Constantin Enea, and Chao Wang (University of Southern California, USA; Microsoft Research, USA; University of Paris Diderot, France) When an evolving program is modified to address issues related to thread synchronization, there is a need to confirm the change is correct, i.e., it does not introduce unexpected behavior. However, manually comparing two programs to identify the semantic difference is labor intensive and error prone, whereas techniques based on model checking are computationally expensive. To fill the gap, we develop a fast and approximate static analysis for computing synchronization differences of two programs. The method is fast because, instead of relying on heavy-weight model checking techniques, it leverages a polynomial-time Datalog-based program analysis framework to compute differentiating data-flow edges, i.e., edges allowed by one program but not the other. Although approximation is used our method is sufficiently accurate due to careful design of the Datalog inference rules and iterative increase of the required data-flow edges for representing a difference. We have implemented our method and evaluated it on a large number of multithreaded C programs to confirm its ability to produce, often within seconds, the same differences obtained by human; in contrast, prior techniques based on model checking take minutes or even hours and thus can be 10x to 1000x slower. ![]() ![]() ![]() |
|
Laser, Marcelo Schmitt |
![]() Yixue Zhao, Paul Wat, Marcelo Schmitt Laser, and Nenad Medvidović (University of Southern California, USA) Network latency in mobile software has a large impact on user experience, with potentially severe economic consequences. Prefetching and caching have been shown effective in reducing the latencies in browser-based systems. However, those techniques cannot be directly applied to the emerging domain of mobile apps because of the differences in network interactions. Moreover, there is a lack of research on prefetching and caching techniques that may be suitable for the mobile app domain, and it is not clear whether such techniques can be effective or whether they are even feasible. This paper takes the first step toward answering these questions by conducting a comprehensive study to understand the characteristics of HTTP requests in over 1,000 popular Android apps. Our work focuses on the prefetchability of requests using static program analysis techniques and cacheability of resulting responses. We find that there is a substantial opportunity to leverage prefetching and caching in mobile apps, but that suitable techniques must take into account the nature of apps’ network interactions and idiosyncrasies such as untrustworthy HTTP header information. Our observations provide guidelines for developers to utilize prefetching and caching schemes in app development, and motivate future research in this area. ![]() ![]() |
|
Lee, Seonho |
![]() Sooyoung Cha, Seonho Lee, and Hakjoo Oh (Korea University, South Korea) We present template-guided concolic testing, a new technique for effectively reducing the search space in concolic testing. Addressing the path-explosion problem has been a significant challenge in concolic testing. Diverse search heuristics have been proposed to mitigate this problem but using search heuristics alone is not sufficient to substantially improve code coverage for real-world programs. The goal of this paper is to complement existing techniques and achieve higher coverage by exploiting templates in concolic testing. In our approach, a template is a partially symbolized input vector whose job is to reduce the search space. However, choosing a right set of templates is nontrivial and significantly affects the final performance of our approach. We present an algorithm that automatically learns useful templates online, based on data collected from previous runs of concolic testing. The experimental results with open-source programs show that our technique achieves greater branch coverage and finds bugs more effectively than conventional concolic testing. ![]() ![]() |
|
Lee, Yugyung |
![]() Gharib Gharibi, Rashmi Tripathi, and Yugyung Lee (University of Missouri-Kansas City, USA) A static call graph is an imperative prerequisite used in most interprocedural analyses and software comprehension tools. However, there is a lack of software tools that can automatically analyze the Python source-code and construct its static call graph. In this paper, we introduce a prototype Python tool, named code2graph, which automates the tasks of (1) analyzing the Python source-code and extracting its structure, (2) constructing static call graphs from the source code, and (3) generating a similarity matrix of all possible execution paths in the system. Our goal is twofold: First, assist the developers in understanding the overall structure of the system. Second, provide a stepping stone for further research that can utilize the tool in software searching and similarity detection applications. For example, clustering the execution paths into a logical workflow of the system would be applied to automate specific software tasks. Code2graph has been successfully used to generate static call graphs and similarity matrices of the paths for three popular open-source Deep Learning projects (TensorFlow, Keras, PyTorch). A tool demo is available at https://youtu.be/ecctePpcAKU. ![]() ![]() ![]() |
|
Le Goues, Claire |
![]() Rijnard van Tonder, John Kotheimer, and Claire Le Goues (Carnegie Mellon University, USA) Precise crash triage is important for automated dynamic testing tools, like fuzzers. At scale, fuzzers produce millions of crashing inputs. Fuzzers use heuristics, like stack hashes, to cut down on duplicate bug reports. These heuristics are fast, but often imprecise: even after deduplication, hundreds of uniquely reported crashes can still correspond to the same bug. Remaining crashes must be inspected manually, incurring considerable effort. In this paper we present Semantic Crash Bucketing, a generic method for precise crash bucketing using program transformation. Semantic Crash Bucketing maps crashing inputs to unique bugs as a function of changing a program (i.e., a semantic delta). We observe that a real bug fix precisely identifies crashes belonging to the same bug. Our insight is to approximate real bug fixes with lightweight program transformation to obtain the same level of precision. Our approach uses (a) patch templates and (b) semantic feedback from the program to automatically generate and apply approximate fixes for general bug classes. Our evaluation shows that approximate fixes are competitive with using true fixes for crash bucketing, and significantly outperforms built-in deduplication techniques for three state of the art fuzzers. ![]() ![]() |
|
Lemberger, Thomas |
![]() Dirk Beyer and Thomas Lemberger (LMU Munich, Germany) We present CPA-SymExec, a tool for symbolic execution that is implemented in the open-source, configurable verification framework CPAchecker. Our implementation automatically detects which symbolic facts to track, in order to obtain a small set of constraints that are necessary to decide reachability of a program area of interest. CPA-SymExec is based on abstraction and counterexample-guided abstraction refinement (CEGAR), and uses a constraint-interpolation approach to detect symbolic facts. We show that our implementation can better mitigate the path-explosion problem than symbolic execution without abstraction, by comparing the performance to the state-of-the-art Klee-based symbolic-execution engine Symbiotic and to Klee itself. For the experiments we use two kinds of analysis tasks: one for finding an executable path to a specific location of interest (e.g., if a test vector is desired to show that a certain behavior occurs), and one for confirming that no executable path to a specific location exists (e.g., if it is desired to show that a certain behavior never occurs). CPA-SymExec is released under the Apache 2 license and available (inclusive source code) at https://cpachecker.sosy-lab.org. A demonstration video is available at https://youtu.be/qoBHtvPKtnw. ![]() ![]() ![]() ![]() ![]() ![]() |
|
Lemieux, Caroline |
![]() Caroline Lemieux and Koushik Sen (University of California at Berkeley, USA) In recent years, fuzz testing has proven itself to be one of the most effective techniques for finding correctness bugs and security vulnerabilities in practice. One particular fuzz testing tool, American Fuzzy Lop (AFL), has become popular thanks to its ease-of-use and bug-finding power. However, AFL remains limited in the bugs it can find since it simply does not cover large regions of code. If it does not cover parts of the code, it will not find bugs there. We propose a two-pronged approach to increase the coverage achieved by AFL. First, the approach automatically identifies branches exercised by few AFL-produced inputs (rare branches), which often guard code that is empirically hard to cover by naively mutating inputs. The second part of the approach is a novel mutation mask creation algorithm, which allows mutations to be biased towards producing inputs hitting a given rare branch. This mask is dynamically computed during fuzz testing and can be adapted to other testing targets. We implement this approach on top of AFL in a tool named FairFuzz. We conduct evaluation on real-world programs against state-of-the-art versions of AFL. We find that on these programs FairFuzz achieves high branch coverage at a faster rate that state-of-the-art versions of AFL. In addition, on programs with nested conditional structure, it achieves sustained increases in branch coverage after 24 hours (average 10.6% increase). In qualitative analysis, we find that FairFuzz has an increased capacity to automatically discover keywords. ![]() ![]() |
|
Li, Bo |
![]() Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang (Harbin Institute of Technology, China; Nanyang Technological University, Singapore; Carnegie Mellon University, USA; Kyushu University, Japan; University of Illinois at Urbana-Champaign, USA; Monash University, Australia) Deep learning (DL) defines a new data-driven programming paradigm that constructs the internal system logic of a crafted neuron network through a set of training data. We have seen wide adoption of DL in many safety-critical scenarios. However, a plethora of studies have shown that the state-of-the-art DL systems suffer from various vulnerabilities which can lead to severe consequences when applied to real-world applications. Currently, the testing adequacy of a DL system is usually measured by the accuracy of test data. Considering the limitation of accessible high quality test data, good accuracy performance on test data can hardly provide confidence to the testing adequacy and generality of DL systems. Unlike traditional software systems that have clear and controllable logic and functionality, the lack of interpretability in a DL system makes system analysis and defect detection difficult, which could potentially hinder its real-world deployment. In this paper, we propose DeepGauge, a set of multi-granularity testing criteria for DL systems, which aims at rendering a multi-faceted portrayal of the testbed. The in-depth evaluation of our proposed testing criteria is demonstrated on two well-known datasets, five DL systems, and with four state-of-the-art adversarial attack techniques against DL. The potential usefulness of DeepGauge sheds light on the construction of more generic and robust DL systems. ![]() ![]() |
|
Li, Dengfeng |
![]() Wenyu Wang, Dengfeng Li, Wei Yang, Yurui Cao, Zhenwen Zhang, Yuetang Deng, and Tao Xie (University of Illinois at Urbana-Champaign, USA; University of Texas at Dallas, USA; Tencent, China) User Interface (UI) testing is a popular approach to ensure the quality of mobile apps. Numerous test generation tools have been developed to support UI testing on mobile apps, especially for Android apps. Previous work evaluates and compares different test generation tools using only relatively simple open-source apps, while real-world industrial apps tend to have more complex functionalities and implementations. There is no direct comparison among test generation tools with regard to effectiveness and ease-of-use on these industrial apps. To address such limitation, we study existing state-of-the-art or state-of-the-practice test generation tools on 68 widely-used industrial apps. We directly compare the tools with regard to code coverage and fault-detection ability. According to our results, Monkey, a state-of-the-practice tool from Google, achieves the highest method coverage on 22 of 41 apps whose method coverage data can be obtained. Of all 68 apps under study, Monkey also achieves the highest activity coverage on 35 apps, while Stoat, a state-of-the-art tool, is able to trigger the highest number of unique crashes on 23 apps. By analyzing the experimental results, we provide suggestions for combining different test generation tools to achieve better performance. We also report our experience in applying these tools to industrial apps under study. Our study results give insights on how Android UI test generation tools could be improved to better handle complex industrial apps. ![]() ![]() |
|
Li, Feng |
![]() Bingchang Liu, Wei Huo, Chao Zhang, Wenchao Li, Feng Li, Aihua Piao, and Wei Zou (Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Tsinghua University, China) Binary code similarity detection (BCSD) has many applications, including patch analysis, plagiarism detection, malware detection, and vulnerability search etc. Existing solutions usually perform comparisons over specific syntactic features extracted from binary code, based on expert knowledge. They have either high performance overheads or low detection accuracy. Moreover, few solutions are suitable for detecting similarities between cross-version binaries, which may not only diverge in syntactic structures but also diverge slightly in semantics. In this paper, we propose a solution αDiff, employing three semantic features, to address the cross-version BCSD challenge. It first extracts the intra-function feature of each binary function using a deep neural network (DNN). The DNN works directly on raw bytes of each function, rather than features (e.g., syntactic structures) provided by experts. αDiff further analyzes the function call graph of each binary, which are relatively stable in cross-version binaries, and extracts the inter-function and inter-module features. Then, a distance is computed based on these three features and used for BCSD. We have implemented a prototype of αDiff, and evaluated it on a dataset with about 2.5 million samples. The result shows that αDiff outperforms state-of-the-art static solutions by over 10 percentages on average in different BCSD settings. ![]() ![]() |
|
Li, Guangwei |
![]() Dongjie He, Lian Li, Lei Wang, Hengjie Zheng, Guangwei Li, and Jingling Xue (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia) The frequent release of Android OS and its various versions bring many compatibility issues to Android Apps. This paper studies and addresses such evolution-induced compatibility problems. We conduct an extensive empirical study over 11 different Android versions and 4,936 Android Apps. Our study shows that there are drastic API changes between adjacent Android versions, with averagely 140.8 new types, 1,505.6 new methods, and 979.2 new fields being introduced in each release. However, the Android Support Library (provided by the Android OS) only supports less than 23% of the newly added methods, with much less support for new types and fields. As a result, 91.84% of Android Apps write additional code to support different OS versions. Furthermore, 88.65% of the supporting codes share a common pattern, which directly compares variable android.os.Build.VERSION.SDK_INT with a constant version number, to use an API of particular versions. Based on our findings, we develop a new tool called IctApiFinder, to detect incompatible API usages in Android applications. IctApiFinder effectively computes the OS versions on which an API may be invoked, using an inter-procedural data-flow analysis framework. It detects numerous incompatible API usages in 361 out of 1,425 Apps. Compared to Android Lint, IctApiFinder is sound and able to reduce the false positives by 82.1%. We have reported the issues to 13 Apps developers. At present, 5 of them have already been confirmed by the original developers and 3 of them have already been fixed. ![]() ![]() |
|
Li, Li |
![]() Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang (Harbin Institute of Technology, China; Nanyang Technological University, Singapore; Carnegie Mellon University, USA; Kyushu University, Japan; University of Illinois at Urbana-Champaign, USA; Monash University, Australia) Deep learning (DL) defines a new data-driven programming paradigm that constructs the internal system logic of a crafted neuron network through a set of training data. We have seen wide adoption of DL in many safety-critical scenarios. However, a plethora of studies have shown that the state-of-the-art DL systems suffer from various vulnerabilities which can lead to severe consequences when applied to real-world applications. Currently, the testing adequacy of a DL system is usually measured by the accuracy of test data. Considering the limitation of accessible high quality test data, good accuracy performance on test data can hardly provide confidence to the testing adequacy and generality of DL systems. Unlike traditional software systems that have clear and controllable logic and functionality, the lack of interpretability in a DL system makes system analysis and defect detection difficult, which could potentially hinder its real-world deployment. In this paper, we propose DeepGauge, a set of multi-granularity testing criteria for DL systems, which aims at rendering a multi-faceted portrayal of the testbed. The in-depth evaluation of our proposed testing criteria is demonstrated on two well-known datasets, five DL systems, and with four state-of-the-art adversarial attack techniques against DL. The potential usefulness of DeepGauge sheds light on the construction of more generic and robust DL systems. ![]() ![]() |
|
Li, Lian |
![]() Dongjie He, Lian Li, Lei Wang, Hengjie Zheng, Guangwei Li, and Jingling Xue (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia) The frequent release of Android OS and its various versions bring many compatibility issues to Android Apps. This paper studies and addresses such evolution-induced compatibility problems. We conduct an extensive empirical study over 11 different Android versions and 4,936 Android Apps. Our study shows that there are drastic API changes between adjacent Android versions, with averagely 140.8 new types, 1,505.6 new methods, and 979.2 new fields being introduced in each release. However, the Android Support Library (provided by the Android OS) only supports less than 23% of the newly added methods, with much less support for new types and fields. As a result, 91.84% of Android Apps write additional code to support different OS versions. Furthermore, 88.65% of the supporting codes share a common pattern, which directly compares variable android.os.Build.VERSION.SDK_INT with a constant version number, to use an API of particular versions. Based on our findings, we develop a new tool called IctApiFinder, to detect incompatible API usages in Android applications. IctApiFinder effectively computes the OS versions on which an API may be invoked, using an inter-procedural data-flow analysis framework. It detects numerous incompatible API usages in 361 out of 1,425 Apps. Compared to Android Lint, IctApiFinder is sound and able to reduce the false positives by 82.1%. We have reported the issues to 13 Apps developers. At present, 5 of them have already been confirmed by the original developers and 3 of them have already been fixed. ![]() ![]() |
|
Li, Wenchao |
![]() Bingchang Liu, Wei Huo, Chao Zhang, Wenchao Li, Feng Li, Aihua Piao, and Wei Zou (Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Tsinghua University, China) Binary code similarity detection (BCSD) has many applications, including patch analysis, plagiarism detection, malware detection, and vulnerability search etc. Existing solutions usually perform comparisons over specific syntactic features extracted from binary code, based on expert knowledge. They have either high performance overheads or low detection accuracy. Moreover, few solutions are suitable for detecting similarities between cross-version binaries, which may not only diverge in syntactic structures but also diverge slightly in semantics. In this paper, we propose a solution αDiff, employing three semantic features, to address the cross-version BCSD challenge. It first extracts the intra-function feature of each binary function using a deep neural network (DNN). The DNN works directly on raw bytes of each function, rather than features (e.g., syntactic structures) provided by experts. αDiff further analyzes the function call graph of each binary, which are relatively stable in cross-version binaries, and extracts the inter-function and inter-module features. Then, a distance is computed based on these three features and used for BCSD. We have implemented a prototype of αDiff, and evaluated it on a dataset with about 2.5 million samples. The result shows that αDiff outperforms state-of-the-art static solutions by over 10 percentages on average in different BCSD settings. ![]() ![]() |
|
Li, Wenhai |
![]() Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Wenhai Li, Chao Ji, and Dan Ding (Fudan University, China; University of Illinois at Urbana-Champaign, USA; Singapore University of Technology and Design, Singapore) Debugging microservice systems involves the deployment and manipulation of microservice systems on a containerized environment and faces unique challenges due to the high complexity and dynamism of microservices. To address these challenges, in this paper, we propose a debugging approach for microservice systems based on the delta debugging algorithm, which is to minimize failureinducing deltas of circumstances (e.g., deployment, environmental configurations) for effective debugging. Our approach includes novel techniques for defining, deploying/manipulating, and executing deltas following the idea of delta debugging. In particular, to construct a (failing) circumstance space for delta debugging to minimize, our approach defines a set of dimensions that can affect the execution of microservice systems. Our experimental study on a medium-size microservice benchmark system shows that our approach can effectively identify failure-inducing deltas that help diagnose the root causes. ![]() ![]() |
|
Li, Yi |
![]() Federico Mora, Yi Li, Julia Rubin, and Marsha Chechik (University of Toronto, Canada; University of British Columbia, Canada) Software is often built by integrating components created by different teams or even different organizations. With little understanding of changes in dependent components, it is challenging to maintain correctness and robustness of the entire system. In this paper, we investigate the effect of component changes on the behavior of their clients. We observe that changes in a component are often irrelevant to a particular client and thus can be adopted without any delays or negative effects. Following this observation, we formulate the notion of client-specific equivalence checking (CSE) and develop an automated technique optimized for checking such equivalence. We evaluate our technique on a set of benchmarks, including those from the existing literature on equivalence checking, and show its applicability and effectiveness. ![]() ![]() |
|
Li, Yongjian |
![]() Jialun Cao, Yongjian Li, and Jun Pang (Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; University of Luxembourg, Luxembourg) This demo introduces L-CMP, an automatic learning-based parameterized verification tool. It can verify parameterized protocols by combining machine learning and model checking techniques. Given a parameterized protocol, L-CMP learns a set of auxiliary invariants and implements verification of the protocol using the invariants automatically. In particular, the learned auxiliary invariants are straightforward and readable. The experimental results show that L-CMP can successfully verify a number of cache coherence protocols, including the industrial-scale FLASH protocol. The video is available at https://youtu.be/6Dl2HiiiS4E, and L-CMPL-CMP can be downloaded at https://github.com/ ArabelaTso/Learning-Based-ParaVerifer. ![]() ![]() |
|
Lin, Huarui |
![]() Huarui Lin, Zan Wang, Shuang Liu, Jun Sun, Dongdi Zhang, and Guangning Wei (Tianjin University, China; Singapore University of Technology and Design, Singapore) Concurrency bugs of a multi-threaded program may only manifest with certain scheduling, i.e., they are heisenbugs which are observed only from time to time if we execute the same program with the same input multiple times. They are notoriously hard to fix. In this work, we propose an approach to automatically fix concurrency bugs. Compared to previous approaches, our key idea is to systematically fix concurrency bugs by inferring locking policies from failure inducing memory-access patterns. That is, we automatically identify memory-access patterns which are correlated with the manifestation of the bug, and then conjecture what is the intended locking policy of the program. Afterwards, we fix the program by implementing the locking policy so that the failure inducing memory-access patterns are made impossible. We have implemented our approach in a toolkit called PFix which supports Java programs. We applied PFix to a set of 23 concurrency bugs and are able to automatically fix 19 of them. In comparison, Grail which is the state-of-the-art tool for fixing concurrency bugs in Java programs can only fix 3 of them correctly. ![]() ![]() |
|
Lin, Tian |
![]() Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu (Australian National University, Australia; Monash University, Australia; Nanyang Technological University, Singapore) Developers can use different technologies for many software development tasks in their work. However, when faced with several technologies with comparable functionalities, it is not easy for developers to select the most appropriate one, as comparisons among technologies are time-consuming by trial and error. Instead, developers can resort to expert articles, read official documents or ask questions in QA sites for technology comparison, but it is opportunistic to get a comprehensive comparison as online information is often fragmented or contradictory. To overcome these limitations, we propose the diffTech system that exploits the crowdsourced discussions from Stack Overflow, and assists technology comparison with an informative summary of different comparison aspects. We first build a large database of comparable technologies in software engineering by mining tags in Stack Overflow, and then locate comparative sentences about comparable technologies with natural language processing methods. We further mine prominent comparison aspects by clustering similar comparative sentences and representing each cluster with its keywords. The evaluation demonstrates both the accuracy and usefulness of our model and we implement our approach into a practical website for public use. ![]() ![]() ![]() |
|
Lin, Yun |
![]() Yun Lin, Jun Sun, Lyly Tran, Guangdong Bai, Haijun Wang, and Jinsong Dong (National University of Singapore, Singapore; Singapore University of Technology and Design, Singapore; Singapore Institute of Technology, Singapore; Nanyang Technological University, Singapore) Dynamic slicing is a common way of identifying the root cause when a program fault is revealed. With the dynamic slicing technique, the programmers can follow data and control flow along the program execution trace to the root cause. However, the technique usually fails to work on omission bugs, i.e., the faults which are caused by missing executing some code. In many cases, dynamic slicing over-skips the root cause when an omission bug happens, leading the debugging process to a dead end. In this work, we conduct an empirical study on the omission bugs in the Defects4J bug repository. Our study shows that (1) omission bugs are prevalent (46.4%) among all the studied bugs; (2) there are repeating patterns on causes and fixes of the omission bugs; (3) the patterns of fixing omission bugs serve as a strong hint to break the slicing dead end. Based on our findings, we train a neural network model on the omission bugs in Defects4J repository to recommend where to approach when slicing can no long work. We conduct an experiment by applying our approach on 3193 mutated omission bugs which slicing fails to locate. The results show that our approach outperforms random benchmark on breaking the dead end and localizing the mutated omission bugs (63.8% over 2.8%). ![]() ![]() |
|
Liu, Bingchang |
![]() Bingchang Liu, Wei Huo, Chao Zhang, Wenchao Li, Feng Li, Aihua Piao, and Wei Zou (Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Tsinghua University, China) Binary code similarity detection (BCSD) has many applications, including patch analysis, plagiarism detection, malware detection, and vulnerability search etc. Existing solutions usually perform comparisons over specific syntactic features extracted from binary code, based on expert knowledge. They have either high performance overheads or low detection accuracy. Moreover, few solutions are suitable for detecting similarities between cross-version binaries, which may not only diverge in syntactic structures but also diverge slightly in semantics. In this paper, we propose a solution αDiff, employing three semantic features, to address the cross-version BCSD challenge. It first extracts the intra-function feature of each binary function using a deep neural network (DNN). The DNN works directly on raw bytes of each function, rather than features (e.g., syntactic structures) provided by experts. αDiff further analyzes the function call graph of each binary, which are relatively stable in cross-version binaries, and extracts the inter-function and inter-module features. Then, a distance is computed based on these three features and used for BCSD. We have implemented a prototype of αDiff, and evaluated it on a dataset with about 2.5 million samples. The result shows that αDiff outperforms state-of-the-art static solutions by over 10 percentages on average in different BCSD settings. ![]() ![]() |
|
Liu, Chao |
![]() Han Liu, Chao Liu, Wenqi Zhao, Yu Jiang, and Jiaguang Sun (Tsinghua University, China; Peking University, China; Ant Financial, China) Smart contracts, as a promising and powerful application on the Ethereum blockchain, have been growing rapidly in the past few years. Since they are highly vulnerable to different forms of attacks, their security becomes a top priority. However, existing security auditing techniques are either limited in fnding vulnerabilities (rely on pre-defned bug paterns) or very expensive (rely on program analysis), thus are insufcient for Ethereum. To mitigate these limitations, we proposed a novel semanticaware security auditing technique called S-gram for Ethereum. The key insight is a combination of N-gram language modeling and lightweight static semantic labeling, which can learn statistical regularities of contract tokens and capture high-level semantics as well (e.g., flow sensitivity of a transaction). S-gram can be used to predict potential vulnerabilities by identifying irregular token sequences and optimize existing in-depth analyzers (e.g., symbolic execution engines, fuzzers etc.). We have implemented S-gram for Solidity smart contracts in Ethereum. The evaluation demonstrated the potential of S-gram in identifying possible security issues. ![]() ![]() |
|
Liu, Cong |
![]() Mengshi Zhang, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz Khurshid (University of Texas at Austin, USA; Southern University of Science and Technology, China; University of Texas at Dallas, USA) While Deep Neural Networks (DNNs) have established the fundamentals of image-based autonomous driving systems, they may exhibit erroneous behaviors and cause fatal accidents. To address the safety issues in autonomous driving systems, a recent set of testing techniques have been designed to automatically generate artificial driving scenes to enrich test suite, e.g., generating new input images transformed from the original ones. However, these techniques are insufficient due to two limitations: first, many such synthetic images often lack diversity of driving scenes, and hence compromise the resulting efficacy and reliability. Second, for machine-learning-based systems, a mismatch between training and application domain can dramatically degrade system accuracy, such that it is necessary to validate inputs for improving system robustness. In this paper, we propose DeepRoad, an unsupervised DNN-based framework for automatically testing the consistency of DNN-based autonomous driving systems and online validation. First, DeepRoad automatically synthesizes large amounts of diverse driving scenes without using image transformation rules (e.g. scale, shear and rotation). In particular, DeepRoad is able to produce driving scenes with various weather conditions (including those with rather extreme conditions) by applying Generative Adversarial Networks (GANs) along with the corresponding real-world weather scenes. Second, DeepRoad utilizes metamorphic testing techniques to check the consistency of such systems using synthetic images. Third, DeepRoad validates input images for DNN-based systems by measuring the distance of the input and training images using their VGGNet features. We implement DeepRoad to test three well-recognized DNN-based autonomous driving systems in Udacity self-driving car challenge. The experimental results demonstrate that DeepRoad can detect thousands of inconsistent behaviors for these systems, and effectively validate input images to potentially enhance the system robustness as well. ![]() ![]() |
|
Liu, Han |
![]() Han Liu, Chao Liu, Wenqi Zhao, Yu Jiang, and Jiaguang Sun (Tsinghua University, China; Peking University, China; Ant Financial, China) Smart contracts, as a promising and powerful application on the Ethereum blockchain, have been growing rapidly in the past few years. Since they are highly vulnerable to different forms of attacks, their security becomes a top priority. However, existing security auditing techniques are either limited in fnding vulnerabilities (rely on pre-defned bug paterns) or very expensive (rely on program analysis), thus are insufcient for Ethereum. To mitigate these limitations, we proposed a novel semanticaware security auditing technique called S-gram for Ethereum. The key insight is a combination of N-gram language modeling and lightweight static semantic labeling, which can learn statistical regularities of contract tokens and capture high-level semantics as well (e.g., flow sensitivity of a transaction). S-gram can be used to predict potential vulnerabilities by identifying irregular token sequences and optimize existing in-depth analyzers (e.g., symbolic execution engines, fuzzers etc.). We have implemented S-gram for Solidity smart contracts in Ethereum. The evaluation demonstrated the potential of S-gram in identifying possible security issues. ![]() ![]() |
|
Liu, Hui |
![]() Hui Liu, Zhifeng Xu, and Yanzhen Zou (Beijing Institute of Technology, China; Peking University, China) Software refactoring is widely employed to improve software quality. A key step in software refactoring is to identify which part of the software should be refactored. To facilitate the identification, a number of approaches have been proposed to identify certain structures in the code (called code smells) that suggest the possibility of refactoring. Most of such approaches rely on manually designed heuristics to map manually selected source code metrics to predictions. However, it is challenging to manually select the best features, especially textual features. It is also difficult to manually construct the optimal heuristics. To this end, in this paper we propose a deep learning based novel approach to detecting feature envy, one of the most common code smells. The key insight is that deep neural networks and advanced deep learning techniques could automatically select features (especially textual features) of source code for feature envy detection, and could automatically build the complex mapping between such features and predictions. We also propose an automatic approach to generating labeled training data for the neural network based classifier, which does not require any human intervention. Evaluation results on open-source applications suggest that the proposed approach significantly improves the state-of-the-art in both detecting feature envy smells and recommending destinations for identified smelly methods. ![]() ![]() |
|
Liu, Jie |
![]() Jie Liu, Diyu Wu, and Jingling Xue (UNSW, Australia) The Android multitasking mechanism can be plagued with app switching attacks, in which a malicious app replaces the legitimate top activity of the focused app with one of its own, thus mounting, e.g., phishing and denial-of-service attacks. Existing market-level defenses are still ineffective, as static analysis is fundamentally unable to reason about the intention of an app and dynamic analysis has low coverage. We introduce TDroid, a new market-level approach to detecting app switching attacks. The challenge lies in how to handle a plethora of input-dependent branch predicates (forming an exponential number of paths) that control the execution of the code responsible for launching such attacks. TDroid tackles this challenge by combining static and dynamic analysis to analyze an app without producing any false positives. In its static analysis, TDroid transforms the app into runnable slices containing potentially app switching attacks, one slice per attack. In its dynamic analysis, TDroid executes these slices on an Android phone or emulator to expose their malicious GUIs. The novelty lies in the use of a new trigger-oriented slicing technique in producing runnable slices so that certain input-dependent branch predicates are specialized to execute always some fixed branches. Evaluated with a large set of malware apps, TDroid is shown to outperform the state of the art, by detecting substantially more app switching attacks, in a few minutes per app, on average. ![]() ![]() |
|
Liu, Shuang |
![]() Huarui Lin, Zan Wang, Shuang Liu, Jun Sun, Dongdi Zhang, and Guangning Wei (Tianjin University, China; Singapore University of Technology and Design, Singapore) Concurrency bugs of a multi-threaded program may only manifest with certain scheduling, i.e., they are heisenbugs which are observed only from time to time if we execute the same program with the same input multiple times. They are notoriously hard to fix. In this work, we propose an approach to automatically fix concurrency bugs. Compared to previous approaches, our key idea is to systematically fix concurrency bugs by inferring locking policies from failure inducing memory-access patterns. That is, we automatically identify memory-access patterns which are correlated with the manifestation of the bug, and then conjecture what is the intended locking policy of the program. Afterwards, we fix the program by implementing the locking policy so that the failure inducing memory-access patterns are made impossible. We have implemented our approach in a toolkit called PFix which supports Java programs. We applied PFix to a set of 23 concurrency bugs and are able to automatically fix 19 of them. In comparison, Grail which is the state-of-the-art tool for fixing concurrency bugs in Java programs can only fix 3 of them correctly. ![]() ![]() |
|
Liu, Ting |
![]() Yu Qu, Ting Liu, Jianlei Chi, Yangxu Jin, Di Cui, Ancheng He, and Qinghua Zheng (Xi'an Jiaotong University, China) Network measures have been proved to be useful in predicting software defects. Leveraging the dependency relationships between software modules, network measures can capture various structural features of software systems. However, existing studies have relied on user-defined network measures (e.g., degree statistics or centrality metrics), which are inflexible and require high computation cost, to describe the structural features. In this paper, we propose a new method called node2defect which uses a newly proposed network embedding technique, node2vec, to automatically learn to encode dependency network structure into low-dimensional vector spaces to improve software defect prediction. Specifically, we firstly construct a program's Class Dependency Network. Then node2vec is used to automatically learn structural features of the network. After that, we combine the learned features with traditional software engineering features, for accurate defect prediction. We evaluate our method on 15 open source programs. The experimental results show that in average, node2defect improves the state-of-the-art approach by 9.15% in terms of F-measure. ![]() ![]() |
|
Liu, Wanwei |
![]() Liangze Yin, Wei Dong, Wanwei Liu, and Ji Wang (National University of Defense Technology, China) Scheduling constraint based abstraction refinement (SCAR) is one of the most efficient methods for verifying programs under sequential consistency (SC). However, most multi-processor architectures implement weak memory models (WMMs) in order to improve the performance of a program. Due to the nondeterministic execution of those memory operations by the same thread, the behavior of a program under WMMs is much more complex than that under SC, which significantly increases the verification complexity. This paper elegantly extends the SCAR method to WMMs such as TSO and PSO. To capture the order requirements of an abstraction counterexample under WMMs, we have enriched the event order graph (EOG) of a counterexample such that it is competent for both SC and WMMs. We have also proposed a unified EOG generation method which can always obtain a minimal EOG efficiently. Experimental results on a large set of multi-threaded C programs show promising results of our method. It significantly outperforms state-of-the-art tools, and the time and memory it required to verify a program under TSO and PSO are roughly comparable to that under SC. ![]() ![]() |
|
Liu, Xiaoyu |
![]() Xiaoyu Liu, LiGuo Huang, and Vincent Ng (Southern Methodist University, USA; University of Texas at Dallas, USA) It is time-consuming and labor-intensive to learn and locate the correct API for programming tasks. Thus, it is beneficial to perform API recommendation automatically. The graph-based statistical model has been shown to recommend top-10 API candidates effectively. It falls short, however, in accurately recommending an actual top-1 API. To address this weakness, we propose RecRank, an approach and tool that applies a novel ranking-based discriminative approach leveraging API usage path features to improve top-1 API recommendation. Empirical evaluation on a large corpus of (1385+8) open source projects shows that RecRank significantly improves top-1 API recommendation accuracy and mean reciprocal rank when compared to state-of-the-art API recommendation approaches. ![]() ![]() |
|
Liu, Xin |
![]() Liang Bao, Xin Liu, Ziheng Xu, and Baoyin Fang (Xidian University, China; University of California at Davis, USA) Distributed message systems (DMSs) serve as the communication backbone for many real-time streaming data processing applications. To support the vast diversity of such applications, DMSs provide a large number of parameters to configure. However, It overwhelms for most users to configure these parameters well for better performance. Although many automatic configuration approaches have been proposed to address this issue, critical challenges still remain: 1) to train a better and robust performance prediction model using a limited number of samples, and 2) to search for a high-dimensional parameter space efficiently within a time constraint. In this paper, we propose AutoConfig -- an automatic configuration system that can optimize producer-side throughput on DMSs. AutoConfig constructs a novel comparison-based model (CBM) that is more robust that the prediction-based model (PBM) used by previous learning-based approaches. Furthermore, AutoConfig uses a weighted Latin hypercube sampling (wLHS) approach to select a set of samples that can provide a better coverage over the high-dimensional parameter space. wLHS allows AutoConfig to search for more promising configurations using the trained CBM. We have implemented AutoConfig on the Kafka platform, and evaluated it using eight different testing scenarios deployed on a public cloud. Experimental results show that our CBM can obtain better results than that of PBM under the same random forests based model. Furthermore, AutoConfig outperforms default configurations by 215.40% on average, and five state-of-the-art configuration algorithms by 7.21%-64.56%. ![]() ![]() |
|
Liu, Yang |
![]() Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang (Harbin Institute of Technology, China; Nanyang Technological University, Singapore; Carnegie Mellon University, USA; Kyushu University, Japan; University of Illinois at Urbana-Champaign, USA; Monash University, Australia) Deep learning (DL) defines a new data-driven programming paradigm that constructs the internal system logic of a crafted neuron network through a set of training data. We have seen wide adoption of DL in many safety-critical scenarios. However, a plethora of studies have shown that the state-of-the-art DL systems suffer from various vulnerabilities which can lead to severe consequences when applied to real-world applications. Currently, the testing adequacy of a DL system is usually measured by the accuracy of test data. Considering the limitation of accessible high quality test data, good accuracy performance on test data can hardly provide confidence to the testing adequacy and generality of DL systems. Unlike traditional software systems that have clear and controllable logic and functionality, the lack of interpretability in a DL system makes system analysis and defect detection difficult, which could potentially hinder its real-world deployment. In this paper, we propose DeepGauge, a set of multi-granularity testing criteria for DL systems, which aims at rendering a multi-faceted portrayal of the testbed. The in-depth evaluation of our proposed testing criteria is demonstrated on two well-known datasets, five DL systems, and with four state-of-the-art adversarial attack techniques against DL. The potential usefulness of DeepGauge sheds light on the construction of more generic and robust DL systems. ![]() ![]() ![]() Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu (Australian National University, Australia; Monash University, Australia; Nanyang Technological University, Singapore) Developers can use different technologies for many software development tasks in their work. However, when faced with several technologies with comparable functionalities, it is not easy for developers to select the most appropriate one, as comparisons among technologies are time-consuming by trial and error. Instead, developers can resort to expert articles, read official documents or ask questions in QA sites for technology comparison, but it is opportunistic to get a comprehensive comparison as online information is often fragmented or contradictory. To overcome these limitations, we propose the diffTech system that exploits the crowdsourced discussions from Stack Overflow, and assists technology comparison with an informative summary of different comparison aspects. We first build a large database of comparable technologies in software engineering by mining tags in Stack Overflow, and then locate comparative sentences about comparable technologies with natural language processing methods. We further mine prominent comparison aspects by clustering similar comparative sentences and representing each cluster with its keywords. The evaluation demonstrates both the accuracy and usefulness of our model and we implement our approach into a practical website for public use. ![]() ![]() ![]() ![]() Lingling Fan, Ting Su, Sen Chen, Guozhu Meng, Yang Liu, Lihua Xu, and Geguang Pu (East China Normal University, China; Nanyang Technological University, Singapore; Chinese Academy of Sciences, China; New York University Shanghai, China) Android, the #1 mobile app framework, enforces the single-GUI-thread model, in which a single UI thread manages GUI rendering and event dispatching. Due to this model, it is vital to avoid blocking the UI thread for responsiveness. One common practice is to offload long-running tasks into async threads. To achieve this, Android provides various async programming constructs, and leaves evelopers themselves to obey the rules implied by the model. However, as our study reveals, more than 25% apps violate these rules and introduce hard-to-detect, fail-stop errors, which we term as aysnc programming errors (APEs). To this end, this paper introduces APEChecker, a technique to automatically and efficiently manifest APEs. The key idea is to characterize APEs as specific fault patterns, and synergistically combine static analysis and dynamic UI exploration to detect and verify such errors. Among the 40 real-world Android apps, APEChecker unveils and processes 61 APEs, of which 51 are confirmed (83.6% hit rate). Specifically, APEChecker detects 3X more APEs than the state-of-art testing tools (Monkey, Sapienz and Stoat), and reduces testing time from half an hour to a few minutes. On a specific type of APEs, APEChecker confirms 5X more errors than the data race detection tool, EventRacer, with very few false alarms. ![]() ![]() ![]() Kaifeng Huang, Bihuan Chen, Xin Peng, Daihong Zhou, Ying Wang, Yang Liu, and Wenyun Zhao (Fudan University, China; Nanyang Technological University, Singapore) Analyzing and understanding source code changes is important in a variety of software maintenance tasks. To this end, many code differencing and code change summarization methods have been proposed. For some tasks (e.g. code review and software merging), however, those differencing methods generate too fine-grained a representation of code changes, and those summarization methods generate too coarse-grained a representation of code changes. Moreover, they do not consider the relationships among code changes. Therefore, the generated differences or summaries make it not easy to analyze and understand code changes in some software maintenance tasks. In this paper, we propose a code differencing approach, named CLDIFF, to generate concise linked code differences whose granularity is in between the existing code differencing and code change summarization methods. The goal of CLDIFF is to generate more easily understandable code differences. CLDIFF takes source code files before and after changes as inputs, and consists of three steps. First, it pre-processes the source code files by pruning unchanged declara- tions from the parsed abstract syntax trees. Second, it generates concise code differences by grouping fine-grained code differences at or above the statement level and describing high-level changes in each group. Third, it links the related concise code differences according to five pre-defined links. Experiments with 12 Java projects (74,387 commits) and a human study with 10 participants have indicated the accuracy, conciseness, performance and usefulness of CLDIFF. ![]() ![]() |
|
Liu, Ye |
![]() Bo Jiang, Ye Liu, and W. K. Chan (Beihang University, China; City University of Hong Kong, China) Decentralized cryptocurrencies feature the use of blockchain to transfer values among peers on networks without central agency. Smart contracts are programs running on top of the blockchain consensus protocol to enable people make agreements while minimizing trusts. Millions of smart contracts have been deployed in various decentralized applications. The security vulnerabilities within those smart contracts pose significant threats to their applications. Indeed, many critical security vulnerabilities within smart contracts on Ethereum platform have caused huge financial losses to their users. In this work, we present ContractFuzzer, a novel fuzzer to test Ethereum smart contracts for security vulnerabilities. ContractFuzzer generates fuzzing inputs based on the ABI specifications of smart contracts, defines test oracles to detect security vulnerabilities, instruments the EVM to log smart contracts runtime behaviors, and analyzes these logs to report security vulnerabilities. Our fuzzing of 6991 smart contracts has flagged more than 459 vulnerabilities with high precision. In particular, our fuzzing tool successfully detects the vulnerability of the DAO contract that leads to USD 60 million loss and the vulnerabilities of Parity Wallet that have led to the loss of USD 30 million and the freezing of USD 150 million worth of Ether. ![]() ![]() ![]() ![]() ![]() |
|
Liu, Yepang |
![]() Huaxun Huang, Lili Wei, Yepang Liu, and Shing-Chi Cheung (Hong Kong University of Science and Technology, China; Southern University of Science and Technology, China) The control flows of Android apps are largely driven by the protocols that govern how callback APIs are invoked in response to various events. When these callback APIs evolve along with the Android framework, the changes in their invocation protocols can induce unexpected control flows to existing Android apps, causing various compatibility issues. We refer to these issues as callback compatibility issues. While Android framework updates have received due attention, little is known about their impacts on app control flows and the callback compatibility issues thus induced. To bridge the gap, we examined Android documentations and conducted an empirical study on 100 real-world callback compatibility issues to investigate how these issues were induced by callback API evolutions. Based on our empirical findings, we propose a graph-based model to capture the control flow inconsistencies caused by API evolutions and devise a static analysis technique, Cider, to detect callback compatibility issues. Our evaluation of Cider on 20 popular open-source Android apps shows that Cider is effective. It detected 13 new callback compatibility issues in these apps, among which 12 issues were confirmed and 9 issues were fixed. ![]() ![]() ![]() Jiajun Hu, Lili Wei, Yepang Liu, Shing-Chi Cheung, and Huaxun Huang (Hong Kong University of Science and Technology, China; Southern University of Science and Technology, China) WebView is a widely used Android component that augments a native app with web browser capabilities. It eases the interactions between an app’s native code and web code. However, the interaction mechanism of WebView induces new types of bugs in Android apps. Understanding the characteristics and manifestation of these WebView-induced bugs (ωBugs for short) facilitates the correct usages of WebViews in Android apps. This motivates us to conduct the first empirical study on ωBugs based on those found in popular open-source Android apps. Our study identified the major root causes and consequences of ωBugs and made interesting observations that can be leveraged for detecting and diagnosing ωBugs. Based on the empirical study, we further propose an automated testing technique ωDroid to effectively expose ωBugs in Android apps. In our experiments, ωDroid successfully discovered 30 unique and previously-unknown ωBugs when applied to 146 open-source Android apps. We reported the 30 ωBugs to the corresponding app developers. Out of these 30 ωBugs, 14 were confirmed and 7 of them were fixed. This shows that ωDroid can effectively detect ωBugs that are of the developers’ concern. ![]() ![]() |
|
Liu, Zhongxin |
![]() Zhongxin Liu, Xin Xia, Ahmed E. Hassan, David Lo, Zhenchang Xing, and Xinyu Wang (Zhejiang University, China; Monash University, Australia; Queen's University, Canada; Singapore Management University, Singapore; Australian National University, Australia) Commit messages can be regarded as the documentation of software changes. These messages describe the content and purposes of changes, hence are useful for program comprehension and software maintenance. However, due to the lack of time and direct motivation, commit messages sometimes are neglected by developers. To address this problem, Jiang et al. proposed an approach (we refer to it as NMT), which leverages a neural machine translation algorithm to automatically generate short commit messages from code. The reported performance of their approach is promising, however, they did not explore why their approach performs well. Thus, in this paper, we first perform an in-depth analysis of their experimental results. We find that (1) Most of the test diffs from which NMT can generate high-quality messages are similar to one or more training diffs at the token level. (2) About 16% of the commit messages in Jiang et al.’s dataset are noisy due to being automatically generated or due to them describing repetitive trivial changes. (3) The performance of NMT declines by a large amount after removing such noisy commit messages. In addition, NMT is complicated and time-consuming. Inspired by our first finding, we proposed a simpler and faster approach, named NNGen (Nearest Neighbor Generator), to generate concise commit messages using the nearest neighbor algorithm. Our experimental results show that NNGen is over 2,600 times faster than NMT, and outperforms NMT in terms of BLEU (an accuracy measure that is widely used to evaluate machine translation systems) by 21%. Finally, we also discuss some observations for the road ahead for automated commit message generation to inspire other researchers. ![]() ![]() |
|
Lo, David |
![]() Xue Han, Tingting Yu, and David Lo (University of Kentucky, USA; Singapore Management University, Singapore) Software performance is important for ensuring the quality of software products. Performance bugs, defined as programming errors that cause significant performance degradation, can lead to slow systems and poor user experience. While there has been some research on automated performance testing such as test case generation, the main idea is to select workload values to increase the program execution times. These techniques often assume the initial test cases have the right combination of input parameters and focus on evolving values of certain input parameters. However, such an assumption may not hold for highly configurable real-word applications, in which the combinations of input parameters can be very large. In this paper, we manually analyze 300 bug reports from three large open source projects - Apache HTTP Server, MySQL, and Mozilla Firefox. We found that 1) exposing performance bugs often requires combinations of multiple input parameters, and 2) certain input parameters are frequently involved in exposing performance bugs. Guided by these findings, we designed and evaluated an automated approach, PerfLearner, to extract execution commands and input parameters from descriptions of performance bug reports and use them to generate test frames for guiding actual performance test case generation. ![]() ![]() ![]() Qiao Huang, Xin Xia, Zhenchang Xing, David Lo, and Xinyu Wang (Zhejiang University, China; Monash University, Australia; Australian National University, Australia; Singapore Management University, Singapore) Developers often need to search for appropriate APIs for their programming tasks. Although most libraries have API reference documentation, it is not easy to find appropriate APIs due to the lexical gap and knowledge gap between the natural language description of the programming task and the API description in API documentation. Here, the lexical gap refers to the fact that the same semantic meaning can be expressed by different words, and the knowledge gap refers to the fact that API documentation mainly describes API functionality and structure but lacks other types of information like concepts and purposes, which are usually the key information in the task description. In this paper, we propose an API recommendation approach named BIKER (Bi-Information source based KnowledgE Recommendation) to tackle these two gaps. To bridge the lexical gap, BIKER uses word embedding technique to calculate the similarity score between two text descriptions. Inspired by our survey findings that developers incorporate Stack Overflow posts and API documentation for bridging the knowledge gap, BIKER leverages Stack Overflow posts to extract candidate APIs for a program task, and ranks candidate APIs by considering the query’s similarity with both Stack Overflow posts and API documentation. It also summarizes supplementary information (e.g., API description, code examples in Stack Overflow posts) for each API to help developers select the APIs that are most relevant to their tasks. Our evaluation with 413 API-related questions confirms the effectiveness of BIKER for both class- and method-level API recommendation, compared with state-of-the-art baselines. Our user study with 28 Java developers further demonstrates the practicality of BIKER for API search. ![]() ![]() ![]() Zhongxin Liu, Xin Xia, Ahmed E. Hassan, David Lo, Zhenchang Xing, and Xinyu Wang (Zhejiang University, China; Monash University, Australia; Queen's University, Canada; Singapore Management University, Singapore; Australian National University, Australia) Commit messages can be regarded as the documentation of software changes. These messages describe the content and purposes of changes, hence are useful for program comprehension and software maintenance. However, due to the lack of time and direct motivation, commit messages sometimes are neglected by developers. To address this problem, Jiang et al. proposed an approach (we refer to it as NMT), which leverages a neural machine translation algorithm to automatically generate short commit messages from code. The reported performance of their approach is promising, however, they did not explore why their approach performs well. Thus, in this paper, we first perform an in-depth analysis of their experimental results. We find that (1) Most of the test diffs from which NMT can generate high-quality messages are similar to one or more training diffs at the token level. (2) About 16% of the commit messages in Jiang et al.’s dataset are noisy due to being automatically generated or due to them describing repetitive trivial changes. (3) The performance of NMT declines by a large amount after removing such noisy commit messages. In addition, NMT is complicated and time-consuming. Inspired by our first finding, we proposed a simpler and faster approach, named NNGen (Nearest Neighbor Generator), to generate concise commit messages using the nearest neighbor algorithm. Our experimental results show that NNGen is over 2,600 times faster than NMT, and outperforms NMT in terms of BLEU (an accuracy measure that is widely used to evaluate machine translation systems) by 21%. Finally, we also discuss some observations for the road ahead for automated commit message generation to inspire other researchers. ![]() ![]() |
|
Lu, Jian |
![]() Yuju Shen, Yanyan Jiang, Chang Xu, Ping Yu, Xiaoxing Ma, and Jian Lu (Nanjing University, China) Regular expression (regex) with modern extensions is one of the most popular string processing tools. However, poorly-designed regexes can yield exponentially many matching steps, and lead to regex Denial-of-Service (ReDoS) attacks under well-conceived string inputs. This paper presents Rescue, a three-phase gray-box analytical technique, to automatically generate ReDoS strings to highlight vulnerabilities of given regexes. Rescue systematically seeds (by a genetic search), incubates (by another genetic search), and finally pumps (by a regex-dedicated algorithm) for generating strings with maximized search time. We implemenmted the Rescue tool and evaluated it against 29,088 practical regexes in real-world projects. The evaluation results show that Rescue found 49% more attack strings compared with the best existing technique, and applying Rescue to popular GitHub projects discovered ten previously unknown ReDoS vulnerabilities. ![]() ![]() ![]() |
|
Lumetta, Steven S. |
![]() Jianxiong Gao and Steven S. Lumetta (University of Illinois at Urbana-Champaign, USA) Path explosion has been a problem for symbolic execution for a long time. The key to avoid path explosion is to limit the number of paths generated within loops while maintaining high code coverage. Full symbolic execution creates paths for every possible execution path. Frequently, paths within loops do not contribute to code coverage. Branches within loops may generate new states at every iteration. The path explosion problem created by loops often stops symbolic execution to reach deeper parts of the code. In this paper, we propose a new path exploration method that reduces the number of states needed to achieve high coverage. Our algorithm limits the number of new states created by first prioritizing states, and then pruning the states that do not contribute to code coverage. Our algorithm does not require loop invariant inference/loop summarization, nor does it bound the number of iterations of loop exploration. The proposed algorithm can thus handle a broader set of loops than previous approaches. In fact, our algorithm is orthogonal to loop summarization techniques and search-guide heuristics, so it complements the current methods. We have implemented our algorithm using KLEE and tested with 235 student-generated versions of a classroom assignment. Our results show that our algorithm helps to achieve the same coverage with speedup of 11.8× for 117 out of the 235 programs, while adding 15% max observed and 2% average overhead over the 50% of programs not benefiting from the technique. The maximum speedup for a single program is 52.3×. ![]() ![]() |
|
Lyu, Michael R. |
![]() Pinjia He, Zhuangbin Chen, Shilin He, and Michael R. Lyu (Chinese University of Hong Kong, China) Logging is a common programming practice of great importance in modern software development, because software logs have been widely used in various software maintenance tasks. To provide high-quality logs, developers need to design the description text in logging statements carefully. Inappropriate descriptions will slow down or even mislead the maintenance process, such as postmortem analysis. However, there is currently a lack of rigorous guide and specifications on developer logging behaviors, which makes the construction of description text in logging statements a challenging problem. To fill this significant gap, in this paper, we systematically study what developers log, with focus on the usage of natural language descriptions in logging statements. We obtain 6 valuable findings by conducting source code analysis on 10 Java projects and 7 C# projects, which contain 28,532,975 LOC and 115,159 logging statements in total. Furthermore, our study demonstrates the potential of automated description text generation for logging statements by obtaining up to 49.04 BLEU-4 score and 62.1 ROUGE-L score using a simple information retrieval method. To facilitate future research in this field, the datasets have been publicly released. ![]() ![]() |
|
Ma, Lei |
![]() Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang (Harbin Institute of Technology, China; Nanyang Technological University, Singapore; Carnegie Mellon University, USA; Kyushu University, Japan; University of Illinois at Urbana-Champaign, USA; Monash University, Australia) Deep learning (DL) defines a new data-driven programming paradigm that constructs the internal system logic of a crafted neuron network through a set of training data. We have seen wide adoption of DL in many safety-critical scenarios. However, a plethora of studies have shown that the state-of-the-art DL systems suffer from various vulnerabilities which can lead to severe consequences when applied to real-world applications. Currently, the testing adequacy of a DL system is usually measured by the accuracy of test data. Considering the limitation of accessible high quality test data, good accuracy performance on test data can hardly provide confidence to the testing adequacy and generality of DL systems. Unlike traditional software systems that have clear and controllable logic and functionality, the lack of interpretability in a DL system makes system analysis and defect detection difficult, which could potentially hinder its real-world deployment. In this paper, we propose DeepGauge, a set of multi-granularity testing criteria for DL systems, which aims at rendering a multi-faceted portrayal of the testbed. The in-depth evaluation of our proposed testing criteria is demonstrated on two well-known datasets, five DL systems, and with four state-of-the-art adversarial attack techniques against DL. The potential usefulness of DeepGauge sheds light on the construction of more generic and robust DL systems. ![]() ![]() |
|
Ma, Shiqing |
![]() Zhenhao Tang, Juan Zhai, Minxue Pan, Yousra Aafer, Shiqing Ma, Xiangyu Zhang, and Jianhua Zhao (Nanjing University, China; Purdue University, USA) Modern Android malwares tend to use advanced techniques to cover their malicious behaviors. They usually feature multi-staged, condition-guarded and environment-specific payloads. An increasing number of them utilize WebView, particularly the two-way communications between Java and JavaScript, to evade detection and analysis of existing techniques. We propose Dual-Force, a forced execution technique which simultaneously forces both Java and JavaScript code of WebView applications to execute along various paths without requiring any environment setup or providing any inputs manually. As such, the hidden payloads of WebView malwares are forcefully exposed. The technique features a novel execution model that allows forced execution to suppress exceptions and continue execution. Experimental results show that Dual-Force precisely exposes malicious payload in 119 out of 150 WebView malwares. Compared to the state-of-the-art, Dual-Force can expose 23% more malicious behaviors. ![]() ![]() |
|
Ma, Xiaoxing |
![]() Yuju Shen, Yanyan Jiang, Chang Xu, Ping Yu, Xiaoxing Ma, and Jian Lu (Nanjing University, China) Regular expression (regex) with modern extensions is one of the most popular string processing tools. However, poorly-designed regexes can yield exponentially many matching steps, and lead to regex Denial-of-Service (ReDoS) attacks under well-conceived string inputs. This paper presents Rescue, a three-phase gray-box analytical technique, to automatically generate ReDoS strings to highlight vulnerabilities of given regexes. Rescue systematically seeds (by a genetic search), incubates (by another genetic search), and finally pumps (by a regex-dedicated algorithm) for generating strings with maximized search time. We implemenmted the Rescue tool and evaluated it against 29,088 practical regexes in real-world projects. The evaluation results show that Rescue found 49% more attack strings compared with the best existing technique, and applying Rescue to popular GitHub projects discovered ten previously unknown ReDoS vulnerabilities. ![]() ![]() ![]() |
|
Macedo, Nuno |
![]() Julien Brunel, David Chemouil, Alcino Cunha, and Nuno Macedo (ONERA, France; University of Toulouse, France; INESC TEC, Portugal; University of Minho, Portugal) This paper presents the Electrum Analyzer, a free-software tool to validate and perform model checking of Electrum specifications. Electrum is an extension of Alloy that enriches its relational logic with LTL operators, thus simplifying the specification of dynamic systems. The Analyzer supports both automatic bounded model checking, with an encoding into SAT, and unbounded model checking, with an encoding into SMV. Instance, or counter-example, traces are presented back to the user in a unified visualizer. Features to speed up model checking are offered, including a decomposed parallel solving strategy and the extraction of symbolic bounds. ![]() ![]() ![]() ![]() |
|
Macho, Christian |
![]() Keheliya Gallaba, Christian Macho, Martin Pinzger, and Shane McIntosh (McGill University, Canada; University of Klagenfurt, Austria) Automated builds, which may pass or fail, provide feedback to a development team about changes to the codebase. A passing build indicates that the change compiles cleanly and tests (continue to) pass. A failing (a.k.a., broken) build indicates that there are issues that require attention. Without a closer analysis of the nature of build outcome data, practitioners and researchers are likely to make two critical assumptions: (1) build results are not noisy; however, passing builds may contain failing or skipped jobs that are actively or passively ignored; and (2) builds are equal; however, builds vary in terms of the number of jobs and configurations. To investigate the degree to which these assumptions about build breakage hold, we perform an empirical study of 3.7 million build jobs spanning 1,276 open source projects. We find that: (1) 12% of passing builds have an actively ignored failure; (2) 9% of builds have a misleading or incorrect outcome on average; and (3) at least 44% of the broken builds contain passing jobs, i.e., the breakage is local to a subset of build variants. Like other software archives, build data is noisy and complex. Analysis of build data requires nuance. ![]() ![]() |
|
Mairegger, Michael |
![]() Andrea Janes, Michael Mairegger, and Barbara Russo (Free University of Bolzano, Italy) As a developer, it is often complex to foresee the impact of changes in source code on usage, e.g., it is time-consuming to find out all components that will be impacted by a change or estimate the impact on the usability of a failing piece of code. It is therefore hard to decide how much effort in quality assurance is justifiable to obtain the desired business goals. In this paper, to reduce the difficulty for developers to understand the importance of source code, we propose an automated way to provide this information to developers as they are working on a given piece of code. As a proof-of-concept, we developed a plug-in for Microsoft Visual Studio Code that informs about the importance of source code methods based on the frequency of usage by the end-users of the developed software. The plug-in aims to increase the awareness developers have about the importance of source code in an unobtrusive way, helping them to prioritize their effort to quality assurance, technical excellence, and usability. code_call_lens can be downloaded from GitHub at https://github.com/xxMUROxx/vscode.code_call_lens. ![]() ![]() |
|
Malek, Sam |
![]() Mahmoud Hammad, Joshua Garcia, and Sam Malek (University of California at Irvine, USA) The current security mechanisms for Android apps, both static and dynamic analysis approaches, are insufficient for detection and prevention of the increasingly dynamic and sophisticated security attacks. Static analysis approaches suffer from false positives whereas dynamic analysis approaches suffer from false negatives. Moreover, they all lack the ability to efficiently analyze systems with incremental changes—such as adding/removing apps, granting/revoking permissions, and dynamic components’ communications. Each time the system changes, the entire analysis needs to be repeated, making the existing approaches inefficient for practical use. To mitigate their shortcomings, we have developed SALMA, a novel self-protecting Android software system that monitors itself and adapts its behavior at runtime to prevent a wide-range of security risks. SALMA maintains a precise architectural model, represented as a Multiple-Domain-Matrix, and incrementally and efficiently analyzes an Android system in response to incremental system changes. The maintained architecture is used to reason about the running Android system. Every time the system changes, SALMA determines (1) the impacted part of the system, and (2) the subset of the security analyses that need to be performed, thereby greatly improving the performance of the approach. Our experimental results on hundreds of real-world apps corroborate SALMA’s scalability and efficiency as well as its ability to detect and prevent security attacks at runtime with minimal disruption. ![]() ![]() |
|
Marinov, Darko |
![]() Michael Hilton, Jonathan Bell, and Darko Marinov (Carnegie Mellon University, USA; George Mason University, USA; University of Illinois at Urbana-Champaign, USA) Statement coverage is commonly used as a measure of test suite quality. Coverage is often used as a part of a code review process: if a patch decreases overall coverage, or is itself not covered, then the patch is scrutinized more closely. Traditional studies of how coverage changes with code evolution have examined the overall coverage of the entire program, and more recent work directly examines the coverage of patches (changed statements). We present an evaluation much larger than prior studies and moreover consider a new, important kind of change --- coverage changes of unchanged statements. We present a large-scale evaluation of code coverage evolution over 7,816 builds of 47 projects written in popular languages including Java, Python, and Scala. We find that in large, mature projects, simply measuring the change to statement coverage does not capture the nuances of code evolution. Going beyond considering statement coverage as a simple ratio, we examine how the set of statements covered evolves between project revisions. We present and study new ways to assess the impact of a patch on a project's test suite quality that both separates coverage of the patch from coverage of the non-patch, and separates changes in coverage from changes in the set of statements covered. ![]() ![]() |
|
Markman, Martin |
![]() Davrondzhon Gafurov, Arne Erik Hurum, and Martin Markman (Norwegian Directorate of eHealth, Norway) We present a process driven test automation solution which enables delegating (part of) automation tasks from test automation engineer (expensive resource) to test analyst (non-developer, less expensive). In our approach, a test automation engineer implements test steps (or actions) which are executed automatically. Such automated test steps represent user actions in the system under test and specified by a natural language which is understandable by a non-technical person. Then, a test analyst with a domain knowledge organizes automated steps combined with test input to create an automated test case. It should be emphasized that the test analyst does not need to possess programming skills to create, modify or execute automated test cases. We refine benchmark test automation architecture to be better suitable for an effective separation and sharing of responsibilities between the test automation engineer (with coding skills) and test analyst (with a domain knowledge). In addition, we propose a metric to empirically estimate cooperation between test automation engineer and test analyst's works. The proposed automation solution has been defined based on our experience in the development and maintenance of Helsenorg, the national electronic health services in Norway which has had over one million of visits per month past year, and we still use it to automate the execution of regression tests. ![]() ![]() |
|
Maro, Salome |
![]() Mukelabai Mukelabai, Damir Nešić, Salome Maro, Thorsten Berger, and Jan-Philipp Steghöfer (Chalmers University of Technology, Sweden; University of Gothenburg, Sweden; KTH, Sweden) Highly configurable systems are complex pieces of software. To tackle this complexity, hundreds of dedicated analysis techniques have been conceived, many of which able to analyze system properties for all possible system configurations, as opposed to traditional, single-system analyses. Unfortunately, it is largely unknown whether these techniques are adopted in practice, whether they address actual needs, or what strategies practitioners actually apply to analyze highly configurable systems. We present a study of analysis practices and needs in industry. It relied on a survey with 27 practitioners engineering highly configurable systems and follow-up interviews with 15 of them, covering 18 different companies from eight countries. We confirm that typical properties considered in the literature (e.g., reliability) are relevant, that consistency between variability models and artifacts is critical, but that the majority of analyses for specifications of configuration options (a.k.a., variability model analysis) is not perceived as needed. We identified rather pragmatic analysis strategies, including practices to avoid the need for analysis. For instance, testing with experience-based sampling is the most commonly applied strategy, while systematic sampling is rarely applicable. We discuss analyses that are missing and synthesize our insights into suggestions for future research. ![]() ![]() ![]() |
|
Mazinanian, Davood |
![]() Mohammad Bajammal, Davood Mazinanian, and Ali Mesbah (University of British Columbia, Canada) The transformation of a user interface mockup designed by a graphic designer to web components in the final app built by a web developer is often laborious, involving manual and time consuming steps. We propose an approach to automate this aspect of web development by generating reusable web components from a mockup. Our approach employs visual analysis of the mockup, and unsupervised learning of visual cues to create reusable web components (e.g., React components). We evaluated our approach, implemented in a tool called VizMod, on five real-world web mockups, and assessed the transformations and generated components through comparison with web development experts. The results show that VizMod achieves on average 94% precision and 75% recall in terms of agreement with the developers' assessment. Furthermore, the refactorings yielded 22% code reusability, on average. ![]() ![]() |
|
McIntosh, Shane |
![]() Keheliya Gallaba, Christian Macho, Martin Pinzger, and Shane McIntosh (McGill University, Canada; University of Klagenfurt, Austria) Automated builds, which may pass or fail, provide feedback to a development team about changes to the codebase. A passing build indicates that the change compiles cleanly and tests (continue to) pass. A failing (a.k.a., broken) build indicates that there are issues that require attention. Without a closer analysis of the nature of build outcome data, practitioners and researchers are likely to make two critical assumptions: (1) build results are not noisy; however, passing builds may contain failing or skipped jobs that are actively or passively ignored; and (2) builds are equal; however, builds vary in terms of the number of jobs and configurations. To investigate the degree to which these assumptions about build breakage hold, we perform an empirical study of 3.7 million build jobs spanning 1,276 open source projects. We find that: (1) 12% of passing builds have an actively ignored failure; (2) 9% of builds have a misleading or incorrect outcome on average; and (3) at least 44% of the broken builds contain passing jobs, i.e., the breakage is local to a subset of build variants. Like other software archives, build data is noisy and complex. Analysis of build data requires nuance. ![]() ![]() |
|
Medvidović, Nenad |
![]() Yixue Zhao, Paul Wat, Marcelo Schmitt Laser, and Nenad Medvidović (University of Southern California, USA) Network latency in mobile software has a large impact on user experience, with potentially severe economic consequences. Prefetching and caching have been shown effective in reducing the latencies in browser-based systems. However, those techniques cannot be directly applied to the emerging domain of mobile apps because of the differences in network interactions. Moreover, there is a lack of research on prefetching and caching techniques that may be suitable for the mobile app domain, and it is not clear whether such techniques can be effective or whether they are even feasible. This paper takes the first step toward answering these questions by conducting a comprehensive study to understand the characteristics of HTTP requests in over 1,000 popular Android apps. Our work focuses on the prefetchability of requests using static program analysis techniques and cacheability of resulting responses. We find that there is a substantial opportunity to leverage prefetching and caching in mobile apps, but that suitable techniques must take into account the nature of apps’ network interactions and idiosyncrasies such as untrustworthy HTTP header information. Our observations provide guidelines for developers to utilize prefetching and caching schemes in app development, and motivate future research in this area. ![]() ![]() |
|
Meng, Guozhu |
![]() Lingling Fan, Ting Su, Sen Chen, Guozhu Meng, Yang Liu, Lihua Xu, and Geguang Pu (East China Normal University, China; Nanyang Technological University, Singapore; Chinese Academy of Sciences, China; New York University Shanghai, China) Android, the #1 mobile app framework, enforces the single-GUI-thread model, in which a single UI thread manages GUI rendering and event dispatching. Due to this model, it is vital to avoid blocking the UI thread for responsiveness. One common practice is to offload long-running tasks into async threads. To achieve this, Android provides various async programming constructs, and leaves evelopers themselves to obey the rules implied by the model. However, as our study reveals, more than 25% apps violate these rules and introduce hard-to-detect, fail-stop errors, which we term as aysnc programming errors (APEs). To this end, this paper introduces APEChecker, a technique to automatically and efficiently manifest APEs. The key idea is to characterize APEs as specific fault patterns, and synergistically combine static analysis and dynamic UI exploration to detect and verify such errors. Among the 40 real-world Android apps, APEChecker unveils and processes 61 APEs, of which 51 are confirmed (83.6% hit rate). Specifically, APEChecker detects 3X more APEs than the state-of-art testing tools (Monkey, Sapienz and Stoat), and reduces testing time from half an hour to a few minutes. On a specific type of APEs, APEChecker confirms 5X more errors than the data race detection tool, EventRacer, with very few false alarms. ![]() ![]() |
|
Mesbah, Ali |
![]() Mohammad Bajammal, Davood Mazinanian, and Ali Mesbah (University of British Columbia, Canada) The transformation of a user interface mockup designed by a graphic designer to web components in the final app built by a web developer is often laborious, involving manual and time consuming steps. We propose an approach to automate this aspect of web development by generating reusable web components from a mockup. Our approach employs visual analysis of the mockup, and unsupervised learning of visual cues to create reusable web components (e.g., React components). We evaluated our approach, implemented in a tool called VizMod, on five real-world web mockups, and assessed the transformations and generated components through comparison with web development experts. The results show that VizMod achieves on average 94% precision and 75% recall in terms of agreement with the developers' assessment. Furthermore, the refactorings yielded 22% code reusability, on average. ![]() ![]() |
|
Mezini, Mira |
![]() Dominik Helm, Florian Kübler, Michael Eichberg, Michael Reif, and Mira Mezini (TU Darmstadt, Germany) Analyzing methods in object-oriented programs whether they are side-effect free and also deterministic, i.e., mathematically pure, has been the target of extensive research. Identifying such methods helps to find code smells and security related issues, and also helps analyses detecting concurrency bugs. Pure methods are also used by formal verification approaches as the foundations for specifications and proving the pureness is necessary to ensure correct specifications. However, so far no common terminology exists which describes the purity of methods. Furthermore, some terms (e.g., pure or side-effect free) are also used inconsistently. Further, all current approaches only report selected purity information making them only suitable for a smaller subset of the potential use cases. In this paper, we present a fine-grained unified lattice model which puts the purity levels found in the literature into relation and which adds a new level that generalizes existing definitions. We have also implemented a scalable, modularized purity analysis which produces significantly more precise results for real-world programs than the best-performing related work. The analysis shows that all defined levels are found in real-world projects. ![]() ![]() |
|
Mo, Ran |
![]() Ran Mo, Will Snipes, Yuanfang Cai, Srini Ramaswamy, Rick Kazman, and Martin Naedele (Central China Normal University, China; ABB Corporate Research, USA; Drexel University, USA; ABB, USA; Carnegie Mellon University, USA; University of Hawaii, USA; ABB, Switzerland) In this paper, we report our experiences of applying three complementary automated software architecture analysis techniques, supported by a tool suite, called DV8, to 8 industrial projects within a large company. DV8 includes two state-of-the-art architecture-level maintainability metrics—Decoupling Level and Propagation Cost, an architecture flaw detection tool, and an architecture root detection tool. We collected development process data from the project teams as input to these tools, reported the results back to the practitioners, and followed up with telephone conferences and interviews. Our experiences revealed that the metrics scores, quantitative debt analysis, and architecture flaw visualization can effectively bridge the gap between management and development, help them decide if, when, and where to refactor. In particular, the metrics scores, compared against industrial benchmarks, faithfully reflected the practitioners’ intuitions about the maintainability of their projects, and enabled them to better understand the maintainability relative to other projects internal to their company, and to other industrial products. The automatically detected architecture flaws and roots enabled the practitioners to precisely pinpoint, visualize, and quantify the “hotspots" within the systems that are responsible for high maintenance costs. Except for the two smallest projects for which both architecture metrics indicated high maintainability, all other projects are planning or have already begun refactorings to address the problems detected by our analyses. We are working on further automating the tool chain, and transforming the analysis suite into deployable services accessible by all projects within the company. ![]() ![]() |
|
Molina, Facundo |
![]() Renzo Degiovanni, Facundo Molina, Germán Regis, and Nazareno Aguirre (National University of Río Cuarto, Argentina; CONICET, Argentina) Goal-conflict analysis has been widely used as an abstraction for risk analysis in goal-oriented requirements engineering approaches. In this context, where the expected behaviour of the system-to-be is captured in terms of domain properties and goals, identifying combinations of circumstances that may make the goals diverge, i.e., not to be satisfied as a whole, is of most importance. Various approaches have been proposed in order to automatically identify boundary conditions, i.e., formulas capturing goal-divergent situations, but they either apply only to some specific goal expressions, or are affected by scalability issues that make them applicable only to relatively small specifications. In this paper, we present a novel approach to automatically identify boundary conditions, using evolutionary computation. More precisely, we develop a genetic algorithm that, given the LTL formulation of the domain properties and the goals, it searches for formulas that capture divergences in the specification. We exploit a modern LTL satisfiability checker to successfully guide our genetic algorithm to the solutions. We assess our technique on a set of case studies, and show that our genetic algorithm is able to find boundary conditions that cannot be generated by related approaches, and is able to efficiently scale to LTL specifications that other approaches are unable to deal with. ![]() ![]() ![]() |
|
Monperrus, Martin |
![]() Oscar Luis Vera-Pérez, Martin Monperrus, and Benoit Baudry (Inria, France; KTH, Sweden) Descartes is a tool that implements extreme mutation operators and aims at finding pseudo-tested methods in Java projects. It leverages the efficient transformation and runtime features of PITest. The demonstration compares Descartes with Gregor, the default mutation engine provided by PITest, in a set of real open source projects. It considers the execution time, number of mutants created and the relationship between the mutation scores produced by both engines. It provides some insights on the main features exposed byDescartes. ![]() ![]() ![]() |
|
Monteiro, Felipe R. |
![]() Felipe R. Monteiro, Mário A. P. Garcia, Lucas C. Cordeiro, and Eddie B. de Lima Filho (Federal University of Amazonas, Brazil; University of Manchester, UK; TPV Technology, Brazil) This work proposes an abstraction of the Qt framework, named as Qt Operational Model (QtOM), which is integrated into two different verification approaches: explicit-state model checking and symbolic (bounded) model checking. The proposed methodology is the first one to formally verify Qt-based applications, which has the potential to devise new directions for software verification of portable code. The full version of this paper is published in Software Testing, Verification and Reliability, on 02 March 2017 and it is available at https://doi.org/10.1002/stvr.1632. ![]() ![]() ![]() ![]() ![]() Mikhail R. Gadelha, Felipe R. Monteiro, Jeremy Morse, Lucas C. Cordeiro, Bernd Fischer, and Denis A. Nicole (University of Southampton, UK; Federal University of Amazonas, Brazil; University of Bristol, UK; University of Manchester, UK; Stellenbosch University, South Africa) ESBMC is a mature, permissively licensed open-source context-bounded model checker for the verification of single- and multi-threaded C programs. It can verify both predefined safety properties (e.g., bounds check, pointer safety, overflow) and user-defined program assertions automatically. ESBMC provides C++ and Python APIs to access internal data structures, allowing inspection and extension at any stage of the verification process. We discuss improvements over previous versions of ESBMC, including the description of new front- and back-ends, IEEE floating-point support, and an improved k-induction algorithm. A demonstration is available at https://www.youtube.com/watch?v=YcJjXHlN1v8. ![]() ![]() ![]() ![]() |
|
Mora, Federico |
![]() Federico Mora, Yi Li, Julia Rubin, and Marsha Chechik (University of Toronto, Canada; University of British Columbia, Canada) Software is often built by integrating components created by different teams or even different organizations. With little understanding of changes in dependent components, it is challenging to maintain correctness and robustness of the entire system. In this paper, we investigate the effect of component changes on the behavior of their clients. We observe that changes in a component are often irrelevant to a particular client and thus can be adopted without any delays or negative effects. Following this observation, we formulate the notion of client-specific equivalence checking (CSE) and develop an automated technique optimized for checking such equivalence. We evaluate our technique on a set of benchmarks, including those from the existing literature on equivalence checking, and show its applicability and effectiveness. ![]() ![]() |
|
Moran, Kevin |
![]() Kevin Moran, Cody Watson, John Hoskins, George Purnell, and Denys Poshyvanyk (College of William and Mary, USA) Mobile applications have become a popular software development domain in recent years due in part to a large user base, capable hardware, and accessible platforms. However, mobile developers also face unique challenges, including pressure for frequent releases to keep pace with rapid platform evolution, hardware iteration, and user feedback. Due to this rapid pace of evolution, developers need automated support for documenting the changes made to their apps in order to aid in program comprehension. One of the more challenging types of changes to document in mobile apps are those made to the graphical user interface (GUI) due to its abstract, pixel-based representation. In this paper, we present a fully automated approach, called GCAT, for detecting and summarizing GUI changes during the evolution of mobile apps. GCAT leverages computer vision techniques and natural language generation to accurately and concisely summarize changes made to the GUI of a mobile app between successive commits or releases. We evaluate the performance of our approach in terms of its precision and recall in detecting GUI changes compared to developer specified changes, and investigate the utility of the generated change reports in a controlled user study. Our results indicate that GCAT is capable of accurately detecting and classifying GUI changes - outperforming developers - while providing useful documentation. ![]() ![]() |
|
Morse, Jeremy |
![]() Mikhail R. Gadelha, Felipe R. Monteiro, Jeremy Morse, Lucas C. Cordeiro, Bernd Fischer, and Denis A. Nicole (University of Southampton, UK; Federal University of Amazonas, Brazil; University of Bristol, UK; University of Manchester, UK; Stellenbosch University, South Africa) ESBMC is a mature, permissively licensed open-source context-bounded model checker for the verification of single- and multi-threaded C programs. It can verify both predefined safety properties (e.g., bounds check, pointer safety, overflow) and user-defined program assertions automatically. ESBMC provides C++ and Python APIs to access internal data structures, allowing inspection and extension at any stage of the verification process. We discuss improvements over previous versions of ESBMC, including the description of new front- and back-ends, IEEE floating-point support, and an improved k-induction algorithm. A demonstration is available at https://www.youtube.com/watch?v=YcJjXHlN1v8. ![]() ![]() ![]() ![]() |
|
Mukelabai, Mukelabai |
![]() Mukelabai Mukelabai, Damir Nešić, Salome Maro, Thorsten Berger, and Jan-Philipp Steghöfer (Chalmers University of Technology, Sweden; University of Gothenburg, Sweden; KTH, Sweden) Highly configurable systems are complex pieces of software. To tackle this complexity, hundreds of dedicated analysis techniques have been conceived, many of which able to analyze system properties for all possible system configurations, as opposed to traditional, single-system analyses. Unfortunately, it is largely unknown whether these techniques are adopted in practice, whether they address actual needs, or what strategies practitioners actually apply to analyze highly configurable systems. We present a study of analysis practices and needs in industry. It relied on a survey with 27 practitioners engineering highly configurable systems and follow-up interviews with 15 of them, covering 18 different companies from eight countries. We confirm that typical properties considered in the literature (e.g., reliability) are relevant, that consistency between variability models and artifacts is critical, but that the majority of analyses for specifications of configuration options (a.k.a., variability model analysis) is not perceived as needed. We identified rather pragmatic analysis strategies, including practices to avoid the need for analysis. For instance, testing with experience-based sampling is the most commonly applied strategy, while systematic sampling is rarely applicable. We discuss analyses that are missing and synthesize our insights into suggestions for future research. ![]() ![]() ![]() |
|
Müller, Peter |
![]() Alexandra Bugariu, Valentin Wüstholz, Maria Christakis, and Peter Müller (ETH Zurich, Switzerland; MPI-SWS, Germany) Static program analyses are routinely applied as the basis of code optimizations and to detect safety and security issues in software systems. For their results to be reliable, static analyses should be sound (i.e., should not produce false negatives) and precise (i.e., should report a low number of false positives). Even though it is possible to prove properties of the design of a static analysis, ensuring soundness and precision for its implementation is challenging. Complex algorithms and sophisticated optimizations make static analyzers difficult to implement and test. In this paper, we present an automatic technique to test, among other properties, the soundness and precision of abstract domains, the core of all static analyzers based on abstract interpretation. In order to cover a wide range of test data and input states, we construct inputs by applying sequences of abstract-domain operations to representative domain elements, and vary the operations through gray-box fuzzing. We use mathematical properties of abstract domains as test oracles. Our experimental evaluation demonstrates the effectiveness of our approach. We detected several previously unknown soundness and precision errors in widely-used abstract domains. Our experiments also show that our approach is more effective than dynamic symbolic execution and than fuzzing the test inputs directly. ![]() ![]() |
|
Murphy, Gail C. |
![]() Gail C. Murphy (University of British Columbia, Canada) The development of a software system requires the orchestration of many different people using many different tools. Despite the need for a developer who is performing a task to understand the context in which that task is undertaken, the tools we imagine, build and provide to support software developers are typically isolated. Instead of the tools helping a developer work within the appropriate context, it is the developer who must bring the context to the tools. In this talk, I will argue that the lack of context in the software engineering tools we build limits the effectiveness of developers and of our software development practices. ![]() ![]() |
|
Naedele, Martin |
![]() Ran Mo, Will Snipes, Yuanfang Cai, Srini Ramaswamy, Rick Kazman, and Martin Naedele (Central China Normal University, China; ABB Corporate Research, USA; Drexel University, USA; ABB, USA; Carnegie Mellon University, USA; University of Hawaii, USA; ABB, Switzerland) In this paper, we report our experiences of applying three complementary automated software architecture analysis techniques, supported by a tool suite, called DV8, to 8 industrial projects within a large company. DV8 includes two state-of-the-art architecture-level maintainability metrics—Decoupling Level and Propagation Cost, an architecture flaw detection tool, and an architecture root detection tool. We collected development process data from the project teams as input to these tools, reported the results back to the practitioners, and followed up with telephone conferences and interviews. Our experiences revealed that the metrics scores, quantitative debt analysis, and architecture flaw visualization can effectively bridge the gap between management and development, help them decide if, when, and where to refactor. In particular, the metrics scores, compared against industrial benchmarks, faithfully reflected the practitioners’ intuitions about the maintainability of their projects, and enabled them to better understand the maintainability relative to other projects internal to their company, and to other industrial products. The automatically detected architecture flaws and roots enabled the practitioners to precisely pinpoint, visualize, and quantify the “hotspots" within the systems that are responsible for high maintenance costs. Except for the two smallest projects for which both architecture metrics indicated high maintainability, all other projects are planning or have already begun refactorings to address the problems detected by our analyses. We are working on further automating the tool chain, and transforming the analysis suite into deployable services accessible by all projects within the company. ![]() ![]() |
|
Nagashima, Yutaka |
![]() Yutaka Nagashima and Yilun He (Czech Technical University, Czechia; University of Innsbruck, Austria; University of Sydney, Australia) Deciding which sub-tool to use for a given proof state requires expertise specific to each interactive theorem prover (ITP). To mitigate this problem, we present PaMpeR, a proof method recommendation system for Isabelle/HOL. Given a proof state, PaMpeR recommends proof methods to discharge the proof goal and provides qualitative explanations as to why it suggests these methods. PaMpeR generates these recommendations based on existing hand-written proof corpora, thus transferring experienced users’ expertise to new users. Our evaluation shows that PaMpeR correctly predicts experienced users’ proof methods invocation especially when it comes to special purpose proof methods. ![]() ![]() |
|
Nejati, Shiva |
![]() Raja Ben Abdessalem, Annibale Panichella, Shiva Nejati, Lionel C. Briand, and Thomas Stifter (University of Luxembourg, Luxembourg; Delft University of Technology, Netherlands; IEE, Luxembourg) Complex systems such as autonomous cars are typically built as a composition of features that are independent units of functionality. Features tend to interact and impact one another's behavior in unknown ways. A challenge is to detect and manage feature interactions, in particular, those that violate system requirements, hence leading to failures. In this paper, we propose a technique to detect feature interaction failures by casting this problem into a search-based test generation problem. We define a set of hybrid test objectives (distance functions) that combine traditional coverage-based heuristics with new heuristics specifically aimed at revealing feature interaction failures. We develop a new search-based test generation algorithm, called FITEST, that is guided by our hybrid test objectives. FITEST extends recently proposed many-objective evolutionary algorithms to reduce the time required to compute fitness values. We evaluate our approach using two versions of an industrial self-driving system. Our results show that our hybrid test objectives are able to identify more than twice as many feature interaction failures as two baseline test objectives used in the software testing literature (i.e., coverage-based and failure-based test objectives). Further, the feedback from domain experts indicates that the detected feature interaction failures represent real faults in their systems that were not previously identified based on analysis of the system features and their requirements. ![]() ![]() |
|
Nešić, Damir |
![]() Mukelabai Mukelabai, Damir Nešić, Salome Maro, Thorsten Berger, and Jan-Philipp Steghöfer (Chalmers University of Technology, Sweden; University of Gothenburg, Sweden; KTH, Sweden) Highly configurable systems are complex pieces of software. To tackle this complexity, hundreds of dedicated analysis techniques have been conceived, many of which able to analyze system properties for all possible system configurations, as opposed to traditional, single-system analyses. Unfortunately, it is largely unknown whether these techniques are adopted in practice, whether they address actual needs, or what strategies practitioners actually apply to analyze highly configurable systems. We present a study of analysis practices and needs in industry. It relied on a survey with 27 practitioners engineering highly configurable systems and follow-up interviews with 15 of them, covering 18 different companies from eight countries. We confirm that typical properties considered in the literature (e.g., reliability) are relevant, that consistency between variability models and artifacts is critical, but that the majority of analyses for specifications of configuration options (a.k.a., variability model analysis) is not perceived as needed. We identified rather pragmatic analysis strategies, including practices to avoid the need for analysis. For instance, testing with experience-based sampling is the most commonly applied strategy, while systematic sampling is rarely applicable. We discuss analyses that are missing and synthesize our insights into suggestions for future research. ![]() ![]() ![]() |
|
Ng, Vincent |
![]() Xiaoyu Liu, LiGuo Huang, and Vincent Ng (Southern Methodist University, USA; University of Texas at Dallas, USA) It is time-consuming and labor-intensive to learn and locate the correct API for programming tasks. Thus, it is beneficial to perform API recommendation automatically. The graph-based statistical model has been shown to recommend top-10 API candidates effectively. It falls short, however, in accurately recommending an actual top-1 API. To address this weakness, we propose RecRank, an approach and tool that applies a novel ranking-based discriminative approach leveraging API usage path features to improve top-1 API recommendation. Empirical evaluation on a large corpus of (1385+8) open source projects shows that RecRank significantly improves top-1 API recommendation accuracy and mean reciprocal rank when compared to state-of-the-art API recommendation approaches. ![]() ![]() |
|
Nicole, Denis A. |
![]() Mikhail R. Gadelha, Felipe R. Monteiro, Jeremy Morse, Lucas C. Cordeiro, Bernd Fischer, and Denis A. Nicole (University of Southampton, UK; Federal University of Amazonas, Brazil; University of Bristol, UK; University of Manchester, UK; Stellenbosch University, South Africa) ESBMC is a mature, permissively licensed open-source context-bounded model checker for the verification of single- and multi-threaded C programs. It can verify both predefined safety properties (e.g., bounds check, pointer safety, overflow) and user-defined program assertions automatically. ESBMC provides C++ and Python APIs to access internal data structures, allowing inspection and extension at any stage of the verification process. We discuss improvements over previous versions of ESBMC, including the description of new front- and back-ends, IEEE floating-point support, and an improved k-induction algorithm. A demonstration is available at https://www.youtube.com/watch?v=YcJjXHlN1v8. ![]() ![]() ![]() ![]() |
|
Noller, Yannic |
![]() Yannic Noller (Humboldt-Universität zu Berlin, Germany) Differential program analysis means to identify the behavioral divergences in one or multiple programs, and it can be classified into two categories: identify the behavioral divergences (1) between two program versions for the same input (aka regression analysis), and (2) for the same program with two different inputs (e.g, side-channel analysis). Most of the existent approaches for both subproblems try to solve it with single techniques, which suffer from its weaknesses like scalability issues or imprecision. This research proposes to combine two very strong techniques, namely fuzzing and symbolic execution to tackle these problems and provide scalable solutions for real-world applications. The proposed approaches will be implemented on top of state-of-the-art tools like AFL and Symbolic PathFinder to evaluate them against existent work. ![]() ![]() |
|
Oh, Hakjoo |
![]() Sooyoung Cha, Seonho Lee, and Hakjoo Oh (Korea University, South Korea) We present template-guided concolic testing, a new technique for effectively reducing the search space in concolic testing. Addressing the path-explosion problem has been a significant challenge in concolic testing. Diverse search heuristics have been proposed to mitigate this problem but using search heuristics alone is not sufficient to substantially improve code coverage for real-world programs. The goal of this paper is to complement existing techniques and achieve higher coverage by exploiting templates in concolic testing. In our approach, a template is a partially symbolized input vector whose job is to reduce the search space. However, choosing a right set of templates is nontrivial and significantly affects the final performance of our approach. We present an algorithm that automatically learns useful templates online, based on data collected from previous runs of concolic testing. The experimental results with open-source programs show that our technique achieves greater branch coverage and finds bugs more effectively than conventional concolic testing. ![]() ![]() |
|
Oliveto, Rocco |
![]() Simone Scalabrino, Giovanni Grano, Dario Di Nucci, Michele Guerra, Andrea De Lucia, Harald C. Gall, and Rocco Oliveto (University of Molise, Italy; University of Zurich, Switzerland; Vrije Universiteit Brussel, Belgium; University of Salerno, Italy) Automatically generating test cases plays an important role to reduce the time spent by developers during the testing phase. In last years, several approaches have been proposed to tackle such a problem: amongst others, search-based techniques have been shown to be particularly promising. In this paper we describe Ocelot, a search-based tool for the automatic generation of test cases in C. Ocelot allows practitioners to write skeletons of test cases for their programs and researchers to easily implement and experiment new approaches for automatic test-data generation. We show that Ocelot achieves a higher coverage compared to a competitive tool in 81% of the cases. Ocelot is publicly available to support both researchers and practitioners. ![]() ![]() |
|
Ore, John-Paul |
![]() John-Paul Ore, Sebastian Elbaum, Carrick Detweiler, and Lambros Karkazis (University of Nebraska-Lincoln, USA) Type annotations provide a link between program variables and domain-specific types. When combined with a type system, these annotations can enable early fault detection. For type annotations to be cost-effective in practice, they need to be both accurate and affordable for developers. We lack, however, an understanding of how burdensome type annotation is for developers. Hence, this work explores three fundamental questions: 1) how accurately do developers make type annotations; 2) how long does a single annotation take; and, 3) if a system could automatically suggest a type annotation, how beneficial to accuracy are correct suggestions and how detrimental are incorrect suggestions? We present results of a study of 71 programmers using 20 random code artifacts that contain variables with physical unit types that must be annotated. Subjects choose a correct type annotation only 51% of the time and take an average of 136 seconds to make a single correct annotation. Our qualitative analysis reveals that variable names and reasoning over mathematical operations are the leading clues for type selection. We find that suggesting the correct type boosts accuracy to 73%, while making a poor suggestion decreases accuracy to 28%. We also explore what state-of-the-art automated type annotation systems can and cannot do to help developers with type annotations, and identify implications for tool developers. ![]() ![]() |
|
Ozdemir, Alim |
![]() Alim Ozdemir, Ayse Tosun, Hakan Erdogmus, and Rui Abreu (Istanbul Technical University, Turkey; Carnegie Mellon University, USA; University of Lisbon, Portugal) Existing tools for monitoring the quality of codebases modified by multiple developers tend to be centralized and inflexible. These tools increase the visibility of quality by producing effective reports and visualizations when a change is made to the codebase and triggering alerts when undesirable situations occur. However, their configuration is invariably both (a) centrally managed in that individual maintainers cannot define local rules to receive customized feedback when a change occurs in a specific part of the code in which they are particularly interested, and (b) coarse-grained in that analyses cannot be turned on and off below the file level. Triggr, the tool proposed in this paper, addresses these limitations by allowing distributed, customized, and fine-grained monitoring. It is a lightweight re-implementation of our previous tool, CodeAware, which adopts the same paradigm. The tool listens on a codebase’s shared repository using an event-based approach, and can send alerts to subscribed developers based on rules defined locally by them. Triggr is open-source and available at https://github.com/lyzerk/Triggr. A demonstration video can be found at https://youtu.be/qQs9aDwXJjY. ![]() ![]() |
|
Palomba, Fabio |
![]() Carmine Vassallo, Fabio Palomba, Alberto Bacchelli, and Harald C. Gall (University of Zurich, Switzerland) Continuous Integration (CI) is a software engineering practice where developers constantly integrate their changes to a project through an automated build process. The goal of CI is to provide developers with prompt feedback on several quality dimensions after each change. Indeed, previous studies provided empirical evidence on a positive association between properly following CI principles and source code quality. A core principle behind CI is Continuous Code Quality (also known as CCQ, which includes automated testing and automated code inspection) may appear simple and effective, yet we know little about its practical adoption. In this paper, we propose a preliminary empirical investigation aimed at understanding how rigorously practitioners follow CCQ. Our study reveals a strong dichotomy between theory and practice: developers do not perform continuous inspection but rather control for quality only at the end of a sprint and most of the times only on the release branch. Preprint [https://doi.org/10.5281/zenodo.1341036]. Data and Materials [http://doi.org/10.5281/zenodo.1341015]. ![]() ![]() ![]() Vladimir Kovalenko, Fabio Palomba, and Alberto Bacchelli (Delft University of Technology, Netherlands; University of Zurich, Switzerland) Modern distributed version control systems, such as Git, offer support for branching — the possibility to develop parts of software outside the master trunk. Consideration of the repository structure in Mining Software Repository (MSR) studies requires a thorough approach to mining, but there is no well-documented, widespread methodology regarding the handling of merge commits and branches. Moreover, there is still a lack of knowledge of the extent to which considering branches during MSR studies impacts the results of the studies. In this study, we set out to evaluate the importance of proper handling of branches when calculating file modification histories. We analyze over 1,400 Git repositories of four open source ecosystems and compute modification histories for over two million files, using two different algorithms. One algorithm only follows the first parent of each commit when traversing the repository, the other returns the full modification history of a file across all branches. We show that the two algorithms consistently deliver different results, but the scale of the difference varies across projects and ecosystems. Further, we evaluate the importance of accurate mining of file histories by comparing the performance of common techniques that rely on file modification history — reviewer recommendation, change recommendation, and defect prediction — for two algorithms of file history retrieval. We find that considering full file histories leads to an increase in the techniques’ performance that is rather modest. ![]() ![]() ![]() |
|
Pan, Minxue |
![]() Zhenhao Tang, Juan Zhai, Minxue Pan, Yousra Aafer, Shiqing Ma, Xiangyu Zhang, and Jianhua Zhao (Nanjing University, China; Purdue University, USA) Modern Android malwares tend to use advanced techniques to cover their malicious behaviors. They usually feature multi-staged, condition-guarded and environment-specific payloads. An increasing number of them utilize WebView, particularly the two-way communications between Java and JavaScript, to evade detection and analysis of existing techniques. We propose Dual-Force, a forced execution technique which simultaneously forces both Java and JavaScript code of WebView applications to execute along various paths without requiring any environment setup or providing any inputs manually. As such, the hidden payloads of WebView malwares are forcefully exposed. The technique features a novel execution model that allows forced execution to suppress exceptions and continue execution. Experimental results show that Dual-Force precisely exposes malicious payload in 119 out of 150 WebView malwares. Compared to the state-of-the-art, Dual-Force can expose 23% more malicious behaviors. ![]() ![]() |
|
Pang, Jun |
![]() Jialun Cao, Yongjian Li, and Jun Pang (Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; University of Luxembourg, Luxembourg) This demo introduces L-CMP, an automatic learning-based parameterized verification tool. It can verify parameterized protocols by combining machine learning and model checking techniques. Given a parameterized protocol, L-CMP learns a set of auxiliary invariants and implements verification of the protocol using the invariants automatically. In particular, the learned auxiliary invariants are straightforward and readable. The experimental results show that L-CMP can successfully verify a number of cache coherence protocols, including the industrial-scale FLASH protocol. The video is available at https://youtu.be/6Dl2HiiiS4E, and L-CMPL-CMP can be downloaded at https://github.com/ ArabelaTso/Learning-Based-ParaVerifer. ![]() ![]() |
|
Panichella, Annibale |
![]() Raja Ben Abdessalem, Annibale Panichella, Shiva Nejati, Lionel C. Briand, and Thomas Stifter (University of Luxembourg, Luxembourg; Delft University of Technology, Netherlands; IEE, Luxembourg) Complex systems such as autonomous cars are typically built as a composition of features that are independent units of functionality. Features tend to interact and impact one another's behavior in unknown ways. A challenge is to detect and manage feature interactions, in particular, those that violate system requirements, hence leading to failures. In this paper, we propose a technique to detect feature interaction failures by casting this problem into a search-based test generation problem. We define a set of hybrid test objectives (distance functions) that combine traditional coverage-based heuristics with new heuristics specifically aimed at revealing feature interaction failures. We develop a new search-based test generation algorithm, called FITEST, that is guided by our hybrid test objectives. FITEST extends recently proposed many-objective evolutionary algorithms to reduce the time required to compute fitness values. We evaluate our approach using two versions of an industrial self-driving system. Our results show that our hybrid test objectives are able to identify more than twice as many feature interaction failures as two baseline test objectives used in the software testing literature (i.e., coverage-based and failure-based test objectives). Further, the feedback from domain experts indicates that the detected feature interaction failures represent real faults in their systems that were not previously identified based on analysis of the system features and their requirements. ![]() ![]() |
|
Patwardhan, Manasi |
![]() Manasi Patwardhan, Abhishek Sainani, Richa Sharma, Shirish Karande, and Smita Ghaisas (TCS Research, India) Compliant software is a critical need of all modern businesses. Disambiguating regulations to derive requirements is therefore an important software engineering activity. Regulations however are ridden with ambiguities that make their comprehension a challenge, seemingly surmountable only by legal experts. Since legal experts' involvement in every project is expensive, approaches to automate the disambiguation need to be explored. These approaches however require a large amount of annotated data. Collecting data exclusively from experts is not a scalable and affordable solution. In this paper, we present the results of a crowd sourcing experiment to collect annotations on ambiguities in regulations from professional software engineers. We discuss an approach to automate the arduous and critical step of identifying ground truth labels by employing crowd consensus using Expectation Maximization (EM). We demonstrate that the annotations reaching a consensus match those of experts with an accuracy of 87%. ![]() ![]() |
|
Paulsen, Brandon |
![]() Chungha Sung, Brandon Paulsen, and Chao Wang (University of Southern California, USA) A unified modeling framework for non-functional properties of a program is essential for research in software analysis and verification, since it reduces burdens on individual researchers to implement new approaches and compare existing approaches. We present CANAL, a framework that models the cache behaviors of a program by transforming its intermediate representation in the LLVM compiler. CANAL inserts auxiliary variables and instructions over these variables, to allow standard verification tools to handle a new class of cache related properties, e.g., for computing the worst-case execution time and detecting side-channel leaks. We demonstrate the effectiveness of using three verification tools: KLEE, SMACK and Crab-llvm. We confirm the accuracy of our cache model by comparing with CPU cycle-accurate simulation results of GEM5. CANAL is available on GitHub(https://github.com/canalcache/canal) and YouTube(https://youtu.be/JDou3F1j2nY). ![]() ![]() ![]() ![]() |
|
Peng, Xin |
![]() Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Wenhai Li, Chao Ji, and Dan Ding (Fudan University, China; University of Illinois at Urbana-Champaign, USA; Singapore University of Technology and Design, Singapore) Debugging microservice systems involves the deployment and manipulation of microservice systems on a containerized environment and faces unique challenges due to the high complexity and dynamism of microservices. To address these challenges, in this paper, we propose a debugging approach for microservice systems based on the delta debugging algorithm, which is to minimize failureinducing deltas of circumstances (e.g., deployment, environmental configurations) for effective debugging. Our approach includes novel techniques for defining, deploying/manipulating, and executing deltas following the idea of delta debugging. In particular, to construct a (failing) circumstance space for delta debugging to minimize, our approach defines a set of dimensions that can affect the execution of microservice systems. Our experimental study on a medium-size microservice benchmark system shows that our approach can effectively identify failure-inducing deltas that help diagnose the root causes. ![]() ![]() ![]() Kaifeng Huang, Bihuan Chen, Xin Peng, Daihong Zhou, Ying Wang, Yang Liu, and Wenyun Zhao (Fudan University, China; Nanyang Technological University, Singapore) Analyzing and understanding source code changes is important in a variety of software maintenance tasks. To this end, many code differencing and code change summarization methods have been proposed. For some tasks (e.g. code review and software merging), however, those differencing methods generate too fine-grained a representation of code changes, and those summarization methods generate too coarse-grained a representation of code changes. Moreover, they do not consider the relationships among code changes. Therefore, the generated differences or summaries make it not easy to analyze and understand code changes in some software maintenance tasks. In this paper, we propose a code differencing approach, named CLDIFF, to generate concise linked code differences whose granularity is in between the existing code differencing and code change summarization methods. The goal of CLDIFF is to generate more easily understandable code differences. CLDIFF takes source code files before and after changes as inputs, and consists of three steps. First, it pre-processes the source code files by pruning unchanged declara- tions from the parsed abstract syntax trees. Second, it generates concise code differences by grouping fine-grained code differences at or above the statement level and describing high-level changes in each group. Third, it links the related concise code differences according to five pre-defined links. Experiments with 12 Java projects (74,387 commits) and a human study with 10 participants have indicated the accuracy, conciseness, performance and usefulness of CLDIFF. ![]() ![]() |
|
Pereira, Rui |
![]() Rui Pereira, Pedro Simão, Jácome Cunha, and João Saraiva (INESC TEC, Portugal; University of Minho, Portugal; NOVA-LINCS, Portugal; Universidade Nova Lisboa, Portugal) Software developers are more and more eager to understand their code’s energy performance. However, even with such knowledge it is di cult to know how to improve the code. Indeed, little tool support exists to understand the energy consumption pro le of a software system and to eventually (automatically) improve its code. In this paper we present a tool termed jStanley which automatically nds collections in Java programs that can be replaced by others with a positive impact on the energy consumption as well as on the execution time. In seconds, developers obtain information about energy-eager collection usage. jStanley will further suggest alternative collections to improve the code, making it use less time, energy, or a combination of both. The preliminary evaluation we ran using jStanley shows energy gains between 2% and 17%, and a reduction in execution time between 2% and 13%. A video can be seen at https://greensoftwarelab.github.io/jStanley. ![]() ![]() ![]() |
|
Pezzè, Mauro |
![]() Valerio Terragni and Mauro Pezzè (University of Lugano, Switzerland) Developing correct and efficient concurrent programs is difficult and error-prone, due to the complexity of thread synchronization. Often, developers alleviate such problem by relying on thread-safe classes, which encapsulate most synchronization-related challenges. Thus, testing such classes is crucial to ensure the reliability of the concurrency aspects of programs. Some recent techniques and corresponding tools tackle the problem of testing thread-safe classes by automatically generating concurrent tests. In this paper, we present a comprehensive study of the state-of-the-art techniques and an independent empirical evaluation of the publicly available tools. We conducted the study by executing all tools on the JaConTeBe benchmark that contains 47 well-documented concurrency faults. Our results show that 8 out of 47 faults (17%) were detected by at least one tool. By studying the issues of the tools and the generated tests, we derive insights to guide future research on improving the effectiveness of automated concurrent test generation. ![]() ![]() |
|
Piao, Aihua |
![]() Bingchang Liu, Wei Huo, Chao Zhang, Wenchao Li, Feng Li, Aihua Piao, and Wei Zou (Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Tsinghua University, China) Binary code similarity detection (BCSD) has many applications, including patch analysis, plagiarism detection, malware detection, and vulnerability search etc. Existing solutions usually perform comparisons over specific syntactic features extracted from binary code, based on expert knowledge. They have either high performance overheads or low detection accuracy. Moreover, few solutions are suitable for detecting similarities between cross-version binaries, which may not only diverge in syntactic structures but also diverge slightly in semantics. In this paper, we propose a solution αDiff, employing three semantic features, to address the cross-version BCSD challenge. It first extracts the intra-function feature of each binary function using a deep neural network (DNN). The DNN works directly on raw bytes of each function, rather than features (e.g., syntactic structures) provided by experts. αDiff further analyzes the function call graph of each binary, which are relatively stable in cross-version binaries, and extracts the inter-function and inter-module features. Then, a distance is computed based on these three features and used for BCSD. We have implemented a prototype of αDiff, and evaluated it on a dataset with about 2.5 million samples. The result shows that αDiff outperforms state-of-the-art static solutions by over 10 percentages on average in different BCSD settings. ![]() ![]() |
|
Pinzger, Martin |
![]() Keheliya Gallaba, Christian Macho, Martin Pinzger, and Shane McIntosh (McGill University, Canada; University of Klagenfurt, Austria) Automated builds, which may pass or fail, provide feedback to a development team about changes to the codebase. A passing build indicates that the change compiles cleanly and tests (continue to) pass. A failing (a.k.a., broken) build indicates that there are issues that require attention. Without a closer analysis of the nature of build outcome data, practitioners and researchers are likely to make two critical assumptions: (1) build results are not noisy; however, passing builds may contain failing or skipped jobs that are actively or passively ignored; and (2) builds are equal; however, builds vary in terms of the number of jobs and configurations. To investigate the degree to which these assumptions about build breakage hold, we perform an empirical study of 3.7 million build jobs spanning 1,276 open source projects. We find that: (1) 12% of passing builds have an actively ignored failure; (2) 9% of builds have a misleading or incorrect outcome on average; and (3) at least 44% of the broken builds contain passing jobs, i.e., the breakage is local to a subset of build variants. Like other software archives, build data is noisy and complex. Analysis of build data requires nuance. ![]() ![]() |
|
Poshyvanyk, Denys |
![]() Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk (College of William and Mary, USA; University of Lugano, Switzerland; University of Sannio, Italy) Millions of open-source projects with numerous bug fixes are available in code repositories. This proliferation of software development histories can be leveraged to learn how to fix common programming bugs. To explore such a potential, we perform an empirical study to assess the feasibility of using Neural Machine Translation techniques for learning bug-fixing patches for real defects. We mine millions of bug-fixes from the change histories of GitHub repositories to extract meaningful examples of such bug-fixes. Then, we abstract the buggy and corresponding fixed code, and use them to train an Encoder-Decoder model able to translate buggy code into its fixed version. Our model is able to fix hundreds of unique buggy methods in the wild. Overall, this model is capable of predicting fixed patches generated by developers in 9% of the cases. ![]() ![]() ![]() Kevin Moran, Cody Watson, John Hoskins, George Purnell, and Denys Poshyvanyk (College of William and Mary, USA) Mobile applications have become a popular software development domain in recent years due in part to a large user base, capable hardware, and accessible platforms. However, mobile developers also face unique challenges, including pressure for frequent releases to keep pace with rapid platform evolution, hardware iteration, and user feedback. Due to this rapid pace of evolution, developers need automated support for documenting the changes made to their apps in order to aid in program comprehension. One of the more challenging types of changes to document in mobile apps are those made to the graphical user interface (GUI) due to its abstract, pixel-based representation. In this paper, we present a fully automated approach, called GCAT, for detecting and summarizing GUI changes during the evolution of mobile apps. GCAT leverages computer vision techniques and natural language generation to accurately and concisely summarize changes made to the GUI of a mobile app between successive commits or releases. We evaluate the performance of our approach in terms of its precision and recall in detecting GUI changes compared to developer specified changes, and investigate the utility of the generated change reports in a controlled user study. Our results indicate that GCAT is capable of accurately detecting and classifying GUI changes - outperforming developers - while providing useful documentation. ![]() ![]() |
|
Pradel, Michael |
![]() Andrew Habib and Michael Pradel (TU Darmstadt, Germany) Static bug detectors are becoming increasingly popular and are widely used by professional software developers. While most work on bug detectors focuses on whether they find bugs at all, and on how many false positives they report in addition to legitimate warnings, the inverse question is often neglected: How many of all real-world bugs do static bug detectors find? This paper addresses this question by studying the results of applying three widely used static bug detectors to an extended version of the Defects4J dataset that consists of 15 Java projects with 594 known bugs. To decide which of these bugs the tools detect, we use a novel methodology that combines an automatic analysis of warnings and bugs with a manual validation of each candidate of a detected bug. The results of the study show that: (i) static bug detectors find a non-negligible amount of all bugs, (ii) different tools are mostly complementary to each other, and (iii) current bug detectors miss the large majority of the studied bugs. A detailed analysis of bugs missed by the static detectors shows that some bugs could have been found by variants of the existing detectors, while others are domain-specific problems that do not match any existing bug pattern. These findings help potential users of such tools to assess their utility, motivate and outline directions for future work on static bug detection, and provide a basis for future comparisons of static bug detection with other bug finding techniques, such as manual and automated testing. ![]() ![]() ![]() ![]() Andrew Habib and Michael Pradel (TU Darmstadt, Germany) Thread-safe classes are pervasive in concurrent, object-oriented software. However, many classes lack documentation regarding their safety guarantees under multi-threaded usage. This lack of documentation forces developers who use a class in a concurrent program to either carefully inspect the implementation of the class, to conservatively synchronize all accesses to it, or to optimistically assume that the class is thread-safe. To overcome the lack of documentation, we present TSFinder, an approach to automatically classify classes as supposedly thread-safe or thread-unsafe. The key idea is to combine a lightweight static analysis that extracts a graph representation from classes with a graph-based classifier. After training the classifier with classes known to be thread-safe and thread-unsafe, it achieves an accuracy of 94.5% on previously unseen classes, enabling the approach to infer thread safety documentation with high confidence. The classifier takes about 3 seconds per class, i.e., it is efficient enough to infer documentation for many classes. ![]() ![]() ![]() |
|
Pu, Geguang |
![]() Lingling Fan, Ting Su, Sen Chen, Guozhu Meng, Yang Liu, Lihua Xu, and Geguang Pu (East China Normal University, China; Nanyang Technological University, Singapore; Chinese Academy of Sciences, China; New York University Shanghai, China) Android, the #1 mobile app framework, enforces the single-GUI-thread model, in which a single UI thread manages GUI rendering and event dispatching. Due to this model, it is vital to avoid blocking the UI thread for responsiveness. One common practice is to offload long-running tasks into async threads. To achieve this, Android provides various async programming constructs, and leaves evelopers themselves to obey the rules implied by the model. However, as our study reveals, more than 25% apps violate these rules and introduce hard-to-detect, fail-stop errors, which we term as aysnc programming errors (APEs). To this end, this paper introduces APEChecker, a technique to automatically and efficiently manifest APEs. The key idea is to characterize APEs as specific fault patterns, and synergistically combine static analysis and dynamic UI exploration to detect and verify such errors. Among the 40 real-world Android apps, APEChecker unveils and processes 61 APEs, of which 51 are confirmed (83.6% hit rate). Specifically, APEChecker detects 3X more APEs than the state-of-art testing tools (Monkey, Sapienz and Stoat), and reduces testing time from half an hour to a few minutes. On a specific type of APEs, APEChecker confirms 5X more errors than the data race detection tool, EventRacer, with very few false alarms. ![]() ![]() |
|
Purnell, George |
![]() Kevin Moran, Cody Watson, John Hoskins, George Purnell, and Denys Poshyvanyk (College of William and Mary, USA) Mobile applications have become a popular software development domain in recent years due in part to a large user base, capable hardware, and accessible platforms. However, mobile developers also face unique challenges, including pressure for frequent releases to keep pace with rapid platform evolution, hardware iteration, and user feedback. Due to this rapid pace of evolution, developers need automated support for documenting the changes made to their apps in order to aid in program comprehension. One of the more challenging types of changes to document in mobile apps are those made to the graphical user interface (GUI) due to its abstract, pixel-based representation. In this paper, we present a fully automated approach, called GCAT, for detecting and summarizing GUI changes during the evolution of mobile apps. GCAT leverages computer vision techniques and natural language generation to accurately and concisely summarize changes made to the GUI of a mobile app between successive commits or releases. We evaluate the performance of our approach in terms of its precision and recall in detecting GUI changes compared to developer specified changes, and investigate the utility of the generated change reports in a controlled user study. Our results indicate that GCAT is capable of accurately detecting and classifying GUI changes - outperforming developers - while providing useful documentation. ![]() ![]() |
|
Qu, Yu |
![]() Yu Qu, Ting Liu, Jianlei Chi, Yangxu Jin, Di Cui, Ancheng He, and Qinghua Zheng (Xi'an Jiaotong University, China) Network measures have been proved to be useful in predicting software defects. Leveraging the dependency relationships between software modules, network measures can capture various structural features of software systems. However, existing studies have relied on user-defined network measures (e.g., degree statistics or centrality metrics), which are inflexible and require high computation cost, to describe the structural features. In this paper, we propose a new method called node2defect which uses a newly proposed network embedding technique, node2vec, to automatically learn to encode dependency network structure into low-dimensional vector spaces to improve software defect prediction. Specifically, we firstly construct a program's Class Dependency Network. Then node2vec is used to automatically learn structural features of the network. After that, we combine the learned features with traditional software engineering features, for accurate defect prediction. We evaluate our method on 15 open source programs. The experimental results show that in average, node2defect improves the state-of-the-art approach by 9.15% in terms of F-measure. ![]() ![]() |
|
Ramaswamy, Srini |
![]() Ran Mo, Will Snipes, Yuanfang Cai, Srini Ramaswamy, Rick Kazman, and Martin Naedele (Central China Normal University, China; ABB Corporate Research, USA; Drexel University, USA; ABB, USA; Carnegie Mellon University, USA; University of Hawaii, USA; ABB, Switzerland) In this paper, we report our experiences of applying three complementary automated software architecture analysis techniques, supported by a tool suite, called DV8, to 8 industrial projects within a large company. DV8 includes two state-of-the-art architecture-level maintainability metrics—Decoupling Level and Propagation Cost, an architecture flaw detection tool, and an architecture root detection tool. We collected development process data from the project teams as input to these tools, reported the results back to the practitioners, and followed up with telephone conferences and interviews. Our experiences revealed that the metrics scores, quantitative debt analysis, and architecture flaw visualization can effectively bridge the gap between management and development, help them decide if, when, and where to refactor. In particular, the metrics scores, compared against industrial benchmarks, faithfully reflected the practitioners’ intuitions about the maintainability of their projects, and enabled them to better understand the maintainability relative to other projects internal to their company, and to other industrial products. The automatically detected architecture flaws and roots enabled the practitioners to precisely pinpoint, visualize, and quantify the “hotspots" within the systems that are responsible for high maintenance costs. Except for the two smallest projects for which both architecture metrics indicated high maintainability, all other projects are planning or have already begun refactorings to address the problems detected by our analyses. We are working on further automating the tool chain, and transforming the analysis suite into deployable services accessible by all projects within the company. ![]() ![]() |
|
Ranjan, Priya |
![]() Mikaela Cashman, Myra B. Cohen, Priya Ranjan, and Robert W. Cottingham (University of Nebraska-Lincoln, USA; Oak Ridge National Laboratory, USA) The bioinformatics software domain contains thousands of applications for automating tasks such as the pairwise alignment of DNA sequences, building and reasoning about metabolic models or simulating growth of an organism. Its end users range from sophisticated developers to those with little computational experience. In response to their needs, developers provide many options to customize the way their algorithms are tuned. Yet there is little or no automated help for the user in determining the consequences or impact of the options they choose. In this paper we describe our experience working with configurable bioinformatics tools. We find limited documentation and help for combining and selecting options along with variation in both functionality and performance. We also find previously undetected faults. We summarize our findings with a set of lessons learned, and present a roadmap for creating automated techniques to interact with bioinformatics software. We believe these will generalize to other types of scientific software. ![]() ![]() ![]() |
|
Ray, Baishakhi |
![]() Nipun Arora, Jonathan Bell, Franjo Ivančić, Gail Kaiser, and Baishakhi Ray (Dropbox, USA; George Mason University, USA; Google, USA; Columbia University, USA) Short time-to-localize and time-to-fix for production bugs is extremely important for any 24x7 service-oriented application (SOA). Debugging buggy behavior in deployed applications is hard, as it requires careful reproduction of a similar environment and workload. Prior approaches for automatically reproducing production failures do not scale to large SOA systems. Our key insight is that for many failures in SOA systems (e.g., many semantic and performance bugs), a failure can automatically be reproduced solely by relaying network packets to replicas of suspect services, an insight that we validated through a manual study of 16 real bugs across five different systems. This paper presents Parikshan, an application monitoring framework that leverages user-space virtualization and network proxy technologies to provide a sandbox “debug” environment. In this “debug” environment, developers are free to attach debuggers and analysis tools without impacting performance or correctness of the production environment. In comparison to existing monitoring solutions that can slow down production applications, Parikshan allows application monitoring at significantly lower overhead. ![]() ![]() |
|
Regis, Germán |
![]() Renzo Degiovanni, Facundo Molina, Germán Regis, and Nazareno Aguirre (National University of Río Cuarto, Argentina; CONICET, Argentina) Goal-conflict analysis has been widely used as an abstraction for risk analysis in goal-oriented requirements engineering approaches. In this context, where the expected behaviour of the system-to-be is captured in terms of domain properties and goals, identifying combinations of circumstances that may make the goals diverge, i.e., not to be satisfied as a whole, is of most importance. Various approaches have been proposed in order to automatically identify boundary conditions, i.e., formulas capturing goal-divergent situations, but they either apply only to some specific goal expressions, or are affected by scalability issues that make them applicable only to relatively small specifications. In this paper, we present a novel approach to automatically identify boundary conditions, using evolutionary computation. More precisely, we develop a genetic algorithm that, given the LTL formulation of the domain properties and the goals, it searches for formulas that capture divergences in the specification. We exploit a modern LTL satisfiability checker to successfully guide our genetic algorithm to the solutions. We assess our technique on a set of case studies, and show that our genetic algorithm is able to find boundary conditions that cannot be generated by related approaches, and is able to efficiently scale to LTL specifications that other approaches are unable to deal with. ![]() ![]() ![]() |
|
Reif, Michael |
![]() Dominik Helm, Florian Kübler, Michael Eichberg, Michael Reif, and Mira Mezini (TU Darmstadt, Germany) Analyzing methods in object-oriented programs whether they are side-effect free and also deterministic, i.e., mathematically pure, has been the target of extensive research. Identifying such methods helps to find code smells and security related issues, and also helps analyses detecting concurrency bugs. Pure methods are also used by formal verification approaches as the foundations for specifications and proving the pureness is necessary to ensure correct specifications. However, so far no common terminology exists which describes the purity of methods. Furthermore, some terms (e.g., pure or side-effect free) are also used inconsistently. Further, all current approaches only report selected purity information making them only suitable for a smaller subset of the potential use cases. In this paper, we present a fine-grained unified lattice model which puts the purity levels found in the literature into relation and which adds a new level that generalizes existing definitions. We have also implemented a scalable, modularized purity analysis which produces significantly more precise results for real-world programs than the best-performing related work. The analysis shows that all defined levels are found in real-world projects. ![]() ![]() |
|
Reiss, Steven P. |
![]() Steven P. Reiss, Qi Xin, and Jeff Huang (Brown University, USA) We introduce a tool within the Code Bubbles development environment that allows for continuous execution as the programmer edits. The tool, SEEDE, shows both the intermediate and final results of execution in terms of variables, control and data flow, output, and graphics. These results are updated as the user edits. The tool can be used to help the user write new code or to find and fix bugs. The tool is explicitly designed to let the user quickly explore the execution of a method along with all the code it invokes, possibly while writing or modifying the code. The user can start continuous execution either at a breakpoint or for a test case. This paper describes the tool, its implementation, and its user interface. It presents an initial user study of the tool demonstrating its potential utility. ![]() ![]() |
|
Rodriguez, Rodney |
![]() Foyzul Hassan, Rodney Rodriguez, and Xiaoyin Wang (University of Texas at San Antonio, USA) Dockerfiles are configuration files of docker images which package all dependencies of a software to enable convenient software deployment and porting. In other words, dockerfiles list all environment assumptions of a software application's build and / or execution, so they need to be frequently updated when the environment assumptions change during fast software evolution. In this paper, we propose RUDSEA, a novel approach to recommend updates of dockerfiles to developers based on analyzing changes on software environment assumptions and their impacts. Our evaluation on 1,199 real-world instruction updates shows that RUDSEA can recommend correct update locations for 78.5% of the updates, and correct code changes for 44.1% of the updates. ![]() ![]() |
|
Roohitavaf, Mohammad |
![]() Mohammad Roohitavaf and Sandeep Kulkarni (Michigan State University, USA) We present our framework DKVF that enables one to quickly prototype and evaluate new consistency protocols for key-value stores. DKVF is designed based on the separation of concerns in creating distributed data stores. This separation of concerns allows the designers of consistency protocols to only focus on the high-level consistency protocols which gives them the opportunity to quickly deploy a consistency protocol and evaluate its performance. Moreover, the loose coupling of the different components allows us to easily change different components (e.g. storage engine) of an implementation. We demonstrate DKVF by implementing four existing protocols --eventual consistency, COPS, GentleRain and CausalSpartan-- with it. The implementation of these protocols was very convenient with DKVF, as it only required to write a piece of code for the consistency component that is very close to the pseudocode of the original papers. Hence, it was possible to achieve this in just 1-2 days per protocol. DKVF also comes with a toolset that facilitates running clusters and performing experiments. Tutorial video: https://www.youtube.com/watch?v=MFJQzsJkwfc&list=PLErtSVEHsnBJvoQQI6iqGn61oNrUVVuST ![]() ![]() ![]() ![]() |
|
Rossi, Gustavo |
![]() Julián Grigera, Juan Cruz Gardey, Alejandra Garrido, and Gustavo Rossi (National University of La Plata, Argentina; CIC, Argentina; CONICET, Argentina) Client-Side Web Refactorings (CSWRs) allow improving interaction aspects of web applications by applying changes to the user interface without altering the code base, even in production settings. However, developers are not always willing, or even allowed to apply external adaptations to their applications’ user interface. Besides, CSWRs do not guarantee improvements in all contexts, so it may be unwise to install them in a production version. We propose a tool that allows creating private versions of a running web application almost automatically. Using this tool, developers or usability experts can easily combine CSWRs to create alternative versions of web applications, without the need of creating a cloned sandbox environment for each version. This yields many uses, such as quickly setting up user tests, showing live alternatives to Product Owners, and even performing A/B testing. The tool is built on top of Kobold, a service that allows applying CSWRs to fix usability smells. Kobold with versioning is available at: https://autorefactoring.lifia.info.unlp.edu.ar. A screencast of the tool is available at https://youtu.be/LVc3BOtVP3I. ![]() ![]() ![]() |
|
Rouvoy, Romain |
![]() Sarra Habchi, Xavier Blanc, and Romain Rouvoy (Inria, France; University of Lille, France; University of Bordeaux, France) With millions of applications (apps) distributed through mobile markets, engaging and retaining end-users challenge Android developers to deliver a nearly perfect user experience. As mobile apps run in resource-limited devices, performance is a critical criterion for the quality of experience. Therefore, developers are expected to pay much attention to limit performance bad practices. On the one hand, many studies already identified such performance bad practices and showed that they can heavily impact app performance. Hence, many static analysers, a.k.a. linters, have been proposed to detect and fix these bad practices. On the other hand, other studies have shown that Android developers tend to deal with performance reactively and they rarely build on linters to detect and fix performance bad practices. In this paper, we therefore perform a qualitative study to investigate this gap between research and development community. In particular, we performed interviews with 14 experienced Android developers to identify the perceived benefits and constraints of using linters to identify performance bad practices in Android apps. Our observations can have a direct impact on developers and the research community. Specifically, we describe why and how developers leverage static source code analysers to improve the performance of their apps. On top of that, we bring to light important challenges faced by developers when it comes to adopting static analysis for performance purposes. ![]() ![]() ![]() ![]() |
|
Roychoudhury, Abhik |
![]() Xiang Gao, Shin Hwei Tan, Zhen Dong, and Abhik Roychoudhury (National University of Singapore, Singapore; Southern University of Science and Technology, China) Symbolic execution of Android applications is challenging as it involves either building a customized VM for Android or modeling the Android libraries. Since the Android Runtime evolves from one version to another, building a high-fidelity symbolic execution engine involves modeling the effect of the libraries and their evolved versions. Without simulating the behavior of Android libraries, path divergence may occur due to constraint loss when the symbolic values flow into Android framework and these values later affect the subsequent path taken. Previous works such as JPF-Android have relied on the modeling of execution environment such as libraries. In this work, we build a dynamic symbolic execution engine for Android apps, without any manual modeling of execution environment. Environment (or library) dependent control flow decisions in the application will trigger an on-demand program synthesis step to automatically deduce a representation of the library.This representation is refined on-the-fly by running the corresponding library multiple times.The overarching goal of the refinement is to enhance behavioral coverage and to alleviate the path divergence problem during symbolic execution. Moreover, our library synthesis can be made context-specific. Compared to traditional synthesis approaches which aim to synthesize the complete library code, our context-specific synthesis engine can generate more precise expressions for a given context. The evaluation of our dynamic symbolic execution engine, built on top of JDART, shows that the library models obtained from program synthesis are often more accurate than the semi-manual models in JPF-Android. Furthermore, our symbolic execution engine could reach more branch targets, as compared to using the JPF-Android models. ![]() ![]() |
|
Ruan, Wenjie |
![]() Youcheng Sun, Min Wu, Wenjie Ruan, Xiaowei Huang, Marta Kwiatkowska, and Daniel Kroening (University of Oxford, UK; University of Liverpool, UK) Concolic testing combines program execution and symbolic analysis to explore the execution paths of a software program. In this paper, we develop the first concolic testing approach for Deep Neural Networks (DNNs). More specifically, we utilise quantified linear arithmetic over rationals to express test requirements that have been studied in the literature, and then develop a coherent method to perform concolic testing with the aim of better coverage. Our experimental results show the effectiveness of the concolic testing approach in both achieving high coverage and finding adversarial examples. ![]() ![]() |
|
Rubin, Julia |
![]() Federico Mora, Yi Li, Julia Rubin, and Marsha Chechik (University of Toronto, Canada; University of British Columbia, Canada) Software is often built by integrating components created by different teams or even different organizations. With little understanding of changes in dependent components, it is challenging to maintain correctness and robustness of the entire system. In this paper, we investigate the effect of component changes on the behavior of their clients. We observe that changes in a component are often irrelevant to a particular client and thus can be adopted without any delays or negative effects. Following this observation, we formulate the notion of client-specific equivalence checking (CSE) and develop an automated technique optimized for checking such equivalence. We evaluate our technique on a set of benchmarks, including those from the existing literature on equivalence checking, and show its applicability and effectiveness. ![]() ![]() |
|
Russo, Barbara |
![]() Andrea Janes, Michael Mairegger, and Barbara Russo (Free University of Bolzano, Italy) As a developer, it is often complex to foresee the impact of changes in source code on usage, e.g., it is time-consuming to find out all components that will be impacted by a change or estimate the impact on the usability of a failing piece of code. It is therefore hard to decide how much effort in quality assurance is justifiable to obtain the desired business goals. In this paper, to reduce the difficulty for developers to understand the importance of source code, we propose an automated way to provide this information to developers as they are working on a given piece of code. As a proof-of-concept, we developed a plug-in for Microsoft Visual Studio Code that informs about the importance of source code methods based on the frequency of usage by the end-users of the developed software. The plug-in aims to increase the awareness developers have about the importance of source code in an unobtrusive way, helping them to prioritize their effort to quality assurance, technical excellence, and usability. code_call_lens can be downloaded from GitHub at https://github.com/xxMUROxx/vscode.code_call_lens. ![]() ![]() |
|
Sainani, Abhishek |
![]() Manasi Patwardhan, Abhishek Sainani, Richa Sharma, Shirish Karande, and Smita Ghaisas (TCS Research, India) Compliant software is a critical need of all modern businesses. Disambiguating regulations to derive requirements is therefore an important software engineering activity. Regulations however are ridden with ambiguities that make their comprehension a challenge, seemingly surmountable only by legal experts. Since legal experts' involvement in every project is expensive, approaches to automate the disambiguation need to be explored. These approaches however require a large amount of annotated data. Collecting data exclusively from experts is not a scalable and affordable solution. In this paper, we present the results of a crowd sourcing experiment to collect annotations on ambiguities in regulations from professional software engineers. We discuss an approach to automate the arduous and critical step of identifying ground truth labels by employing crowd consensus using Expectation Maximization (EM). We demonstrate that the annotations reaching a consensus match those of experts with an accuracy of 87%. ![]() ![]() |
|
Saraiva, João |
![]() Rui Pereira, Pedro Simão, Jácome Cunha, and João Saraiva (INESC TEC, Portugal; University of Minho, Portugal; NOVA-LINCS, Portugal; Universidade Nova Lisboa, Portugal) Software developers are more and more eager to understand their code’s energy performance. However, even with such knowledge it is di cult to know how to improve the code. Indeed, little tool support exists to understand the energy consumption pro le of a software system and to eventually (automatically) improve its code. In this paper we present a tool termed jStanley which automatically nds collections in Java programs that can be replaced by others with a positive impact on the energy consumption as well as on the execution time. In seconds, developers obtain information about energy-eager collection usage. jStanley will further suggest alternative collections to improve the code, making it use less time, energy, or a combination of both. The preliminary evaluation we ran using jStanley shows energy gains between 2% and 17%, and a reduction in execution time between 2% and 13%. A video can be seen at https://greensoftwarelab.github.io/jStanley. ![]() ![]() ![]() |
|
Scalabrino, Simone |
![]() Simone Scalabrino, Giovanni Grano, Dario Di Nucci, Michele Guerra, Andrea De Lucia, Harald C. Gall, and Rocco Oliveto (University of Molise, Italy; University of Zurich, Switzerland; Vrije Universiteit Brussel, Belgium; University of Salerno, Italy) Automatically generating test cases plays an important role to reduce the time spent by developers during the testing phase. In last years, several approaches have been proposed to tackle such a problem: amongst others, search-based techniques have been shown to be particularly promising. In this paper we describe Ocelot, a search-based tool for the automatic generation of test cases in C. Ocelot allows practitioners to write skeletons of test cases for their programs and researchers to easily implement and experiment new approaches for automatic test-data generation. We show that Ocelot achieves a higher coverage compared to a competitive tool in 81% of the cases. Ocelot is publicly available to support both researchers and practitioners. ![]() ![]() |
|
Sen, Koushik |
![]() Caroline Lemieux and Koushik Sen (University of California at Berkeley, USA) In recent years, fuzz testing has proven itself to be one of the most effective techniques for finding correctness bugs and security vulnerabilities in practice. One particular fuzz testing tool, American Fuzzy Lop (AFL), has become popular thanks to its ease-of-use and bug-finding power. However, AFL remains limited in the bugs it can find since it simply does not cover large regions of code. If it does not cover parts of the code, it will not find bugs there. We propose a two-pronged approach to increase the coverage achieved by AFL. First, the approach automatically identifies branches exercised by few AFL-produced inputs (rare branches), which often guard code that is empirically hard to cover by naively mutating inputs. The second part of the approach is a novel mutation mask creation algorithm, which allows mutations to be biased towards producing inputs hitting a given rare branch. This mask is dynamically computed during fuzz testing and can be adapted to other testing targets. We implement this approach on top of AFL in a tool named FairFuzz. We conduct evaluation on real-world programs against state-of-the-art versions of AFL. We find that on these programs FairFuzz achieves high branch coverage at a faster rate that state-of-the-art versions of AFL. In addition, on programs with nested conditional structure, it achieves sustained increases in branch coverage after 24 hours (average 10.6% increase). In qualitative analysis, we find that FairFuzz has an increased capacity to automatically discover keywords. ![]() ![]() |
|
Sferruzza, David |
![]() |