ISSTA 2022
31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2022)
Powered by
Conference Publishing Consulting

31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2022), July 18–22, 2022, Virtual, South Korea

ISSTA 2022 – Preliminary Table of Contents

Contents - Abstracts - Authors


Title Page

Message from the Chairs



Accepted Papers

Unicorn: Detect Runtime Errors in Time-Series Databases with Hybrid Input Synthesis
Zhiyong Wu ORCID logo, Jie Liang ORCID logo, Mingzhe Wang ORCID logo, Chijin Zhou ORCID logo, and Yu JiangORCID logo
(Tsinghua University, China; ShuimuYulin, China)
The ubiquitous use of time-series databases in the safety-critical Internet of Things domain demands strict security and correctness. One successful approach in database bug detection is fuzzing, where hundreds of bugs have been detected automatically in relational databases. However, it cannot be easily applied to time-series databases: the bulk of time-series logic is unreachable because of mismatched query specifications, and serious bugs are undetectable because of implicitly handled exceptions. In this paper, we propose Unicorn to secure time-series databases with automated fuzzing. First, we design hybrid input synthesis to generate high-quality queries which not only cover time-series features but also ensure grammar correctness. Then, Unicorn uses proactive exception detection to discover minuscule-symptom bugs which hide behind implicit exception handling. With the specialized design oriented to time-series databases, Unicorn outperforms the state-of-the-art database fuzzers in terms of coverage and bugs. Specifically, Unicorn outperforms SQLsmith and SQLancer on widely used time-series databases IoTDB, KairosDB, TimescaleDB, TDEngine, QuestDB, and GridDB in the number of basic blocks by 21%-199% and 34%-693%, respectively. More importantly, Unicorn has discovered 42 previously unknown bugs.

Article Search
Understanding Device Integration Bugs in Smart Home System
Tao Wang, Kangkang Zhang, Wei Chen, Wensheng Dou, Jiaxin Zhu, Jun Wei, and Tao Huang
(Institute of Software at Chinese Academy of Sciences, China)

Article Search
A Large-Scale Empirical Analysis of the Vulnerabilities Introduced by Third-Party Components in IoT Firmware
Binbin Zhao ORCID logo, Shouling Ji ORCID logo, Jiacheng Xu ORCID logo, Yuan Tian ORCID logo, Qiuyang Wei ORCID logo, Qinying Wang ORCID logo, Chenyang LyuORCID logo, Xuhong ZhangORCID logo, Changting Lin ORCID logo, Jingzheng Wu ORCID logo, and Raheem Beyah ORCID logo
(Georgia Institute of Technology, USA; Zhejiang University, China; University of Virginia, USA; Institute of Software at Chinese Academy of Sciences, China)
As the core of IoT devices, firmware is undoubtedly vital. Currently, the development of IoT firmware heavily depends on third-party components (TPCs), which significantly improves the development efficiency and meanwhile reduces the cost. Nevertheless, TPCs are not secure, and the vulnerabilities in TPCs will turn back influence the security of IoT firmware. Currently, existing works pay less attention to the vulnerabilities caused by TPCs and we still lack a comprehensive understanding of the security impact of TPC vulnerability against firmware. To fill in the knowledge gap, we design and implement FirmSec, which leverages syntactical features and control-flow graph features to detect the TPCs at version-level in firmware, and then recognizes the corresponding vulnerabilities. Based on FirmSec, we present the first large-scale analysis of the usage of TPCs and the corresponding vulnerabilities in firmware. More specifically, we perform an analysis on 34,136 firmware images, including 11,086 publicly accessible firmware images, and 23,050 private firmware images from TSmart. We successfully detect 584 TPCs and identify 128,757 vulnerabilities caused by 429 CVEs. Our in-depth analysis reveals the diversity of security issues for different kinds of firmware from various vendors, and discovers some well-known vulnerabilities are still deeply rooted in many firmware images. We also find that the TPCs used in firmware have fallen behind by five years on average. Besides, we explore the geographical distribution of vulnerable devices, and confirm the security situation of devices in several regions, e.g., South Korea and China, is more severe than in other regions. Further analysis shows 2,478 commercial firmware images have potentially violated GPL/AGPL licensing terms. To facilitate future research, we will open-source our dataset.

Article Search
jTrans: Jump-Aware Transformer for Binary Code Similarity Detection
Hao WangORCID logo, Wenjie Qu, Gilad Katz, Wenyu Zhu, Zeyu Gao, Han Qiu ORCID logo, Jianwei Zhuge, and Chao Zhang ORCID logo
(Tsinghua University, China; Huazhong University of Science and Technology, China; Ben-Gurion University of the Negev, Israel; University of Science and Technology of China, China; Beijing National Research Center for Information Science and Technology, China)
Binary code similarity detection (BCSD) has important applications in various fields such as vulnerabilities detection, software component analysis, and reverse engineering. Recent studies have shown that deep neural networks (DNNs) can comprehend instructions or control-flow graphs (CFG) of binary code and support BCSD. In this study, we propose a novel Transformer-based approach, namely jTrans, to learn representations of binary code. It is the first solution that embeds control flow information of binary code into Transformer-based language models, by using a novel jump-aware representation of the analyzed binaries and a newly-designed pre-training task. Additionally, we release to the community a newly-created large dataset of binaries, BinaryCorp, which is the most diverse to date. Evaluation results show that jTrans outperforms state-of-the-art (SOTA) approaches on this more challenging dataset by 30.5% (i.e., from 32.0% to 62.5%). In a real-world task of known vulnerability searching, jTrans achieves a recall that is 2X higher than existing SOTA baselines.

Article Search
Patch Correctness Assessment in Automated Program Repair Based on the Impact of Patches on Production and Test Code
Ali GhanbariORCID logo and Andrian MarcusORCID logo
(Iowa State University, USA; University of Texas at Dallas, USA)
Test-based generate-and-validate automated program repair (APR) systems often generate many patches that pass the test suite without fixing the bug. The generated patches must be manually inspected by the developers, so previous research proposed various techniques for automatic correctness assessment of APR-generated patches. Among them, dynamic patch correctness assessment techniques rely on the assumption that, when running the originally passing test cases, the correct patches will not alter the program behavior in a significant way, e.g., removing the code implementing correct functionality of the program. In this paper, we propose and evaluate a novel technique, named Shibboleth, for automatic correctness assessment of the patches generated by test-based generate-and-validate APR systems. Unlike existing works, the impact of the patches is captured along three complementary facets, allowing more effective patch correctness assessment. Specifically, we measure the impact of patches on both production code (via syntactic and semantic similarity) and test code (via code coverage of passing tests) to separate the patches that result in similar programs and that do not delete desired program elements. Shibboleth assesses the correctness of patches via both ranking and classification. We evaluated Shibboleth on 1,871 patches, generated by 29 Java-based APR systems for Defects4J programs. The technique outperforms state-of-the-art ranking and classification techniques. Specifically, in our ranking data set, in 43% (66%) of the cases, Shibboleth ranks the correct patch in top-1 (top-2) positions, and in classification mode applied on our classification data set, it achieves an accuracy and F1-score of 0.887 and 0.852, respectively.

ATR: Template-Based Repair for Alloy Specifications
Guolong Zheng, ThanhVu Nguyen, Simón Gutiérrez Brida, Germán Regis, Nazareno Aguirre, Marcelo F. Frias, and Hamid Bagheri ORCID logo
(University of Nebraska-Lincoln, USA; George Mason University, USA; University of Rio Cuarto, Argentina; CONICET, Argentina; Instituto Tecnológico de Buenos Aires, Argentina)
Automatic Program Repair (APR) is a practical research topic that studies techniques to automatically repair programs to fix bugs. Most existing APR techniques are designed for imperative programming languages, such as C and Java, and rely on analyzing correct and incorrect executions of programs to identify and repair suspicious statements.
We introduce a new APR approach for software specifications written in the Alloy declarative language, where specifications are not “executed”, but rather converted into logical formulas and analyzed using backend constraint solvers, to find specification instances and counterexamples to assertions. We present ATR, a technique that takes as input an Alloy specification with some violated assertion and returns a repaired specification that satisfies the assertion. The key ideas are (i) analyzing the differences between counterexamples that do not satisfy the assertion and instances that do satisfy the assertion to guide the repair and (ii) generating repair candidates from specific templates and pruning the space of repair candidates using the counterexamples and satisfying instances. Experimental results using existing large Alloy benchmarks show that ATR is effective in generating difficult repairs. ATR repairs 66.3% of 1974 fault specifications, including specification repairs that cannot be handled by existing Alloy repair techniques. ATR and all benchmarks are open-source and available in the following Github repository:

Article Search
FDG: A Precise Measurement of Fault Diagnosability Gain of Test Cases
Gabin AnORCID logo and Shin YooORCID logo
(KAIST, South Korea)
The performance of many Fault Localisation (FL) techniques directly depends on the quality of the used test suites. Consequently, it is extremely useful to be able to precisely measure how much diagnostic power each test case can introduce when added to a test suite used for FL. Such a measure can help us not only to prioritise and select test cases to be used for FL, but also to effectively augment test suites that are too weak to be used with FL techniques. We propose FDG, a new measure of Fault Diagnosability Gain for individual test cases. The design of FDG is based on our analysis of existing metrics that are designed to prioritise test cases for better FL. Unlike other metrics, FDG exploits the ongoing FL results to emphasise the parts of the program for which more information is needed. Our evaluation of FDG with Defects4J shows that it can successfully help the augmentation of test suites for better FL. When given only a few failing test cases (2.3 test cases on average), FDG can effectively augment the given test suite by prioritising the test cases generated automatically by EvoSuite: the augmentation can improve the acc@1 and acc@10 of the FL results by 11.6x and 2.2x on average, after requiring only ten human judgements on the correctness of the assertions EvoSuite generates.

Article Search
Path-Sensitive Code Embedding via Contrastive Learning for Software Vulnerability Detection
Xiao Cheng ORCID logo, Guanqin Zhang, Haoyu Wang, and Yulei SuiORCID logo
(University of Technology Sydney, Australia; Huazhong University of Science and Technology, China)
Machine learning and its promising branch deep learning have shown success in a wide range of application domains. Recently, much effort has been expended on applying deep learning techniques (e.g., graph neural networks) to static vulnerability detection as an alternative to conventional bug detection methods. To obtain the structural information of code, current learning approaches typically abstract a program in the form of graphs (e.g., data-flow graphs, abstract syntax trees), and then train an underlying classification model based on the (sub)graphs of safe and vulnerable code fragments for vulnerability prediction. However, these models are still insufficient for precise bug detection, because the objective of these models is to produce classification results rather than comprehending the semantics of vulnerabilities, e.g., pinpoint bug triggering paths, which are essential for static bug detection.
This paper presents ContraFlow, a selective yet precise contrastive value-flow embedding approach to statically detect software vulnerabilities. The novelty of ContraFlow lies in selecting and preserving feasible value-flow (aka program dependence) paths through a pretrained path embedding model using self-supervised contrastive learning, thus significantly reducing the amount of labeled data required for training expensive downstream models for path-based vulnerability detection. We evaluated ContraFlow using 288 real-world projects by comparing eight recent learning-based approaches. ContraFlow outperforms these eight baselines by up to 334.1%, 317.9%, 58.3% for informedness, markedness and F1 Score, and achieves up to 450.0%, 192.3%, 450.0% improvement for mean statement recall, mean statement precision and mean IoU respectively in terms of locating buggy statements.

Article Search
Finding Permission Bugs in Smart Contracts with Role Mining
Ye Liu, Yi LiORCID logo, Shang-Wei Lin, and Cyrille Artho
(Nanyang Technological University, Singapore; KTH, Sweden)

Article Search Info
$\varepsilon$-Weakened Robustness of Deep Neural Networks
Pei Huang, Yuting Yang, Minghao Liu, Fuqi Jia, Feifei Ma, and Jian Zhang
(Institute of Software at Chinese Academy of Sciences, China; Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)

Article Search
A Large-Scale Study of Usability Criteria Addressed by Static Analysis Tools
Marcus Nachtigall, Michael SchlichtigORCID logo, and Eric BoddenORCID logo
(University of Paderborn, Germany; Fraunhofer IEM, Germany)

Article Search
Simple Techniques Work Surprisingly Well for Neural Network Test Prioritization and Active Learning (Replicability Study)
Michael WeissORCID logo and Paolo Tonella ORCID logo
(USI Lugano, Switzerland)
Test Input Prioritizers (TIP) for Deep Neural Networks (DNN) are an important technique to handle the typically very large test datasets efficiently, saving computation and labelling costs. This is particularly true for large scale, deployed systems, where inputs observed in production are recorded to serve as potential test or training data for next versions of the system. Feng et. al. propose DeepGini, a very fast and simple TIP and show that it outperforms more elaborate techniques such as neuron- and surprise coverage. In a large-scale study (4 case studies, 8 test datasets, 32’200 trained models) we verify their findings. However, we also find that other comparable or even simpler baselines from the field of uncertainty quantification, such as the predicted softmax likelihood or the entropy of the predicted softmax likelihoods perform equally well as DeepGini

Article Search
SnapFuzz: High-Throughput Fuzzing of Network Applications
Anastasios Andronidis and Cristian Cadar
(Imperial College London, UK)

Article Search
Deadlock Prediction via Generalized Dependency
Jinpeng Zhou, Hanmei Yang, John Lange, and Tongping Liu
(University of Pittsburgh, USA; University of Massachusetts at Amherst, USA; Oak Ridge National Laboratory, USA)

Article Search
eTainter: Detecting Gas-Related Vulnerabilities in Smart Contracts
Asem Ghaleb ORCID logo, Julia RubinORCID logo, and Karthik Pattabiraman
(University of British Columbia, Canada)
The execution of smart contracts on the Ethereum blockchain consumes gas paid for by users submitting contracts' invocation requests. A contract execution proceeds as long as the users dedicate enough gas, within the limit set by Ethereum. If insufficient gas is provided, the contract execution halts and changes made during execution get reverted. Unfortunately, contracts may contain code patterns that increase execution cost, causing the contracts to run out of gas. These patterns can be manipulated by malicious attackers to induce unwanted behavior in the targeted victim contracts, e.g., Denial-of-Service (DoS) attacks. We call these gas-related vulnerabilities. We propose eTainter, a static analyzer for detecting gas-related vulnerabilities based on taint tracking in the bytecode of smart contracts. We evaluate eTainter by comparing it with the prior work, MadMax, on a dataset of annotated contracts. The results show that eTainter outperforms MadMax in both precision and recall, and that eTainter has a precision of 90% based on manual inspection. We also use eTainter to perform large-scale analysis of 60,612 real-world contracts on the Ethereum blockchain. We find that gas-related vulnerabilities exist in 2,763 of these contracts, and that eTainter analyzes a contract in eight seconds, on average.

Article Search
TELL: Log Level Suggestions via Modeling Multi-level Code Block Information
Jiahao Liu, Jun Zeng, Xiang Wang, Kaihang Ji, and Zhenkai Liang
(National University of Singapore, Singapore; University of Science and Technology of China, China)
Developers insert logging statements into source code to monitor system execution, which forms the basis for software debugging and maintenance. For distinguishing diverse runtime information, each software log is assigned with a separate verbosity level (e.g., trace and error). However, choosing an appropriate verbosity level is a challenging and error-prone task due to the lack of specifications for log level usages. Prior solutions aim to suggest log levels based on the code block in which a logging statement resides (i.e., intra-block features). Such suggestions, however, do not consider information from surrounding blocks (i.e., inter-block features), which also plays an important role in revealing logging characteristics.
To address this issue, we combine multiple levels of code block information (i.e., intra-block and inter-block features) into a joint graph structure called Flow of Abstract Syntax Tree (FAST). To explicitly exploit multi-level block features, we design a new neural architecture, Hierarchical Block Graph Network (HBGN), on the FAST. In particular, it leverages graph neural networks to encode both the intra-block and inter-block features into code block representations and guide log level suggestions. We implement a prototype system, TeLL, and evaluate its effectiveness on nine large-scale software systems. Experimental results showcase TeLL's advantage in predicting log levels over the state-of-the-art approaches.

Article Search
An Empirical Study on the Effectiveness of Static C Code Analyzers for Vulnerability Detection
Stephan Lipp, Sebastian Banescu, and Alexander Pretschner
(TU Munich, Germany)

Article Search
Almost Correct Invariants: Synthesizing Inductive Invariants by Fuzzing Proofs
Sumit LahiriORCID logo and Subhajit RoyORCID logo
(IIT Kanpur, India)
Real-life programs contain multiple operations whose semantics are unavailable to verification engines, like third-party library calls, inline assembly and SIMD instructions, special compiler-provided primitives, and queries to uninterpretable machine learning models. Even with the exceptional success story of program verification, synthesis of inductive invariants for such “open” programs has remained a challenge. Currently, this problem is handled by manually “closing” the program—by providing hand-written stubs that attempt to capture the behavior of the unmodelled operations; writing stubs is not only difficult and tedious, but the stubs are often incorrect—raising serious questions on the whole endeavor. In this work, we propose Almost Correct Invariants as an automated strategy for synthesizing inductive invariants for such “open” programs. We adopt an active learning strategy where a data-driven learner proposes candidate invariants. In deviation from prior work that attempt to verify invariants, we attempt to falsify the invariants: we reduce the falsification problem to a set of reachability checks on non-deterministic programs; we ride on the success of modern fuzzers to answer these reachability queries. Our tool, Achar, automatically synthesizes inductive invariants that are sufficient to prove the correctness of the target programs. We compare Achar with a state-of-the-art invariant synthesis tool that employs theorem proving on formulae built over the program source. Though Achar is without strong soundness guarantees, our experiments show that even when we provide almost no access to the program source, Achar outperforms the state-of-the-art invariant generator that has complete access to the source. We also evaluate Achar on programs that current invariant synthesis engines cannot handle—programs that invoke external library calls, inline assembly, and queries to convolution neural networks; Achar successfully infers the necessary inductive invariants within a reasonable time.

Article Search
Testing Dafny (Experience Paper)
Ahmed IrfanORCID logo, Sorawee Porncharoenwase ORCID logo, Zvonimir Rakamarić ORCID logo, Neha Rungta ORCID logo, and Emina Torlak ORCID logo
(Amazon Web Services, USA; University of Washington, USA)
Verification toolchains are widely used to prove the correctness of critical software systems. To build confidence in their results, it is important to develop testing frameworks that help detect bugs in these toolchains. Inspired by the success of fuzzing in finding bugs in compilers and SMT solvers, we have built the first fuzzing and differential testing framework for Dafny, a high-level programming language with a Floyd-Hoare-style program verifier and compilers to C#, Java, Go, and Javascript.
This paper presents our experience building and using XDsmith, a testing framework that targets the entire Dafny toolchain, from verification to compilation. XDsmith randomly generates annotated programs in a subset of Dafny that is free of loops and heap-mutating operations. The generated programs include preconditions, postconditions, and assertions, and they have a known verification outcome. These programs are used to test the soundness and precision of the Dafny verifier, and to perform differential testing on the four Dafny compilers. Using XDsmith, we uncovered 31 bugs across the Dafny verifier and compilers, each of which has been confirmed by the Dafny developers. Moreover, 8 of these bugs have been fixed in the mainline release of Dafny.

Article Search
Improving Cross-Platform Binary Analysis using Representation Learning via Graph Alignment
Geunwoo Kim ORCID logo, Sanghyun Hong ORCID logo, Michael Franz, and Dokyung SongORCID logo
(University of California at Irvine, USA; Oregon State University, USA; Yonsei University, South Korea)

Article Search
Combining Static Analysis Error Traces with Dynamic Symbolic Execution (Experience Paper)
Frank Busse, Pritam Gharat, Cristian Cadar, and Alastair F. DonaldsonORCID logo
(Imperial College London, UK)

Article Search
SLIME: Program-Sensitive Energy Allocation for Fuzzing
Chenyang LyuORCID logo, Hong Liang ORCID logo, Shouling Ji ORCID logo, Xuhong ZhangORCID logo, Binbin Zhao ORCID logo, Meng Han ORCID logo, Yun Li ORCID logo, Zhe WangORCID logo, Wenhai Wang ORCID logo, and Raheem Beyah ORCID logo
(Zhejiang University, China; Georgia Institute of Technology, USA; Huawei Technologies, n.n.; Institute of Computing Technology at Chinese Academy of Sciences, China)
The energy allocation strategy is one of the most popular techniques in fuzzing to improve code coverage and vulnerability discovery. The core intuition is that fuzzers should allocate more computational energy to the seed files that have high efficiency to trigger unique paths and crashes after mutation. Existing solutions usually define several properties, e.g., the execution speed, the file size, and the number of the triggered edges in the control flow graph, to serve as the key measurements in their allocation logics to estimate the potential of a seed. The efficiency of a property is usually assumed to be the same across different programs. However, we find that this assumption is not always valid. As a result, the state-of-the-art energy allocation solutions with static energy allocation logics are hard to achieve desirable performance on different programs.
To address the above problem, we propose a novel program-sensitive solution, named SLIME, to enable adaptive energy allocation on the seed files with various properties for each program. Specifically, SLIME first designs multiple property-aware queues, with each queue containing the seed files with a specific property. Second, to improve the return of investment, SLIME leverages a customized Upper Confidence Bound Variance-aware (UCB-V) algorithm to statistically select a property queue with the most estimated reward, i.e., finding the most new unique execution paths and crashes. Finally, SLIME mutates the seed files in the selected property queue to perform property-adaptive fuzzing on a program. We evaluate SLIME against state-of-the-art open source fuzzers AFL, MOPT, AFL++, AFL++HIER, EcoFuzz, and TortoiseFuzz on 9 real-world programs. The results demonstrate that SLIME discovers 3.53X, 0.24X, 0.62X, 1.54X, 0.88X, and 3.81X more unique vulnerabilities compared to the above fuzzers, respectively. We will open source the prototype of SLIME to facilitate future fuzzing research.

Article Search
BET: Black-Box Efficient Testing for Convolutional Neural Networks
Jialai Wang ORCID logo, Han Qiu ORCID logo, Yi Rong ORCID logo, Hengkai Ye ORCID logo, Qi Li ORCID logo, Zongpeng Li ORCID logo, and Chao Zhang ORCID logo
(Tsinghua University, China; Purdue University, USA; Beijing National Research Center for Information Science and Technology, China)
It is important to test convolutional neural networks (CNNs) to identify defects (e.g. error-inducing inputs) before deploying them in security-sensitive scenarios. Although existing white-box testing methods can effectively test CNN models with high neuron coverage, they are not applicable to privacy-sensitive scenarios where full knowledge of target CNN models is lacking. In this work, we propose a novel Black-box Efficient Testing (BET) method for CNN models. The core insight of BET is that CNNs are generally prone to be affected by continuous perturbations. Thus, by generating such continuous perturbations in a black-box manner, we design a tunable objective function to guide our testing process for thoroughly exploring defects in different decision boundaries of the target CNN models. We further design an efficiency-centric policy to find more error-inducing inputs within a fixed query budget. We conduct extensive evaluations with three well-known datasets and five popular CNN structures. The results show that BET significantly outperforms existing white-box and black-box testing methods considering the effective error-inducing inputs found in a fixed query/inference budget. We further show that the error-inducing inputs found by BET can be used to fine-tune the target model, improving its accuracy by up to 3%.

Article Search
Program Vulnerability Repair via Inductive Inference
Yuntong Zhang, Xiang Gao, Gregory J. Duck, and Abhik RoychoudhuryORCID logo
(National University of Singapore, Singapore; Beihang University, China)

Article Search
MDPFuzz: Testing Models Solving Markov Decision Processes
Qi Pang ORCID logo, Yuanyuan Yuan ORCID logo, and Shuai Wang ORCID logo
(Hong Kong University of Science and Technology, China)
The Markov decision process (MDP) provides a mathematical frame- work for modeling sequential decision-making problems, many of which are crucial to security and safety, such as autonomous driving and robot control. The rapid development of artificial intelligence research has created efficient methods for solving MDPs, such as deep neural networks (DNNs), reinforcement learning (RL), and imitation learning (IL). However, these popular models solving MDPs are neither thoroughly tested nor rigorously reliable.
We present MDPFuzz, the first blackbox fuzz testing framework for models solving MDPs. MDPFuzz forms testing oracles by checking whether the target model enters abnormal and dangerous states. During fuzzing, MDPFuzz decides which mutated state to retain by measuring if it can reduce cumulative rewards or form a new state sequence. We design efficient techniques to quantify the “freshness” of a state sequence using Gaussian mixture models (GMMs) and dynamic expectation-maximization (DynEM). We also prioritize states with high potential of revealing crashes by estimating the local sensitivity of target models over states.
MDPFuzz is evaluated on five state-of-the-art models for solving MDPs, including supervised DNN, RL, IL, and multi-agent RL. Our evaluation includes scenarios of autonomous driving, aircraft collision avoidance, and two games that are often used to benchmark RL. During a 12-hour run, we find over 80 crash-triggering state sequences on each model. We show inspiring findings that crash-triggering states, though they look normal, induce distinct neuron activation patterns compared with normal states. We further develop an abnormal behavior detector to harden all the evaluated models and repair them with the findings of MDPFuzz to significantly enhance their robustness without sacrificing accuracy.

Article Search
Automated Testing of Image Captioning Systems
Boxi Yu ORCID logo, Zhiqing Zhong ORCID logo, Xinran Qin ORCID logo, Jiayi Yao ORCID logo, Yuancheng Wang ORCID logo, and Pinjia He ORCID logo
(Chinese University of Hong Kong, China; South China University of Technology, China)
Image captioning (IC) systems, which automatically generate a text description of the salient objects in an image (real or synthetic), have seen great progress over the past few years due to the development of deep neural networks. IC plays an indispensable role in human society, for example, labeling massive photos for scientific studies and assisting visually-impaired people in perceiving the world. However, even the top-notch IC systems, such as Microsoft Azure Cognitive Services and IBM Image Caption Generator, may return incorrect results, leading to the omission of important objects, deep misunderstanding, and threats to personal safety.
To address this problem, we propose MetaIC, the first metamorphic testing approach to validate IC systems. Our core idea is that the object names should exhibit directional changes after object insertion. Specifically, MetaIC (1) extracts objects from existing images to construct an object corpus; (2) inserts an object into an image via novel object resizing and location tuning algorithms; and (3) reports image pairs whose captions do not exhibit differences in an expected way. In our evaluation, we use MetaIC to test one widely-adopted image captioning API and five state-of-the-art (SOTA) image captioning models. Using 1,000 seeds, MetaIC successfully reports 16,825 erroneous issues with high precision (84.9%-98.4%). There are three kinds of errors: misclassification, omission, and incorrect quantity. We visualize the errors reported by MetaIC, which shows that flexible overlapping setting facilitates IC testing by increasing and diversifying the reported errors. In addition, MetaIC can be further generalized to detect label errors in the training dataset, which has successfully detected 151 incorrect labels in MS COCO Caption, a standard dataset in image captioning.

Article Search Archive submitted (410 kB)
An Extensive Study on Pre-trained Models for Program Understanding and Generation
Zhengran Zeng, Hanzhuo Tan ORCID logo, Haotian Zhang, Jing Li, Yuqun Zhang, and Lingming Zhang ORCID logo
(Southern University of Science and Technology, China; Hong Kong Polytechnic University, China; Kwai, n.n.; University of Illinois at Urbana-Champaign, USA)
Automatic program understanding and generation techniques could significantly advance the productivity of programmers and have been widely studied by academia and industry. Recently, the advent of pre-trained paradigm enlightens researchers to develop general-purpose pre-trained models which can be applied for a broad range of program understanding and generation tasks. Such pre-trained models, derived by self-supervised objectives on large unlabelled corpora, can be fine-tuned in downstream tasks (such as code search and code generation) with minimal adaptations. Although these pre-trained models claim superiority over the prior techniques, they seldom follow equivalent evaluation protocols, e.g., they are hardly evaluated on the identical benchmarks, tasks, or settings. Consequently, there is a pressing need for a comprehensive study of the pre-trained models on their effectiveness, versatility as well as the limitations to provide implications and guidance for the future development in this area. To this end, we first perform an extensive study of eight open-access pre-trained models over a large benchmark on seven representative code tasks to assess their reproducibility. We further compare the pre-trained models and domain-specific state-of-the-art techniques for validating pre-trained effectiveness. At last, we investigate the robustness of the pre-trained models by inspecting their performance variations under adversarial attacks. Through the study, we find that while we can in general replicate the original performance of the pre-trained models on their evaluated tasks and adopted benchmarks, subtle performance fluctuations can refute the findings in their original papers. Moreover, none of the existing pre-trained models can dominate over all other models. We also find that the pre-trained models can significantly outperform non-pre-trained state-of-the-art techniques in program understanding tasks. Furthermore, we perform the first study for natural language-programming language pre-trained model robustness via adversarial attacks and find that a simple random attack approach can easily fool the state-of-the-art pre-trained models and thus incur security issues. At last, we also provide multiple practical guidelines for advancing future research on pre-trained models for program understanding and generation.

Article Search
ASRTest: Automated Testing for Deep-Neural-Network-Driven Speech Recognition Systems
Pin Ji, Yang Feng, Jia Liu, Zhihong Zhao, and Zhenyu Chen
(Nanjing University, China)

Article Search
Metamorphic Relations via Relaxations: An Approach to Obtain Oracles for Action-Policy Testing
Hasan Ferit Eniser, Timo P. Gros, Valentin Wüstholz, Jörg Hoffmann, and Maria Christakis
(MPI-SWS, Germany; Saarland University, Germany; ConsenSys, Germany; DFKI, Germany)

Article Search
Hunting Bugs with Accelerated Optimal Graph Vertex Matching
Xiaohui Zhang, Yuanjun Gong, Bin Liang, Jianjun Huang, Wei You, Wenchang Shi, and Jian Zhang
(Renmin University of China, China; Institute of Software at Chinese Academy of Sciences, China)

Article Search
AEON: A Method for Automatic Evaluation of NLP Test Cases
Jen-tse HuangORCID logo, Jianping Zhang ORCID logo, Wenxuan Wang ORCID logo, Pinjia He ORCID logo, Yuxin Su ORCID logo, and Michael R. Lyu ORCID logo
(Chinese University of Hong Kong, China; Sun Yat-sen University, China)
Due to the labor-intensive nature of manual test oracle construction, various automated testing techniques have been proposed to enhance the reliability of Natural Language Processing (NLP) software. In theory, these techniques mutate an existing test case (e.g., a sentence with its label) and assume the generated one preserves an equivalent or similar semantic meaning and thus, the same label. However, in practice, many of the generated test cases fail to preserve similar semantic meaning and are unnatural (e.g., grammar errors), which leads to a high false alarm rate and unnatural test cases. Our evaluation study finds that 44% of the test cases generated by the state-of-the-art (SOTA) approaches are false alarms. These test cases require extensive manual checking effort, and instead of improving NLP software, they can even degrade NLP software when utilized in model training. To address this problem, we propose AEON for Automatic Evaluation Of NLP test cases. For each generated test case, it outputs scores based on semantic similarity and language naturalness. We employ AEON to evaluate test cases generated by four popular testing techniques on five datasets across three typical NLP tasks. The results show that AEON aligns the best with human judgment. In particular, AEON achieves the best average precision in detecting semantic inconsistent test cases, outperforming the best baseline metric by 10%. In addition, AEON also has the highest average precision of finding unnatural test cases, surpassing the baselines by more than 15%. Moreover, model training with test cases prioritized by AEON leads to models that are more accurate and robust, demonstrating AEON’s potential in improving NLP software.

Article Search
Park: Accelerating Smart Contract Vulnerability Detection via Parallel-Fork Symbolic Execution
Peilin Zheng, Zibin Zheng, and Xiapu LuoORCID logo
(Sun Yat-sen University, China; Hong Kong Polytechnic University, China)

Article Search
Using Pre-trained Language Models to Resolve Textual and Semantic Merge Conflicts (Experience Paper)
Jialu Zhang, Todd Mytkowicz, Mike Kaufman, Ruzica Piskac ORCID logo, and Shuvendu K. Lahiri
(Yale University, USA; Microsoft Research, USA; Microsoft, USA)

Article Search
LiRTest: Augmenting LiDAR Point Clouds for Automated Testing of Autonomous Driving Systems
An Guo, Yang Feng, and Zhenyu Chen
(Nanjing University, China)

Article Search
Test Mimicry to Assess the Exploitability of Library Vulnerabilities
Hong Jin Kang, Truong Giang Nguyen, Bach Le, Corina S. Păsăreanu, and David LoORCID logo
(Singapore Management University, Singapore; University of Melbourne, Australia; Carnegie Mellon University, USA; NASA Ames Research Center, USA)

Article Search
Combining Solution Reuse and Bound Tightening for Efficient Analysis of Evolving Systems
Clay StevensORCID logo and Hamid Bagheri ORCID logo
(University of Nebraska-Lincoln, USA)
Software engineers have long employed formal verification to ensure the safety and validity of their system designs. As the system changes---often via predictable, domain-specific operations---their models must also change, requiring system designers to repeatedly execute the same formal verification on similar system models. State-of-the-art formal verification techniques can be expensive at scale, the cost of which is multiplied by repeated analysis. This paper presents a novel analysis technique---implemented in a tool called SoRBoT---which can automatically determine domain-specific optimizations that can dramatically reduce the cost of repeatedly analyzing evolving systems. Different from all prior approaches, which focus on either tightening the bounds for analysis or reusing all or part of prior solutions, SoRBoT's automated derivation of domain-specific optimizations combines the benefits of both solution reuse and bound tightening while avoiding the main pitfalls of each. We experimentally evaluate SoRBoT against state-of-the-art techniques for verifying evolving specifications, demonstrating that SoRBoT substantially exceeds the run-time performance of those state-of-the-art techniques while introducing only a negligible overhead, in contrast to the expensive additional computations required by the state-of-the-art verification techniques.

Article Search
The Raise of Machine Learning Hyperparameter Constraints in Python Code
Ingkarat Rak-amnouykit, Ana Milanova, Guillaume BaudartORCID logo, Martin Hirzel, and Julian Dolby
(Rensselaer Polytechnic Institute, USA; Inria, France; ENS-PSL University, France; IBM Research, USA)

Article Search
Automated Test Generation for REST APIs: No Time to Rest Yet
Myeongsoo Kim, Qi Xin, Saurabh Sinha, and Alessandro Orso
(Georgia Institute of Technology, USA; Wuhan University, China; IBM Research, USA)

Article Search
Detecting and Fixing Data Loss Issues in Android Apps
Wunan Guo, Zhen Dong, Liwei Shen, Wei Tian, Ting Su ORCID logo, and Xin Peng
(Fudan University, China; East China Normal University, China)

Article Search
TensileFuzz: Facilitating Seed Input Generation in Fuzzing via String Constraint Solving
Xuwei Liu, Wei You, Zhuo Zhang, and Xiangyu Zhang
(Purdue University, USA; Renmin University of China, China)

Article Search
Evolution-Aware Detection of Order-Dependent Flaky Tests
Chengpeng Li and August Shi
(University of Texas at Austin, USA)

Article Search
On the Use of Evaluation Measures for Defect Prediction Studies
Rebecca Moussa and Federica SarroORCID logo
(University College London, UK)

Article Search
Learning to Identify Semantic Bugs for String Processing Programs
Charaka Geethal Kapugama, Van-Thuan PhamORCID logo, Aldeida Aleti, and Marcel BöhmeORCID logo
(Monash University, Australia; University of Melbourne, Australia; MPI-SP, Germany)

Article Search
Automatically Detecting API-induced Compatibility Issues in Android Apps: A Comparative Analysis (Replicability Study)
Pei Liu, Yanjie Zhao, Haipeng CaiORCID logo, Mattia Fazzini, John Grundy ORCID logo, and Li Li
(Monash University, Australia; Washington State University at Pullman, USA; University of Minnesota, USA)
Fragmentation is a serious problem in the Android ecosystem. This problem is mainly caused by the fast evolution of the system itself and the various customizations independently maintained by different smartphone manufacturers. Many efforts have attempted to mitigate its impact via approaches to automatically pinpoint compatibility issues in Android apps. Unfortunately, at this stage, it is still unknown if this objective has been fulfilled, and the existing approaches can indeed be replicated and reliably leveraged to pinpoint compatibility issues in the wild. We, therefore, propose to fill this gap by first conducting a literature review within this topic to identify all the available approaches. Among the nine identified approaches, we then try our best to reproduce them based on their original datasets. After that, we go one step further to empirically compare those approaches against common datasets with real-world apps containing compatibility issues. Experimental results show that existing tools can indeed be reproduced, but their capabilities are quite distinct, as confirmed by the fact that there is only a small overlap of the results reported by the selected tools. This evidence suggests that more efforts should be spent by our community to achieve sound compatibility issues detection.

Article Search
HybridRepair: Towards Annotation-Efficient Repair for Deep Learning Models
Yu Li ORCID logo, Muxi Chen ORCID logo, and Qiang Xu ORCID logo
(Chinese University of Hong Kong, China)
A well-trained deep learning (DL) model often cannot achieve expected performance after deployment due to the mismatch between the distributions of the training data and the field data in the operational environment. Therefore, repairing DL models is critical, especially when deployed on increasingly larger tasks with shifted distributions.
Generally speaking, it is easy to obtain a large amount of field data. Existing solutions develop various techniques to select a subset for annotation and then fine-tune the model for repair. While effective, achieving a higher repair rate is inevitably associated with more expensive labeling costs. To mitigate this problem, we propose a novel annotation-efficient repair solution for DL models, namely HybridRepair, wherein we take a holistic approach that coordinates the use of a small amount of annotated data and a large amount of unlabeled data for repair. Our key insight is that accurate yet sufficient training data is needed to repair the corresponding failure region in the data distribution. Under a given labeling budget, we selectively annotate some data in failure regions and propagate their labels to the neighboring data on the one hand. On the other hand, we take advantage of the semi-supervised learning (SSL) techniques to further boost the training data density. However, different from existing SSL solutions that try to use all the unlabeled data, we only use a selected part of them considering the impact of distribution shift on SSL solutions. Experimental results show that HybridRepair outperforms both state-of-the-art DL model repair solutions and semi-supervised techniques for model improvements, especially when there is a distribution shift between the training data and the field data. Our code is available at:

Article Search
Finding Bugs in Gremlin-Based Graph Database Systems via Randomized Differential Testing
Yingying Zheng, Wensheng Dou, Yicheng Wang, Zheng Qin, Lei Tang, Yu Gao, Dong Wang, Wei Wang, and Jun Wei
(Institute of Software at Chinese Academy of Sciences, China)

Article Search
NCScope: Hardware-Assisted Analyzer for Native Code in Android Apps
Hao Zhou, Shuohan Wu, Xiapu LuoORCID logo, Ting Wang, Yajin Zhou, Chao Zhang ORCID logo, and Haipeng CaiORCID logo
(Hong Kong Polytechnic University, China; Pennsylvania State University, USA; Zhejiang University, China; Tsinghua University, China; Beijing National Research Center for Information Science and Technology, China; Washington State University, USA)

Article Search
Cross-Lingual Transfer Learning for Statistical Type Inference
Zhiming Li, Xiaofei Xie, Haoliang Li, Zhengzi Xu, Yi LiORCID logo, and Yang Liu
(Nanyang Technological University, Singapore; Singapore Management University, Singapore; City University of Hong Kong, China)
Hitherto statistical type inference systems rely thoroughly on supervised learning approaches, which require laborious manual effort to collect and label large amounts of data. Most Turing-complete imperative languages share similar control- and data-flow structures, which make it possible to transfer knowledge learned from one language to another. In this paper, we propose a cross-lingual transfer learning framework, PLATO, for statistical type inference, which allows us to leverage prior knowledge learned from the labeled dataset of one language and transfer it to the others, e.g., Python to JavaScript, Java to JavaScript, etc. PLATO is powered by a novel kernelized attention mechanism to constrain the attention scope of the backbone Transformer model such that model is forced to base its prediction on commonly shared features among languages. In addition, we propose the syntax enhancement that augments the learning on the feature overlap among language domains. Furthermore, PLATO can also be used to improve the performance of the conventional supervised-based type inference by introducing cross-language augmentation, which enables the model to learn more general features across multiple languages. We evaluated PLATO under two settings: 1) under the cross-domain scenario that the target language data is not labeled or labeled partially, the results show that PLATO outperforms the state-of-the-art domain transfer techniques by a large margin, , it improves the Python to TypeScript baseline by +14.6%@EM, +18.6%@weighted-F1, and 2) under the conventional monolingual supervised scenario, PLATO improves the Python baseline by +4.10%@EM, +1.90%@weighted-F1 with the introduction of the cross-lingual augmentation.

Article Search
Precise and Efficient Atomicity Violation Detection for Interrupt-Driven Programs via Staged Path Pruning
Chao Li ORCID logo, Rui Chen ORCID logo, Boxiang Wang ORCID logo, Tingting Yu ORCID logo, Dongdong Gao ORCID logo, and Mengfei Yang
(Beijing Institute of Control Engineering, China; Beijing Sunwise Information Technology, China; Xidian University, China; China Academy of Space Technology, China)
Interrupt-driven programs are widely used in aerospace and other safety-critical areas. However, uncertain interleaving execution of interrupts may cause concurrency bugs, which could result in serious safety problems. Most of the previous researches tackling the detection of interrupt concurrency bugs focus on data races, that are usually benign as shown in empirical studies. Some studies focus on pattern-based atomicity violations that are most likely harmful. However, they cannot achieve simultaneous high precision and scalability. This paper presents intAtom, a precise and efficient static detection technique for interrupt atomicity violations, described by access interleaving pattern. The key point is that it eliminates false violations by staged path pruning with constraint solving. It first identifies all the violation candidates using data flow analysis and access interleaving pattern matching. intAtom then analyzes the path feasibility between two consecutive accesses in preempted task/interrupt, in order to recognize the atomicity intention of developers, with the help of which it filters out some candidates. Finally, it performs a modular path pruning by constructing symbolic summary and representative preemption points selection to eliminate the infeasible path in concurrent context efficiently. All the path feasibility checking processes are based on sparse value-flow analysis, which makes intAtom scalable. intAtom is evaluated on a benchmark and 6 real-world aerospace embedded programs. The experimental results show that intAtom reduces the false positive by 72% and improves the detection speed by 3 times, compared to the state-of-the-art methods. Furthermore, it can finish analyzing the real-world aerospace embedded software very fast with an average FP rate of 19.6%, while finding 19 bugs that were confirmed by developers.

Article Search
Detecting Resource Utilization Bugs Induced by Variant Lifecycles in Android
Yifei Lu, Minxue PanORCID logo, Yu Pei, and Xuandong Li
(Nanjing University, China; Hong Kong Polytechnic University, China)

Article Search
Efficient Greybox Fuzzing of Applications in Linux-Based IoT Devices via Enhanced User-Mode Emulation
Yaowen Zheng, Yuekang Li, Cen Zhang, Hongsong Zhu, Yang Liu, and Limin Sun
(Nanyang Technological University, Singapore; Institute of Information Engineering at Chinese Academy of Sciences, China)

Article Search

Conditionally Accepted Papers

On the Use of Mutation Analysis for Evaluating Student Test Suite Quality
James Perretta, Andrew DeOrio, Arjun Guha, and Jonathan Bell
(Northeastern University, USA; University of Michigan, USA)

Article Search
WASAI: Uncovering Vulnerabilities in Wasm Smart Contracts
Weimin Chen, Zihan Sun, Haoyu Wang, Xiapu LuoORCID logo, Haipeng CaiORCID logo, and Lei Wu
(Hong Kong Polytechnic University, China; Beijing University of Posts and Telecommunications, China; Huazhong University of Science and Technology, China; Washington State University at Pullman, USA; Zhejiang University, China)

Article Search
CIRCLE: Continual Repair across Programming Languages
Wei Yuan, Quanjun Zhang, Tieke He, Chunrong Fang, Nguyen Quoc Viet Hung, Xiaodong Hao, and Hongzhi Yin
(University of Queensland, Australia; Nanjing University, China; Griffith University, Australia)
Automatic Program Repair (APR) aims at fixing buggy source code with less manual debugging efforts, which plays a vital role in improving software reliability and development productivity. Recent APR works have achieved remarkable progress via applying deep learning (DL), particularly neural machine translation (NMT) techniques. However, we observe that existing DL-based APR models suffer from at least two severe drawbacks: (1) Most of them can only generate patches for a single programming language, as a result, to repair multiple languages, we have to build and train many repairing models. (2) Most of them are developed offline. Therefore, they won’t function when there are new-coming requirements.
To address the above problems, a T5-based APR framework equipped with continual learning ability across multiple programming languages is proposed, namely ContInual Repair aCross Programming LanguagEs (CIRCLE). Specifically, (1) CIRCLE utilizes a prompting function to narrow the gap between natural language processing (NLP) pre-trained tasks and APR. (2) CIRCLE adopts a difficulty-based rehearsal strategy to achieve lifelong learning for APR without access to the full historical data. (3) An elastic regularization method is employed to strengthen CIRCLE’s continual learning ability further, preventing it from catastrophic forgetting. (4) CIRCLE applies a simple but effective re-repairing method to revise generated errors caused by crossing multiple programming languages.
We train CIRCLE for four languages (i.e., C, JAVA, JavaScript, and Python) and evaluate it on five commonly used benchmarks. The experimental results demonstrate that CIRCLE not only effectively and efficiently repairs multiple programming languages in continual learning settings, but also achieves state-of-the-art performance (e.g., fixes 64 Defects4J bugs) with a single repair model.

Article Search
DocTer: Extracting Constraints from Documentation to Test Deep Learning API Functions
Danning Xie, Yitong Li, Mijung Kim, Hung Viet Pham, Lin Tan, Xiangyu Zhang, and Michael W. Godfrey
(Purdue University, USA; University of Waterloo, Canada; Ulsan National Institute of Science and Technology, South Korea)

Article Search
PermDroid: Automatically Testing Permission-Related Behaviour of Android Applications
Shuaihao Yang, Zigang Zeng, and Wei Song ORCID logo
(Nanjing University of Science and Technology, China)
The Android runtime permission model allows users to grant and revoke permissions at runtime. To verify the robustness of apps, developers have to test the apps repeatedly under a wide range of permission combinations, which is time-consuming and unsuited for regression testing. Existing app testing techniques are of limited help in this context, as they seldom consider different permission combinations explicitly. To address this issue, we present PermDroid to automatically test the permission-related behaviour of apps with permissions granted/revoked dynamically. PermDroid first statically constructs a state transition graph (STG) for the app; it then utilizes the STG for the permission-directed exploration to test permission-related behaviour only under the combinations of the relevant permissions. The experimental results on 50 real-world Android apps demonstrate the effectiveness and efficiency of PermDroid: the average permission-related API invocation coverage achieves 72.38% in 10 minutes, and seven permission-related bugs are uncovered, six of which are not detected by the competitors.

Article Search
SmartDagger: A Bytecode-Based Static Analysis Approach for Detecting Cross-Contract Vulnerability
Zeqin Liao ORCID logo, Zibin Zheng, Xiao Chen, and Yuhong Nan ORCID logo
(Sun Yat-sen University, China)

Article Search
Detecting Multi-sensor Fusion Errors in Advanced Driver-Assistance Systems
Ziyuan ZhongORCID logo, Zhisheng Hu ORCID logo, Shengjian Guo ORCID logo, Xinyang Zhang ORCID logo, Zhenyu Zhong ORCID logo, and Baishakhi RayORCID logo
(Columbia University, USA; Baidu Security, USA)
Advanced Driver-Assistance Systems (ADAS) have been thriving and widely deployed in recent years. In general, these systems receive sensor data, compute driving decisions, and output control signals to the vehicles. To smooth out the uncertainties brought by sensor outputs, they usually leverage multi-sensor fusion (MSF) to fuse the sensor outputs and produce a more reliable understanding of the surroundings. However, MSF cannot completely eliminate the uncertainties since it lacks the knowledge about which sensor provides the most accurate data and how to optimally integrate the data provided by the sensors. As a result, critical consequences might happen unexpectedly. In this work, we observed that the popular MSF methods in an industry-grade ADAS can mislead the car control and result in serious safety hazards. We define the failures (e.g., car crashes) caused by the faulty MSF as fusion errors and develop a novel evolutionary-based domain-specific search framework, FusED, for the efficient detection of fusion errors. We further apply causality analysis to show that the found fusion errors are indeed caused by the MSF method. We evaluate our framework on two widely used MSF methods in two driving environments. Experimental results show that FusED identifies more than 150 fusion errors. Finally, we provide several suggestions to improve the MSF methods we study.

RegMiner: Towards Constructing a Large Regression Dataset from Code Evolution History
Xuezhi Song, Yun Lin, Siang Hwee Ng, Yijian Wu, Xin Peng, Jin Song Dong, and Hong Mei
(Fudan University, China; National University of Singapore, Singapore; Peking University, China)

Article Search
One Step Further: Evaluating Interpreters using Metamorphic Testing
Ming Fan ORCID logo, JiaLi Wei, WuXia Jin, Zhou Xu, Wenying Wei, and Ting Liu
(Xi'an Jiaotong University, China; Chongqing University, China)
The black-box nature of the Deep Neural Network (DNN) makes it difficult for people to understand why it makes a specific decision, which restricts its applications in critical tasks. Recently, many interpreters (interpretation methods) are proposed to improve the transparency of DNNs by providing relevant features in the form of a saliency map. However, different interpreters might provide different interpretation results for the same classification case, which motivates us to conduct the robustness evaluation of interpreters. However, the biggest challenge of evaluating interpreters is the testing oracle problem, i.e., hard to label ground-truth interpretation results. To fill this critical gap, we first use the images with bounding boxes in the object detection system and the images inserted with backdoor triggers as our original ground-truth dataset. Then, we apply metamorphic testing to extend the dataset by three operators, including inserting an object, deleting an object, and feature squeezing the image background. Our key intuition is that after the three operations which do not modify the primary detected objects, the interpretation results should not change for good interpreters. Finally, we measure the qualities of interpretation results quantitatively with the Intersection-over-Minimum (IoMin) score and evaluate interpreters based on the statistics of metamorphic relation's failures.
We evaluate seven popular interpreters on 877,324 metamorphic images in diverse scenes. The results show that our approach can quantitatively evaluate interpreters' robustness, where Grad-CAM provides the most reliable interpretation results among the seven interpreters.

Article Search
PrIntFuzz: Fuzzing Linux Drivers via Automated Virtual Device Simulation
Zheyu Ma ORCID logo, Bodong Zhao ORCID logo, Letu Ren ORCID logo, Zheming Li ORCID logo, Siqi Ma ORCID logo, Xiapu LuoORCID logo, and Chao Zhang ORCID logo
(Tsinghua University, China; UNSW, Australia; Hong Kong Polytechnic University, China; Beijing National Research Center for Information Science and Technology, China)
Linux drivers share the same address space and privilege with the core of the kernel but have a much larger code base and attack surface. The Linux drivers are not well tested and have weaker security guarantees than the kernel. Missing support from hardware devices, existing fuzzing solutions fail to cover a large portion of the driver code, e.g., the initialization code and interrupt handlers. In this paper, we present PrIntFuzz, an efficient and universal fuzzing framework that can test the overlooked driver code, including the PRobing code and INTerrupt handlers. PrIntFuzz first extracts knowledge from the driver through inter-procedural field-sensitive, path-sensitive, and flow-sensitive static analysis. Then it utilizes the information to build a flexible and efficient simulator, which supports device probing, hardware interrupts emulation and device I/O interception. Lastly, PrIntFuzz applies a multi-dimension fuzzing strategy to explore the overlooked code. We have developed a prototype of PrIntFuzz and successfully simulated 311 virtual PCI (Peripheral Component Interconnect) devices, 472 virtual I2C (Inter-Integrated Circuit) devices, 169 virtual USB (Universal Serial Bus) devices, and found 150 bugs in the corresponding device drivers. We have submitted patches for these bugs to the Linux kernel community, and 59 patches have been merged so far. In a control experiment of Linux 5.10-rc6, PrIntFuzz found 99 bugs, while the state-of-the-art fuzzer only found 50. PrIntFuzz covers 11,968 basic blocks on the latest Linux kernel, while the state-of-the-art fuzzer Syzkaller only covers 2,353 basic blocks.

Article Search

proc time: 7.54