ISSTA 2023
32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2023)
Powered by
Conference Publishing Consulting

32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2023), July 17–21, 2023, Seattle, WA, USA

ISSTA 2023 – Proceedings

Contents - Abstracts - Authors

Frontmatter

Title Page


Welcome from the Chairs
It is our great pleasure to welcome you to ISSTA 2023, the 32nd edition of the International Symposium on Software Testing and Analysis, to be held on July 18–20, 2023 in Seattle, USA. The symposium has become a premier scientific event in the expanding area of software testing and analysis, with a strong appeal to researchers from all continents.

ISSTA 2023 Organization
Committees

Papers

CydiOS: A Model-Based Testing Framework for iOS Apps
Shuohan Wu, Jianfeng Li, Hao Zhou, Yongsheng Fang, Kaifa Zhao, Haoyu Wang, Chenxiong Qian, and Xiapu Luo
(Hong Kong Polytechnic University, China; Xi’an Jiaotong University, China; Beijing University of Posts and Telecommunications, China; Huazhong University of Science and Technology, China; University of Hong Kong, China)
To make an app stand out in an increasingly competitive market, developers must ensure its quality to deliver a better user experience. UI testing is a popular technique for quality assurance, which can thoroughly test the app from the users’ perspective. However, while considerable research has already studied UI testing on the Android platform, there is no research on iOS. This paper introduces CydiOS, a novel approach to performing model-based testing for iOS apps. CydiOS enhances the existing static analysis to build a more complete static model for the app under test. We propose an approach to retrieve runtime information to obtain real-time app context that can be mapped in the model. To improve the effectiveness of UI testing, we also introduce a potential-aware search algorithm to guide testing execution. We compare CydiOS with four representative algorithms(i.e., random, depth-first, stoat, and ape). We have evaluated CydiOS on 50 popular apps from App Store, and the results show that CydiOS outperforms other tools, achieving both higher code coverage and screen coverage. We open source CydiOS at https://github.com/SoftWare2022Testing/CydiOS, and a demo video can be found there.

Publisher's Version Published Artifact Artifacts Available
Improving Bit-Blasting for Nonlinear Integer Constraints
Fuqi Jia, Rui Han, Pei Huang, Minghao Liu, Feifei Ma, and Jian Zhang
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Stanford University, USA)
Nonlinear integer constraints are common and difficult in the verification and analysis of software/hardware. SMT(QF_NIA) generalizes such constraints, which is a boolean combination of nonlinear integer arithmetic constraints. A classical method to solve SMT(QF_NIA) is bit-blasting, which reduces them to boolean satisfiability problems. Currently, the existing pure bit-blasting based solvers are noncompetitive with other state-of-the-art SMT solvers. The bit-blasting based methods have some problems: First, the bit-blasting method is hampered by nonlinear multiplication operations; second, it sometimes does not search in a proper search space; and third, it contains some redundancy.
In this paper, we focus on improving the efficiency of bit-blasting based method. To decide on a proper search space, we proposed an adaptive function for hard nonlinear multiplications, and heuristic strategies to analyze specific constraints. We also found that different orders in successive additions will result in bit vectors with different bit-widths. We proposed an optimal order decision algorithm to save redundancy in successive additions. We implement a solver with the proposed methods named BLAN. Experiments demonstrate that BLAN outperforms other state-of-the-art SMT solvers (APROVE, CVC5, MATHSAT, YICES2, Z3) on the satisfiable SMT(QF_NIA) instances in SMT-LIB. We provide an outlook of BLAN on solving unsatisfiable instances via combining with other solvers. Sensitivity analysis also demonstrates the effectiveness of the proposed methods.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
CONCORD: Clone-Aware Contrastive Learning for Source Code
Yangruibo Ding, Saikat Chakraborty, Luca Buratti, Saurabh Pujar, Alessandro Morari, Gail Kaiser, and Baishakhi Ray
(Columbia University, USA; Microsoft Research, USA; IBM Research, USA)
Deep Learning (DL) models to analyze source code have shown immense promise during the past few years. More recently, self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks, such as clone and bug detection.
While previous work successfully learned from different code abstractions (e.g., token, AST, graph), we argue that it is also essential to factor in how developers code day-to-day for learning general-purpose representation. On the one hand, human developers tend to write repetitive programs referencing existing code snippets from the current codebase or online resources (e.g., Stack Overflow website) rather than implementing functions from scratch; such behaviors result in a vast number of code clones. In contrast, a deviant clone by mistake might trigger malicious program behaviors.
Thus, as a proxy to incorporate developers' coding behavior into the pre-training scheme, we propose to include code clones and their deviants. In particular, we propose CONCORD, a self-supervised pre-training strategy to place benign clones closer in the representation space while moving deviants further apart. We show that CONCORD's clone-aware pre-training drastically reduces the need for expensive pre-training resources while improving the performance of downstream SE tasks. We also empirically demonstrate that CONCORD can improve existing pre-trained models to learn better representations that consequently become more efficient in both identifying semantically equivalent programs and differentiating buggy from non-buggy code.

Publisher's Version
Towards Efficient Fine-Tuning of Pre-trained Code Models: An Experimental Study and Beyond
Ensheng Shi, Yanlin Wang, Hongyu Zhang, Lun Du, Shi Han, Dongmei Zhang, and Hongbin Sun
(Xi’an Jiaotong University, China; Sun Yat-sen University, China; Chongqing University, China; Microsoft, China)
Recently, fine-tuning pre-trained code models such as CodeBERT on downstream tasks has achieved great success in many software testing and analysis tasks. While effective and prevalent, fine-tuning the pre-trained parameters incurs a large computational cost. In this paper, we conduct an extensive experimental study to explore what happens to layer-wise pre-trained representations and their encoded code knowledge during fine-tuning. We then propose efficient alternatives to fine-tune the large pre-trained code model based on the above findings. Our experimental study shows that (1) lexical, syntactic and structural properties of source code are encoded in the lower, intermediate, and higher layers, respectively, while the semantic property spans across the entire model. (2) The process of fine-tuning preserves most of the code properties. Specifically, the basic code properties captured by lower and intermediate layers are still preserved during fine-tuning. Furthermore, we find that only the representations of the top two layers change most during fine-tuning for various downstream tasks. (3) Based on the above findings, we propose Telly to efficiently fine-tune pre-trained code models via layer freezing. The extensive experimental results on five various downstream tasks demonstrate that training parameters and the corresponding time cost are greatly reduced, while performances are similar or better.

Publisher's Version
Understanding and Tackling Label Errors in Deep Learning-Based Vulnerability Detection (Experience Paper)
Xu Nie, Ningke Li, Kailong Wang, Shangguang Wang, Xiapu Luo, and Haoyu Wang
(Huazhong University of Science and Technology, China; Beijing University of Posts and Telecommunications, China; Hong Kong Polytechnic University, China)
Software system complexity and security vulnerability diversity are plausible sources of the persistent challenges in software vulnerability research. Applying deep learning methods for automatic vulnerability detection has been proven an effective means to complement traditional detection approaches. Unfortunately, lacking well-qualified benchmark datasets could critically restrict the effectiveness of deep learning-based vulnerability detection techniques. Specifically, the long-term existence of erroneous labels in the existing vulnerability datasets may lead to inaccurate, biased, and even flawed results.
In this paper, we aim to obtain an in-depth understanding and explanation of the label error causes. To this end, we systematically analyze the diversified datasets used by state-of-the-art learning-based vulnerability detection approaches, and examine their techniques for collecting vulnerable source code datasets. We find that label errors heavily impact the mainstream vulnerability detection models, with a worst-case average F1 drop of 20.7%. As mitigation, we introduce two approaches to dataset denoising, which will enhance the model performance by an average of 10.4%. Leveraging dataset denoising methods, we provide a feasible solution to obtain high-quality labeled datasets.

Publisher's Version
Pattern-Based Peephole Optimizations with Java JIT Tests
Zhiqiang Zang, Aditya Thimmaiah, and Milos Gligoric
(University of Texas at Austin, USA)
We present JOG, a framework that facilitates developing Java JIT peephole optimizations alongside JIT tests. JOG enables developers to write a pattern, in Java itself, that specifies desired code transformations by writing code before and after the optimization, as well as any necessary preconditions. Such patterns can be written in the same way that tests of the optimization are already written in OpenJDK. JOG translates each pattern into C/C++ code that can be integrated as a JIT optimization pass. JOG also generates Java tests for optimizations from patterns. Furthermore, JOG can automatically detect possible shadow relation between a pair of optimizations where the effect of the shadowed optimization is overridden by another. Our evaluation shows that JOG makes it easier to write readable JIT optimizations alongside tests without decreasing the effectiveness of JIT optimizations. We wrote 162 patterns, including 68 existing optimizations in OpenJDK, 92 new optimizations adapted from LLVM, and two new optimizations that we proposed. We opened eight pull requests (PRs) for OpenJDK, including six for new optimizations, one on removing shadowed optimizations, and one for newly generated JIT tests; seven PRs have already been integrated into the master branch of OpenJDK.

Publisher's Version
Icicle: A Re-designed Emulator for Grey-Box Firmware Fuzzing
Michael Chesser, Surya Nepal, and Damith C. Ranasinghe
(University of Adelaide, Australia; CSIRO’s Data61, Australia)
Emulation-based fuzzers enable testing binaries without source code and facilitate testing embedded applications where automated execution on the target hardware architecture is difficult and slow. The instrumentation techniques added to extract feedback and guide input mutations towards generating effective test cases is at the core of modern fuzzers. But, modern emulation-based fuzzers have evolved by re-purposing general-purpose emulators; consequently, developing and integrating fuzzing techniques, such as instrumentation methods, is difficult and often added in an ad-hoc manner, specific to an instruction set architecture (ISA). This limits state-of-the-art fuzzing techniques to a few ISAs such as x86/x86-64 or ARM/AArch64; a significant problem for firmware fuzzing of diverse ISAs.
This study presents our efforts to re-think emulation for fuzzing. We design and implement a fuzzing-specific, multi-architecture emulation framework—Icicle. We demonstrate the capability to add instrumentation once, in an architecture agnostic manner, with low execution overhead. We employ Icicle as the emulator for a state-of-the-art ARM firmware fuzzer—Fuzzware—and replicate results. Significantly, we demonstrate the availability of new instrumentation in Icicle enabled the discovery of new bugs. We demonstrate the fidelity of Icicle and efficacy of architecture agnostic instrumentation by discovering bugs in benchmarks that require a known and specific operational capability of instrumentation techniques, across a diverse set of instruction set architectures (x86-64, ARM/AArch64, RISC-V, MIPS). Further, to demonstrate the effectiveness of Icicle to discover bugs in a currently unsupported architecture in emulation-based fuzzers, we perform a fuzzing campaign with real-world firmware binaries for Texas Instruments’ MSP430 ISA and discovered 7 new bugs.

Publisher's Version
Fine-Grained Code Clone Detection with Block-Based Splitting of Abstract Syntax Tree
Tiancheng Hu, Zijing Xu, Yilin Fang, Yueming Wu, Bin Yuan, Deqing Zou, and Hai Jin
(Huazhong University of Science and Technology, China; Nanyang Technological University, Singapore)
Code clone detection aims to find similar code fragments and gains increasing importance in the field of software engineering. There are several types of techniques for detecting code clones. Text-based or token-based code clone detectors are scalable and efficient but lack consideration of syntax, thus resulting in poor performance in detecting syntactic code clones. Although some tree-based methods have been proposed to detect syntactic or semantic code clones with decent performance, they are mostly time-consuming and lack scalability. In addition, these detection methods can not realize fine-grained code clone detection. They are unable to distinguish the concrete code blocks that are cloned. In this paper, we design Tamer, a scalable and fine-grained tree-based syntactic code clone detector. Specifically, we propose a novel method to transform the complex abstract syntax tree into simple subtrees. It can accelerate the process of detection and implement the fine-grained analysis of clone pairs to locate the concrete clone parts of the code. To examine the detection performance and scalability of Tamer, we evaluate it on a widely used dataset BigCloneBench. Experimental results show that Tamer outperforms ten state-of-the-art code clone detection tools (i.e., CCAligner, SourcererCC, Siamese, NIL, NiCad, LVMapper, Deckard, Yang2018, CCFinder, and CloneWorks).

Publisher's Version
Reducing the Memory Footprint of IFDS-Based Data-Flow Analyses using Fine-Grained Garbage Collection
Dongjie He, Yujiang Gui, Yaoqing Gao, and Jingling Xue
(UNSW, Australia; Huawei Toronto Research Center, Canada)
The IFDS algorithm can be both memory- and compute-intensive for large programs as it needs to store a huge amount of path edges in memory and process them until a fixed point. In general, an IFDS-based data-flow analysis, such as taint analysis, aims to discover only the data-flow facts at some program points. Maintaining a huge amount of path edges (with many visited only once) wastes memory resources, and consequently, reduces its scalability and efficiency (due to frequent re-hashings for the path-edge data structure used).
This paper introduces a fine-grained garbage collection (GC) algorithm to enable (multi-threaded) IFDS to reduce its memory footprint by removing non-live path edges (i.e., ones that are no longer needed for establishing other path edges) from its path-edge data structure. The resulting IFDS algorithm, named FPC, retains the correctness, precision, and termination properties of IFDS while avoiding re-processing GC’ed path edges redundantly (in the presence of unknown recursive cycles that may be formed in future iterations of the analysis). Unlike CleanDroid, which augments IFDS with a coarse-grained GC algorithm to collect path edges at the method level, FPC is fine-grained by collecting path edges at the data-fact level. As a result, FPC can collect more path edges than CleanDroid, and consequently, cause fewer re-hashings for the path-edge data structure used. In our evaluation, we focus on applying an IFDS-based taint analysis to a set of 28 Android apps. FPC can scalably analyze three apps that CleanDroid fails to run to completion (under a 3-hour budget per app) due to out-of-memory (OoM). For the remaining 25 apps, FPC reduces the number of path edges and memory usage incurred under CleanDroid by 4.4× and 1.4× on average, respectively, and consequently, outperforms CleanDroid by 1.7× on average (with 18.5× in the best case).

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Hybrid Inlining: A Framework for Compositional and Context-Sensitive Static Analysis
Jiangchao Liu, Jierui Liu, Peng Di, Diyu Wu, Hengjie Zheng, Alex X. Liu, and Jingling Xue
(Ant Group, China; ByteDance, China; UNSW, Australia)
Context-sensitivity is essential for achieving good precision in inter-procedural static analysis. To be context-sensitive, top-down analysis needs to fully inline all the statements in a callee at all its callsites, leading to statement explosion. Compositional analysis, which inlines summaries of all the callees, scales up but often loses precision, as it is not strictly context-sensitive. We propose a compositional and strictly context-sensitive framework for static analysis. This framework is based on a key observation: a compositional analysis often loses precision only on some critical statements that need to be analyzed context-sensitively. Our approach hybridly inlines the critical statements and the summaries of non-critical statements of each callee, thus avoiding re-analyzing non-critical ones. In addition, our analysis lazily summarizes the critical statements, by stopping propagating the critical statements once the calling context accumulated is adequate. We have designed and implemented several analyses (including a pointer analysis) based on this framework. Our evaluation on the pointer analysis shows that it can analyze large Java programs from the DaCapo benchmark suite and industry in minutes. Compared to context-insensitive analysis, Hybrid Inlining introduces only 65% and 1% additional time overheads on DaCapo and industrial applications, respectively.

Publisher's Version
Green Fuzzing: A Saturation-Based Stopping Criterion using Vulnerability Prediction
Stephan Lipp, Daniel Elsner, Severin Kacianka, Alexander Pretschner, Marcel Böhme, and Sebastian Banescu
(TU Munich, Germany; MPI-SP, Germany; Monash University, Australia)
Fuzzing is a widely used automated testing technique that uses random inputs to provoke program crashes indicating security breaches. A difficult but important question is when to stop a fuzzing campaign. Usually, a campaign is terminated when the number of crashes and/or covered code elements has not increased over a certain period of time. To avoid premature termination when a ramp-up time is needed before vulnerabilities are reached, code coverage is often preferred over crash count to decide when to terminate a campaign. However, a campaign might only increase the coverage on non-security-critical code or repeatedly trigger the same crashes. For these reasons, both code coverage and crash count tend to overestimate the fuzzing effectiveness, unnecessarily increasing the duration and thus the cost of the testing process.
The present paper explores the tradeoff between the amount of saved fuzzing time and number of missed bugs when stopping campaigns based on the saturation of covered, potentially vulnerable functions rather than triggered crashes or regular function coverage. In a large-scale empirical evaluation of 30 open-source C programs with a total of 240 security bugs and 1,280 fuzzing campaigns, we first show that binary classification models trained on software with known vulnerabilities (CVEs), using lightweight machine learning features derived from findings of static application security testing tools and proven software metrics, can reliably predict (potentially) vulnerable functions. Second, we show that our proposed stopping criterion terminates 24-hour fuzzing campaigns 6-12 hours earlier than the saturation of crashes and regular function coverage while missing (on average) fewer than 0.5 out of 12.5 contained bugs.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Testing Graph Database Engines via Query Partitioning
Matteo Kamm, Manuel Rigger, Chengyu Zhang, and Zhendong Su
(ETH Zurich, Switzerland; National University of Singapore, Singapore)
Graph Database Management Systems (GDBMSs) store data as graphs and allow the efficient querying of nodes and their relationships. Logic bugs are bugs that cause a GDBMS to return an incorrect result for a given query (e.g., by returning incorrect nodes or relationships). The impact of such bugs can be severe, as they often go unnoticed. The core insight of this paper is that Query Partitioning, a test oracle that has been proposed to test Relational Database Systems, is applicable to testing GDBMSs as well. The core idea of Query Partitioning is that, given a query, multiple queries are derived whose results can be combined to reconstruct the given query’s result. Any discrepancy in the result indicates a logic bug. We have implemented this approach as a practical tool named GDBMeter and evaluated GDBMeter on three popular GDBMSs and found a total of 40 unique, previously unknown bugs. We consider 14 of them to be logic bugs, the others being error or crash bugs. Overall, 27 of the bugs have been fixed, and 35 confirmed. We compared our approach to the state-of-the-art approach to testing GDBMS, which relies on differential testing; we found that it results in a high number of false alarms, while Query Partitioning reported actual logic bugs without any false alarms. Furthermore, despite the previous efforts in testing Neo4j and JanusGraph, we found 18 additional bugs. The developers appreciate our work and plan to integrate GDBMeter into their testing process. We expect that this simple, yet effective approach and the practical tool will be used to test other GDBMSs.

Publisher's Version Published Artifact Info Artifacts Available Artifacts Functional
Semantic-Based Neural Network Repair
Richard Schumi and Jun Sun
(Singapore Management University, Singapore)
Recently, neural networks have spread into numerous fields including many safety-critical systems. Neural networks are built (and trained) by programming in frameworks such as TensorFlow and PyTorch. Developers apply a rich set of pre-defined layers to manually program neural networks or to automatically generate them (e.g., through AutoML). Composing neural networks with different layers is error-prone due to the non-trivial constraints that must be satisfied in order to use those layers. In this work, we propose an approach to automatically repair erroneous neural networks. The challenge is in identifying a minimal modification to the network so that it becomes valid. Modifying a layer might have cascading effects on subsequent layers and thus our approach must search recursively to identify a ''globally'' minimal modification. Our approach is based on an executable semantics of deep learning layers and focuses on four kinds of errors which are common in practice. We evaluate our approach for two usage scenarios, i.e., repairing automatically generated neural networks and manually written ones suffering from common model bugs. The results show that we are able to repair 100% of a set of randomly generated neural networks (which are produced with an existing AI framework testing approach) effectively and efficiently (with an average repair time of 21.08s) and 93.75% of a collection of real neural network bugs (with an average time of 3min 40s).

Publisher's Version
GDsmith: Detecting Bugs in Cypher Graph Database Engines
Ziyue Hua, Wei Lin, Luyao Ren, Zongyang Li, Lu Zhang, Wenpin Jiao, and Tao Xie
(Peking University, China)
Graph database engines stand out in the era of big data for their efficiency of modeling and processing linked data. To assure high quality of graph database engines, it is highly critical to conduct automatic test generation for graph database engines, e.g., random test generation, the most commonly adopted approach in practice. However, random test generation faces the challenge of generating complex inputs (i.e., property graphs and queries) for producing non-empty query results; generating such type of inputs is important especially for detecting wrong-result bugs. To address this challenge, in this paper, we propose GDsmith, the first approach for testing Cypher graph database engines. GDsmith ensures that each randomly generated query satisfies the semantic requirements. To increase the probability of producing complex queries that return non-empty results, GDsmith includes two new techniques: graph-guided generation of complex pattern combinations and data-guided generation of complex conditions. Our evaluation results demonstrate that GDsmith is effective and efficient for producing complex queries that return non-empty results for bug detection, and substantially outperforms the baselines. GDsmith successfully detects 28 bugs on the released versions of three highly popular open-source graph database engines and receives positive feedback from their developers.

Publisher's Version
Loop Invariant Inference through SMT Solving Enhanced Reinforcement Learning
Shiwen Yu, Ting Wang, and Ji Wang
(National University of Defense Technology, China)
Inferring loop invariants is one of the most challenging problems in program verification. It is highly desired to incorporate machine learning when inferring. This paper presents a Reinforcement Learning (RL) pruning framework to infer loop invariants over a general nonlinear hypothesis space. The key idea is to synergize the RL-based pruning and SMT solving to generate candidate invariants efficiently. To address the sparse reward problem in learning, we design a novel two-dimensional reward mechanism that enables the RL pruner to recognize the capability boundary of SMT solvers and learn the pruning heuristics in a few rounds. We have implemented our approach with Z3 SMT solver in the tool called LIPuS and conducted extensive experiments over the linear and nonlinear benchmarks. Experiment results show that LIPuS can solve the most cases compared to the state-of-the-art loop invariant inference tools such as Code2Inv, ICE-DT, GSpacer, SymInfer, ImplCheck, and Eldarica. Especially, LIPuS outperforms them significantly on nonlinear benchmarks.

Publisher's Version Published Artifact Info Artifacts Available
CODEP: Grammatical Seq2Seq Model for General-Purpose Code Generation
Yihong Dong, Ge Li, and Zhi Jin
(Peking University, China)
General-purpose code generation aims to automatically convert the natural language description to code snippets in a general-purpose programming language (GPL) such as Python. In the process of code generation, it is essential to guarantee the generated code satisfies grammar constraints of GPL. However, existing sequence-to-sequence (Seq2Seq) approaches neglect grammar rules when generating GPL code. In this paper, we devise a pushdown automaton (PDA)-based methodology to make the first attempt to consider grammatical Seq2Seq models for general-purpose code generation, exploiting the principle that PL is a subset of PDA recognizable language and code accepted by PDA is grammatical. Specifically, we construct a PDA module and design an algorithm to constrain the generation of Seq2Seq models to ensure grammatical correctness. Guided by this methodology, we further propose CODEP, a code generation framework equipped with a PDA module, to integrate the deduction of PDA into deep learning. This framework leverages the state of PDA deduction (including state representation, state prediction task, and joint prediction with state) to assist models in learning PDA deduction. To comprehensively evaluate CODEP, we construct a PDA for Python and conduct extensive experiments on four public benchmark datasets. CODEP can employ existing sequence-based models as base models, and we show that it achieves 100% grammatical correctness percentage on these benchmark datasets. Consequently, CODEP relatively improves 17% CodeBLEU on CONALA, 8% EM on DJANGO, and 15% CodeBLEU on JUICE-10K compared to base models. Moreover, PDA module also achieves significant improvements on the pre-trained models.

Publisher's Version
Concept-Based Automated Grading of CS-1 Programming Assignments
Zhiyu Fan, Shin Hwei Tan, and Abhik Roychoudhury
(National University of Singapore, Singapore; Concordia University, Canada)
Due to the increasing enrolments in Computer Science programs, teaching of introductory programming needs to be scaled up. This places significant strain on teaching resources for programming courses for tasks such as grading of submitted programming assignments. Conventional attempts at automated grading of programming assignment rely on test-based grading which assigns scores based on the number of passing tests in a given test-suite. Since test-based grading may not adequately capture the student's understanding of the programming concepts needed to solve a programming task, we propose the notion of a concept graph which is essentially an abstracted control flow graph. Given the concept graphs extracted from a student's solution and a reference solution, we define concept graph matching and comparing of differing concepts. Our experiments on 1540 student submissions from a publicly available dataset show the efficacy of concept-based grading vis-a-vis test-based grading. Specifically, the concept based grading is (experimentally) shown to be closer to the grade manually assigned by the tutor. Apart from grading, the concept graph used by our approach is also useful for providing feedback to struggling students, as confirmed by our user study among tutors.

Publisher's Version
Beware of the Unexpected: Bimodal Taint Analysis
Yiu Wai Chow, Max Schäfer, and Michael Pradel
(University of Stuttgart, Germany; GitHub, UK)
Static analysis is a powerful tool for detecting security vulnerabilities and other programming problems. Global taint tracking, in particular, can spot vulnerabilities arising from complicated data flow across multiple functions. However, precisely identifying which flows are problematic is challenging, and sometimes depends on factors beyond the reach of pure program analysis, such as conventions and informal knowledge. For example, learning that a parameter name of an API function locale ends up in a file path is surprising and potentially problematic. In contrast, it would be completely unsurprising to find that a parameter command passed to an API function execaCommand is eventually interpreted as part of an operating-system command. This paper presents Fluffy, a bimodal taint analysis that combines static analysis, which reasons about data flow, with machine learning, which probabilistically determines which flows are potentially problematic. The key idea is to let machine learning models predict from natural language information involved in a taint flow, such as API names, whether the flow is expected or unexpected, and to inform developers only about the latter. We present a general framework and instantiate it with four learned models, which offer different trade-offs between the need to annotate training data and the accuracy of predictions. We implement Fluffy on top of the CodeQL analysis framework and apply it to 250K JavaScript projects. Evaluating on five common vulnerability types, we find that Fluffy achieves an F1 score of 0.85 or more on four of them across a variety of datasets.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
DeUEDroid: Detecting Underground Economy Apps Based on UTG Similarity
Zhuo Chen, Jie Liu, Yubo Hu, Lei Wu, Yajin Zhou, Yiling He, Xianhao Liao, Ke Wang, Jinku Li, and Zhan Qin
(Zhejiang University, China; Ant Group, China; Xidian University, China)
In recent years, the underground economy is proliferating in the mobile system. These underground economy apps (UEware for short) make profits from providing non-compliant services, especially in sensitive areas (e.g., gambling, porn, loan). Unlike traditional malware, most of them (over 80%) do not have malicious payloads. Due to their unique characteristics, existing detection approaches cannot effectively and efficiently mitigate this emerging threat. To address this problem, we propose a novel approach to effectively and efficiently detect UEware by considering their UI transition graphs (UTGs). Based on the proposed approach, we design and implement a system, named DeUEDroid, to perform the detection. To evaluate DeUEDroid, we collect 25, 717 apps and build up the first large-scale ground-truth dataset (1, 700 apps) of UEware. The evaluation result based on the ground-truth dataset shows that DeUEDroid can cover new UI features and statically construct precise UTG. It achieves 98.22% detection F1-score and 98.97% classification accuracy, a significantly better performance than the traditional approaches. The evaluation result involving 24, 017 apps demonstrates the effectiveness and efficiency of UEware detection in real-world scenarios. Furthermore, the result also reveals that UEware are prevalent, i.e., 54% apps in the wild and 11% apps in the app stores are UEware. Our work sheds light on the future work of analyzing and detecting UEware. To engage the community, we have made our prototype system and the dataset available online.

Publisher's Version Published Artifact Archive submitted (850 kB) Artifacts Available
Dependency-Aware Metamorphic Testing of Datalog Engines
Muhammad Numair Mansur, Valentin Wüstholz, and Maria Christakis
(MPI-SWS, Germany; ConsenSys, Austria; TU Wien, Austria)
Datalog is a declarative query language with wide applicability, especially in program analysis. Queries are evaluated by Datalog engines, which are complex and thus prone to returning incorrect results. Such bugs, called query bugs, may compromise the soundness of upstream program analyzers, having potentially detrimental consequences in safety-critical settings.
To address this issue, we develop a metamorphic testing approach for detecting query bugs in Datalog engines. In comparison to existing work, our approach is based on rich precedence information capturing dependencies among relations in the program. This enables much more general and effective metamorphic transformations. We implement our approach in DLSmith, which detected 16 previously unknown query bugs in four Datalog engines.

Publisher's Version Info
Fuzzing Deep Learning Compilers with HirGen
Haoyang Ma, Qingchao Shen, Yongqiang Tian, Junjie Chen, and Shing-Chi Cheung
(Hong Kong University of Science and Technology, China; Tianjin University, China; University of Waterloo, Canada)
Deep Learning (DL) compilers are widely adopted to optimize advanced DL models for efficient deployment on diverse hardware. Their quality has a profound effect on the quality of compiled DL models. A recent bug study shows that the optimization of high-level intermediate representations (IRs) is the most error-prone compilation stage and bugs in this stage account for 44.92% of the whole collected ones. However, existing testing techniques do not consider the features related to high-level optimization (e.g., the high-level IR), and are therefore weak in exposing bugs at this stage. To bridge this gap, we propose HirGen, an automated testing technique that effectively exposes coding mistakes in the optimization of high-level IRs. The design of HirGen includes 1) three coverage criteria to generate diverse and valid computational graphs; 2) the use of the high-level IR’s language features to generate diverse IRs; 3) three test oracles of which two are inspired by metamorphic testing and differential testing. HirGen has successfully detected 21 bugs that occur at TVM, with 17 bugs confirmed and 12 fixed. Further, we construct four baselines using state-of-the-art DL compiler fuzzers that can cover the high-level optimization stage. Our experiment results show that HirGen can detect 10 crashes and inconsistencies that cannot be detected by the baselines in 48 hours. We also evaluate the usefulness of our proposed coverage criteria and test oracles.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
API2Vec: Learning Representations of API Sequences for Malware Detection
Lei Cui, Jiancong Cui, Yuede Ji, Zhiyu Hao, Lun Li, and Zhenquan Ding
(Zhongguancun Laboratory, China; University of Chinese Academy of Sciences, China; Institute of Information Engineering at Chinese Academy of Sciences, China; University of North Texas, USA)
Analyzing malware based on API call sequence is an effective approach as the sequence reflects the dynamic execution behavior of malware.Recent advancements in deep learning have led to the application of these techniques for mining useful information from API call sequences. However, these methods mainly operate on raw sequences and may not effectively capture important information especially for multi-process malware, mainly due to the API call interleaving problem.
Motivated by that, this paper presents API2Vec, a graph based API embedding method for malware detection. First, we build a graph model to represent the raw sequence. In particular, we design the temporal process graph (TPG) to model inter-process behavior and temporal API graph (TAG) to model intra-process behavior. With such graphs, we design a heuristic random walk algorithm to generate a number of paths that can capture the fine-grained malware behavior. By pre-training the paths using the Doc2Vec model, we are able to generate the embeddings of paths and APIs, which can further be used for malware detection. The experiments on a real malware dataset demonstrate that API2Vec outperforms the state-of-the-art embedding methods and detection methods for both accuracy and robustness, especially for multi-process malware.

Publisher's Version
June: A Type Testability Transformation for Improved ATG Performance
Dan Bruce, David Kelly, Hector Menendez, Earl T. Barr, and David Clark
(Microsoft, UK; University College London, UK; King’s College London, UK)
Strings are universal containers: they are flexible to use, abundant in code, and difficult to test. String-controlled programs are programs that make branching decisions based on string input. Automatically generating valid test inputs for these programs considering only character sequences rather than any underlying string-encoded structures, can be prohibitively expensive. We present June, a tool that enables Java developers to expose any present latent string structure to test generation tools. June is an annotation-driven testability transformation and an extensible library, JuneLib, of structured string definitions. The core JuneLib definitions are empirically derived and provide templates for all structured strings in our test set. June takes lightly annotated source code and injects code that permits an automated test generator (ATG) to focus on the creation of mutable substrings inside a structured string. Using June costs the developer little, with an average of 2.1 annotations per string-controlled class. June uses standard Java build tools and therefore deploys seamlessly within a Java project. By feeding string structure information to an ATG tool, June dramatically reduces wasted effort; branches are effortlessly covered that would otherwise be extremely difficult, or impossible, to cover. This waste reduction both increases and speeds coverage. EvoSuite, for example, achieves the same coverage on June-ed classes in 1 minute, on average, as it does in 9 minutes on the un-June-ed class. These gains increase over time. On our corpus, June-ing a program compresses 24 hours of execution time into ca. 2 hours. We show that many ATG tools can reuse the same June-ed code: a few June annotations, a one-off cost, benefit many different testing regimes.

Publisher's Version
A Comprehensive Study on Quality Assurance Tools for Java
Han Liu, Sen Chen, Ruitao Feng, Chengwei Liu, Kaixuan Li, Zhengzi Xu, Liming Nie, Yang Liu, and Yixiang Chen
(East China Normal University, China; Tianjin University, China; UNSW, Australia; Nanyang Technological University, Singapore)
Quality assurance (QA) tools are receiving more and more attention and are widely used by developers. Given the wide range of solutions for QA technology, it is still a question of evaluating QA tools. Most existing research is limited in the following ways: (i) They compare tools without considering scanning rules analysis. (ii) They disagree on the effectiveness of tools due to the study methodology and benchmark dataset. (iii) They do not separately analyze the role of the warnings. (iv) There is no large-scale study on the analysis of time performance. To address these problems, in the paper, we systematically select 6 free or open-source tools for a comprehensive study from a list of 148 existing Java QA tools. To carry out a comprehensive study and evaluate tools in multi-level dimensions, we first mapped the scanning rules to the CWE and analyze the coverage and granularity of the scanning rules. Then we conducted an experiment on 5 benchmarks, including 1,425 bugs, to investigate the effectiveness of these tools. Furthermore, we took substantial effort to investigate the effectiveness of warnings by comparing the real labeled bugs with the warnings and investigating their role in bug detection. Finally, we assessed these tools’ time performance on 1,049 projects. The useful findings based on our comprehensive study can help developers improve their tools and provide users with suggestions for selecting QA tools.

Publisher's Version
Detecting State Inconsistency Bugs in DApps via On-Chain Transaction Replay and Fuzzing
Mingxi Ye, Yuhong Nan, Zibin Zheng, Dongpeng Wu, and Huizhong Li
(Sun Yat-sen University, China; WeBank, China)
Decentralized applications (DApps) consist of multiple smart contracts running on Blockchain. With the increasing popularity of the DApp ecosystem, vulnerabilities in DApps could bring significant impacts such as financial losses. Identifying vulnerabilities in DApps is by no means trivial, as modern DApps consist of complex interactions across multiple contracts. Previous research suffers from either high false positives or false negatives, due to the lack of precise contextual information which is mandatory for confirming smart contract vulnerabilities when analyzing smart contracts.
In this paper, we present IcyChecker, a new fuzzing-based framework to effectively identify State inconsistency (SI) Bugs – a specific type of bugs that can cause vulnerabilities such as re-entrancy, front-running with complex patterns. Different from prior works, IcyChecker utilizes a set of accurate contextual information for contract fuzzing by replaying the on-chain historical transactions. Besides, instead of designing specific testing oracles which are required by other fuzzing approaches, IcyChecker implements novel mechanisms to mutate a set of fuzzing transaction sequences, and further identify SI bugs by observing their state differences. Evaluation of IcyChecker over the top 100 popular DApps showed it effectively identifies a total number of 277 SI bugs, with a precision of 87%. By comparing IcyChecker with other state-of-the-art tools (i.e., Smartian, Confuzzius, and Sailfish), we show IcyChecker not only identifies more SI bugs but also with much lower false positives, thanks to its integration of accurate on-chain data and unique fuzzing strategies. Our research sheds light on new ways of detecting smart contract vulnerabilities in DApps.

Publisher's Version Published Artifact Artifacts Available
FairRec: Fairness Testing for Deep Recommender Systems
Huizhong Guo, Jinfeng Li, Jingyi Wang, Xiangyu Liu, Dongxia Wang, Zehong Hu, Rong Zhang, and Hui Xue
(Zhejiang University, China; Alibaba Group, China)
Deep learning-based recommender systems (DRSs) are increasingly and widely deployed in the industry, which brings significant convenience to people’s daily life in different ways. However, recommender systems are also shown to suffer from multiple issues, e.g., the echo chamber and the Matthew effect, of which the notation of “fairness” plays a core role. For instance, the system may be regarded as unfair to 1) a specific user, if the user gets worse recommendations than other users, or 2) an item (to recommend), if the item is much less likely to be exposed to the users than other items. While many fairness notations and corresponding fairness testing approaches have been developed for traditional deep classification models, they are essentially hardly applicable to DRSs. One major challenge is that there still lacks a systematic understanding and mapping between the existing fairness notations and the diverse testing requirements for deep recommender systems, not to mention further testing or debugging activities. To address the gap, we propose FairRec, a unified framework that supports fairness testing of DRSs from multiple customized perspectives, e.g., model utility, item diversity, item popularity, etc. We also propose a novel, efficient search-based testing approach to tackle the new challenge, i.e., double-ended discrete particle swarm optimization (DPSO) algorithm, to effectively search for hidden fairness issues in the form of certain disadvantaged groups from a vast number of candidate groups. Given the testing report, by adopting a simple re-ranking mitigation strategy on these identified disadvantaged groups, we show that the fairness of DRSs can be significantly improved. We conducted extensive experiments on multiple industry-level DRSs adopted by leading companies. The results confirm that FairRec is effective and efficient in identifying the deeply hidden fairness issues, e.g., achieving ∼95% testing accuracy with ∼half to 1/8 time.

Publisher's Version Archive submitted (71 kB)
ItyFuzz: Snapshot-Based Fuzzer for Smart Contract
Chaofan Shou, Shangyin Tan, and Koushik Sen
(University of California at Berkeley, USA)
Smart contracts are critical financial instruments, and their security is of utmost importance. However, smart contract programs are difficult to fuzz due to the persistent blockchain state behind all transactions. Mutating sequences of transactions are complex and often lead to a suboptimal exploration for both input and program spaces. In this paper, we introduce a novel snapshot-based fuzzer ItyFuzz for testing smart contracts. In ItyFuzz, instead of storing sequences of transactions and mutating from them, we snapshot states and singleton transactions. To explore interesting states, ItyFuzz introduces a dataflow waypoint mechanism to identify states with more potential momentum. ItyFuzz also incorporates comparison waypoints to prune the space of states. By maintaining snapshots of the states, ItyFuzz can synthesize concrete exploits like reentrancy attacks quickly. Because ItyFuzz has second-level response time to test a smart contract, it can be used for on-chain testing, which has many benefits compared to local development testing. Finally, we evaluate ItyFuzz on real-world smart contracts and some hacked on-chain DeFi projects. ItyFuzz outperforms existing fuzzers in terms of instructional coverage and can find and generate realistic exploits for on-chain projects quickly.

Publisher's Version
Who Judges the Judge: An Empirical Study on Online Judge Tests
Kaibo Liu, Yudong Han, Jie M. Zhang, Zhenpeng Chen, Federica Sarro, Mark Harman, Gang Huang, and Yun Ma
(Peking University, China; King’s College London, UK; University College London, UK; National Key Laboratory of Data Space Technology and System, China)
Online Judge platforms play a pivotal role in education, competitive programming, recruitment, career training, and large language model training. They rely on predefined test suites to judge the correctness of submitted solutions. It is therefore important that the solution judgement is reliable and free from potentially misleading false positives (i.e., incorrect solutions that are judged as correct). In this paper, we conduct an empirical study of 939 coding problems with 541,552 solutions, all of which are judged to be correct according to the test suites used by the platform, finding that 43.4% of the problems include false positive solutions (3,440 bugs are revealed in total). We also find that test suites are, nevertheless, of high quality according to widely-studied test effectiveness measurements: 88.2% of false positives have perfect (100%) line coverage, 78.9% have perfect branch coverage, and 32.5% have a perfect mutation score. Our findings indicate that more work is required to weed out false positive solutions and to further improve test suite effectiveness. We have released the detected false positive solutions and the generated test inputs to facilitate future research.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Precise and Efficient Patch Presence Test for Android Applications against Code Obfuscation
Zifan Xie, Ming Wen, Haoxiang Jia, Xiaochen Guo, Xiaotong Huang, Deqing Zou, and Hai Jin
(Huazhong University of Science and Technology, China)
Third-party libraries (TPLs) are widely utilized by Android developers to implement new apps. Unfortunately, TPLs are often suffering from various vulnerabilities, which could be exploited by attackers to cause catastrophic consequences for app users. Therefore, testing whether a vulnerability has been patched in target apps is crucial. However, existing techniques are unable to effectively test patch presence for obfuscated apps while obfuscation is pervasive in practice. To address the new challenges introduced by code obfuscation, this study presents PHunter, which is a system that captures obfuscation-resilient semantic features of patch-related methods to identify the presence of the patch in target apps. Specifically, PHunter utilizes coarse-grained features to locate patch-related methods, and compares the fine-grained semantic similarity to determine whether the code has been patched. Extensive evaluations on 94 CVEs and 200 apps show that PHunter can outperform state-of-the-art tools, achieving an average accuracy of 97.1% with high efficiency and low false positive rates. Besides, PHunter is able to be resilient to different obfuscation strategies. More importantly, PHunter is useful in eliminating the false alarms generated by existing TPL detection tools. In particular, it can help reduce up to 25.2% of the false alarms with an accuracy of 95.3%.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Detecting Vulnerabilities in Linux-Based Embedded Firmware with SSE-Based On-Demand Alias Analysis
Kai Cheng, Yaowen Zheng, Tao Liu, Le Guan, Peng Liu, Hong Li, Hongsong Zhu, Kejiang Ye, and Limin Sun
(Shenzhen Institute of Advanced Technology at Chinese Academy of Sciences, China; Sangfor Technologies, China; Nanyang Technological University, Singapore; Pennsylvania State University, USA; University of Georgia, USA; Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)
Although the importance of using static taint analysis to detect taint-style vulnerabilities in Linux-based embedded firmware is widely recognized, existing approaches are plagued by following major limitations: (a) Existing works cannot properly handle indirect call on the path from attacker-controlled sources to security-sensitive sinks, resulting in lots of false negatives. (b) They employ heuristics to identify mediate taint source and it is not accurate enough, which leads to high false positives.
To address issues, we propose EmTaint, a novel static approach for accurate and fast detection of taint-style vulnerabilities in Linux-based embedded firmware. In EmTaint, we first design a structured symbolic expression-based (SSE-based) on-demand alias analysis technique. Based on it, we come up with indirect call resolution and accurate taint analysis scheme. Combined with sanitization rule checking, EmTaint can eventually discovers a large number of taint-style vulnerabilities accurately within a limited time. We evaluated EmTaint against 35 real-world embedded firmware samples from six popular vendors. The result shows EmTaint discovered at least 192 vulnerabilities, including 41 n-day vulnerabilities and 151 0-day vulnerabilities. At least 115 CVE/PSV numbers have been allocated from a subset of the reported vulnerabilities at the time of writing. Compared with state-of-the-art tools such as KARONTE and SaTC, EmTaint found significantly more vulnerabilities on the same dataset in less time.

Publisher's Version Published Artifact Info Artifacts Available Artifacts Reusable
Definition and Detection of Defects in NFT Smart Contracts
Shuo Yang, Jiachi Chen, and Zibin Zheng
(Sun Yat-sen University, China)
Recently, the birth of non-fungible tokens (NFTs) has attracted great attention. NFTs are capable of representing users’ ownership on the blockchain and have experienced tremendous market sales due to their popularity. Unfortunately, the high value of NFTs also makes them a target for attackers. The defects in NFT smart contracts could be exploited by attackers to harm the security and reliability of the NFT ecosystem. Despite the significance of this issue, there is a lack of systematic work that focuses on analyzing NFT smart contracts, which may raise worries about the security of users’ NFTs. To address this gap, in this paper, we introduce 5 defects in NFT smart contracts. Each defect is defined and illustrated with a code example highlighting its features and consequences, paired with possible solutions to fix it. Furthermore, we propose a tool named NFTGuard to detect our defined defects based on a symbolic execution framework. Specifically, NFTGuard extracts the information of the state variables from the contract abstract syntax tree (AST), which is critical for identifying variable-loading and storing operations during symbolic execution. Furthermore, NFTGuard recovers source-code-level features from the bytecode to effectively locate defects and report them based on predefined detection patterns. We run NFTGuard on 16,527 real-world smart contracts and perform an evaluation based on the manually labeled results. We find that 1,331 contracts contain at least one of the 5 defects, and the overall precision achieved by our tool is 92.6%.

Publisher's Version
Eunomia: Enabling User-Specified Fine-Grained Search in Symbolically Executing WebAssembly Binaries
Ningyu He, Zhehao Zhao, Jikai Wang, Yubin Hu, Shengjian Guo, Haoyu Wang, Guangtai Liang, Ding Li, Xiangqun Chen, and Yao Guo
(Peking University, China; Huazhong University of Science and Technology, China; Beijing University of Posts and Telecommunications, China; Baidu Security, USA; Huawei Cloud Computing Technologies, China)
Although existing techniques have proposed automated approaches to alleviate the path explosion problem of symbolic execution, users still need to optimize symbolic execution by applying various searching strategies carefully. As existing approaches mainly support only coarse-grained global searching strategies, they cannot efficiently traverse through complex code structures. In this paper, we propose Eunomia, a symbolic execution technique that supports fine-grained search with local domain knowledge. Eunomia uses Aes, a DSL that lets users specify local searching strategies for different parts of the program. Eunomia also isolates the context of variables for different local searching strategies, avoiding conflicts. We implement Eunomia for WebAssembly, which can analyze applications written in various languages. Eunomia is the first symbolic execution engine that supports the full features of WebAssembly. We evaluate Eunomia with a microbenchmark suite and six real-world applications. Our evaluation shows that Eunomia improves bug detection by up to three orders of magnitude. We also conduct a user study that shows the benefits of using Aes. Moreover, Eunomia verifies six known bugs and detects two new zero-day bugs in Collections-C.

Publisher's Version
Type Batched Program Reduction
Golnaz Gharachorlu and Nick Sumner
(Simon Fraser University, Canada)
Given a program with a property of interest, program reduction searches for a smaller program that preserves the property and is easier to understand. Domain agnostic program reducers can reduce programs of multiple languages without extra domain knowledge. Despite their reusability, they may still take hours to run, hindering productivity and scalability. This paper proposes type batched program reduction, which uses machine learning to suggest portions of a program, or batches, that are most likely to be advantageous to reduce at a particular point in the reduction. We also extend this to jointly reduce multiple portions of a program at once, improving the performance further. Suggesting an appropriate order for removing batches from a program along with their potential simultaneous removal enables our reducer to outperform the state of the art reducers in reduction time over a set of large programs from multiple programming languages. This work lays foundations for further improvements in ML guided program reduction.

Publisher's Version Published Artifact Artifacts Available
Automatically Reproducing Android Bug Reports using Natural Language Processing and Reinforcement Learning
Zhaoxu Zhang, Robert Winn, Yu Zhao, Tingting Yu, and William G.J. Halfond
(University of Southern California, USA; University of Central Missouri, USA; University of Cincinnati, USA)
As part of the process of resolving issues submitted by users via bug reports, Android developers attempt to reproduce and observe the crashes described by the bug reports. Due to the low-quality of bug reports and the complexity of modern apps, the reproduction process is non-trivial and time-consuming. Therefore, automatic approaches that can help reproduce Android bug reports are in great need. However, current approaches to help developers automatically reproduce bug reports are only able to handle limited forms of natural language text and struggle to successfully reproduce crashes for which the initial bug report had missing or imprecise steps. In this paper, we introduce a new fully automated approach to reproduce crashes from Android bug reports that addresses these limitations. Our approach accomplishes this by leveraging natural language processing techniques to more holistically and accurately analyze the natural language in Android bug reports and designing new techniques, based on reinforcement learning, to guide the search for successful reproducing steps. We conducted an empirical evaluation of our approach on 77 real world bug reports. Our approach achieved 67% precision and 77% recall in accurately extracting reproduction steps from bug reports, reproduced 74% of the total bug reports, and reproduced 64% of the bug reports that contained missing steps, significantly outperforming state of the art techniques.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Large Language Models Are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models
Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang
(University of Illinois at Urbana-Champaign, USA; University of Science and Technology of China, China)
Deep Learning (DL) systems have received exponential growth in popularity and have become ubiquitous in our everyday life. Such systems are built on top of popular DL libraries, e.g., TensorFlow and PyTorch which provide APIs as building blocks for DL systems. Detecting bugs in these DL libraries is critical for almost all downstream DL systems in ensuring effectiveness/safety for end users. Meanwhile, traditional fuzzing techniques can be hardly effective for such a challenging domain since the input DL programs need to satisfy both the input language (e.g., Python) syntax/semantics and the DL API input/shape constraints for tensor computations. To address these limitations, we propose TitanFuzz – the first approach to directly leveraging Large Language Models (LLMs) to generate input programs for fuzzing DL libraries. LLMs are titanic models trained on billions of code snippets and can autoregressively generate human-like code snippets. Our key insight is that modern LLMs can also include numerous code snippets invoking DL library APIs in their training corpora, and thus can implicitly learn both language syntax/semantics and intricate DL API constraints for valid DL program generation. More specifically, we use both generative and infilling LLMs (e.g., Codex/InCoder) to generate and mutate valid/diverse input DL programs for fuzzing. Our experimental results demonstrate that TitanFuzz can achieve 30.38%/50.84% higher code coverage than state-of-the-art fuzzers on TensorFlow/PyTorch. Furthermore, TitanFuzz is able to detect 65 bugs, with 44 already confirmed as previously unknown bugs. This paper demonstrates that modern titanic LLMs can be leveraged to directly perform both generation-based and mutation-based fuzzing studied for decades, while being fully automated, generalizable, and applicable to domains challenging for traditional approaches (such as DL systems). We hope TitanFuzz can stimulate more work in this promising direction of LLMs for fuzzing.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Exploring Missed Optimizations in WebAssembly Optimizers
Zhibo Liu, Dongwei Xiao, Zongjie Li, Shuai Wang, and Wei Meng
(Hong Kong University of Science and Technology, China; Chinese University of Hong Kong, China)
The prosperous trend of deploying complex applications to web browsers has boosted the development of WebAssembly (wasm) compilation toolchains. Software written in different high-level programming languages are compiled into wasm executables, which can be executed fast and safely in a virtual machine. The performance of wasm executables depends highly on compiler optimizations. Despite the prosperous use of wasm executables, recent research has indicated that real-world wasm applications are slower than anticipated, suggesting deficiencies in wasm optimizations.
This paper aims to present the first systematic and in-depth understanding of the status quo of wasm optimizations. To do so, we present DITWO, a differential testing framework to uncover missed optimizations (MO) of wasm optimizers. DITWO compiles a C program into both native x86 executable and wasm executable, and differentiates optimization indication traces (OITraces) logged by running each executable to uncover MO. Each OITrace is composed with global variable writes and function calls, two performance indicators that practically and systematically reflect the optimization degree across wasm and native executables. Our analysis of the official wasm optimizer, wasm-opt, successfully identifies 1,293 inputs triggering MO of wasm-opt. With extensive manual effort, we identify nine root causes for all MO, and we estimate that fixing discovered MO can result in a performance improvement of at least 17.15%. We also summarize four lessons from our findings to deliver better wasm optimizations.

Publisher's Version
PhysCov: Physical Test Coverage for Autonomous Vehicles
Carl Hildebrandt, Meriel von Stein, and Sebastian Elbaum
(University of Virginia, USA)
Adequately exercising the behaviors of autonomous vehicles is fundamental to their validation. However, quantifying an autonomous vehicle’s testing adequacy is challenging as the system’s behavior is influenced both by its state as well as its physical environment. To address this challenge, our work builds on two insights. First, data sensed by an autonomous vehicle provides a unique spatial signature of the physical environment inputs. Second, given the vehicle’s current state, inputs residing outside the autonomous vehicle’s physically reachable regions are less relevant to its behavior. Building on those insights, we introduce an abstraction that enables the computation of a physical environment-state coverage metric, PhysCov. The abstraction combines the sensor readings with a physical reachability analysis based on the vehicle’s state and dynamics to determine the region of the environment that may affect the autonomous vehicle. It then characterizes that region through a parameterizable geometric approximation that can trade quality for cost. Tests with the same characterizations are deemed to have had similar internal states and exposed to similar environments and thus likely to exercise the same set of behaviors, while tests with distinct characterizations will increase PhysCov. A study on two simulated and one real system’s dataset examines PhysCovs’s ability to quantify an autonomous vehicle’s test suite, showcases its characterization cost and precision, investigates its correlation with failures found and potential for test selection, and assesses its ability to distinguish among real-world scenarios.

Publisher's Version Info
Building Critical Testing Scenarios for Autonomous Driving from Real Accidents
Xudong Zhang and Yan Cai
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)
One of the aims of the development and spread of autonomous driving technology is to reduce traffic accidents caused by human factors. But recently reported data on fatal accidents involving autonomous driving system (ADS) shows that this important goal has not been achieved. So there is an emerge requirement on more comprehensive and targeted testing especially on safe driving. In this paper, we propose an approach to automatically building critical testing scenarios from real-world accident data. Firstly, we propose a new model called M-CPS (Multi-channel Panoptic Segmentation) to extract the effective information from the accident record (such as images or videos), and separate the independent individuals of different traffic participants for further scene recovery. Compared with the traditional panoramic segmentation models, M-CPS model is able to effectively handle segmentation challenges due to the shooting angle, image quality, pixel overlap and other problems existing in the accident record. Next, the extracted core information is then connected with the virtual testing platform to generate the original scene set. Besides, we also design a mutation testing solution on the basis of the original scene set, thus greatly enriching the scene library for testing. In our experiments, the M-CPS model reaches a result of 66.1% PQ on CityScapes test set, shows that our model has only slight fluctuations on performance compared with the best benchmark model on pure panoptic segmentation task. It also reaches a result of 84.5% IoU for semantic segmentation branch and 40.3% mAP for instance segmentation branch on SHIFT dataset. Then we use UCF-Crime, CADP and US-Accidents datasets to generate the original and mutated scene set. Those generated scene sets are connected to Apollo and Carla simulation platforms to test ADS prototypes. We find three types of scenarios that can lead to accidents of ADS prototypes, which indicates that the existing ADS prototype has defects. Our solution provides a new possible direction for the recovery of key scenarios in ADS testing, and can improve the efficiency in related fields.

Publisher's Version
CILIATE: Towards Fairer Class-Based Incremental Learning by Dataset and Training Refinement
Xuanqi Gao, Juan Zhai, Shiqing Ma, Chao Shen, Yufei Chen, and Shiwei Wang
(Xi’an Jiaotong University, China; University of Massachusetts, USA; City University of Hong Kong, China)
Due to the model aging problem, Deep Neural Networks (DNNs) need updates to adjust them to new data distributions. The common practice leverages incremental learning (IL), e.g., Class-based Incremental Learning (CIL) that updates output labels, to update the model with new data and a limited number of old data. This avoids heavyweight training (from scratch) using conventional methods and saves storage space by reducing the number of old data to store. But it also leads to poor performance in fairness. In this paper, we show that CIL suffers both dataset and algorithm bias problems, and existing solutions can only partially solve the problem. We propose a novel framework, CILIATE, that fixes both dataset and algorithm bias in CIL. It features a novel differential analysis guided dataset and training refinement process that identifies unique and important samples overlooked by existing CIL and enforces the model to learn from them. Through this process, CILIATE improves the fairness of CIL by 17.03%, 22.46%, and 31.79% compared to state-of-the-art methods, iCaRL, BiC, and WA, respectively, based on our evaluation on three popular datasets and widely used ResNet models. Our code is available at https://github.com/Antimony5292/CILIATE.

Publisher's Version
BehAVExplor: Behavior Diversity Guided Testing for Autonomous Driving Systems
Mingfei Cheng, Yuan Zhou, and Xiaofei Xie
(Singapore Management University, Singapore; Nanyang Technological University, Singapore)
Testing Autonomous Driving Systems (ADSs) is a critical task for ensuring the reliability and safety of autonomous vehicles. Existing methods mainly focus on searching for safety violations while the diversity of the generated test cases is ignored, which may generate many redundant test cases and failures. Such redundant failures can reduce testing performance and increase failure analysis costs. In this paper, we present a novel behavior-guided fuzzing technique (BehAVExplor) to explore the different behaviors of the ego vehi- cle (i.e., the vehicle controlled by the ADS under test) and detect diverse violations. Specifically, we design an efficient unsupervised model, called BehaviorMiner, to characterize the behavior of the ego vehicle. BehaviorMiner extracts the temporal features from the given scenarios and performs a clustering-based abstraction to group behaviors with similar features into abstract states. A new test case will be added to the seed corpus if it triggers new behav- iors (e.g., cover new abstract states). Due to the potential conflict between the behavior diversity and the general violation feedback, we further propose an energy mechanism to guide the seed selec- tion and the mutation. The energy of a seed quantifies how good it is. We evaluated BehAVExplor on Apollo, an industrial-level ADS, and LGSVL simulation environment. Empirical evaluation results show that BehAVExplor can effectively find more diverse violations than the state-of-the-art.

Publisher's Version
In Defense of Simple Techniques for Neural Network Test Case Selection
Shenglin Bao, Chaofeng Sha, Bihuan Chen, Xin Peng, and Wenyun Zhao
(Fudan University, China)
Although deep learning (DL) software has been pervasive in various applications, the brittleness of deep neural networks (DNN) hinders their deployment in many tasks especially high-stake ones. To mitigate the risk accompanied with DL software fault, a variety of DNN testing techniques have been proposed such as test case selection. Among those test case selection or prioritization methods, the uncertainty-based ones such as DeepGini have demonstrated their effectiveness in finding DNN’s faults. Recently, TestRank, a learning based test ranking method has shown their out-performance over simple uncertainty-based test selection methods. However, this is achieved with a more complicated design which needs to train a graph convolutional network and a multi-layer Perceptron. In this paper, we propose a novel and lightweight DNN test selection method to enhance the effectiveness of existing simple ones. Besides the DNN model’s uncertainty on test case itself, we take into account model’s uncertainty on its neighbors. This could diversify the selected test cases and improve the effectiveness of existing uncertainty-based test selection methods. Extensive experiments on 5 datasets demonstrate the effectiveness of our approach.

Publisher's Version
ConfFix: Repairing Configuration Compatibility Issues in Android Apps
Huaxun Huang, Chi Xu, Ming Wen, Yepang Liu, and Shing-Chi Cheung
(Hong Kong University of Science and Technology, China; Southern University of Science and Technology, China; Huazhong University of Science and Technology, China)
XML configuration files are widely-used to specify the user interfaces (UI) of Android apps. Configuration compatibility (CC) issues are induced owing to the inconsistent handling of such XML configuration files across different Android framework versions. CC issues can cause software crashes and inconsistent look-and-feels, severely impacting the user experience of Android apps. However, there is no universal solution to resolve CC issues and app developers need to handle CC issues case by case. Existing tools are designed based on predefined rules or visual features that are possibly manifested by CC issues. Unfortunately, they can fail or generate overfitting patches when the CC issues are beyond their capabilities. To fill the above research gaps, we first empirically studied the app developers' common strategies in patching real-world CC issues. Based on the findings, we propose ConfFix, an automatic approach to repair CC issues in Android apps. ConfFix is driven by the knowledge of how an XML element is handled inconsistently in different versions of the Android framework and generates patches to eliminate such inconsistencies. We evaluated ConfFix on a set of 77 reproducible CC issues in 13 open-source Android apps. The results show that ConfFix outperforms baselines in successfully repairing 64 CC issues with a high precision. Encouragingly, the patches for 38 CC issues have been confirmed and merged by app developers.

Publisher's Version
Vectorizing Program Ingredients for Better JVM Testing
Tianchang Gao, Junjie Chen, Yingquan Zhao, Yuqun Zhang, and Lingming Zhang
(Tianjin University, China; Southern University of Science and Technology, China; University of Illinois at Urbana-Champaign, USA)
JVM testing is one of the most widely-used methodologies for guaranteeing the quality of JVMs. Among various JVM testing techniques, synthesis-based JVM testing, which constructs a test program by synthesizing various code snippets (also called program ingredients), has been demonstrated state-of-the-art. The existing synthesis-based JVM testing work puts more efforts in ensuring the validity of synthesized test programs, but ignores the influence of huge ingredient space, which largely limits the ingredient exploration efficiency as well as JVM testing performance. In this work, we propose Vectorized JVM Testing (called VECT) to further promote the performance of synthesis-based JVM testing. Its key insight is to reduce the huge ingredient space by clustering semantically similar ingredients via vectorizing ingredients using state-of-the-art code representation. To make VECT complete and more effective, based on vectorized ingredients, VECT further designs a feedback-driven ingredient selection strategy and an enhanced test oracle. We conducted an extensive study to evaluate VECT on three popular JVMs (i.e., HotSpot, OpenJ9, and Bisheng JDK) involving five OpenJDK versions. The results demonstrate VECT detects 115.03% ~ 776.92% more unique inconsistencies than the state-of-the-art JVM testing technique during the same testing time. In particular, VECT detects 26 previously unknown bugs for them, 15 of which have already been confirmed/fixed by developers.

Publisher's Version
What You See Is What You Get? It Is Not the Case! Detecting Misleading Icons for Mobile Applications
Linlin Li, Ruifeng Wang, Xian Zhan, Ying Wang, Cuiyun Gao, Sinan Wang, and Yepang Liu
(Southern University of Science and Technology, China; Northeastern University, China; Harbin Institute of Technology, China)
With the prevalence of smartphones, people nowadays can access a wide variety of services through diverse apps. A good Graphical User Interface (GUI) can make an app more appealing and competitive in app markets. Icon widgets, as an essential part of an app’s GUI, leverage icons to visually convey their functionalities to facilitate user interactions. Whereas, designing intuitive icon widgets can be a non-trivial job. Developers should follow a series of guidelines and make appropriate choices from a plethora of possibilities. Inappropriately designed or misused icons may cause user confusion, lead to wrong operations, and even result in security risks (e.g., revenue loss and privacy leakage). To investigate the problem, we manually checked 9,075 icons of 1,111 top-ranked commercial apps from Google Play and found 640 misleading icons in 312 ( 28%) of these apps. This shows that misleading icons are prevalent among real-world apps, even the top ones.
Manually identifying misleading icons to improve app quality is time-consuming and laborious. In this work, we propose the first framework, IconSeer, to automatically detect misleading icons for mobile apps. Our basic idea is to find the discrepancies between the commonly perceived intentions of an icon and the actual functionality of the corresponding icon widget. IconSeer takes an Android app as input and reports potential misleading icons. It is powered by a comprehensive icon-intention mapping constructed by analyzing 268,353 icons collected from 15,571 popular Android apps in Google Play. The mapping includes 179 icon classes and 852 intention classes. Given an icon widget under analysis, IconSeer first employs a pre-trained open-set deep learning model to infer the possible icon class and the potential intentions. IconSeer then extracts developer-specified text properties of the icon widget, which indicate the widget’s actual functionality. Finally, IconSeer determines whether an icon is misleading by comparing the semantic similarity between the inferred intentions and the extracted text properties of the widget. We have evaluated IconSeer on the 1,111 Android apps with manually established ground truth. IconSeer successfully identified 1,172 inconsistencies (with an accuracy of 0.86), among which we further found 482 real misleading icons.

Publisher's Version
Testing the Compiler for a New-Born Programming Language: An Industrial Case Study (Experience Paper)
Yingquan Zhao, Junjie Chen, Ruifeng Fu, Haojie Ye, and Zan Wang
(Tianjin University, China; Huawei, China)
Due to the critical role of compilers, many compiler testing techniques have been proposed, two most notable categories among which are grammar-based and metamorphic-based techniques. All of them have been extensively studied for testing mature compilers. However, it is typical to develop a new compiler for a new-born programming language in practice. In this scenario, the existing techniques are hardly applicable due to some major reasons: (1) no reference compilers to support differential testing, (2) lack of program analysis tools to support most of metamorphic-based compiler testing, (3) substantial implementation effort incurred by different programming language features. Hence, it is unknown how the existing techniques perform in this new scenario.
In this work, we conduct the first exploration (i.e., an industrial case study) to investigate the performance of the existing techniques in this new scenario with substantial adaptations. We adapted grammar-based compiler testing to this scenario by synthesizing new test programs based on code snippets and using compilation crash as test oracle due to the lack of reference compilers for differential testing. We also adapted metamorphic-based compiler testing to this scenario by constructing equivalent test programs under any inputs to relieve the dependence on program analysis tools. We call the adapted techniques SynFuzz and MetaFuzz, respectively.
We evaluated both SynFuzz and MetaFuzz on two versions of a new compiler for a new-born programming language in a global IT company. By comparing with the testing practice adopted by the testing team and the general fuzzer (AFL), SynFuzz can detect more bugs during the same testing time, and both SynFuzz and MetaFuzz can complement the other two techniques. In particular, SynFuzz and MetaFuzz have detected 11 previously unknown bugs, all of which have been fixed by the developers. From the industrial case study, we summarized a series of lessons and suggestions for practical use and future research.

Publisher's Version
Quantitative Policy Repair for Access Control on the Cloud
William Eiers, Ganesh Sankaran, and Tevfik Bultan
(University of California at Santa Barbara, USA)
With the growing prevalence of cloud computing, providing secure access to information stored in the cloud has become a critical problem. Due to the complexity of access control policies, administrators may inadvertently allow unintended access to private information, and this is a common source of data breaches in cloud based services. In this paper, we present a quantitative symbolic analysis approach for automated policy repair in order to fix overly permissive policies. We encode the semantics of the access control policies using SMT formulas and assess their permissiveness using model counting. Given a policy, a permissiveness bound, and a set of requests that should be allowed, we iteratively repair the policy through permissiveness reduction and refinement, so that the permissiveness bound is reached while the given set of requests are still allowed. We demonstrate the effectiveness of our automated policy repair technique by applying it to policies written in Amazon's AWS Identity and Access Management (IAM) policy language.

Publisher's Version
Validating Multimedia Content Moderation Software via Semantic Fusion
Wenxuan Wang, Jingyuan Huang, Chang Chen, Jiazhen Gu, Jianping Zhang, Weibin Wu, Pinjia He, and Michael Lyu
(Chinese University of Hong Kong, China; Sun Yat-sen University, China)
The exponential growth of social media platforms, such as Facebook, Instagram, Youtube, and TikTok, has revolutionized communication and content publication in human society. Users on these platforms can publish multimedia content that delivers information via the combination of text, audio, images, and video. Meanwhile, the multimedia content release facility has been increasingly exploited to propagate toxic content, such as hate speech, malicious advertisement, and pornography. To this end, content moderation software has been widely deployed on these platforms to detect and blocks toxic content. However, due to the complexity of content moderation models and the difficulty of understanding information across multiple modalities, existing content moderation software can fail to detect toxic content, which often leads to extremely negative impacts (e.g., harmful effects on teen mental health). We introduce Semantic Fusion, a general, effective methodology for validating multimedia content moderation software. Our key idea is to fuse two or more existing single-modal inputs (e.g., a textual sentence and an image) into a new input that combines the semantics of its ancestors in a novel manner and has toxic nature by construction. This fused input is then used for validating multimedia content moderation software. We realized Semantic Fusion as DUO, a practical content moderation software testing tool. In our evaluation, we employ DUO to test five commercial content moderation software and two state-of-the-art models against three kinds of toxic contents. The results show that DUO achieves up to 100% error finding rate (EFR) when testing moderation software and it obtains up to 94.1% EFR when testing the state-of-the-art models. In addition, we leverage the test cases generated by DUO to retrain the two models we explored, which largely improves model robustness (2.5%∼5.7% EFR) while maintaining the accuracy on the original test set.

Publisher's Version
Towards More Realistic Evaluation for Neural Test Oracle Generation
Zhongxin Liu, Kui Liu, Xin Xia, and Xiaohu Yang
(Zhejiang University, China; Huawei, China)
Unit testing has become an essential practice during software development and maintenance. Effective unit tests can help guard and improve software quality but require a substantial amount of time and effort to write and maintain. A unit test consists of a test prefix and a test oracle. Synthesizing test oracles, especially functional oracles, is a well-known challenging problem. Recent studies proposed to leverage neural models to generate test oracles, i.e., neural test oracle generation (NTOG), and obtained promising results. However, after a systematic inspection, we find there are some inappropriate settings in existing evaluation methods for NTOG. These settings could mislead the understanding of existing NTOG approaches’ performance. We summarize them as 1) generating test prefixes from bug-fixed program versions, 2) evaluating with an unrealistic metric, and 3) lacking a straightforward baseline. In this paper, we first investigate the impacts of these settings on evaluating and understanding the performance of NTOG approaches. We find that 1) unrealistically generating test prefixes from bug-fixed program versions inflates the number of bugs found by the state-of-the-art NTOG approach TOGA by 61.8%, 2) FPR (False Positive Rate) is not a realistic evaluation metric and the Precision of TOGA is only 0.38%, and 3) a straightforward baseline NoException, which simply expects no exception should be raised, can find 61% of the bugs found by TOGA with twice the Precision. Furthermore, we introduce an additional ranking step to existing evaluation methods and propose an evaluation metric named Found@K to better measure the cost-effectiveness of NTOG approaches in terms of bug-finding. We propose a novel unsupervised ranking method to instantiate this ranking step, significantly improving the cost-effectiveness of TOGA. Eventually, based on our experimental results and observations, we propose a more realistic evaluation method TEval+ for NTOG and summarize seven rules of thumb to boost NTOG approaches into their practical usages.

Publisher's Version
Back Deduction Based Testing for Word Sense Disambiguation Ability of Machine Translation Systems
Jun Wang, Yanhui Li, Xiang Huang, Lin Chen, Xiaofang Zhang, and Yuming Zhou
(Nanjing University, China; Soochow University, China)
Machine translation systems have penetrated our daily lives, providing translation services from source language to target language to millions of users online daily. Word Sense Disambiguation (WSD) is one of the essential functional requirements of machine translation systems, which aims to determine the exact sense of polysemes in the given context. Commercial machine translation systems (e.g., Google Translate) have been shown to fail in identifying the proper sense and consequently cause translation errors. However, to our knowledge, no prior studies focus on testing such WSD bugs for machine translation systems.
To tackle this challenge, we propose a novel testing method Back Deduction based Testing for Word Sense Disambiguation (BDTD). Our method’s main idea is to obtain the hidden senses of source words via back deduction from the target language, i.e., employ translation words in the target language to deduce senses of original words identified in the translation procedure. To evaluate BDTD, we conduct an extensive empirical study with millions of sentences under three popular translators, including Google Translate and Bing Microsoft Translator. The experimental results indicate that BDTD can identify a considerable number of WSD bugs with high accuracy, more than 80%, under all three translators.

Publisher's Version
DyCL: Dynamic Neural Network Compilation Via Program Rewriting and Graph Optimization
Simin Chen, Shiyi Wei, Cong Liu, and Wei Yang
(University of Texas at Dallas, USA; University of California at Riverside, USA)
The deep learning (DL) compiler serves as a vital infrastructure component to enable the deployment of deep neural networks on diverse hardware platforms such as mobile devices and Raspberry Pi. DL compiler’s primary function is to translate DNN programs written in high-level DL frameworks such as PyTorch and TensorFlow into portable executables. These executables can then be flexibly executed by the deployed host programs. However, existing DL compilers rely on a tracing mechanism, which involves feeding a runtime input to a neural network program and tracing the program execution paths to generate the computational graph necessary for compilation. Unfortunately, this mechanism falls short when dealing with modern dynamic neural networks (DyNNs) that possess varying computational graphs depending on the inputs. Consequently, conventional DL compilers struggle to accurately compile DyNNs into executable code. To address this limitation, we propose DyCL, a general approach that enables any existing DL compiler to successfully compile DyNNs. DyCL tackles the dynamic nature of DyNNs by introducing a compilation mechanism that redistributes the control and data flow of the original DNN programs during the compilation process. Specifically, DyCL develops program analysis and program transformation techniques to convert a dynamic neural network into multiple sub-neural networks. Each sub-neural network is devoid of conditional statements and is compiled independently. Furthermore, DyCL synthesizes a host module that models the control flow of the DyNNs and facilitates the invocation of the sub-neural networks. Our evaluation demonstrates the effectiveness of DyCL, achieving a 100% success rate in compiling all dynamic neural networks. Moreover, the compiled executables generated by DyCL exhibit significantly improved performance, running between 1.12× and 20.21× faster than the original DyNNs executed on general-purpose DL frameworks.

Publisher's Version Published Artifact Info Artifacts Available Artifacts Functional
Systematically Producing Test Orders to Detect Order-Dependent Flaky Tests
Chengpeng Li, M. Mahdi Khosravi, Wing Lam, and August Shi
(University of Texas at Austin, USA; Middle East Technical University, Turkey; George Mason University, USA)
Software testing suffers from the presence of flaky tests, which can pass or fail when run on the same version of code. Order- dependent tests (OD tests) are flaky tests whose outcome depends on the order in which they are run. An OD test can be detected if specific tests are run or not run before it, resulting in a difference in test outcome. While prior work has proposed rerunning tests in different random test orders, this approach does not provide guarantees toward detecting all OD tests. Later work that proposed a more systematic approach to ordering tests still fails to account for the relationships between all tests in the test suite. We propose three new techniques to detect OD tests through a more systematic means of producing test orders. Our techniques build upon prior work in Tuscan squares to cover test pairs in a minimal set of test orders while also obeying the constraints of how tests can be positioned in a test order w.r.t. their test classes. Further, as there are many test pairs that need to be covered, we develop a technique that can take a specified set of test pairs to cover and produce test orders that aim to cover just those test pairs. Our evaluation with 289 known OD tests across 47 test suites from open-source projects shows that our most cost-effective technique can detect 97.2% of the known OD tests with 104.7 test orders, on average, per subject. While all techniques produce a relatively large number of test orders, our analysis of the minimal set of test orders needed to detect OD tests shows a tremendous reduction in the test orders needed to detect OD tests – representing an opportunity for future work to prioritize test orders.

Publisher's Version
Security Checking of Trigger-Action-Programming Smart Home Integrations
Lei Bu, Qiuping Zhang, Suwan Li, Jinglin Dai, Guangdong Bai, Kai Chen, and Xuandong Li
(Nanjing University, China; University of Queensland, Australia; Institute of Information Engineering at Chinese Academy of Sciences, China)
Internet of Things (IoT) has become prevalent in various fields, especially in the context of home automation (HA). To better control HA-IoT devices, especially to integrate several devices for rich smart functionalities, trigger-action programming, such as the If This Then That (IFTTT), has become a popular paradigm. Leveraging it, novice users can easily specify their intent in applets regarding how to control a device/service through another once a specific condition is met. Nevertheless, the users may design IFTTT-style integrations inappropriately, due to lack of security experience or unawareness of the security impact of cyber-attacks against individual devices. This has caused financial loss, privacy leakage, unauthorized access and other security issues. To address these problems, this work proposes a systematic framework named MEDIC to model smart home integrations and check their security. It automatically generates models incorporating the service/device behaviors and action rules of the applets, while taking into consideration the external attacks and in-device vulnerabilities. Our approach takes around one second to complete the modeling and checking of one integration. We carried out experiments based on 200 integrations created from a user study and a dataset crawled from ifttt.com. To our great surprise, nearly 83% of these integrations have security issues.

Publisher's Version
LiResolver: License Incompatibility Resolution for Open Source Software
Sihan Xu, Ya Gao, Lingling Fan, Linyu Li, Xiangrui Cai, and Zheli Liu
(Nankai University, China)
Open source software (OSS) licenses regulate the conditions under which OSS can be legally reused, distributed, and modified. However, a common issue arises when incorporating third-party OSS accompanied with licenses, i.e., license incompatibility, which occurs when multiple licenses exist in one project and there are conflicts between them. Despite being problematic, fixing license incompatibility issues requires substantial efforts due to the lack of license understanding and complex package dependency. In this paper, we propose LiResolver, a fine-grained, scalable, and flexible tool to resolve license incompatibility issues for open source software. Specifically, it first understands the semantics of licenses through fine-grained entity extraction and relation extraction. Then, it detects and resolves license incompatibility issues by recommending official licenses in priority. When no official licenses can satisfy the constraints, it generates a custom license as an alternative solution. Comprehensive experiments demonstrate the effectiveness of LiResolver, with 4.09% false positive (FP) rate and 0.02% false negative (FN) rate for incompatibility issue localization, and 62.61% of 230 real-world incompatible projects resolved by LiResolver. We discuss the feedback from OSS developers and the lessons learned from this work. All the datasets and the replication package of LiResolver have been made publicly available to facilitate follow-up research.

Publisher's Version
More Precise Regression Test Selection via Reasoning about Semantics-Modifying Changes
Yu Liu, Jiyang Zhang, Pengyu Nie, Milos Gligoric, and Owolabi Legunsen
(University of Texas at Austin, USA; Cornell University, USA)
Regression test selection (RTS) speeds up regression testing by only re-running tests that might be affected by code changes. Ideal RTS safely selects all affected tests and precisely selects only affected tests. But, aiming for this ideal is often slower than re-running all tests. So, recent RTS techniques use program analysis to trade precision for speed, i.e., lower regression testing time, or even use machine learning to trade safety for speed. We seek to make recent analysis-based RTS techniques more precise, to further speed up regression testing. Independent studies suggest that these techniques reached a “performance wall” in the speed-ups that they provide. We manually inspect code changes to discover those that do not require re-running tests that are only affected by such changes. We categorize 29 kinds of changes that we find from five projects into 13 findings, 11 of which are semantics-modifying. We enhance two RTS techniques---Ekstazi and STARTS---to reason about our findings. Using 1,150 versions of 23 projects, we evaluate the impact on safety and precision of leveraging such changes. We also evaluate if our findings from a few projects can speed up regression testing in other projects. The results show that our enhancements are effective and they can generalize. On average, they result in selecting 41.7% and 31.8% fewer tests, and take 33.7% and 28.7% less time than Ekstazi and STARTS, respectively, with no loss in safety.

Publisher's Version
Silent Compiler Bug De-duplication via Three-Dimensional Analysis
Chen Yang, Junjie Chen, Xingyu Fan, Jiajun Jiang, and Jun Sun
(Tianjin University, China; Singapore Management University, Singapore)
Compiler testing is an important task for assuring the quality of compilers, but investigating test failures is very time-consuming. This is because many test failures are caused by the same compiler bug (known as bug duplication problem). In particular, this problem becomes much more challenging on silent compiler bugs (also called wrong code bugs), since these bugs can provide little information (unlike crash bugs that can produce error messages) for bug de-duplication. In this work, we propose a novel technique (called D3) to solve the duplication problem on silent compiler bugs. Its key insight is to characterize the silent bugs from the testing process and identify three-dimensional information (i.e., test program, optimizations, and test execution) for bug de-duplication. However, there are huge amount of bug-irrelevant details on the three dimensions, D3 then systematically conducts causal analysis to identify bug-causal features from each of the three dimensions for more accurate bug de-duplication. Finally, D3 ranks the test failures that are more likely to be caused by different silent bugs higher by measuring the distance among test failures based on the three-dimensional bug-causal features. Our experimental results on four datasets (including duplicate bugs of both GCC and LLVM) demonstrate the significant superiority of D3 over the two state-of-the-art compiler bug de-duplication techniques, achieving the average improvement of 19.36% and 51.43% in identifying unique silent compiler bugs when analyzing the same number of test failures.

Publisher's Version
ACETest: Automated Constraint Extraction for Testing Deep Learning Operators
Jingyi Shi, Yang Xiao, Yuekang Li, Yeting Li, Dongsong Yu, Chendong Yu, Hui Su, Yufeng Chen, and Wei Huo
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia; Zhongguancun Laboratory, China)
Deep learning (DL) applications are prevalent nowadays as they can help with multiple tasks. DL libraries are essential for building DL applications. Furthermore, DL operators are the important building blocks of the DL libraries, that compute the multi-dimensional data (tensors). Therefore, bugs in DL operators can have great impacts. Testing is a practical approach for detecting bugs in DL operators. In order to test DL operators effectively, it is essential that the test cases pass the input validity check and are able to reach the core function logic of the operators. Hence, extracting the input validation constraints is required for generating high-quality test cases. Existing techniques rely on either human effort or documentation of DL library APIs to extract the constraints. They cannot extract complex constraints and the extracted constraints may differ from the actual code implementation. To address the challenge, we propose ACETest, a technique to automatically extract input validation constraints from the code to build valid yet diverse test cases which can effectively unveil bugs in the core function logic of DL operators. For this purpose, ACETest can automatically identify the input validation code in DL operators, extract the related constraints and generate test cases according to the constraints. The experimental results on popular DL libraries, TensorFlow and PyTorch, demonstrate that ACETest can extract constraints with higher quality than state-of-the-art (SOTA) techniques. Moreover, ACETest is capable of extracting 96.4% more constraints and detecting 1.95 to 55 times more bugs than SOTA techniques. In total, we have used ACETest to detect 108 previously unknown bugs on TensorFlow and PyTorch, with 87 of them confirmed by the developers. Lastly, five of the bugs were assigned with CVE IDs due to their security impacts.

Publisher's Version
DDLDroid: Efficiently Detecting Data Loss Issues in Android Apps
Yuhao Zhou and Wei Song
(Nanjing University of Science and Technology, China)
Data loss issues in Android apps triggered by activity restart or app relaunch significantly reduce the user experience and undermine the app quality. While data loss detection has received much attention, the state-of-the-art techniques still miss many data loss issues due to the inaccuracy of the static analysis or the low coverage of the dynamic exploration. To this end, we present DDLDroid, a static analysis approach and an open-source tool, to systematically and efficiently detect data loss issues based on the data flow analysis. DDLDroid is bootstrapped by a saving-restoring bipartite graph which correlates variables that need saving to the corresponding variables that need restoring according to their carrier widgets. The missed or broken saving or restoring data flows lead to data loss issues. The experimental evaluation on 66 Android apps demonstrates the effectiveness and efficiency of our approach: DDLDroid successfully detects 302 true data loss issues in 73 minutes, 180 of which are previously unknown.

Publisher's Version Published Artifact Artifacts Available
To Kill a Mutant: An Empirical Study of Mutation Testing Kills
Hang Du, Vijay Krishna Palepu, and James A. Jones
(University of California at Irvine, USA; Microsoft, USA)
Mutation testing has been used and studied for over four decades as a method to assess the strength of a test suite. This technique adds an artificial bug (i.e., a mutation) to a program to produce a mutant, and the test suite is run to determine if any of its test cases are sufficient to detect this mutation (i.e., kill the mutant). In this situation, a test case that fails is the one that kills the mutant. However, little is known about the nature of these kills. In this paper, we present an empirical study that investigates the nature of these kills. We seek to answer questions, such as: How are test cases failing so that they contribute to mutant kills? How many test cases fail for each killed mutant, given that only a single failure is required to kill that mutant? How do program crashes contribute to kills, and what are the origins and nature of these crashes? We found several revealing results across all subjects, including the substantial contribution of "crashes" to test failures leading to mutant kills, the existence of diverse causes for test failures even for a single mutation, and the specific types of exceptions that commonly instigate crashes. We posit that this study and its results should likely be taken into account for practitioners in their use of mutation testing and interpretation of its mutation score, and for researchers who study and leverage mutation testing in their future work.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
iSyn: Semi-automated Smart Contract Synthesis from Legal Financial Agreements
Pengcheng Fang, Zhenhua Zou, Xusheng Xiao, and Zhuotao Liu
(Case Western Reserve University, USA; Tsinghua University, China; Arizona State University, USA)
Embracing software-driven smart contracts to fulfill legal agreements is a promising direction for digital transformation in the legal sector. Existing solutions mostly consider smart contracts as simple add-ons, without leveraging the programmability of smart contracts to realize complex semantics of legal agreements. In this paper, we propose iSyn, the first end-to-end system that synthesizes smart contracts to fulfill the semantics of financial legal agreements, with minimal human interventions. The design of iSyn centers around a novel intermediate representation (SmartIR) that closes the gap between the natural language sentences and smart contract statements. Specifically, iSyn includes a synergistic pipeline that unifies multiple NLP-techniques to accurately construct SmartIR instances given legal agreements, and performs template-based synthesis based on the SmartIR instances to synthesize smart contracts. We also design a validation framework to verify the correctness and detect known vulnerabilities of the synthesized smart contracts.We evaluate iSyn using legal agreements centering around financial transactions. The results show that iSyn-synthesized smart contracts are syntactically similar and semantically correct (or within a few edits), compared with the “ground truth” smart contracts manually developed by inspecting the legal agreements.

Publisher's Version Info
RefBERT: A Two-Stage Pre-trained Framework for Automatic Rename Refactoring
Hao Liu, Yanlin Wang, Zhao Wei, Yong Xu, Juhong Wang, Hui Li, and Rongrong Ji
(Xiamen University, China; Sun Yat-sen University, China; Tencent, China)
Refactoring is an indispensable practice of improving the quality and maintainability of source code in software evolution. Rename refactoring is the most frequently performed refactoring that suggests a new name for an identifier to enhance readability when the identifier is poorly named. However, most existing works only identify renaming activities between two versions of source code, while few works express concern about how to suggest a new name. In this paper, we study automatic rename refactoring on variable names, which is considered more challenging than other rename refactoring activities. We first point out the connections between rename refactoring and various prevalent learning paradigms and the difference between rename refactoring and general text generation in natural language processing. Based on our observations, we propose RefBERT, a two-stage pre-trained framework for rename refactoring on variable names. RefBERT first predicts the number of sub-tokens in the new name and then generates sub-tokens accordingly. Several techniques, including constrained masked language modeling, contrastive learning, and the bag-of-tokens loss, are incorporated into RefBERT to tailor it for automatic rename refactoring on variable names. Through extensive experiments on our constructed refactoring datasets, we show that the generated variable names of RefBERT are more accurate and meaningful than those produced by the existing method. Our implementation and data are available at https://github.com/KDEGroup/RefBERT.

Publisher's Version
CoopHance: Cooperative Enhancement for Robustness of Deep Learning Systems
Quan Zhang, Yongqiang Tian, Yifeng Ding, Shanshan Li, Chengnian Sun, Yu Jiang, and Jiaguang Sun
(Tsinghua University, China; University of Waterloo, Canada; University of Illinois at Urbana-Champaign, USA; National University of Defense Technology, China)
Adversarial attacks have been a threat to Deep Learning (DL) systems to be reckoned with. By adding human-imperceptible perturbation to benign inputs, adversarial attacks can cause the incorrect behavior of DL systems. Considering the popularity of DL systems in the industry, it is critical and urgent for developers to enhance the robustness of DL systems against adversarial attacks.
In this study, we propose a novel enhancement technique for DL systems, namely CoopHance. CoopHance leverages two specifically customized components, Regulator and Inspector, to cooperatively enhance the DL systems' robustness against adversarial examples with different distortions. Regulator can purify adversarial examples with low or moderate distortions, while Inspector is responsible for detecting these adversarial examples with high distortion by capturing the abnormal status of DL systems. Our evaluation using various attacks shows that, on average, CoopHance can successfully resist 90.62% and 96.56% of the adversarial examples that are generated for the unprotected systems on CIFAR-10 and SVHN datasets separately, which is 188.14% more effective than five state-of-the-art enhancement techniques, including Feature Squeeze, LID, SOAP, Adversarial Training, and MagNet. Meanwhile, when attackers generate new adversarial examples on the enhanced systems, CoopHance can reject 78.06% of attacks, which outperforms the best of five enhancement techniques by 82.71% on average.

Publisher's Version
ROME: Testing Image Captioning Systems via Recursive Object Melting
Boxi Yu, Zhiqing Zhong, Jiaqi Li, Yixing Yang, Shilin He, and Pinjia He
(Chinese University of Hong Kong, China; Microsoft Research, China)
Image captioning (IC) systems aim to generate a text description of the salient objects in an image. In recent years, IC systems have been increasingly integrated into our daily lives, such as assistance for visually-impaired people and description generation in Microsoft Powerpoint. However, even the cutting-edge IC systems (e.g., Microsoft Azure Cognitive Services) and algorithms (e.g., OFA) could produce erroneous captions, leading to incorrect captioning of important objects, misunderstanding, and threats to personal safety. The existing testing approaches either fail to handle the complex form of IC system output (i.e., sentences in natural language) or generate unnatural images as test cases. To address these problems, we introduce Recursive Object MElting (ROME), a novel metamorphic testing approach for validating IC systems. Different from existing approaches that generate test cases by inserting objects, which easily make the generated images unnatural, ROME melts (i.e., remove and inpaint) objects. ROME assumes that the object set in the caption of an image includes the object set in the caption of a generated image after object melting. Given an image, ROME can recursively remove its objects to generate different pairs of images. We use ROME to test one widely-adopted image captioning API and four state-of-the-art (SOTA) algorithms. The results show that the test cases generated by ROME look much more natural than the SOTA IC testing approach and they achieve comparable naturalness to the original images. Meanwhile, by generating test pairs using 226 seed images, ROME reports a total of 9,121 erroneous issues with high precision (86.47%-92.17%). In addition, we further utilize the test cases generated by ROME to retrain the Oscar, which improves its performance across multiple evaluation metrics.

Publisher's Version Published Artifact Artifacts Available
GPUHarbor: Testing GPU Memory Consistency at Large (Experience Paper)
Reese Levine, Mingun Cho, Devon McKee, Andrew Quinn, and Tyler Sorensen
(University of California at Santa Cruz, USA; University of California at Davis, USA)
Memory consistency specifications (MCSs) are a difficult, yet critical, part of a concurrent programming framework. Existing MCS testing tools are not immediately accessible, and thus, have only been applied to a limited number of devices. However, in the post-Dennard scaling landscape, there has been an explosion of new architectures and frameworks. Studying the shared memory behaviors of these new platforms is important to understand their behavior and ensure conformance to framework specifications.
In this paper, we present GPUHarbor, a widescale GPU MCS testing tool with a web interface and an Android app. Using GPUHarbor, we deployed a testing campaign that checks conformance and characterizes weak behaviors. We advertised GPUHarbor on forums and social media, allowing us to collect testing data from 106 devices, spanning seven vendors. In terms of devices tested, this constitutes the largest study on weak memory behaviors by at least 10×, and our conformance tests identified two new bugs on embedded Arm and NVIDIA devices. Analyzing our characterization data yields many insights, including quantifying and comparing weak behavior occurrence rates (e.g., AMD GPUs show 25.3× more weak behaviors on average than Intel). We conclude with a discussion of the impact our results have on software development for these performance-critical devices.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
COME: Commit Message Generation with Modification Embedding
Yichen He, Liran Wang, Kaiyi Wang, Yupeng Zhang, Hang Zhang, and Zhoujun Li
(Beihang University, China)
Commit messages concisely describe code changes in natural language and are important for program comprehension and maintenance. Previous studies proposed some approaches for automatic commit message generation, but their performance is limited due to inappropriate representation of code changes and improper combination of translation-based and retrieval-based approaches. To address these problems, this paper introduces a novel framework named COME, in which modification embeddings are used to represent code changes in a fine-grained way, a self-supervised generative task is designed to learn contextualized code change representation, and retrieval-based and translation-based methods are combined through a decision algorithm. The average improvement of COME over the state-of-the-art approaches is 9.2% on automatic evaluation metrics and 8.0% on human evaluation metrics. We also analyse the effectiveness of COME's three main components and each of them results in an improvement of 8.6%, 8.7% and 5.2%.

Publisher's Version
OCFI: Make Function Entry Identification Hard Again
Chengbin Pang, Tiantai Zhang, Xuelan Xu, Linzhang Wang, and Bing Mao
(Nanjing University, China)
Function entry identification is a crucial yet challenging task for binary disassemblers that has been the focus of research in the past decades. However, recent researches show that call frame information (CFI) provides accurate and almost complete function entries. With the aid of CFI, disassemblers have significant improvements in function entry detection. CFI is specifically designed for efficient stack unwinding, and every function has corresponding CFI in x64 and aarch64 architectures. Nevertheless, not every function and instruction unwinds the stack at runtime, and this observation has led to the development of techniques such as obfuscation to complicate function detection by disassemblers.
We propose a prototype of OCFI to obfuscate CFI based on this observation. The goal of OCFI is to obstruct function detection of popular disassemblers that use CFI as a way to detect function entries. We evaluated OCFI on a large-scale dataset that includes real-world applications and automated generation programs, and found that the obfuscated CFI was able to correctly unwind the stack and make the detection of function entries of popular disassemblers more difficult. Furthermore, on average, OCFI incurs a size overhead of only 4% and nearly zero runtime overhead.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Catamaran: Low-Overhead Memory Safety Enforcement via Parallel Acceleration
Yiyu Zhang, Tianyi Liu, Zewen Sun, Zhe Chen, Xuandong Li, and Zhiqiang Zuo
(Nanjing University, China; Nanjing University of Aeronautics and Astronautics, China)
Memory safety issues are the intrinsic diseases of C/C++ programs. Dynamic memory safety enforcement as the dominant approach has an advantage in high effectiveness, yet suffers from prohibitively high runtime overhead. Existing attempts to reduce the overhead are either labor-intensive, tightly dependent on specific hardware/compiler support, or poorly effective.
In this paper, we propose a novel technique to reduce time overhead by executing the dynamic checking code in parallel. We leverage static dependence analysis and dynamic profit analysis to identify and dispatch the potential code to separate threads running simultaneously. We implemented a tool called Catamaran and evaluated it over a rich set of benchmarks. The experimental results validate that Catamaran is able to significantly reduce the runtime overhead of the existing dynamic tools, without sacrificing capability of memory safety enforcement.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Latent Imitator: Generating Natural Individual Discriminatory Instances for Black-Box Fairness Testing
Yisong Xiao, Aishan Liu, Tianlin Li, and Xianglong Liu
(Beihang University, China; Institute of Dataspace, China; Nanyang Technological University, Singapore; Zhongguancun Laboratory, China)
Machine learning (ML) systems have achieved remarkable performance across a wide area of applications. However, they frequently exhibit unfair behaviors in sensitive application domains (e.g., employment and loan), raising severe fairness concerns. To evaluate and test fairness, engineers often generate individual discriminatory instances to expose unfair behaviors before model deployment. However, existing baselines ignore the naturalness of generation and produce instances that deviate from the real data distribution, which may fail to reveal the actual model fairness since these unnatural discriminatory instances are unlikely to appear in practice. To address the problem, this paper proposes a framework named Latent Imitator (LIMI) to generate more natural individual discriminatory instances with the help of a generative adversarial network (GAN), where we imitate the decision boundary of the target model in the semantic latent space of GAN and further samples latent instances on it. Specifically, we first derive a surrogate linear boundary to coarsely approximate the decision boundary of the target model, which reflects the nature of the original data distribution. Subsequently, to obtain more natural instances, we manipulate random latent vectors to the surrogate boundary with a one-step movement, and further conduct vector calculation to probe two potential discriminatory candidates that may be more closely located in the real decision boundary. Extensive experiments on various datasets demonstrate that our LIMI outperforms other baselines largely in effectiveness (×9.42 instances), efficiency (×8.71 speeds), and naturalness (+19.65%) on average. In addition, we empirically demonstrate that retraining on test samples generated by our approach can lead to improvements in both individual fairness (45.67% on IFr and 32.81% on IFo) and group fairness (9.86% on SPD and 28.38% on AOD). Our codes can be found on our website.

Publisher's Version Info
Simulation-Based Validation for Autonomous Driving Systems
Changwen Li, Joseph Sifakis, Qiang Wang, Rongjie Yan, and Jian Zhang
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; University Grenoble Alpes, France; CNRS, France; Grenoble INP, France; VERIMAG, France; Academy of Military Sciences, China)
We investigate a rigorous simulation and testing-based validation method for autonomous driving systems that integrates an existing industrial simulator and a formally defined testing environment. The environment includes a scenario generator that drives the simulation process and a monitor that checks at runtime the observed behavior of the system against a set of system properties to be validated. The validation method consists in extracting from the simulator a semantic model of the simulated system including a metric graph, which is a mathematical model of the environment in which the vehicles of the system evolve. The monitor can verify properties formalized in a first-order linear temporal logic and provide diagnostics explaining their non-satisfaction. Instead of exploring the system behavior randomly as many simulators do, we propose a method to systematically generate sets of scenarios that cover potentially risky situations, especially for different types of junctions where specific traffic rules must be respected. We show that the systematic exploration of risky situations has uncovered many flaws in the real simulator that would have been very difficult to discover by a random exploration process.

Publisher's Version Published Artifact Artifacts Available
Automated Program Repair from Fuzzing Perspective
YoungJae Kim, Seungheon Han, Askar Yeltayuly Khamit, and Jooyong Yi
(Ulsan National Institute of Science and Technology, South Korea)
In this work, we present a novel approach that connects two closely-related topics: fuzzing and automated program repair (APR). The paper is divided into two parts. In the first part, we describe the similarities between fuzzing and APR both of which can be viewed as a search problem. In the second part, we introduce a new patch-scheduling algorithm called Casino, which is designed from a fuzzing perspective to enhance search efficiency. Our experiments demonstrate that Casino outperforms existing algorithms. We also promote open science by sharing SimAPR, a simulation tool that can be used to evaluate new patch-scheduling algorithms.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
1dFuzz: Reproduce 1-Day Vulnerabilities with Directed Differential Fuzzing
Songtao Yang, Yubo He, Kaixiang Chen, Zheyu Ma, Xiapu Luo, Yong Xie, Jianjun Chen, and Chao Zhang
(Tsinghua University, China; Information Engineering University, China; Hong Kong Polytechnic University, China; Qinghai University, China)
1-day vulnerabilities are common in practice and have posed severe threats to end users, as adversaries could learn from released patches to find them and exploit them. Reproducing 1-day vulnerabilities is also crucial for defenders, e.g., to block attack traffic against 1-day vulnerabilities. A core question that affects the effectiveness of recognizing and triggering 1-day vulnerabilities is what is the unique feature of a security patch. After conducting a large-scale empirical study, we point out that a common and unique feature of patches is the trailing call sequence (TCS) and present a novel directed differential fuzzing solution 1dFuzz to efficiently reproduce 1-day vulnerabilities in this paper. Based on the TCS feature, we present a locator 1dLoc able to find candidate patch locations via static analysis, a novel TCS-based distance metric for directed fuzzing, and a novel sanitizer 1dSan able to catch PoCs for 1-day vulnerabilities during fuzzing. We have systematically evaluated 1dFuzz on a set of real-world software vulnerabilities in 11 different settings. Results show that 1dFuzz significantly outperforms state-of-the-art (SOTA) baselines and could find up to 2.26x more 1-day vulnerabilities with a 43% shorter time.

Publisher's Version
A Bayesian Framework for Automated Debugging
Sungmin Kang, Wonkeun Choi, and Shin Yoo
(KAIST, South Korea)
Debugging takes up a significant portion of developer time. As a result, automated debugging techniques including Fault Localization (FL) and Automated Program Repair (APR) have garnered significant attention due to their potential to aid developers in debugging tasks. With the recent advance in techniques that treat the two tasks as closely coupled, such as Unified Debugging, a framework to formally express these two tasks together would heighten our understanding of automated debugging and provide a way to formally analyze techniques and approaches. To this end, we propose a Bayesian framework of understanding automated debugging. We find that the Bayesian framework, along with a concrete statement of the objective of automated debugging, can recover maximal fault localization formulae from prior work, as well as analyze existing APR techniques and their underlying assumptions. As a means of empirically demonstrating our framework, we further propose BAPP, a Bayesian Patch Prioritization technique that incorporates intermediate program values to analyze likely patch locations and repair actions, with its core equations being derived by our Bayesian framework. We find that incorporating program values allows BAPP to identify correct patches more precisely: the rankings produced by BAPP reduced the number of required patch evaluations by 68% and consequently reduced the repair time by 34 minutes on average. Further, our Bayesian framework suggests a number of changes to the way fault localization information is used in program repair, which we validate is useful for BAPP. These results highlight the potential of value-cognizant automated debugging techniques, and further verifies our theoretical framework.

Publisher's Version
That’s a Tough Call: Studying the Challenges of Call Graph Construction for WebAssembly
Daniel Lehmann, Michelle Thalakottur, Frank Tip, and Michael Pradel
(University of Stuttgart, Germany; Northeastern University, USA)
WebAssembly is a low-level bytecode format that powers applications and libraries running in browsers, on the server side, and in standalone runtimes. Call graphs are at the core of many interprocedural static analysis and optimization techniques. However, WebAssembly poses some unique challenges for static call graph construction. Currently, these challenges are neither well understood, nor is it clear to what extent existing techniques address them. This paper presents the first systematic study of WebAssembly-specific challenges for static call graph construction and of the state-of-the-art in call graph analysis. We identify and classify 12 challenges, encode them into a suite of 24 executable microbenchmarks, and measure their prevalence in real-world binaries. These challenges reflect idiosyncrasies of WebAssembly, such as indirect calls via a mutable function table, interactions with the host environment, and unmanaged linear memory. We show that they commonly occur across a set of more than 8,000 real-world binaries. Based on our microbenchmarks and a set of executable real-world binaries, we then study the soundness and precision of four existing static analyses. Our findings include that, surprisingly, all of the existing techniques are unsound, without this being documented anywhere. We envision our work to provide guidance for improving static call graph construction for WebAssembly. In particular, the presented microbenchmarks will enable future work to check whether an analysis supports challenging language features, and to quantify its soundness and precision.

Publisher's Version Published Artifact Info Artifacts Available Artifacts Reusable
GenCoG: A DSL-Based Approach to Generating Computation Graphs for TVM Testing
Zihan Wang, Pengbo Nie, Xinyuan Miao, Yuting Chen, Chengcheng Wan, Lei Bu, and Jianjun Zhao
(Shanghai Jiao Tong University, China; East China Normal University, China; Nanjing University, China; Kyushu University, Japan)
TVM is a popular deep learning (DL) compiler. It is designed for compiling DL models, which are naturally computation graphs, and as well promoting the efficiency of DL computation. State-of-the-art methods, such as Muffin and NNSmith, allow developers to generate computation graphs for testing DL compilers. However, these techniques are inefficient — their generated computation graphs are either type-invalid or inexpressive, and hence not able to test the core functionalities of a DL compiler.
To tackle this problem, we propose GenCoG, a DSL-based approach to generating computation graphs for TVM testing. GenCoG is composed of (1) GenCoGL, a domain-specific language for specifying type constraints of operators, and (2) an approach that concolically solves type constraints and incrementally generates computation graphs of high expressivity. We implement and evaluate GenCoG on TVM releases. Our results show that GenCoG is effective in generating valid and expressive computation graphs — all of the GenCoG-generated graphs pass type checking, a critical graph validation stage; letting the graphs’ expressivity be measured by their vertex and edge diversities, GenCoG outperforms state-of-the-arts by achieving 1.65~6.93× in vertex diversity and 1.06~7.08× in edge diversity, respectively. Furthermore, GenCoG has detected 16 bugs in TVM v0.8 and v0.9, with 14 confirmed and 12 fixed.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Alligator in Vest: A Practical Failure-Diagnosis Framework via Arm Hardware Features
Yiming Zhang, Yuxin Hu, Haonan Li, Wenxuan Shi, Zhenyu Ning, Xiapu Luo, and Fengwei Zhang
(Southern University of Science and Technology, China; Hong Kong Polytechnic University, China; Hunan University, China)
Failure diagnosis in practical systems is difficult, and the main obstacle is that the information a developer has access to is limited. This information is usually not enough to help developers fix or even locate the related bug. Moreover, due to the vast difference between the development and production environments, it is not trivial to reproduce failures from the production environment in the development environment. When failures are caused by non-deterministic events such as race conditions or unforeseen inputs, reproducing them is even more challenging.
In this paper, we present Investigator, a failure diagnosis framework for practical systems running on Arm. At runtime, Investigator leverages the hardware tracing component called Embedded Trace Macrocell (ETM) and a lightweight event capturer to collect information with low overhead. With the collected trace and analysis, Investigator identifies the control and data flow related to the cause of a failure, which helps developers in bug fixing. We implemented a prototype of Investigator and evaluated it with real-world bugs. The results show that Investigator diagnoses these bugs effectively and efficiently while introducing a low performance overhead at runtime.

Publisher's Version
Guiding Greybox Fuzzing with Mutation Testing
Vasudev Vikram, Isabella Laybourn, Ao Li, Nicole Nair, Kelton OBrien, Rafaello Sanna, and Rohan Padhye
(Carnegie Mellon University, USA; Swarthmore College, USA; University of Minnesota, USA; University of Rochester, USA)
Greybox fuzzing and mutation testing are two popular but mostly independent fields of software testing research that have so far had limited overlap. Greybox fuzzing, generally geared towards searching for new bugs, predominantly uses code coverage for selecting inputs to save. Mutation testing is primarily used as a stronger alternative to code coverage in assessing the quality of regression tests; the idea is to evaluate tests for their ability to identify artificially injected faults in the target program. But what if we wanted to use greybox fuzzing to synthesize high-quality regression tests?
In this paper, we develop and evaluate Mu2, a Java-based framework for incorporating mutation analysis in the greybox fuzzing loop, with the goal of producing a test-input corpus with a high mutation score. Mu2 makes use of a differential oracle for identifying inputs that exercise interesting program behavior without causing crashes. This paper describes several dynamic optimizations implemented in Mu2 to overcome the high cost of performing mutation analysis with every fuzzer-generated input. These optimizations introduce trade-offs in fuzzing throughput and mutation killing ability, which we evaluate empirically on five real-world Java benchmarks. Overall, variants of Mu2 are able to synthesize test-input corpora with a higher mutation score than state-of-the-art Java fuzzer Zest.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Testing Automated Driving Systems by Breaking Many Laws Efficiently
Xiaodong Zhang, Wei Zhao, Yang Sun, Jun Sun, Yulong Shen, Xuewen Dong, and Zijiang Yang
(Xidian University, China; Singapore Management University, Singapore; GuardStrike, China)
An automated driving system (ADS), as the brain of an autonomous vehicle (AV), should be tested thoroughly ahead of deployment. ADS must satisfy a complex set of rules to ensure road safety, e.g., the existing traffic laws and possibly future laws that are dedicated to AVs. To comprehensively test an ADS, we would like to systematically discover diverse scenarios in which certain traffic law is violated. The challenge is that (1) there are many traffic laws (e.g., 13 testable articles in Chinese traffic laws and 16 testable articles in Singapore traffic laws, with 81 and 43 violation situations respectively); and (2) many of traffic laws are only relevant in complicated specific scenarios.
Existing approaches to testing ADS either focus on simple oracles such as no-collision or have limited capacity in generating diverse law-violating scenarios. In this work, we propose ABLE, a new ADS testing method inspired by the success of GFlowNet, which Aims to Break many Laws Efficiently by generating diverse scenarios. Different from vanilla GFlowNet, ABLE drives the testing process with dynamically updated testing objectives (based on a robustness semantics of signal temporal logic) as well as active learning, so as to effectively explore the vast search space. We evaluate ABLE based on Apollo and LGSVL, and the results show that ABLE outperforms the state-of-the-art by violating 17% and 25% more laws when testing Apollo 6.0 and Apollo 7.0, most of which are hard-to-violate laws, respectively.

Publisher's Version
DeepAtash: Focused Test Generation for Deep Learning Systems
Tahereh Zohdinasab, Vincenzo Riccio, and Paolo Tonella
(USI Lugano, Switzerland)
When deployed in the operation environment, Deep Learning (DL) systems often experience the so-called development to operation (dev2op) data shift, which causes a lower prediction accuracy on field data as compared to the one measured on the test set during development. To address the dev2op shift, developers must obtain new data with the newly observed features, as these are under-represented in the train/test set, and must use them to fine tune the DL model, so as to reach the desired accuracy level. In this paper, we address the issue of acquiring new data with the specific features observed in operation, which caused a dev2op shift, by proposing DeepAtash, a novel search-based focused testing approach for DL systems. DeepAtash targets a cell in the feature space, defined as a combination of feature ranges, to generate misbehaviour-inducing inputs with predefined features. Experimental results show that DeepAtash was able to generate up to 29X more targeted, failure-inducing inputs than the baseline approach. The inputs generated by DeepAtash were useful to significantly improve the quality of the original DL systems through fine tuning not only on data with the targeted features, but quite surprisingly also on inputs drawn from the original distribution.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
SBDT: Search-Based Differential Testing of Certificate Parsers in SSL/TLS Implementations
Chu Chen, Pinghong Ren, Zhenhua Duan, Cong Tian, Xu Lu, and Bin Yu
(Qufu Normal University, China; Xidian University, China)
Certificate parsers, which are critical components of Secure Sockets Layer or Transport Layer Security (SSL/TLS) implementations, parse incomprehensible certificates into comprehensible inputs to certificate validators and humans. Thus, certificate parsers profoundly affect decision-makings of validators and humans, which in turn affect security. To guarantee the correctness of certificate parsers, an approach for search-based differential testing of certificate parsers, namely SBDT, is put forward. SBDT begins with modeling certificate structures, mutation operations, and bounds. Based on the initial model, SBDT searches for the most promising model node and mutation operator that trigger discrepancies, and generates a certificate from the node and operator it finds. Then, SBDT feeds the certificate to certificate parsers, and searches for multiple types of discrepancies after normalizing the results output by parsers. Distinct discrepancies are employed as feedback to update and prune the model. SBDT starts the next iteration from the updated and pruned model, unless all nodes and mutation operators have been pruned due to reaching their upper bounds. Our work has the following contributions: (1) To the best of our knowledge, this is the first time that testing of certificate parsers has been clearly distinguished from testing of certificate validators, which will facilitate accurate testing of certificate parsers and validators; (2) SBDT is the first systematic and efficient approach for differential testing of certificate parsers by searching, updating, and pruning models; and (3) We have implemented an open-source prototype tool of SBDT, and experimental results show that SBDT is effective and efficient in finding new bugs and enhancements of certificate parsers.

Publisher's Version
SmartState: Detecting State-Reverting Vulnerabilities in Smart Contracts via Fine-Grained State-Dependency Analysis
Zeqin Liao, Sicheng Hao, Yuhong Nan, and Zibin Zheng
(Sun Yat-sen University, China)
Smart contracts written in Solidity are widely used in different blockchain platforms such as Ethereum, TRON and BNB Chain. One of the unique designs in Solidity smart contracts is its statereverting mechanism for error handling and access control. Unfortunately, a number of recent security incidents showed that adversaries also utilize this mechanism to manipulate critical states of smart contracts, and hence, bring security consequences such as illegal profit-gain and Deny-of-Service (DoS). In this paper, we call such vulnerabilities as the State-reverting Vulnerability (SRV). Automatically identifying SRVs poses unique challenges, as it requires an in-depth analysis and understanding of the state-dependency relations in smart contracts.
This paper presents SmartState, a new framework for detecting state-reverting vulnerability in Solidity smart contracts via finegrained state-dependency analysis. SmartState integrates a set of novel mechanisms to ensure its effectiveness. Particularly, Smart- State extracts state dependencies from both contract bytecode and historical transactions. Both of them are critical for inferring dependencies related to SRVs. Further, SmartState models the generic patterns of SRVs (i.e., profit-gain and DoS) as SRV indicators, and hence effectively identify SRVs based on the constructed statedependency graph. To evaluate SmartState, we manually annotated a ground-truth dataset which contains 91 SRVs in the real world. Evaluation results showed that SmartState achieves a precision of 87.23% and a recall of 89.13%. In addition, SmartState successfully identifies 406 new SRVs from 47,351 real-world smart contracts. 11 of these SRVs are from popular smart contracts with high transaction amounts (i.e., top 2000). In total, our reported SRVs affect a total amount of digital assets worth 428,600 USD.

Publisher's Version
ωTest: WebView-Oriented Testing for Android Applications
Jiajun Hu, Lili Wei, Yepang Liu, and Shing-Chi Cheung
(Hong Kong University of Science and Technology, China; McGill University, Canada; Southern University of Science and Technology, China)
WebView is a UI widget that helps integrate web applications into the native context of Android apps. It provides powerful mechanisms for bi-directional interactions between the native-end (Java) and the web-end (JavaScript) of an Android app. However, these interaction mechanisms are complicated and have induced various types of bugs. To mitigate the problem, various techniques have been proposed to detect WebView-induced bugs via dynamic analysis, which heavily relies on executing tests to explore WebView behaviors. Unfortunately, these techniques either require manual effort or adopt random test generation approaches, which are not able to effectively explore diverse WebView behaviors. In this paper, we study the problem of test generation for WebViews in Android apps. Effective test generation for WebViews requires identifying the essential program properties to be covered by the generated tests. To this end, we propose WebView-specific properties to characterize WebView behaviors, and devise a cross-language dynamic analysis method to identify these properties. We develop ωTest, a test generation technique that searches for event sequences covering the identified WebView-specific properties. An evaluation on 74 real-world open-/closed-source Android apps shows that ωTest can cover diverse WebView behaviors and detect WebView-induced bugs effectively. ωTest detected 36 previously-unknown bugs. From the 22 bugs that we have reported to the app developers, 13 bugs were confirmed, 9 of which were fixed.

Publisher's Version Published Artifact Artifacts Available
ModelObfuscator: Obfuscating Model Information to Protect Deployed ML-Based Systems
Mingyi Zhou, Xiang Gao, Jing Wu, John Grundy, Xiao Chen, Chunyang Chen, and Li Li
(Monash University, Australia; Beihang University, China)
More and more edge devices and mobile apps are leveraging deep learning (DL) capabilities. Deploying such models on devices – referred to as on-device models – rather than as remote cloud-hosted services, has gained popularity because it avoids transmitting user’s data off of the device and achieves high response time. However, on-device models can be easily attacked, as they can be accessed by unpacking corresponding apps and the model is fully exposed to attackers. Recent studies show that attackers can easily generate white-box-like attacks for an on-device model or even inverse its training data. To protect on-device models from white-box attacks, we propose a novel technique called model obfuscation. Specifically, model obfuscation hides and obfuscates the key information – structure, parameters and attributes – of models by renaming, parameter encapsulation, neural structure obfuscation, shortcut injection, and extra layer injection. We have developed a prototype tool ModelObfuscator to automatically obfuscate on-device TFLite models. Our experiments show that this proposed approach can dramatically improve model security by significantly increasing the difficulty of parsing models’ inner information, without increasing the latency of DL models. Our proposed on-device model obfuscation has the potential to be a fundamental technique for on-device model deployment. Our prototype tool is publicly available at https://github.com/zhoumingyi/ModelObfuscator.

Publisher's Version Info
AGORA: Automated Generation of Test Oracles for REST APIs
Juan C. Alonso, Sergio Segura, and Antonio Ruiz-Cortés
(University of Seville, Spain)
Test case generation tools for REST APIs have grown in number and complexity in recent years. However, their advanced capabilities for automated input generation contrast with the simplicity of their test oracles, which limit the types of failures they can detect to crashes, regressions, and violations of the API specification or design best practices. In this paper, we present AGORA, an approach for the automated generation of test oracles for REST APIs through the detection of invariants—properties of the output that should always hold. In practice, AGORA aims to learn the expected behavior of an API by analyzing previous API requests and their corresponding responses. For this, we extended the Daikon tool for dynamic detection of likely invariants, including the definition of new types of invariants and the implementation of an instrumenter called Beet. Beet converts any OpenAPI specification and a collection of API requests and responses to a format processable by Daikon. As a result, AGORA currently supports the detection of up to 105 different types of invariants in REST APIs. AGORA achieved a total precision of 81.2% when tested on a dataset of 11 operations from 7 industrial APIs. More importantly, the test oracles generated by AGORA detected 6 out of every 10 errors systematically seeded in the outputs of the APIs under test. Additionally, AGORA revealed 11 bugs in APIs with millions of users: Amadeus, GitHub, Marvel, OMDb and YouTube. Our reports have guided developers in improving their APIs, including bug fixes and documentation updates in GitHub. Since it operates in black-box mode, AGORA can be seamlessly integrated into existing API testing tools.

Publisher's Version Published Artifact Info Artifacts Available Artifacts Reusable
Fuzzing Embedded Systems using Debug Interfaces
Max Eisele, Daniel Ebert, Christopher Huth, and Andreas Zeller
(Robert Bosch, Germany; Saarland University, Germany; CISPA Helmholtz Center for Information Security, Germany)
Fuzzing embedded systems is hard. Their key components – microcontrollers – are highly diverse and cannot be easily virtualized; their software may not be changed or instrumented. However, we observe that many, if not most, microcontrollers feature a debug interface through which a debug probe (typically controllable via GDB, the GNU debugger) can set a limited number of hardware breakpoints. Using these, we extract partial coverage feedback even for uninstrumented binary code; and thus enable effective fuzzing for embedded systems through a generic, widespread mechanism. In its evaluation on four different microcontroller boards, our prototypical implementation GDBFuzz quickly reaches high code coverage and detects known and new vulnerabilities. As it can be applied to any program and system that GDB can debug, GDBFuzz is one of the least demanding and most versatile coverage-guided fuzzers.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Splendor: Static Detection of Stored XSS in Modern Web Applications
He Su, Feng Li, Lili Xu, Wenbo Hu, Yujie Sun, Qing Sun, Huina Chao, and Wei Huo
(Institute of Information Engineering at Chinese Academy of Sciences, China)
In modern websites, stored Cross-Site Scripting (XSS) is the most dangerous XSS vulnerability, which can store payloads in the web system and be triggered directly by the victim. Database (DB) as the most commonly used storage medium for data on websites is therefore also the most common place where stored XSS occurs. Due to the modularity of modern programming architectures, the complex underlying database operations will often be encapsulated and abstracted as a Data Access Layer (DAL) to provide unified data access services to the business layer. The heavy use of Object-Oriented (OO) and dynamic language features involved in the encapsulation makes it increasingly challenging for static taint analysis tools to understand how tainted data flows between the source code and the exact locations in database.
In this paper, we propose the first static analysis framework for detecting stored XSS in modern web applications using DAL and implement a prototype Splendor for PHP code analysis. The highlight in the framework is the design of a heuristic but precise token-matching method to locate the flows of taint data between database and source code. The precisions of the identified DB read and write (R/W) locations are 91.3% and 82.6%, respectively. With the identified R/W locations, the disconnected taint paths can be statically stitched to obtain a complete taint propagation path of stored XSS. Comparisons with existing works on 5 real-world applications and large-scale experiments on PHP web applications in Github show that Splendor significantly outperforms both the state-of-the-art static and dynamic approaches on stored-XSS detection, and detects 17 zero-day vulnerabilities.

Publisher's Version Info
Applying and Extending the Delta Debugging Algorithm for Elevator Dispatching Algorithms (Experience Paper)
Pablo Valle, Aitor Arrieta, and Maite Arratibel
(Mondragon University, Spain; Orona, Spain)
Elevator systems are one kind of Cyber-Physical Systems (CPSs), and as such, test cases are usually complex and long in time. This is mainly because realistic test scenarios are employed (e.g., for testing elevator dispatching algorithms, typically a full day of passengers traveling through a system of elevators is used). However, in such a context, when needing to reproduce a failure, it is of high benefit to provide the minimal test input to the software developers. This way, analyzing and trying to localize the root-cause of the failure is easier and more agile. Delta debugging has been found to be an efficient technique to reduce failure-inducing test inputs. In this paper, we enhance this technique by first monitoring the environment at which the CPS operates as well as its physical states. With the monitored information, we search for stable states of the CPS during the execution of the simulation. In a second step, we use such identified stable states to help the delta debugging algorithm isolate the failure-inducing test inputs more efficiently.
We report our experience of applying our approach into an industrial elevator dispatching algorithm. An empirical evaluation carried out with real operational data from a real installation of elevators suggests that the proposed environment-wise delta debugging algorithm is between 1.3 to 1.8 times faster than the traditional delta debugging, while producing a larger reduction in the failure-inducing test inputs. The results provided by the different implemented delta debugging algorithm versions are qualitatively assessed with domain experts. This assessment provides new insights and lessons learned, such as, potential applications of the delta debugging algorithm beyond debugging.

Publisher's Version
Finding Short Slow Inputs Faster with Grammar-Based Search
Ziyad Alsaeed and Michal Young
(Qassim University, Saudi Arabia; University of Oregon, USA)
Recent research has shown that mutational search with appropriate instrumentation can generate short inputs that demonstrate performance issues. Another thread of fuzzing research has shown that substituting subtrees from a forest of derivation trees is an effective grammar-based fuzzing technique for finding deep semantic bugs. We combine performance fuzzing with grammar-based search by generating length-limited derivation trees in which each subtree is labeled with its length. In addition we use performance instrumentation feedback to guide search. In contrast to fuzzing for security issues, for which fuzzing campaigns of many hours or even weeks can be appropriate, we focus on searches that are short enough (up to an hour with modest computational resources) to be part of a routine incremental test process. We have evaluated combinations of these approaches, with baselines including the best prior performance fuzzer. No single search technique dominates across all examples, but both Monte Carlo tree search and length-limited tree hybridization perform consistently well on example applications in which semantic performance bugs can be found with syntactically correct input. In the course of our evaluation we discovered a hang bug in LunaSVG, which the developers have acknowledged and corrected.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Transforming Test Suites into Croissants
Yang Chen, Alperen Yildiz, Darko Marinov, and Reyhaneh Jabbarvand
(University of Illinois at Urbana-Champaign, USA; Sabanci University, Turkey)
Software developers often rely on regression testing to ensure that recent changes made to the source code do not introduce bugs. Flaky tests, which non-deterministically pass or fail regardless of any change to the code, can negatively impact the effectiveness of the regression testing. While state-of-the-art is advancing the techniques for test-flakiness detection and mitigation, the community is missing a systematic approach for generating high-quality benchmarks of flaky tests to compare the effectiveness of such techniques. Inspired by the power of mutation testing in evaluating the fault-detection ability of tests, this paper proposes Croissant, a framework for injecting flakiness into the test suites to assess the effectiveness of test-flakiness detection tools in finding these tests. Croissant implements 18 flakiness-inducing mutation operators. We designed these operators to allow controlling the non-determinism involved in flakiness, i.e., making many mutants deterministically pass or fail to observe flaky behavior. Our extensive empirical evaluation of Croissant on the test suites of 15 real-world projects confirms the ability of designed mutation operators to generate high-quality mutants, and their effectiveness in challenging test-flakiness detection tools in revealing flaky tests.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Tai-e: A Developer-Friendly Static Analysis Framework for Java by Harnessing the Good Designs of Classics
Tian Tan and Yue Li
(Nanjing University, China)
Static analysis is a mature field with applications to bug detection, security analysis, program understanding, optimization, and more. To facilitate these applications, static analysis frameworks play an essential role by providing a series of fundamental services such as intermediate representation (IR) generation, control flow graph construction, points-to/alias information computation, and so on. However, although static analysis has made great strides and several well-known frameworks have emerged in this field over the past decades, these frameworks are not that easy to learn and use for developers who rely on them to create and implement analyses. In that sense, it is far from trivial to build a developer-friendly static analysis framework, because compared to the knowledge required for static analysis itself, we have significantly less knowledge designing and implementing static analysis frameworks.
In this work, we take a step forward by discussing the design trade-offs for the crucial components of a static analysis framework for Java, and select the designs by following the HGDC (Harnessing the Good Designs of Classics) principle: for each crucial component of a static analysis framework, we compare the design choices made for it (possibly) by different classic frameworks such as Soot, Wala, Doop, SpotBugs and Checker, and choose arguably a more appropriate one; but if none is good enough, we then propose a better design. These selected or newly proposed designs finally constitute Tai-e, a new static analysis framework for Java, which has been implemented from scratch. Tai-e is novel in the designs of several aspects like IR, pointer analysis and development of new analyses, etc., leading to a developer-friendly (easy-to-learn and easy-to-use) analysis framework. To the best of our knowledge, this is the first work that systematically explores the designs and implementations of various static analysis frameworks for Java. We expect it to provide useful materials and viewpoints for building better static analysis infrastructures, and we hope that it could draw more attentions of the community to this challenging but tangible topic.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Improving Binary Code Similarity Transformer Models by Semantics-Driven Instruction Deemphasis
Xiangzhe Xu, Shiwei Feng, Yapeng Ye, Guangyu Shen, Zian Su, Siyuan Cheng, Guanhong Tao, Qingkai Shi, Zhuo Zhang, and Xiangyu Zhang
(Purdue University, USA)
Given a function in the binary executable form, binary code similarity analysis determines a set of similar functions from a large pool of candidate functions. These similar functions are usually compiled from the same source code with different compilation setups. Such analysis has a large number of applications, such as malware detection, code clone detection, and automatic software patching. The state-of-the art methods utilize complex Deep Learning models such as Transformer models. We observe that these models suffer from undesirable instruction distribution biases caused by specific compiler conventions. We develop a novel technique to detect such biases and repair them by removing the corresponding instructions from the dataset and finetuning the models. This entails synergy between Deep Learning model analysis and program analysis. Our results show that we can substantially improve the state-of-the-art models’ performance by up to 14.4% in the most challenging cases where test data may be out of the distributions of training data.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Data Constraint Mining for Automatic Reconciliation Scripts Generation
Tianxiao Wang, Chen Zhi, Xiaoqun Zhou, Jinjie Wu, Jianwei Yin, and Shuiguang Deng
(Zhejiang University, China; Alibaba-Zhejiang University Joint Institute of Frontier Technologies, China; Alibaba Group, China)
Fund loss is an increasingly critical problem caused by the misbehavior of software, especially in fintech and e-commerce platforms. Data reconciliation is one of the most commonly used approaches in detecting and preventing fund loss by executing reconciliation scripts on data storage systems (e.g., database and cache systems). The core of reconciliation scripts is the data constraints, which can be expressed as implications with two parts: preconditions and assertions. However, due to the complexity and diversity of business, the construction of data constraints and reconciliation scripts usually heavily relies on business experts. To this end, we propose AutoReconciler to mine data constraints from business data and generate reconciliation scripts automatically. It can mine assertions via enhanced symbolic regression, discover preconditions via association rule mining, and generate reconciliation scripts in SQL form. We have performed extensive experiments on the synthesized data. The result shows that our approach outperforms the baseline by a large margin (an average improvement in precision and recall of 22.1% and 51.6%, respectively), especially for complex data constraints. Our solution has been implemented, deployed, and adopted in production, and we conducted several case studies further to confirm the benefits of our solution in industrial scenarios.

Publisher's Version
Guided Retraining to Enhance the Detection of Difficult Android Malware
Nadia Daoudi, Kevin Allix, Tegawendé F. Bissyandé, and Jacques Klein
(University of Luxembourg, Luxembourg; CentraleSupélec, France)
The popularity of Android OS has made it an appealing target for malware developers. To evade detection, including by ML-based techniques, attackers invest in creating malware that closely resemble legitimate apps, challenging the state of the art with difficult-to-detect samples. In this paper, we propose Guided Retraining, a supervised representation learning-based method for boosting the performance of malware detectors. To that end, we first split the experimental dataset into subsets of “easy” and “difficult” samples, where difficulty is associated to the prediction probabilities yielded by a malware detector. For the subset of “easy” samples, the base malware detector is used to make the final predictions since the error rate on that subset is low by construction. Our work targets the second subset containing “difficult” samples, for which the probabilities are such that the classifier is not confident on the predictions, which have high error rates. We apply our Guided Retraining method on these difficult samples to improve their classification. Guided Retraining leverages the correct predictions and the errors made by the base malware detector to guide the retraining process. Guided Retraining learns new embeddings of the difficult samples using Supervised Contrastive Learning and trains an auxiliary classifier for the final predictions. We validate our method on four state-of-the-art Android malware detection approaches using over 265k malware and benign apps. Experimental results show that Guided Retraining can boost state-of-the-art detectors by eliminating up to 45.19% of the prediction errors that they make on difficult samples. We note furthermore that our method is generic and designed to enhance the performance of binary classifiers for other tasks beyond Android malware detection.

Publisher's Version
DeFiTainter: Detecting Price Manipulation Vulnerabilities in DeFi Protocols
Queping Kong, Jiachi Chen, Yanlin Wang, Zigui Jiang, and Zibin Zheng
(Sun Yat-sen University, China)
DeFi protocols are programs that manage high-value digital assets on blockchain. The price manipulation vulnerability is one of the common vulnerabilities in DeFi protocols, which allows attackers to gain excessive profits by manipulating token prices. In this paper, we propose DeFiTainter, an inter-contract taint analysis framework for detecting price manipulation vulnerabilities. DeFiTainter features two innovative mechanisms to ensure its effectiveness. The first mechanism is to construct a call graph for inter-contract taint analysis by restoring call information, not only from code constants but also from contract storage and function parameters. The second mechanism is a high-level semantic induction tailored for detecting price manipulation vulnerabilities, which accurately identifies taint sources and sinks and tracks taint data across contracts. Extensive evaluation of real-world incidents and high-value DeFi protocols shows that DeFiTainter outperforms existing approaches and achieves state-of-the-art performance with a precision of 96% and a recall of 91.3% in detecting price manipulation vulnerabilities. Furthermore, DeFiTainter uncovers three previously undisclosed price manipulation vulnerabilities.

Publisher's Version
Beyond “Protected” and “Private”: An Empirical Security Analysis of Custom Function Modifiers in Smart Contracts
Yuzhou Fang, Daoyuan Wu, Xiao Yi, Shuai Wang, Yufan Chen, Mengjie Chen, Yang Liu, and Lingxiao Jiang
(Hong Kong University of Science and Technology, China; Chinese University of Hong Kong, China; Xidian University, China; Mask Network, China; Nanyang Technological University, Singapore; Singapore Management University, Singapore)
A smart contract is a piece of application-layer code running on blockchain ledgers and it provides programmatic logic via transaction-based execution of pre-defined functions. Smart contract functions are by default invokable by any party. To safeguard them, the mainstream smart contract language, i.e., Solidity of the popular Ethereum blockchain, proposed a unique language-level keyword called “modifier,” which allows developers to define custom function access control policies beyond the traditional “protected” and “private” modifiers in classic programming languages.
In this paper, we aim to conduct a large-scale security analysis of the modifiers used in real-world Ethereum smart contracts. To achieve this, we design and implement a novel smart contract analysis tool called SoMo. Its main objective is to identify insecure modifiers that can be bypassed from one or more unprotected smart contract functions. This is challenging because of the complicated relationship between modifiers and their variables/functions and the ambiguity of attacker-accessible entry functions. To overcome them, we first propose a new structure, the Modifier Dependency Graph (MDG), to connect all the modifier-related control/data flows. Over MDGs, we then model system variables, generate symbolic path constraints, and iteratively test each candidate entry function. Our extensive evaluation shows that SoMo outperforms the state-of-the-art SPCon tool by detecting all its true positives and correctly avoiding 9 out of 11 false positives. It also achieves high precision of 91.2% when analyzing a large dataset of 62,464 contracts, over 400 of which were identified with bypassable modifiers. Our analysis further reveals three interesting security findings about modifiers and nine major types of modifier usage in the wild. SoMo has been integrated into an online security scanning service, MetaScan.

Publisher's Version
Synthesizing Speech Test Cases with Text-to-Speech? An Empirical Study on the False Alarms in Automated Speech Recognition Testing
Julia Kaiwen Lau, Kelvin Kai Wen Kong, Julian Hao Yong, Per Hoong Tan, Zhou Yang, Zi Qian Yong, Joshua Chern Wey Low, Chun Yong Chong, Mei Kuan Lim, and David Lo
(Monash University Malaysia, Malaysia; Singapore Management University, Singapore)
Recent studies have proposed the use of Text-To-Speech (TTS) systems to automatically synthesise speech test cases on a scale and uncover a large number of failures in ASR systems. However, the failures uncovered by synthetic test cases may not reflect the actual performance of an ASR system when it transcribes human audio, which we refer to as false alarms. Given a failed test case synthesised from TTS systems, which consists of TTS-generated audio and the corresponding ground truth text, we feed the human audio stating the same text to an ASR system. If human audio can be correctly transcribed, an instance of a false alarm is detected.
In this study, we investigate false alarm occurrences in five popular ASR systems using synthetic audio generated from four TTS systems and human audio obtained from two commonly used datasets. Our results show that the least number of false alarms is identified when testing Deepspeech, and the number of false alarms is the highest when testing Wav2vec2. On average, false alarm rates range from 21% to 34% in all five ASR systems. Among the TTS systems used, Google TTS produces the least number of false alarms (17%), and Espeak TTS produces the highest number of false alarms (32%) among the four TTS systems. Additionally, we build a false alarm estimator that flags potential false alarms, which achieves promising results: a precision of 98.3%, a recall of 96.4%, an accuracy of 98.5%, and an F1 score of 97.3%. Our study provides insight into the appropriate selection of TTS systems to generate high-quality speech to test ASR systems. Additionally, a false alarm estimator can be a way to minimise the impact of false alarms and help developers choose suitable test inputs when evaluating ASR systems. The source code used in this paper is publicly available on GitHub at https://github.com/julianyonghao/FAinASRtest.

Publisher's Version Info
A Tale of Two Approximations: Tightening Over-Approximation for DNN Robustness Verification via Under-Approximation
Zhiyi Xue, Si Liu, Zhaodi Zhang, Yiting Wu, and Min Zhang
(East China Normal University, China; ETH Zurich, Switzerland; Chengdu Education Research Institute, China)
The robustness of deep neural networks (DNNs) is crucial to the hosting system’s reliability and security. Formal verification has been demonstrated to be effective in providing provable robustness guarantees. To improve its scalability, over-approximating the non-linear activation functions in DNNs by linear constraints has been widely adopted, which transforms the verification problem into an efficiently solvable linear programming problem. Many efforts have been dedicated to defining the so-called tightest approximations to reduce overestimation imposed by over-approximation. In this paper, we study existing approaches and identify a dominant factor in defining tight approximation, namely the approximation domain of the activation function. We find out that tight approximations defined on approximation domains may not be as tight as the ones on their actual domains, yet existing approaches all rely only on approximation domains. Based on this observation, we propose a novel dual-approximation approach to tighten overapproximations, leveraging an activation function’s underestimated domain to define tight approximation bounds. We implement our approach with two complementary algorithms based respectively on Monte Carlo simulation and gradient descent into a tool called DualApp. We assess it on a comprehensive benchmark of DNNs with different architectures. Our experimental results show that DualApp significantly outperforms the state-of-the-art approaches with 100% − 1000% improvement on the verified robustness ratio and 10.64% on average (up to 66.53%) on the certified lower bound.

Publisher's Version Published Artifact Info Artifacts Available Artifacts Functional
SlipCover: Near Zero-Overhead Code Coverage for Python
Juan Altmayer Pizzorno and Emery D. Berger
(University of Massachusetts Amherst, USA)
Coverage analysis is widely used but can suffer from high overhead. This overhead is especially acute in the context of Python, which is already notoriously slow (a recent study observes a roughly 30x slowdown vs. native code). We find that the state-of-the-art coverage tool for Python, coverage.py, introduces a median overhead of 180% with the standard Python interpreter. Slowdowns are even more extreme when using PyPy, a JIT-compiled Python implementation, with coverage.py imposing a median overhead of 1,300%. This performance degradation reduces the utility of coverage analysis in most use cases, including testing and fuzzing, and precludes its use in deployment. This paper presents SlipCover, a novel, near-zero overhead coverage analyzer for Python. SlipCover works without modifications to either the Python interpreter or PyPy. It first processes a program's AST to accurately identify all branches and lines. SlipCover then dynamically rewrites Python bytecodes to add lightweight instrumentation to each identified branch and line. At run time, SlipCover periodically de-instruments already-covered lines and branches. The result is extremely low overheads -- a median of just 5% -- making SlipCover suitable for use in deployment. We show its efficiency can translate to significant increases in the speed of coverage-based clients. As a proof of concept, we integrate SlipCover into TPBT, a targeted property-based testing system, and observe a 22x speedup.

Publisher's Version Info
Systematic Testing of the Data-Poisoning Robustness of KNN
Yannan Li, Jingbo Wang, and Chao Wang
(University of Southern California, USA)
Data poisoning aims to compromise a machine learning based software component by contaminating its training set to change its prediction results for test inputs. Existing methods for deciding data-poisoning robustness have either poor accuracy or long running time and, more importantly, they can only certify some of the truly-robust cases, but remain inconclusive when certification fails. In other words, they cannot falsify the truly-non-robust cases. To overcome this limitation, we propose a systematic testing based method, which can falsify as well as certify data-poisoning robustness for a widely used supervised-learning technique named k-nearest neighbors (KNN). Our method is faster and more accurate than the baseline enumeration method, due to a novel over-approximate analysis in the abstract domain, to quickly narrow down the search space, and systematic testing in the concrete domain, to find the actual violations. We have evaluated our method on a set of supervised-learning datasets. Our results show that the method significantly outperforms state-of-the-art techniques, and can decide data-poisoning robustness of KNN prediction results for most of the test inputs.

Publisher's Version
GrayC: Greybox Fuzzing of Compilers and Analysers for C
Karine Even-Mendoza, Arindam Sharma, Alastair F. Donaldson, and Cristian Cadar
(King’s College London, UK; Imperial College London, UK)
Fuzzing of compilers and code analysers has led to a large number of bugs being found and fixed in widely-used frameworks such as LLVM, GCC and Frama-C. Most such fuzzing techniques have taken a blackbox approach, with compilers and code analysers starting to become relatively immune to such fuzzers.
We propose a coverage-directed, mutation-based approach for fuzzing C compilers and code analysers, inspired by the success of this type of greybox fuzzing in other application domains. The main challenge of applying mutation-based fuzzing in this context is that naive mutations are likely to generate programs that do not compile. Such programs are not useful for finding deep bugs that affect optimisation, analysis, and code generation routines.
We have designed a novel greybox fuzzer for C compilers and analysers by developing a new set of mutations to target common C constructs, and transforming fuzzed programs so that they produce meaningful output, allowing differential testing to be used as a test oracle, and paving the way for fuzzer-generated programs to be integrated into compiler and code analyser regression test suites.
We have implemented our approach in GrayC, a new open-source LibFuzzer-based tool, and present experiments showing that it provides more coverage on the middle- and back-end stages of compilers and analysers compared to other mutation-based approaches, including Clang-Fuzzer, PolyGlot, and a technique similar to LangFuzz.
We have used GrayC to identify 30 confirmed compiler and code analyser bugs: 25 previously unknown bugs (with 22 of them already fixed in response to our reports) and 5 confirmed bugs reported independently shortly before we found them. A further 3 bug reports are under investigation. Apart from the results above, we have contributed 24 simplified versions of coverage-enhancing test cases produced by GrayC to the Clang/LLVM test suite, targeting 78 previously uncovered functions in the LLVM codebase.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Enhancing REST API Testing with NLP Techniques
Myeongsoo Kim, Davide Corradini, Saurabh Sinha, Alessandro Orso, Michele Pasqua, Rachel Tzoref-Brill, and Mariano Ceccato
(Georgia Institute of Technology, USA; University of Verona, Italy; IBM Research, USA; IBM Research, Israel)
RESTful services are commonly documented using OpenAPI specifications. Although numerous automated testing techniques have been proposed that leverage the machine-readable part of these specifications to guide test generation, their human-readable part has been mostly neglected. This is a missed opportunity, as natural language descriptions in the specifications often contain relevant information, including example values and inter-parameter dependencies, that can be used to improve test generation. In this spirit, we propose NLPtoREST, an automated approach that applies natural language processing techniques to assist REST API testing. Given an API and its specification, NLPtoREST extracts additional OpenAPI rules from the human-readable part of the specification. It then enhances the original specification by adding these rules to it. Testing tools can transparently use the enhanced specification to perform better test case generation. Because rule extraction can be inaccurate, due to either the intrinsic ambiguity of natural language or mismatches between documentation and implementation, NLPtoREST also incorporates a validation step aimed at eliminating spurious rules. We performed studies to assess the effectiveness of our rule extraction and validation approach, and the impact of enhanced specifications on the performance of eight state-of-the-art REST API testing tools. Our results are encouraging and show that NLPtoREST can extract many relevant rules with high accuracy, which can in turn significantly improve testing tools’ performance.

Publisher's Version Published Artifact Artifacts Available
Automated Generation of Security-Centric Descriptions for Smart Contract Bytecode
Yu Pan, Zhichao Xu, Levi Taiji Li, Yunhe Yang, and Mu Zhang
(University of Utah, USA)
Smart contract and DApp users are taking great risks, as they do not obtain necessary knowledge that can help them avoid using vulnera- ble and malicious contract code. In this paper, we develop a novel system Tx2TXT that can automatically create security-centric textual descriptions directly from smart contract bytecode. To capture the security aspect of financial applications, we formally define a funds transfer graph to model critical funds flows in smart contracts. To ensure the expressiveness and conciseness of the descriptions de- rived from these graphs, we employ a GCN-based model to identify security-related condition statements and selectively add them to our graph models. To convert low-level bytecode instructions to human- readable textual scripts, we leverage robust API signatures to recover bytecode semantics. We have evaluated Tx2TXT on 890 well-labeled vulnerable, malicious and safe contracts where developer-crafted descriptions are available. Our results have shown that Tx2TXT out- performs state-of-the-art solutions and can effectively help end users avoid risky contracts

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Toward Automated Detecting Unanticipated Price Feed in Smart Contract
Yifan Mo, Jiachi Chen, Yanlin Wang, and Zibin Zheng
(Sun Yat-sen University, China)
Decentralized finance (DeFi) based on smart contracts has reached a total value locked (TVL) of over USD 200 billion in 2022. In DeFi ecosystems, price oracles play a critical role in providing real-time price feeds for cryptocurrencies to ensure accurate asset pricing in smart contracts. However, the price oracle also faces security issues, including the possibility of unanticipated price feeds, which can lead to imbalances in debt and assets in the DeFi protocol. However, existing solutions cannot effectively combine transactions and code for real-time monitoring of price oracles.
To address this limitation, we first categorize price oracles as either DON oracles, DEX oracles, or internal oracles based on trusted parties, and analyze their security risks, data sources, price duration, and query fees. Then, we propose VeriOracle, a formal verification framework for the automated detection of unanticipated price feeds in smart contracts. VeriOracle can deploy a formal semantic model of the price oracle on the blockchain to detect the status of smart contracts and identify unanticipated price feed transactions in real time. We apply VeriOracle to verify over 500,000 transactions of 13 vulnerable DeFi protocols in the real world. The experimental results show that (1) VeriOracle is effective and it can detect unanticipated price feeds before DeFi attacks (33,714 blocks ahead of the attacker in the best case); (2) VeriOracle is efficient in that its verification time (about 4s) is less than the block time of Ethereum (about 14s), which means VeriOracle can detect unsafe transactions in real time; and (3) VeriOracle is extendable for verifying defense strategies. Attacks using unanticipated price feeds can only succeed in particular smart contract states. VeriOracle can verify which smart contract states can defend against attacks.

Publisher's Version
Virtual Reality (VR) Automated Testing in the Wild: A Case Study on Unity-Based VR Applications
Dhia Elhaq Rzig, Nafees Iqbal, Isabella Attisano, Xue Qin, and Foyzul Hassan
(University of Michigan at Dearborn, USA; Villanova University, USA)
Virtual Reality (VR) is an emerging technique that provides a unique real-time experience for users. VR technologies have provided revolutionary user experiences in various scenarios (e.g., training, education, gaming, etc.). However, testing VR applications is challenging due to their nature which necessitates physical interactivity, and their reliance on specific hardware systems. Despite the recent advancements in VR technology and its usage scenarios, we still know little about VR application testing. To fill up this knowledge gap, we performed an empirical study on 314 open-source VR applications. Our analysis identified that 79% of the VR projects evaluated did not have any automatic tests, and for the VR projects that did, the median functional-method to test-method ratios were lower than those of other project types. Moreover, we uncovered tool support issues concerning the measurement of VR code coverage, and the assertion density results we were able to generate were relatively low, with an average of 17.63%. Finally, through a manual analysis of 370 test cases, we identified the different categories of test cases being used to validate VR application quality attributes. Furthermore, we extracted which of these categories are VR-attention, meaning that test writers need to pay special attention to VR characteristics when writing tests of these categories. We believe that our findings constitute a call to action for the VR development community to improve their automatic testing practices and provide directions for software engineering researchers to develop advanced techniques for automatic test case generation and test quality analysis for VR applications. Our replication package containing the dataset we used, software tools we developed, and the results we found, is accessible at https://doi.org/10.6084/m9.figshare.19678938.

Publisher's Version Published Artifact Artifacts Available
How Effective Are Neural Networks for Fixing Security Vulnerabilities
Yi Wu, Nan Jiang, Hung Viet Pham, Thibaud Lutellier, Jordan Davis, Lin Tan, Petr Babkin, and Sameena Shah
(Purdue University, USA; York University, Canada; University of Alberta, Canada; J.P. Morgan AI Research, USA)
Security vulnerability repair is a difficult task that is in dire need of automation. Two groups of techniques have shown promise: (1) large code language models (LLMs) that have been pre-trained on source code for tasks such as code completion, and (2) automated program repair (APR) techniques that use deep learning (DL) models to automatically fix software bugs.
This paper is the first to study and compare Java vulnerability repair capabilities of LLMs and DL-based APR models. The contributions include that we (1) apply and evaluate five LLMs (Codex, CodeGen, CodeT5, PLBART and InCoder), four fine-tuned LLMs, and four DL-based APR techniques on two real-world Java vulnerability benchmarks (Vul4J and VJBench), (2) design code transformations to address the training and test data overlapping threat to Codex, (3) create a new Java vulnerability repair benchmark VJBench, and its transformed version VJBench-trans, to better evaluate LLMs and APR techniques, and (4) evaluate LLMs and APR techniques on the transformed vulnerabilities in VJBench-trans.
Our findings include that (1) existing LLMs and APR models fix very few Java vulnerabilities. Codex fixes 10.2 (20.4%), the most number of vulnerabilities. Many of the generated patches are uncompilable patches. (2) Fine-tuning with general APR data improves LLMs’ vulnerability-fixing capabilities. (3) Our new VJBench reveals that LLMs and APR models fail to fix many Common Weakness Enumeration (CWE) types, such as CWE-325 Missing cryptographic step and CWE-444 HTTP request smuggling. (4) Codex still fixes 8.7 transformed vulnerabilities, outperforming all the other LLMs and APR models on transformed vulnerabilities. The results call for innovations to enhance automated Java vulnerability repair such as creating larger vulnerability repair training data, tuning LLMs with such data, and applying code simplification transformation to facilitate vulnerability repair.

Publisher's Version
Rare Path Guided Fuzzing
Seemanta Saha, Laboni Sarker, Md Shafiuzzaman, Chaofan Shou, Albert Li, Ganesh Sankaran, and Tevfik Bultan
(University of California at Santa Barbara, USA)
Starting with a random initial seed, fuzzers search for inputs that trigger bugs or vulnerabilities. However, fuzzers often fail to generate inputs for program paths guarded by restrictive branch conditions. In this paper, we show that by first identifying rare-paths in programs (i.e., program paths with path constraints that are unlikely to be satisfied by random input generation), and then, generating inputs/seeds that trigger rare-paths, one can improve the coverage of fuzzing tools. In particular, we present techniques 1) that identify rare paths using quantitative symbolic analysis, and 2) generate inputs that can explore these rare paths using path-guided concolic execution. We provide these inputs as initial seed sets to three state of the art fuzzers. Our experimental evaluation on a set of programs shows that the fuzzers achieve better coverage with the rare-path based seed set compared to a random initial seed.

Publisher's Version
CGuard: Scalable and Precise Object Bounds Protection for C
Piyus Kedia, Rahul Purandare, Udit Agarwal, and Rishabh
(IIIT Delhi, India; University of Nebraska-Lincoln, USA; University of British Columbia, Canada; GGSIPU, India)
Spatial safety violations are the root cause of many security attacks and unexpected behavior of applications. Existing techniques to enforce spatial safety work broadly at either object or pointer granularity. Object-based approaches tend to incur high CPU overheads, whereas pointer-based approaches incur both high CPU and memory overheads. SGXBounds, an object-based approach, provides precise out-of-bounds protection for objects at a lower overhead compared to other tools with similar precision. However, a major drawback of this approach is that it cannot support address space larger than 32-bit.
In this paper, we present CGuard, a tool that provides precise object-bounds protection for C applications with comparable overheads to SGXBounds without restricting the application address space. CGuard stores the bounds information just before the base address of an object and encodes the relative offset of the base address in the spare bits of the virtual address available in x86_64 architecture. For an object that cannot fit in the spare bits, CGuard uses a custom memory layout that enables it to find the base address of the object in just one memory access. Our study revealed spatial safety violations in the gcc and x264 benchmarks from the SPEC CPU2017 benchmark suite and the string_match benchmark from the Phoenix benchmark suite. The execution time overheads for the SPEC CPU2017 and Phoenix benchmark suites were 42% and 26% respectively, whereas the reduction in the throughput for the Apache webserver when the CPUs were fully saturated was 30%. These results indicate that CGuard can be highly effective while maintaining a reasonable degree of efficiency.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
An Empirical Study of Functional Bugs in Android Apps
Yiheng Xiong, Mengqian Xu, Ting Su, Jingling Sun, Jue Wang, He Wen, Geguang Pu, Jifeng He, and Zhendong Su
(East China Normal University, China; Nanjing University, China; ETH Zurich, Switzerland)
Android apps are ubiquitous and serve many aspects of our daily lives. Ensuring their functional correctness is crucial for their success. To date, we still lack a general and in-depth understanding of functional bugs, which hinders the development of practices and techniques to tackle functional bugs. To fill this gap, we conduct the first systematic study on 399 functional bugs from 8 popular open-source and representative Android apps to investigate the root causes, bug symptoms, test oracles, and the capabilities and limitations of existing testing techniques. This study took us substantial effort. It reveals several new interesting findings and implications which help shed light on future research on tackling functional bugs. Furthermore, findings from our study guided the design of a proof-of-concept differential testing tool, RegDroid, to automatically find functional bugs in Android apps. We applied RegDroid on 5 real-world popular apps, and successfully discovered 14 functional bugs, 10 of which were previously unknown and affected the latest released versions—all these 10 bugs have been confirmed and fixed by the app developers. Specifically, 10 out of these 14 found bugs cannot be found by existing testing techniques. We have made all the artifacts (including the dataset of 399 functional bugs and RegDroid) in our work publicly available at https://github.com/Android-Functional-bugs-study/home.

Publisher's Version
NodeRT: Detecting Races in Node.js Applications Practically
Jingyao Zhou, Lei Xu, Gongzheng Lu, Weifeng Zhang, and Xiangyu Zhang
(Nanjing University, China; Suzhou City University, China; Nanjing University of Posts and Telecommunications, China; Purdue University, USA)
Node.js has become one of the most popular development platforms due to its superior concurrency support. However, races induced by the nondeterministic execution order of event handlers may occur in Node.js applications, causing serious runtime failures. The state-of-the-art Node.js race detector NRace builds a happens-before (HB) graph before detection with a set of HB relation rules. In detection, NRace utilizes a heavy-weight BFS-based algorithm to query the reachability between resource operations, which introduces substantial overhead in practice, causing NRace inapplicable to real-world Node.js application test processes. This paper proposes a more practical Node.js dynamic race detection approach called NodeRT (Node.js Race Tracker). To reduce unnecessary overhead, NodeRT simplifies the HB relation rules, and divides the detection into three stages: trace collection stage, race candidate detection stage, and false positive removal stage. In the trace collection stage, NodeRT constructs a partial HB graph called asynchronous call tree (ACTree), enabling efficient reachability queries between event handlers. In the race candidate detection stage, NodeRT performs detection on the ACTree, which effectively eliminates most non-racing event handlers and outputs race candidates. In the false positive removal stage, NodeRT utilizes matching rules derived from HB relation rules and features of resources to reduce false positives in the race candidates. In experiments, NodeRT detects all known races and 9 unknown harmful races in real-world applications, whereas NRace only detects 3 of the unknown harmful races, with 64× more time consumption on average. Compared with NRace, NodeRT significantly reduces the overhead, making it practical to be integrated into real-world test processes.

Publisher's Version
An Empirical Study on Concurrency Bugs in Interrupt-Driven Embedded Software
Chao Li, Rui Chen, Boxiang Wang, Zhixuan Wang, Tingting Yu, Yunsong Jiang, Bin Gu, and Mengfei Yang
(Beijing Institute of Control Engineering, China; Beijing Sunwise Information Technology, China; Xidian University, China; China Academy of Space Technology, China)
Interrupt-driven embedded software is widely used in aerospace, automotive electronics, medical equipment, IoT, and other industrial fields. This type of software is usually programmed with interrupts to interact with hardware and respond to external stimuli on time. However, uncertain interleaving execution of interrupts may cause concurrency bugs, resulting in task failure or serious safety issues. A deep understanding of real-world concurrency bugs in embedded software will significantly improve the ability of techniques in combating concurrency bugs, such as bug detection, testing and fixing.
This paper performs the first comprehensive and large-scale empirical study on concurrency bugs in industrial interrupt-driven embedded software. A total number of 132 real-world concurrency bugs in 102 industrial embedded software have been rigorously analyzed. Not only have the root causes, impacts and fix strategies of bugs been studied, but also the manifestation, including triggering scopes, racing variables, access interleaving patterns, and variables correlations. This study reveals several significant findings, which can guide future research in developing techniques and tools to combat concurrency bugs for interrupt-driven embedded software.

Publisher's Version
CodeGrid: A Grid Representation of Code
Abdoul Kader Kaboré, Earl T. Barr, Jacques Klein, and Tegawendé F. Bissyandé
(University of Luxembourg, Luxembourg; University College London, UK; Google DeepMind, Canada)
Code representation is a key step in the application of AI in software engineering. Generic NLP representations are effective but do not exploit all the rich structure inherent to code. Recent work has focused on extracting abstract syntax trees (AST) and integrating their structural information into code representations.These AST-enhanced representations advanced the state of the art and accelerated new applications of AI to software engineering. ASTs, however, neglect important aspects of code structure, notably control and data flow, leaving some potentially relevant code signal unexploited. For example, purely image-based representations perform nearly as well as AST-based representations, despite the fact that they must learn to even recognize tokens, let alone their semantics. This result, from prior work, is strong evidence that these new code representations can still be improved; it also raises the question of just what signal image-based approaches are exploiting.
We answer this question. We show that code is spatial and exploit this fact to propose , a new representation that embeds tokens into a grid that preserves code layout. Unlike some of the existing state of the art, is agnostic to the downstream task: whether that task is generation or classification, can complement the learning algorithm with spatial signal. For example, we show that CNNs, which are inherently spatially-aware models, can exploit outputs to effectively tackle fundamental software engineering tasks, such as code classification, code clone detection and vulnerability detection. PixelCNN leverages ’s grid representations to achieve code completion. Through extensive experiments, we validate our spatial code hypothesis, quantifying model performance as we vary the degree to which the representation preserves the grid. To demonstrate its generality, we show that augments models, improving their performance on a range of tasks, On clone detection, improves ASTNN’s performance by 3.3% F1 score.

Publisher's Version
Detecting Condition-Related Bugs with Control Flow Graph Neural Network
Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Xudong Liu, Chunming Hu, and Yang Liu
(Beihang University, China; Chongqing University, China; Nanyang Technological University, Singapore)
Automated bug detection is essential for high-quality software development and has attracted much attention over the years. Among the various bugs, previous studies show that the condition expressions are quite error-prone and the condition-related bugs are commonly found in practice. Traditional approaches to automated bug detection are usually limited to compilable code and require tedious manual effort. Recent deep learning-based work tends to learn general syntactic features based on Abstract Syntax Tree (AST) or apply the existing Graph Neural Networks over program graphs. However, AST-based neural models may miss important control flow information of source code, and existing Graph Neural Networks for bug detection tend to learn local neighbourhood structure information. Generally, the condition-related bugs are highly influenced by control flow knowledge, therefore we propose a novel CFG-based Graph Neural Network (CFGNN) to automatically detect condition-related bugs, which includes a graph-structured LSTM unit to efficiently learn the control flow knowledge and long-distance context information. We also adopt the API-usage attention mechanism to leverage the API knowledge. To evaluate the proposed approach, we collect real-world bugs in popular GitHub repositories and build a large-scale condition-related bug dataset. The experimental results show that our proposed approach significantly outperforms the state-of-the-art methods for detecting condition-related bugs.

Publisher's Version
Third-Party Library Dependency for Large-Scale SCA in the C/C++ Ecosystem: How Far Are We?
Ling Jiang, Hengchen Yuan, Qiyi Tang, Sen Nie, Shi Wu, and Yuqun Zhang
(Southern University of Science and Technology, China; Tencent Security Keen Lab, China)
Existing software composition analysis (SCA) techniques for the C/C++ ecosystem tend to identify the reused components through feature matching between target software project and collected third-party libraries (TPLs). However, feature duplication caused by internal code clone can cause inaccurate SCA results. To mitigate this issue, Centris, a state-of-the-art SCA technique for the C/C++ ecosystem, was proposed to adopt function-level code clone detection to derive the TPL dependencies for eliminating the redundant features before performing SCA tasks. Although Centris has been shown effective in the original paper, the accuracy of the derived TPL dependencies is not evaluated. Additionally, the dataset to evaluate the impact of TPL dependency on SCA is limited. To further investigate the efficacy and limitations of Centris, we first construct two large-scale ground-truth datasets for evaluating the accuracy of deriving TPL dependency and SCA results respectively. Then we extensively evaluate Centris where the evaluation results suggest that the accuracy of TPL dependencies derived by Centris may not well generalize to our evaluation dataset. We further infer the key factors that degrade the performance can be the inaccurate function birth time and the threshold-based recall. In addition, the impact on SCA from the TPL dependencies derived by Centris can be somewhat limited. Inspired by our findings, we propose TPLite with function-level origin TPL detection and graph-based dependency recall to enhance the accuracy of TPL reuse detection in the C/C++ ecosystem. Our evaluation results indicate that TPLite effectively increases the precision from 35.71% to 88.33% and the recall from 49.44% to 62.65% of deriving TPL dependencies compared with Centris. Moreover, TPLite increases the precision from 21.08% to 75.90% and the recall from 57.62% to 64.17% compared with the SOTA academic SCA tool B2SFinder and even outperforms the well-adopted commercial SCA tool BDBA, i.e., increasing the precision from 72.46% to 75.90% and the recall from 58.55% to 64.17%.

Publisher's Version
Green Fuzzer Benchmarking
Jiradet Ounjai, Valentin Wüstholz, and Maria Christakis
(MPI-SWS, Germany; ConsenSys, Austria; TU Wien, Austria)
Over the last decade, fuzzing has been increasingly gaining traction due to its effectiveness in finding bugs. Nevertheless, fuzzer evaluations have been challenging during this time, mainly due to lack of standardized benchmarking. Aiming to alleviate this issue, in 2020, Google released FuzzBench, an open-source benchmarking platform, that is widely used for accurate fuzzer benchmarking.
However, a typical FuzzBench experiment takes CPU years to run. If we additionally consider that fuzzers under active development evaluate any changes empirically, benchmarking becomes prohibitive both in terms of computational resources and time. In this paper, we propose GreenBench, a greener benchmarking platform that, compared to FuzzBench, significantly speeds up fuzzer evaluations while maintaining very high accuracy.
In contrast to FuzzBench, GreenBench drastically increases the number of benchmarks while drastically decreasing the duration of fuzzing campaigns. As a result, the fuzzer rankings generated by GreenBench are almost as accurate as those by FuzzBench (with very high correlation), but GreenBench is from 18 to 61 times faster. We discuss the implications of these findings for the fuzzing community.

Publisher's Version Info
Interpreters for GNN-Based Vulnerability Detection: Are We There Yet?
Yutao Hu, Suyuan Wang, Wenke Li, Junru Peng, Yueming Wu, Deqing Zou, and Hai Jin
(Huazhong University of Science and Technology, China; Wuhan University, China; Nanyang Technological University, Singapore)
Traditional vulnerability detection methods have limitations due to their need for extensive manual labor. Using automated means for vulnerability detection has attracted research interest, especially deep learning, which has achieved remarkable results. Since graphs can better convey the structural feature of code than text, graph neural network (GNN) based vulnerability detection is significantly better than text-based approaches. Therefore, GNN-based vulnerability detection approaches are becoming popular. However, GNN models are close to black boxes for security analysts, so the models cannot provide clear evidence to explain why a code sample is detected as vulnerable or secure. At this stage, many GNN interpreters have been proposed. However, the explanations provided by these interpretations for vulnerability detection models are highly inconsistent and unconvincing to security experts. To address the above issues, we propose principled guidelines to assess the quality of the interpretation approaches for GNN-based vulnerability detectors based on concerns in vulnerability detection, namely, stability, robustness, and effectiveness. We conduct extensive experiments to evaluate the interpretation performance of six famous interpreters (GNN-LRP, DeepLIFT, GradCAM, GNNExplainer, PGExplainer, and SubGraphX) on four vulnerability detectors (DeepWukong, Devign, IVDetect, and Reveal). The experimental results show that the target interpreters achieve poor performance in terms of effectiveness, stability, and robustness. For effectiveness, we find that the instance-independent methods outperform others due to their deep insight into the detection model. In terms of stability, the perturbation-based interpretation methods are more resilient to slight changes in model parameters as they are model-agnostic. For robustness, the instance-independent approaches provide more consistent interpretation results for similar vulnerabilities.

Publisher's Version
An Empirical Study on the Effects of Obfuscation on Static Machine Learning-Based Malicious JavaScript Detectors
Kunlun Ren, Weizhong Qiang, Yueming Wu, Yi Zhou, Deqing Zou, and Hai Jin
(Huazhong University of Science and Technology, China; Nanyang Technological University, Singapore)
Machine learning is increasingly being applied to malicious JavaScript detection in response to the growing number of Web attacks and the attendant costly manual identification. In practice, to hide their malicious behaviors or protect intellectual copyrights, both malicious and benign scripts tend to obfuscate their own code before uploading. While obfuscation is beneficial, it also introduces some additional code features (e.g., dead code) into the code. When machine learning is employed to learn a malicious JavaScript detector, these additional features can affect the model to make it less effective. However, there is still a lack of clear understanding of how robust existing machine learning-based detectors are on different obfuscators. In this paper, we conduct the first empirical study to figure out how obfuscation affects machine learning detectors based on static features. Through the results, we observe several findings: 1) Obfuscation has a significant impact on the effectiveness of detectors, causing an increase both in false negative rate (FNR) and false positive rate (FPR), and the bias of obfuscation in the training set induces detectors to detect obfuscation rather than malicious behaviors. 2) The common measures such as improving the quality of the training set by adding relevant obfuscated samples and leveraging state-of-the-art deep learning models can not work well.3) The root cause of obfuscation effects on these detectors is that feature spaces they use can only reflect shallow differences in code, not about the nature of benign and malicious, which can be easily affected by the differences brought by obfuscation. 4) Obfuscation has a similar effect on realistic detectors in VirusTotal, indicating that this is a common real-world problem.

Publisher's Version Published Artifact Artifacts Available
Understanding Breaking Changes in the Wild
Dhanushka Jayasuriya, Valerio Terragni, Jens Dietrich, Samuel Ou, and Kelly Blincoe
(University of Auckland, New Zealand; Victoria University of Wellington, New Zealand)
Modern software applications rely heavily on the usage of libraries, which provide reusable functionality, to accelerate the development process. As libraries evolve and release new versions, the software systems that depend on those libraries (the clients) should update their dependencies to use these new versions as the new release could, for example, include critical fixes for security vulnerabilities. However, updating is not always a smooth process, as it can result in software failures in the clients if the new version includes breaking changes. Yet, there is little research on how these breaking changes impact the client projects in the wild. To identify if changes between two library versions cause breaking changes at the client end, we perform an empirical study on Java projects built using Maven. For the analysis, we used 18,415 Maven artifacts, which declared 142,355 direct dependencies, of which 71.60% were not up-to-date. We updated these dependencies and found that 11.58% of the dependency updates contain breaking changes that impact the client. We further analyzed these changes in the library which impact the client projects and examine if libraries have adhered to the semantic versioning scheme when introducing breaking changes in their releases. Our results show that changes in transitive dependencies were a major factor in introducing breaking changes during dependency updates and almost half of the detected client impacting breaking changes violate the semantic versioning scheme by introducing breaking changes in non-Major updates.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Improving Spectrum-Based Localization of Multiple Faults by Iterative Test Suite Reduction
Dylan Callaghan and Bernd Fischer
(Stellenbosch University, South Africa)
Spectrum-based fault localization (SBFL) works well for single-fault programs but its accuracy decays for increasing fault numbers. We present FLITSR (Fault Localization by Iterative Test Suite Reduction), a novel SBFL extension that improves the localization of a given base metric specifically in the presence of multiple faults. FLITSR iteratively selects reduced versions of the test suite that better localize the individual faults in the system. This allows it to identify and re-rank faults ranked too low by the base metric because they were masked by other program elements.
We evaluated FLITSR over method-level spectra from an existing large synthetic dataset comprising 75000 variants of 15 open-source projects with up to 32 injected faults, as well as method- and statement-level spectra from a new dataset with 326 true multi-fault versions from the Defects4J benchmark set containing up to 14 real faults. For all three spectrum types we consistently see substantial reductions of the average wasted efforts at different fault levels, of 30%-90% over the best base metric, and generally similarly large increases in precision and recall, albeit with larger variance across the underlying projects. For the method-level real faults, FLITSR also substantially outperforms GRACE, a state-of-the-art learning-based fault localizer.

Publisher's Version Published Artifact Info Artifacts Available Artifacts Reusable
Extracting Inline Tests from Unit Tests
Yu Liu, Pengyu Nie, Anna Guo, Milos Gligoric, and Owolabi Legunsen
(University of Texas at Austin, USA; Cornell University, USA)
We recently proposed inline tests for validating individual program statements; they allow developers to provide test inputs, expected outputs, and test oracles immediately after a target statement. But, existing code can have many target statements. So, automatic generation of inline tests is an important next step towards increasing their adoption. We propose ExLi, the first technique for automatically generating inline tests. ExLi extracts inline tests from unit tests; it first records all variable values at a target statement while executing unit tests. Then, ExLi uses those values as test inputs and test oracles in an initial set of generated inline tests. Target statements that are executed many times could have redundant initial inline tests. So, ExLi uses a novel coverage-then-mutants based reduction process to remove redundant inline tests. We implement ExLi for Java and use it to generate inline tests for 718 target statements in 31 open-source programs. ExLi reduces 17,273 initially generated inline tests to 905 inline tests. The final set of generated inline tests kills up to 25.1% more mutants on target statements than developer written and automatically generated unit tests. That is, ExLi generates inline tests that can improve the fault-detection capability of the test suites from which they are extracted.

Publisher's Version

Tool Demonstrations

DDLDroid: A Static Analyzer for Automatically Detecting Data Loss Issues in Android Applications
Yuhao Zhou and Wei Song
(Nanjing University of Science and Technology, China)
DDLDroid is a static analyzer for detecting data loss issues in Android apps during activity restart or app relaunch. It is bootstrapped by a saving-restoring bipartite graph which correlates variables that need saving to those that need restoring according to their carrier widgets, and is based on the analysis of saving and restoring data flows. It reports data loss issues once missed or broken data flows are identified. DDLDroid detects 302 true data loss issues out of 66 Android apps in 73 minutes, including 180 previously-unknown issues, demonstrating its effectiveness and efficiency.

Publisher's Version
Behaviorally Typed State Machines in TypeScript for Heterogeneous Swarms
Roland Kuhn and Alan Darmasaputra
(Actyx, Germany)
A heterogeneous swarm system is a distributed system where participants come and go, communication topology may change at any time, data replication is asynchronous and partial, and local agents behave differently between nodes. These systems are hard to design and reason about, mainly because we desire a particular class of behaviors to emerge from the interplay of heterogeneous individual agents. Nevertheless, mission-critical operations like manufacturing process orchestration in factories use such systems due to their uncompromising availability and resilience of computing services.
This paper presents a set of TypeScript libraries to model peer-to-peer workflows as state machines, execute them using the Actyx middleware, and check the shape of these machines for conformance to a swarm protocol. The swarm protocol describes an idealized global view of the cooperation of machines of different roles. It directly corresponds to a diagram a product manager would sketch on a whiteboard; this allows for verifying that the coded state machines correctly implement the product specification. A well-formed swarm protocol also guarantees that conforming machines will achieve eventual consensus on the overall state progression even in the absence of further coordination. This tool is for developers of business logic for heterogeneous swarm systems, helping them verify that their protocols and implementations are correct.
Tool repo: https://github.com/Actyx/machines

Publisher's Version
ECSTATIC: Automatic Configuration-Aware Testing and Debugging of Static Analysis Tools
Austin Mordahl, Dakota Soles, Miao Miao, Zenong Zhang, and Shiyi Wei
(University of Texas at Dallas, USA)
Static analyses are powerful tools that can serve as a complement to dynamic approaches such as testing. In order to ensure generality, many static analysis tools are configurable. However, these configurations can make testing and debugging more difficult. To address this issue, we introduce a new tool, ECSTATIC, which leverages partial order relations between analysis configuration options to automatically test and debug static analyzers, even without ground truths. ECSTATIC’s results are reproducible by virtue of running within Docker containers, and ECSTATIC provides clear extension interfaces for users to add their own tools and input programs. We evaluated ECSTATIC on four popular dataflow analysis tools, and found 74 bugs in all four tools. We also found that ECSTATIC’s novel two-staged delta debugging was able to reduce real-world programs by 50%, compared to a baseline of 6%.

Publisher's Version
RustSmith: Random Differential Compiler Testing for Rust
Mayank Sharma, Pingshi Yu, and Alastair F. Donaldson
(Imperial College London, UK)
We present RustSmith, the first Rust randomised program generator for end-to-end testing of Rust compilers. RustSmith generates programs that conform to the advanced type system of Rust, respecting rules related to borrowing and lifetimes, and that are guaranteed to yield a well-defined result. This makes RustSmith suitable for differential testing between compilers or across optimisation levels. By applying RustSmith to a series of versions of the official Rust compiler, rustc, we show that it can detect insidious historical bugs that evaded detection for some time. We have also used RustSmith to find previously-unknown bugs in an alternative Rust compiler implementation, mrustc. In a controlled experiment, we assess statement and mutation coverage achieved by RustSmith vs. the rustc optimisation test suite.

Publisher's Version
KeenTune: Automated Tuning Tool for Cloud Application Performance Testing and Optimization
Qinglong Wang, Runzhe Wang, Yuxi Hu, Xiaohai Shi, Zheng Liu, Tao Ma, Houbing Song, and Heyuan Shi
(Alibaba Group, China; University of Maryland, Baltimore County, USA; Central South University, China)
The performance testing and optimization of cloud applications is challenging, because manual tuning of cloud computing stacks is tedious and automated tuning tools are rare used for cloud services. To address this issue, we introduce KeenTune, an automated tuning tool designed to optimize application performance and facilitate performance testing. KeenTune is a lightweight and flexible tool that can be deployed with to-be-tuned applications with negligible impact on their performance. Specifically, KeenTune uses a surrogate model that can be implemented with machine learning models to filter out less relevant parameters for efficient tuning. Our empirical evaluation shows that KeenTune significantly enhances the throughput performance of Nginx web servers, resulting in performance improvements of up to 90.43% and 117.23% in certain cases. This study highlights the benefits of using KeenTune for achieving efficient and effective performance testing of cloud applications. The video and source code for KeenTune are provided as supplementary materials.

Publisher's Version Video Info
KDAlloc: The KLEE Deterministic Allocator: Deterministic Memory Allocation during Symbolic Execution and Test Case Replay
Daniel Schemmel, Julian Büning, Frank Busse, Martin Nowack, and Cristian Cadar
(Imperial College London, UK; RWTH Aachen University, Germany)
The memory allocator can have an important impact in symbolic execution. Taking a user-centric view, this tool demonstration paper discusses some of the main benefits provided by KLEE's new allocator KDAlloc in terms of improved deterministic execution and bug-finding capabilities. We then introduce a new replay tool for KLEE which enables the native execution to integrate KDAlloc and receive the same heap addresses as during symbolic execution.

Publisher's Version Info
EDHOC-Fuzzer: An EDHOC Protocol State Fuzzer
Konstantinos Sagonas and Thanasis Typaldos
(Uppsala University, Sweden; National Technical University of Athens, Greece)
EDHOC is a compact and lightweight authenticated key exchange protocol proposed by the IETF, whose design focuses on small message sizes, in order to be suitable for constrained IoT communication technologies. In this tool paper, we overview EDHOC-Fuzzer, a protocol state fuzzer for implementations of EDHOC clients and servers. It employs model learning to generate a state machine model of an EDHOC implementation, capturing its input/output behavior. This model can then be used for model-based testing, for fingerprinting, or can be analyzed for non-conformances, state machine bugs and security vulnerabilities. We overview the architecture and use of EDHOC-Fuzzer, and present some examples of models produced by the tool and our current findings.

Publisher's Version
MetaData262: Automatic Test Suite Selection for Partial JavaScript Implementations
Frederico Ramos, Diogo Costa Reis, Miguel Trigo, António Morgado, and José Fragoso Santos
(Instituto Superior Técnico, Portugal; University of Lleida, Spain; INESC-ID, Portugal)
Despite the large number of partial reference implementations of the JavaScript language, there is currently no automatic mechanism for selecting the appropriate official tests for such implementations. To fill this gap, we introduce a new format for presenting the metadata associated with the tests included in Test262, the official JavaScript test suite, and present MetaData262, a new tool for both computing the metadata of Test262 tests and filtering tests according to their respective metadata properties.

Publisher's Version Published Artifact Artifacts Available
RobotBT: Behavior-Tree-Based Test-Case Specification for the Robot Framework
Sven Peldszus, Noubar Akopian, and Thorsten Berger
(Ruhr University Bochum, Germany; Chalmers, Sweden; University of Gothenburg, Sweden)
The Robot Framework is a popular and widely used test automation framework that abstracts test case specifications toward natural language specifications. This makes it well suited for implementing high-level test cases, at least as long as the functions provided by Robot can support the intended functionality. For more complicated test cases, custom and often deeply nested functionality specifications are required, and the readability of Robot test cases tends to decrease. We present RobotBT, a library for the Robot framework that addresses these shortcomings by adding support for specifying test cases using behavior trees. Behavior trees are a comprehensive method for specifying complex behaviors based on a control flow model that orchestrates the execution of functionality. We evaluated RobotBT on a test suite for GUI testing from G~DATA CyberDefense AG and interviewed their engineers about the usability, readability, and applicability of RobotBT. Our results show that BTs improve the expressiveness and readability of Robot Framework test cases and are applicable to practical problems.

Publisher's Version Video
TreeLine and SlackLine: Grammar-Based Performance Fuzzing on Coffee Break
Ziyad Alsaeed and Michal Young
(Qassim University, Saudi Arabia; University of Oregon, USA)
TreeLine and SlackLine are grammar-based fuzzers for quickly finding performance problems in programs driven by richly structured text that can be described by context-free grammar. In contrast to long fuzzing campaigns to find (mostly invalid) inputs that trigger security vulnerabilities, TreeLine and SlackLine are designed to search for performance problems in the space of valid inputs in minutes rather than hours. The TreeLine and SlackLine front-ends differ in search strategy (Monte Carlo Tree Search or derivation tree splicing, respectively) but accept the same grammar specifications and rely on a common back-end for instrumented execution. Separation of concerns should facilitate use by other researchers who wish to explore alternatives and extensions of either the front or back ends.

Publisher's Version Video
Oven: Safe and Live Communication Protocols in Scala, using Synthetic Behavioural Type Analysis
Francisco Ferreira and Sung-Shik Jongmans
(Royal Holloway University of London, UK; Open University of the Netherlands, Netherlands; CWI, Netherlands)
We present Oven: a toolset to assure safety and liveness of communication protocols among threads in concurrent programs in Scala.
Oven is the first practical toolset that is built on top of new theoretical foundations of synthetic behavioural type analysis, recently developed by us to improve the expressiveness of existing work. We explain Oven's usage, summarise its design and implementation (main challenge: how to encode the new synthetic behavioural typing rules in Scala's existing type system), and discuss a preliminary evaluation of expressiveness (the results provide first evidence that Oven is an improvement over two state-of-the-art tools).

Publisher's Version
SymRustC: A Hybrid Fuzzer for Rust
Frédéric Tuong, Mohammad Omidvar Tehrani, Marco Gaboardi, and Steven Y. Ko
(Simon Fraser University, Canada; Boston University, USA)
We present SymRustC, a hybrid fuzzer for Rust. SymRustC is hybrid in the sense that it combines fuzzing and concolic execution. SymRustC leverages an existing tool called SymCC for its concolic execution capability and another existing tool called LibAFL for its fuzzing capability. Since SymCC instruments LLVM IR (Intermediate Representation) for concolic execution and the Rust compiler uses LLVM as a backend, we integrate SymCC with the Rust compiler to instrument Rust programs for concolic execution. LibAFL provides a framework to develop a fuzzer, and we use it to develop a hybrid fuzzer that combines fuzzing and our concolic execution. We discuss our implementation as well as four case studies to demonstrate that SymRustC can generate inputs that discover errors in Rust programs.

Publisher's Version
EvoSpex: A Search-Based Tool for Postcondition Inference
Facundo Molina, Pablo Ponzio, Nazareno Aguirre, and Marcelo F. Frias
(IMDEA Software Institute, Spain; University of Rio Cuarto, Argentina; CONICET, Argentina; University of Texas at El Paso, USA)
Postconditions are predicates that specify the intended behavior of a program by capturing properties about the program state when the program finishes its execution. Although postconditions can help to improve many software reliability analyses, they are seldom found accompanying source code. Thus, tools that assist developers in specifying postconditions are useful. This tool demo paper presents EvoSpex, a tool based on evolutionary computation that automatically infers postconditions of Java methods. Given a target Java method and a test suite for it, our tool executes the test suite to obtain valid pre/post state pairs for the method under analysis. Then, these pairs are mutated to obtain (allegedly) invalid ones, and finally a postcondition assertion characterizing the current method behavior is produced, by using an evolutionary algorithm that searches for an assertion that is satisfied by the valid pre/post state pairs and leaves out the invalid ones. EvoSpex implements a classic genetic algorithm that explores the space of candidate postconditions over a JML-like specification language. The algorithm is guided by a fitness function that aims at precisely capturing the valid state pairs, rejecting the invalid ones, and that also favors more succinct assertions.

Publisher's Version Video Info
PExReport-Maven: Creating Pruned Executable Cross-Project Failure Reports in Maven Build System
Sunzhou Huang and Xiaoyin Wang
(University of Texas at San Antonio, USA)
Modern Java software development extensively depends on existing libraries written by other developer teams from the same or a different organization. When a developer executes the test, the execution trace may go across the boundaries of multiple dependencies and create cross-project failures (CPFs). A readable, executable, and concise CPF report may enable the most effective communication, but creating such a report is often challenging in Java ecosystems. We developed PExReport-Maven to automatically create the ideal CPF reports in the Maven build system. PExReport-Maven leverages the Maven build system to prune source code, dependencies, and the build environment to create a concise stand-alone executable CPF reproduction package from the original CPF project. The reproduction package includes the source code, dependencies, and build environment necessary to reproduce the CPF, making it an ideal CPF report. We performed an evaluation on 74 software project issues with 198 cross-project failures, and the evaluation results show that PExReport can create pruned reproduction packages for 184 out of the 198 test failures in our dataset, with an average reduction of 72.97% in Java classes. A future study will be conducted based on user feedback from using this tool to report real-world CPFs. PExReport-Maven is publicly available at https://github.com/wereHuang/PExReport-Maven. The tool demo is available on the PExReport website: https://sites.google.com/view/pexreport/home.

Publisher's Version

Doctoral Symposium

Late PhD Students

Quantitative Robustness Analysis of Neural Networks
Mara Downing
(University of California at Santa Barbara, USA)
Neural networks are an increasingly common tool for solving problems that require complex analysis and pattern matching, such as identifying stop signs or processing medical imagery. Accordingly, verification of neural networks for safety and correctness is of great importance, as mispredictions can have catastrophic results in safety critical domains. One metric for verification is robustness, which answers whether or not a misclassified input exists in a given input neighborhood. I am focusing my research at quantitative robustness---finding not only if there exist misclassified inputs within a given neighborhood but also how many exist as a proportion of the neighborhood size. My overall goal is to expand the research on quantitative neural network robustness verification and create a variety of quantitative verification tools geared towards expanding our understanding of neural network robustness.

Publisher's Version
Automatic Testing and Benchmarking for Configurable Static Analysis Tools
Austin Mordahl
(University of Texas at Dallas, USA)
Static analysis is an important tool for detecting bugs in real-world software. The advent of numerous analysis algorithms with their own tradeoffs has led to the proliferation of configurable static analysis tools, but their complex, undertested configuration spaces are obstacles to their widespread adoption. To improve the reliability of these tools, my research focuses on developing new approaches to automatically test and debug them. First, I describe an empirical study that helps to understand the performance and behavior of configurable taint analysis tools for Android. The findings of this study motivate the development of ECSTATIC, a framework for testing and debugging that goes beyond taint analysis to test any configurable static analysis tool. The next steps for this research involve the automatic creation of real-world benchmarks for static analysis with associated ground truths and analysis features.

Publisher's Version
Type Automata
Ori Roth
(Technion, Israel)
PL researchers have a profound understanding of automata theory but fail to grasp the subtleties and nuances of the type systems used in modern programming languages. My research pursues new insights into the computational power of type systems by connecting them with well-founded classes of automata through type automata---machines that employ program types as control and storage. In addition, I demonstrate advanced type-level metaprogramming applications of type automata that coerce the compiler into performing computations at compile time.

Publisher's Version
Harnessing Large Language Models for Simulink Toolchain Testing and Developing Diverse Open-Source Corpora of Simulink Models for Metric and Evolution Analysis
Sohil Lal Shrestha
(University of Texas at Arlington, USA)
MATLAB/Simulink is a de-facto standard tool in several safety-critical industries such as automotive, aerospace, healthcare, and industrial automation for system modeling and analysis, compiling models to code, and deploying code to embedded hardware. On one hand, testing cyber-physical system (CPS) development tools such as MathWorks’ Simulink is important as a bug in the toolchain may propagate to the artifacts they produce. On the other hand, it is equally important to understand modeling practices and model evolution to support engineers and scientists as they are widely used in design, simulation, and verification of CPS models. Existing work in this area is limited by two main factors, i.e., (1) inefficiencies of state-of-the-art testing schemes in finding critical tool-chain bugs and (2) the lack of a reusable corpus of public Simulink models. In my thesis, I propose to (1) curate a large reusable corpus of Simulink models to help understand modeling practices and model evolution and (2) leverage such a corpus with deep-learning based language models to test the toolchain.

Publisher's Version

Early PhD Students

Fairness Testing for Recommender Systems
Huizhong Guo
(Zhejiang University, China)
The topic of fairness in recommender systems (RSs) is gaining significant attention. However, current fairness metrics and testing approaches primarily cater to classification systems and are not suitable for RSs. To bridge this gap, we aim to address the specific challenges involved in fairness testing for RSs. In this paper, we present a novel testing approach specifically designed for RSs, which enables us to achieve accurate results while maintaining high efficiency. Additionally, we suggest potential avenues for further research in the realm of fairness testing for RSs.

Publisher's Version
Quantitative Symbolic Similarity Analysis
Laboni Sarker
(University of California at Santa Barbara, USA)
Similarity analysis plays a crucial role in various software engineering tasks, such as detecting software changes, version merging, identifying plagiarism, and analyzing binary code. Equivalence analysis, a stricter form of similarity, focuses on determining whether different programs or versions of the same program behave identically. While extensive research exists on code and binary similarity as well as equivalence analysis, there is a lack of quantitative reasoning in these areas. Non-equivalence is a spectrum that requires deeper exploration, as it can manifest in different ways across the input domain space. This paper emphasizes the importance of quantitative reasoning on non-equivalence which arises due to semantic differences. By quantitatively reasoning about non-equivalence, it becomes possible to identify specific input ranges for which programs are equivalent or non-equivalent. We aim to address the gap in quantitative reasoning in symbolic similarity analysis, enabling a more comprehensive understanding of program behavior.

Publisher's Version
Reasoning about MLIR Semantics through Effects and Handlers
Pingshi Yu
(Imperial College London, UK)
MLIR is a novel framework for developing intermediate representations (IRs) of compilers. At its core, MLIR is a framework for the specification of syntax fragments (dialects) and optimisations, which can be combined à−lacarte to form customised IRs. Through this, MLIR allows IR abstractions to be shared across different domains. With rapid adoption of MLIR across industry, techniques for formalised semantics which matches the flexibility and extensibility offered by MLIR are urgently needed. We propose a framework for MLIR semantics based on effect handlers, which allows for dialect semantics to be specified in a modular and composable way, parallel to MLIR. We also describe several research directions continuing on from handlers-based MLIR semantics.

Publisher's Version

proc time: 26.53