FSE 2025 – Author Index |
Contents -
Abstracts -
Authors
|
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Abid, Muhammad Salman |
![]() Ali Reza Ibrahimzada, Kaiyao Ke, Mrigank Pawagi, Muhammad Salman Abid, Rangeet Pan, Saurabh Sinha, and Reyhaneh Jabbarvand (University of Illinois at Urbana-Champaign, USA; Indian Institute of Science, India; Cornell University, USA; IBM Research, USA) ![]() |
|
Aghili, Roozbeh |
![]() Roozbeh Aghili, Heng Li, and Foutse Khomh (Polytechnique Montréal, Canada) Software logs, generated during the runtime of software systems, are essential for various development and analysis activities, such as anomaly detection and failure diagnosis. However, the presence of sensitive information in these logs poses significant privacy concerns, particularly regarding Personally Identifiable Information (PII) and quasi-identifiers that could lead to re-identification risks. While general data privacy has been extensively studied, the specific domain of privacy in software logs remains underexplored, with inconsistent definitions of sensitivity and a lack of standardized guidelines for anonymization. To mitigate this gap, this study offers a comprehensive analysis of privacy in software logs from multiple perspectives. We start by performing an analysis of 25 publicly available log datasets to identify potentially sensitive attributes. Based on the result of this step, we focus on three perspectives: privacy regulations, research literature, and industry practices. We first analyze key data privacy regulations, such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), to understand the legal requirements concerning sensitive information in logs. Second, we conduct a systematic literature review to identify common privacy attributes and practices in log anonymization, revealing gaps in existing approaches. Finally, we survey 45 industry professionals to capture practical insights on log anonymization practices. Our findings shed light on various perspectives of log privacy and reveal industry challenges, such as technical and efficiency issues while highlighting the need for standardized guidelines. By combining insights from regulatory, academic, and industry perspectives, our study aims to provide a clearer framework for identifying and protecting sensitive information in software logs. ![]() |
|
Ahmed, Toufique |
![]() Yuvraj Singh Virk, Premkumar Devanbu, and Toufique Ahmed (University of California at Davis, USA; IBM Research, USA) ![]() |
|
Akram, Waseem |
![]() Waseem Akram, Yanjie Jiang, Yuxia Zhang, Haris Ali Khan, and Hui Liu (Beijing Institute of Technology, China; Peking University, China) Accurate method naming is crucial for code readability and maintainability. However, manually creating concise and meaningful names remains a significant challenge. To this end, in this paper, we propose an approach based on Large Language Model (LLMs) to suggest method names according to function descriptions. The key of the approach is ContextCraft, an automated algorithm for generating context-rich prompts for LLM that suggests the expected method names according to the prompts. For a given query (functional description), it retrieves a few best examples whose functional descriptions have the greatest similarity with the query. From the examples, it identifies tokens that are likely to appear in the final method name as well as their likely positions, picks up pivot words that are semantically related to tokens in the according method names, and specifies the evaluation results of the LLM on the selected examples. All such outputs (tokens with probabilities and position information, pivot words accompanied by associated name tokens and similarity scores, and evaluation results) together with the query and the selected examples are then filled in a predefined prompt template, resulting in a context-rich prompt. This context-rich prompt reduces the randomness of LLMs by focusing the LLM’s attention on relevant contexts, constraining the solution space, and anchoring results to meaningful semantic relationships. Consequently, the LLM leverages this prompt to generate the expected method name, producing a more accurate and relevant suggestion. We evaluated the proposed approach with 43k real-world Java and Python methods accompanied by functional descriptions. Our evaluation results suggested that it significantly outperforms the state-of-the-art approach RNN-att-Copy, improving the chance of exact match by 52% and decreasing the edit distance between generated and expected method names by 32%. Our evaluation results also suggested that the proposed approach worked well for various LLMs, including ChatGPT-3.5, ChatGPT-4, ChatGPT-4o, Gemini-1.5, and Llama-3. ![]() ![]() Haris Ali Khan, Yanjie Jiang, Qasim Umer, Yuxia Zhang, Waseem Akram, and Hui Liu (Beijing Institute of Technology, China; Peking University, China; King Fahd University of Petroleum and Minerals, Saudi Arabia) It is often valuable to know whether a given piece of source code has or hasn’t been used to train a given deep learning model. On one side, it helps avoid data contamination problems that may exaggerate the performance of evaluated models. Conversely, it facilitates copyright protection by identifying private or protected code leveraged for model training without permission. To this end, automated approaches have been proposed for the detection, known as data contamination detection. Such approaches often heavily rely on the confidence of the involved models, assuming that the models should be more confident in handling contaminated data than cleaned data. However, such approaches do not consider the nature of the given data item, i.e., how difficult it is to predict the given item. Consequently, difficult-to-predict contaminated data and easy-to-predict cleaned data are often misclassified. As an initial attempt to solve this problem, this paper presents a naturalness-based approach, called Natural-DaCoDe, for code-completion models to distinguish contaminated source code from cleaned ones. Natural-DaCoDe leverages code naturalness to quantitatively measure the difficulty of a given source code for code-completion models. It then trains a classifier to distinguish contaminated source code according to both code naturalness and the performance of the code-completion models on the given source code. We evaluate Natural-DaCoDe with two pre-trained large language models (e.g., ChatGPT and Claude) and two code-completion models that we trained from scratch for detecting contamination data. Our evaluation results suggest that Natural-DaCoDe substantially outperformed the state-of-the-art approaches in detecting contaminated data, improving the average accuracy by 61.78%. We also evaluate Natural-DaCoDe with method name suggestion task, and it remains more accurate than the state-of-the-art approaches, improving the accuracy by 54.39%. Furthermore, Natural-DaCoDe was tested on a natural language text benchmark, significantly outperforming the state-of-the-art approaches by 22% . It may suggest that Natural-DaCoDe could be applied to various source code related tasks besides code complete. ![]() |
|
Aleti, Aldeida |
![]() Lam Nguyen Tung, Steven Cho, Xiaoning Du, Neelofar Neelofar, Valerio Terragni, Stefano Ruberto, and Aldeida Aleti (Monash University, Australia; University of Auckland, New Zealand; Royal Melbourne Institute of Technology, Australia; Joint Research Centre at the European Commission, Italy) ![]() |
|
Alhamed, Mohammed |
![]() Gül Çalıklı and Mohammed Alhamed (University of Glasgow, UK; Applied Behaviour Systems, UK) Software development Effort Estimation (SEE) comprises predicting the most realistic amount of effort (e.g., in work hours) required to develop or maintain software based on incomplete, uncertain, and noisy input. Expert judgment is the dominant SEE strategy used in the industry. Yet, expert-based judgment can provide inaccurate effort estimates, leading to projects’ poor budget planning and cost and time overruns, negatively impacting the world economy. Large Language Models (LLMs) are good candidates to assist software professionals in effort estimation. However, their effective leveraging for SEE requires thoroughly investigating their limitations and to what extent they overlap with those of (human) software practitioners. One primary limitation of LLMs is the sensitivity of their responses to prompt changes. Similarly, empirical studies showed that changes in the request format (e.g., rephrasing) could impact (human) software professionals’ effort estimates. This paper reports the first study that replicates a series of SEE experiments, which were initially carried out with software professionals (humans) in the literature. Our study aims to investigate how LLMs’ effort estimates change due to the transition from the traditional request format (i.e., "How much effort is required to complete X?”) to the alternative request format (i.e., "How much can be completed in Y work hours?”). Our experiments involved three different LLMs (GPT-4, Gemini 1.5 Pro, Llama 3.1) and 88 software project specifications (per treatment in each experiment), resulting in 880 prompts, in total that we prepared using 704 user stories from three large-scale open-source software projects (Hyperledger Fabric, Mulesoft Mule, Spring XD). Our findings align with the original experiments conducted with software professionals: The first four experiments showed that LLMs provide lower effort estimates due to transitioning from the traditional to the alternative request format. The findings of the fifth and first experiments detected that LLMs display patterns analogous to anchoring bias, a human cognitive bias defined as the tendency to stick to the anchor (i.e., the "Y work-hours” in the alternative request format). Our findings provide crucial insights into facilitating future human-AI collaboration and prompt designs for improved effort estimation accuracy. ![]() |
|
Al-Kaswan, Ali |
![]() Ali Al-Kaswan, Sebastian Deatc, Begüm Koç, Arie van Deursen, and Maliheh Izadi (Delft University of Technology, Netherlands) ![]() |
|
Al Mamun, Md Afif |
![]() Borui Yang, Md Afif Al Mamun, Jie M. Zhang, and Gias Uddin (King's College London, UK; University of Calgary, Canada; York University, Canada) Large Language Models (LLMs) are prone to hallucinations, e.g., factually incorrect information, in their responses. These hallucinations present challenges for LLM-based applications that demand high factual accuracy. Existing hallucination detection methods primarily depend on external resources, which can suffer from issues such as low availability, incomplete coverage, privacy concerns, high latency, low reliability, and poor scalability. There are also methods depending on output probabilities, which are often inaccessible for closed-source LLMs like GPT models. This paper presents MetaQA, a self-contained hallucination detection approach that leverages metamorphic relation and prompt mutation. Unlike existing methods, MetaQA operates without any external resources and is compatible with both open-source and closed-source LLMs. MetaQA is based on the hypothesis that if an LLM’s response is a hallucination, the designed metamorphic relations will be violated. We compare MetaQA with the state-of-the-art zero-resource hallucination detection method, SelfCheckGPT, across multiple datasets, and on two open-source and two closed-source LLMs. Our results reveal that MetaQA outperforms SelfCheckGPT in terms of precision, recall, and f1 score. For the four LLMs we study, MetaQA outperforms SelfCheckGPT with a superiority margin ranging from 0.041 - 0.113 (for precision), 0.143 - 0.430 (for recall), and 0.154 - 0.368 (for F1-score). For instance, with Mistral-7B, MetaQA achieves an average F1-score of 0.435, compared to SelfCheckGPT’s F1-score of 0.205, representing an improvement rate of 112.2%. MetaQA also demonstrates superiority across all different categories of questions. ![]() |
|
Alon, Yoav |
![]() Yoav Alon and Cristina David (University of Bristol, UK) Large Language Models (LLMs) were shown to struggle with long-term planning, which may be caused by the limited way in which they explore the space of possible solutions. We propose an architecture where a Reinforcement Learning (RL) Agent guides an LLM's space exploration: (1) the Agent has access to domain-specific information, and can therefore make decisions about the quality of candidate solutions based on specific and relevant metrics, which were not explicitly considered by the LLM's training objective; (2) the LLM can focus on generating immediate next steps, without the need for long-term planning. We allow non-linear reasoning by exploring alternative paths and backtracking. We evaluate this architecture on the program equivalence task, and compare it against Chain of Thought (CoT) and Tree of Thoughts (ToT). We assess both the downstream task, denoting the binary classification, and the intermediate reasoning steps. Our approach compares positively against CoT and ToT. ![]() |
|
Altmayer Pizzorno, Juan |
![]() Juan Altmayer Pizzorno and Emery D. Berger (University of Massachusetts at Amherst, USA; Amazon Web Services, USA) Testing is an essential part of software development. Test generation tools attempt to automate the otherwise labor-intensive task of test creation, but generating high-coverage tests remains challenging. This paper proposes CoverUp, a novel approach to driving the generation of high-coverage Python regression tests. CoverUp combines coverage analysis, code context, and feedback in prompts that iteratively guide the LLM to generate tests that improve line and branch coverage. We evaluate our prototype CoverUp implementation across a benchmark of challenging code derived from open-source Python projects and show that CoverUp substantially improves on the state of the art. Compared to CodaMosa, a hybrid search/LLM-based test generator, CoverUp achieves a per-module median line+branch coverage of 80% (vs. 47%). Compared to MuTAP, a mutation- and LLM-based test generator, CoverUp achieves an overall line+branch coverage of 89% (vs. 77%). We also demonstrate that CoverUp’s performance stems not only from the LLM used but from the combined effectiveness of its components. ![]() |
|
Apel, Sven |
![]() Max Weber, Alina Mailach, Sven Apel, Janet Siegmund, Raimund Dachselt, and Norbert Siegmund (Leipzig University, Germany; ScaDS.AI Dresden-Leipzig, Germany; Saarland Informatics Campus, Germany; Saarland University, Germany; Chemnitz University of Technology, Germany; Dresden University of Technology, Germany) Debugging performance bugs in configurable software systems is a complex and time-consuming task that requires not only fixing a bug, but also understanding its root cause. While there is a vast body of literature on debugging strategies, there is no consensus on general debugging. This makes it difficult to provide concrete guidance for developers, especially for configuration-dependent performance bugs. The goal of our work is to alleviate this situation by providing an framework to describe debugging strategies in a more general, unifying way. We conducted a user study with 12 professional developers who debugged a performance bug in a real-world configurable system. To observe developers in an unobstructive way, we provided an immersive virtual reality tool, SoftVR, giving them a large degree of freedom to choose the preferred debugging strategy. The results show that the existing documentation of strategies is too coarse-grained and intermixed to identify successful approaches. In a subsequent qualitative analysis, we devised a coding framework to reason about debugging approaches. With this framework, we identified five goal-oriented episodes that developers employ, which they also confirmed in subsequent interviews. Our work provides a unified description of debugging strategies, allowing researchers a common foundation to study debugging and practitioners and teachers guidance on successful debugging strategies. ![]() |
|
Arvan, Erfan |
![]() Nima Karimipour, Erfan Arvan, Martin Kellogg, and Manu Sridharan (University of California at Riverside, USA; New Jersey Institute of Technology, USA) Null-pointer exceptions are serious problem for Java, and researchers have developed type-based nullness checking tools to prevent them. These tools, however, have a downside: they require developers to write nullability annotations, which is time-consuming and hinders adoption. Researchers have therefore proposed nullability annotation inference tools, whose goal is to (partially) automate the task of annotating a program for nullability. However, prior works rely on differing theories of what makes a set of nullability annotations good, making comparing their effectiveness challenging. In this work, we identify a systematic bias in some prior experimental evaluation of these tools: the use of “type reconstruction” experiments to see if a tool can recover erased developer-written annotations. We show that developers make semantic code changes while adding annotations to facilitate typechecking, leading such experiments to overestimate the effectiveness of inference tools on never-annotated code. We propose a new definition of the “best” inferred annotations for a program that avoids this bias, based on a systematic exploration of the design space. With this new definition, we perform the first head-to-head comparison of three extant nullability inference tools. Our evaluation showed the complementary strengths of the tools and remaining weaknesses that could be addressed in future work. ![]() |
|
Bacchelli, Alberto |
![]() Francesco Sovrano, Adam Bauer, and Alberto Bacchelli (ETH Zurich, Switzerland; University of Zurich, Switzerland) Traditionally, software vulnerability detection research has focused on individual small functions due to earlier language processing technologies’ limitations in handling larger inputs. However, this function-level approach may miss bugs that span multiple functions and code blocks. Recent advancements in artificial intelligence have enabled processing of larger inputs, leading everyday software developers to increasingly rely on chat-based large language models (LLMs) like GPT-3.5 and GPT-4 to detect vulnerabilities across entire files, not just within functions. This new development practice requires researchers to urgently investigate whether commonly used LLMs can effectively analyze large file-sized inputs, in order to provide timely insights for software developers and engineers about the pros and cons of this emerging technological trend. Hence, the goal of this paper is to evaluate the effectiveness of several state-of-the-art chat-based LLMs, including the GPT models, in detecting in-file vulnerabilities. We conducted a costly investigation into how the performance of LLMs varies based on vulnerability type, input size, and vulnerability location within the file. To give enough statistical power (β ≥ .8) to our study, we could only focus on the three most common (as well as dangerous) vulnerabilities: XSS, SQL injection, and path traversal. Our findings indicate that the effectiveness of LLMs in detecting these vulnerabilities is strongly influenced by both the location of the vulnerability and the overall size of the input. Specifically, regardless of the vulnerability type, LLMs tend to significantly (p < .05) underperform when detecting vulnerabilities located toward the end of larger files—a pattern we call the lost-in-the-end effect. Finally, to further support software developers and practitioners, we also explored the optimal input size for these LLMs and presented a simple strategy for identifying it, which can be applied to other models and vulnerability types. Eventually, we show how adjusting the input size can lead to significant improvements in LLM-based vulnerability detection, with an average recall increase of over 37% across all models. ![]() ![]() ![]() Valentin Bourcier, Pooja Rani, Maximilian Ignacio Willembrinck Santander, Alberto Bacchelli, and Steven Costiou (University of Lille - Inria - CNRS - Centrale Lille - UMR 9189 CRIStAL, France; University of Zurich, Switzerland) Debugging consists in understanding the behavior of a program to identify and correct its defects. Breakpoints are the most commonly used debugging tool and aim to facilitate the debugging process by allowing developers to interrupt a program’s execution at a source code location of their choice and inspect the state of the program. Researchers suggest that in systems developed using object-oriented programming (OOP), traditional breakpoints may be a not effective method for debugging. In OOP, developers create code in classes, which at runtime are instantiated as object—entities with their own state and behavior that can interact with one another. Traditional breakpoints are set within the class code, halting execution for every object that shares that class’s code. This leads to unnecessary interruptions for developers who are focused on monitoring the behavior of a specific object. As an answer to this challenge, researchers proposed object-centric debugging, an approach based on debugging tools that focus on objects rather than classes. In particular, using object-centric breakpoints, developers can select specific objects (rather than classes) for which the execution must be interrupted. Even though it seems reasonable that this approach may ease the debugging process by reducing the time and actions needed for debugging objects, no research has yet verified its actual impact. To investigate the impact of object-centric breakpoints on the debugging process, we devised and conducted a controlled experiment with 81 developers who spent an average of 1 hour and 30 minutes each on the study. The experiment required participants to complete two debugging tasks using debugging tools with vs. without object-centric breakpoints. We found no significant effect from the use of object-centric breakpoints on the number of actions required to debug or the effectiveness in understanding or fixing the bug. However, for one of the two tasks, we measured a statistically significant reduction in debugging time for participants who used object-centric breakpoints, while for the other task, there was a statistically significant increase. Our analysis suggests that the impact of object-centric breakpoints varies depending on the context and the specific nature of the bug being addressed. In particular, our analysis indicates that object-centric breakpoints can speed up the process of locating the root cause of a bug when the bug can be replicated without needing to restart the program. We discuss the implications of these findings for debugging practices and future research. ![]() ![]() |
|
Badoux, Nicolas |
![]() Flavio Toffalini, Nicolas Badoux, Zurab Tsinadze, and Mathias Payer (EPFL, Switzerland; Ruhr University Bochum, Germany) ![]() |
|
Bagheri, Hamid |
![]() Md Rashedul Hasan, Mohammad Rashedul Hasan, and Hamid Bagheri (University of Nebraska-Lincoln, USA) Optimizing object-relational database mapping (ORM) design is crucial for performance and scalability in modern software systems. However, widely used ORM tools offer limited support for exploring performance tradeoffs, often enforcing a single design and overlooking alternatives, which can lead to suboptimal outcomes. While systematic tradeoff analysis can reveal Pareto-optimal designs, its high computational cost and poor scalability hinder practical adoption. This paper presents DesignTradeoffSculptor, an extensible tool suite for efficient, scalable tradeoff analysis in ORM database design. Leveraging advanced Transformer-based deep learning models—trained and fine-tuned on formally analyzed database designs—and framing design exploration as a Natural Language Processing task, DesignTradeoffSculptor efficiently identifies and removes suboptimal designs, sharply reducing the number of candidates requiring costly tradeoff analysis. Experiments show that DesignTradeoffSculptor uncovers optimal designs missed by leading ORM tools and improves analysis efficiency by over 98.21%, reducing tradeoff analysis time from 15 days to just 18 minutes, demonstrating the transformative potential of integrating formal methods with deep learning. ![]() |
|
Bai, Linxiao |
![]() Haoran Liu, Shanshan Li, Zhouyang Jia, Yuanliang Zhang, Linxiao Bai, Si Zheng, Xiaoguang Mao, and Xiangke Liao (National University of Defense Technology, China) ![]() |
|
Bartel, Alexandre |
![]() Bruno Kreyssig and Alexandre Bartel (Umeå University, Sweden) While multiple recent publications on detecting Java Deserialization Vulnerabilities highlight an increasing relevance of the topic, until now no proper benchmark has been established to evaluate the individual approaches. Hence, it has become increasingly difficult to show improvements over previous tools and trade-offs that were made. In this work, we synthesize the main challenges in gadget chain detection. More specifically, this unveils the constraints program analysis faces in the context of gadget chain detection. From there, we develop Gleipner: the first synthetic, large-scale and systematic benchmark to validate the effectiveness of algorithms for detecting gadget chains in the Java programming language. We then benchmark seven previous publications in the field using Gleipner. As a result, it shows, that (1) our benchmark provides a transparent, qualitative, and sound measurement for the maturity of gadget chain detecting tools, (2) Gleipner alleviates severe benchmarking flaws which were previously common in the field and (3) state-of-the-art tools still struggle with most challenges in gadget chain detection. ![]() |
|
Batole, Fraol |
![]() Sayem Mohammad Imtiaz, Astha Singh, Fraol Batole, and Hridesh Rajan (Iowa State University, USA; Tulane University, USA) Not a day goes by without hearing about the impressive feats of large language models (LLMs), and equally, not a day passes without hearing about their challenges. LLMs are notoriously vulnerable to biases in their dataset, leading to issues such as toxicity, harmful responses, and factual inaccuracies. While domain-adaptive training has been employed to mitigate these issues, these techniques often address all model parameters indiscriminately during the repair process, resulting in poor repair quality and reduced model versatility. In this paper, drawing inspiration from fault localization via program slicing, we introduce a novel dynamic slicing-based intent-aware LLM repair strategy, IRepair. This approach selectively targets the most error-prone sections of the model for repair. Specifically, we propose dynamically slicing the model’s most sensitive layers that require immediate attention, concentrating repair efforts on those areas. This method enables more effective repairs with potentially less impact on the model’s overall versatility by altering a smaller portion of the model. Furthermore, dynamic selection allows for a more nuanced and precise model repair compared to a fixed selection strategy. We evaluated our technique on three models from the GPT2 and GPT-Neo families, with parameters ranging from 800M to 1.6B, in a toxicity mitigation setup. Our results show that IRepair repairs errors 43.6% more effectively while causing 46% less disruption to general performance compared to the closest baseline, direct preference optimization. Our empirical analysis also reveals that errors are more concentrated in a smaller section of the model, with the top 20% of layers exhibiting 773% more error density than the remaining 80%. This highlights the need for selective repair. Additionally, we demonstrate that a dynamic selection approach is essential for addressing errors dispersed throughout the model, ensuring a robust and efficient repair. ![]() |
|
Bauer, Adam |
![]() Francesco Sovrano, Adam Bauer, and Alberto Bacchelli (ETH Zurich, Switzerland; University of Zurich, Switzerland) Traditionally, software vulnerability detection research has focused on individual small functions due to earlier language processing technologies’ limitations in handling larger inputs. However, this function-level approach may miss bugs that span multiple functions and code blocks. Recent advancements in artificial intelligence have enabled processing of larger inputs, leading everyday software developers to increasingly rely on chat-based large language models (LLMs) like GPT-3.5 and GPT-4 to detect vulnerabilities across entire files, not just within functions. This new development practice requires researchers to urgently investigate whether commonly used LLMs can effectively analyze large file-sized inputs, in order to provide timely insights for software developers and engineers about the pros and cons of this emerging technological trend. Hence, the goal of this paper is to evaluate the effectiveness of several state-of-the-art chat-based LLMs, including the GPT models, in detecting in-file vulnerabilities. We conducted a costly investigation into how the performance of LLMs varies based on vulnerability type, input size, and vulnerability location within the file. To give enough statistical power (β ≥ .8) to our study, we could only focus on the three most common (as well as dangerous) vulnerabilities: XSS, SQL injection, and path traversal. Our findings indicate that the effectiveness of LLMs in detecting these vulnerabilities is strongly influenced by both the location of the vulnerability and the overall size of the input. Specifically, regardless of the vulnerability type, LLMs tend to significantly (p < .05) underperform when detecting vulnerabilities located toward the end of larger files—a pattern we call the lost-in-the-end effect. Finally, to further support software developers and practitioners, we also explored the optimal input size for these LLMs and presented a simple strategy for identifying it, which can be applied to other models and vulnerability types. Eventually, we show how adjusting the input size can lead to significant improvements in LLM-based vulnerability detection, with an average recall increase of over 37% across all models. ![]() ![]() |
|
Bell, Jonathan |
![]() Katherine Hough and Jonathan Bell (Northeastern University, USA) Dynamic taint tracking is a program analysis that traces the flow of information through a program. In the Java virtual machine (JVM), there are two prominent approaches for dynamic taint tracking: "shadowing" and "mirroring". Shadowing is able to precisely track information flows, but is also prone to disrupting the semantics of the program under analysis. Mirroring is better able to preserve program semantics, but often inaccurate. The limitations of these approaches are further exacerbated by features introduced in the latest Java versions. In this paper, we propose Galette, an approach for dynamic taint tracking in the JVM that combines aspects of both shadowing and mirroring to provide precise, robust taint tag propagation in modern JVMs. On a benchmark suite of 3,451 synthetic Java programs, we found that Galette was able to propagate taint tags with perfect accuracy while preserving program semantics on all four active long-term support versions of Java. We also found that Galette's runtime and memory overheads were competitive with that of two state-of-the-art dynamic taint tracking systems on a benchmark suite of twenty real-world Java programs. ![]() |
|
Berger, Emery D. |
![]() Kyla H. Levin, Nicolas van Kempen, Emery D. Berger, and Stephen N. Freund (University of Massachusetts at Amherst, USA; Amazon Web Services, USA; Williams College, USA) ![]() ![]() Juan Altmayer Pizzorno and Emery D. Berger (University of Massachusetts at Amherst, USA; Amazon Web Services, USA) Testing is an essential part of software development. Test generation tools attempt to automate the otherwise labor-intensive task of test creation, but generating high-coverage tests remains challenging. This paper proposes CoverUp, a novel approach to driving the generation of high-coverage Python regression tests. CoverUp combines coverage analysis, code context, and feedback in prompts that iteratively guide the LLM to generate tests that improve line and branch coverage. We evaluate our prototype CoverUp implementation across a benchmark of challenging code derived from open-source Python projects and show that CoverUp substantially improves on the state of the art. Compared to CodaMosa, a hybrid search/LLM-based test generator, CoverUp achieves a per-module median line+branch coverage of 80% (vs. 47%). Compared to MuTAP, a mutation- and LLM-based test generator, CoverUp achieves an overall line+branch coverage of 89% (vs. 77%). We also demonstrate that CoverUp’s performance stems not only from the LLM used but from the combined effectiveness of its components. ![]() |
|
Beyazıt, Mutlu |
![]() Yuqing Wang, Mika V. Mäntylä, Serge Demeyer, Mutlu Beyazıt, Joanna Kisaakye, and Jesse Nyyssölä (University of Helsinki, Finland; University of Oulu, Finland; University of Antwerp, Belgium) Microservice-based systems (MSS) may fail with various fault types, due to their complex and dynamic nature. While existing AIOps methods excel at detecting abnormal traces and locating the responsible service(s), human efforts from practitioners are still required for further root cause analysis to diagnose specific fault types and analyze failure reasons for detected abnormal traces, particularly when abnormal traces do not stem directly from specific services. In this paper, we propose a novel AIOps framework, TraFaultDia, to automatically classify abnormal traces into fault categories for MSS. We treat the classification process as a series of multi-class classification tasks, where each task represents an attempt to classify abnormal traces into specific fault categories for a MSS. TraFaultDia is trained on several abnormal trace classification tasks with a few labeled instances from a MSS using a meta-learning approach. After training, TraFaultDia can quickly adapt to new, unseen abnormal trace classification tasks with a few labeled instances across MSS. TraFaultDia’s use cases are scalable depending on how fault categories are built from anomalies within MSS. We evaluated TraFaultDia on two representative MSS, TrainTicket and OnlineBoutique, with open datasets. In these datasets, each fault category is tied to the faulty system component(s) (service/pod) with a root cause. Our TraFaultDia automatically classifies abnormal traces into these fault categories, thus enabling the automatic identification of faulty system components and root causes without manual analysis. Our results show that, within the MSS it is trained on, TraFaultDia achieves an average accuracy of 93.26% and 85.20% across 50 new, unseen abnormal trace classification tasks for TrainTicket and OnlineBoutique respectively, when provided with 10 labeled instances for each fault category per task in each system. In the cross-system context, when TraFaultDia is applied to a MSS different from the one it is trained on, TraFaultDia gets an average accuracy of 92.19% and 84.77% for the same set of 50 new, unseen abnormal trace classification tasks of the respective systems, also with 10 labeled instances provided for each fault category per task in each system. ![]() |
|
Bhattacharjee, Subhankar |
![]() Shaila Sharmin, Anwar Hossain Zahid, Subhankar Bhattacharjee, Chiamaka Igwilo, Miryung Kim, and Wei Le (Iowa State University, USA; University of California at Los Angeles, USA) ![]() |
|
Biagiola, Matteo |
![]() Matteo Biagiola, Robert Feldt, and Paolo Tonella (USI Lugano, Switzerland; Chalmers University of Technology, Sweden) Adaptive Random Testing (ART) has faced criticism, particularly for its computational inefficiency, as highlighted by Arcuri and Briand. Their analysis clarified how ART requires a quadratic number of distance computations as the number of test executions increases, which limits its scalability in scenarios requiring extensive testing to uncover faults. Simulation results support this, showing that the computational overhead of these distance calculations often outweighs ART’s benefits. While various ART variants have attempted to reduce these costs, they frequently do so at the expense of fault detection, lack complexity guarantees, or are restricted to specific input types, such as numerical or discrete data. In this paper, we introduce a novel framework for adaptive random testing that replaces pairwise distance computations with a compact aggregation of past executions, such as counting the q-grams observed in previous runs. Test case selection then leverages this aggregated data to measure diversity (e.g., entropy of q-grams), allowing us to reduce the computational complexity from quadratic to linear. Experiments with a benchmark of six web applications, show that ART with q-grams covers, on average, 4× more unique targets than random testing, and 3.5×more than ART using traditional distance-based methods. ![]() |
|
Bissyandé, Tegawendé F. |
![]() Micheline Bénédicte Moumoula, Abdoul Kader Kaboré, Jacques Klein, and Tegawendé F. Bissyandé (University of Luxembourg, Luxembourg; Interdisciplinary Centre of Excellence in AI for Development (CITADEL), Burkina Faso) With the involvement of multiple programming languages in modern software development, cross-lingual code clone detection has gained traction within the software engineering community. Numerous studies have explored this topic, proposing various promising approaches. Inspired by the significant advances in machine learning in recent years, particularly Large Language Models (LLMs), which have demonstrated their ability to tackle various tasks, this paper revisits cross-lingual code clone detection. We evaluate the performance of five (05) LLMs and, eight prompts (08) for the identification of cross-lingual code clones. Additionally, we compare these results against two baseline methods. Finally, we evaluate a pre-trained embedding model to assess the effectiveness of the generated representations for classifying clone and non-clone pairs. The studies involving LLMs and Embedding models are evaluated using two widely used cross-lingual datasets, XLCoST and CodeNet. Our results show that LLMs can achieve high F1 scores, up to 0.99, for straightforward programming examples. However, they not only perform less well on programs associated with complex programming challenges but also do not necessarily understand the meaning of “code clones” in a cross-lingual setting. We show that embedding models used to represent code fragments from different programming languages in the same representation space enable the training of a basic classifier that outperforms all LLMs by ∼1 and ∼20 percentage points on the XLCoST and CodeNet datasets, respectively. This finding suggests that, despite the apparent capabilities of LLMs, embeddings provided by embedding models offer suitable representations to achieve state-of-the-art performance in cross-lingual code clone detection. ![]() |
|
Böhme, Marcel |
![]() Han Zheng, Flavio Toffalini, Marcel Böhme, and Mathias Payer (EPFL, Switzerland; Ruhr University Bochum, Germany; MPI-SP, Germany) Can a fuzzer cover more code with minimal corruption of the initial seed? Before a seed is fuzzed, the early greybox fuzzers first systematically enumerated slightly corrupted inputs by applying every mutation operator to every part of the seed, once per generated input. The hope of this so-called “deterministic” stage was that simple changes to the seed would be less likely to break the complex file format; the resulting inputs would find bugs in the program logic well beyond the program’s parser. However, when experiments showed that disabling the deterministic stage achieves more coverage, applying multiple mutation operators at the same time to a single input, most fuzzers disabled the deterministic stage by default. Instead of ignoring the deterministic stage, we analyze its potential and substantially improve deterministic exploration. Our deterministic stage is now the default in AFL++, reverting the earlier decision of dropping deterministic exploration. We start by investigating the overhead and the contribution of the deterministic stage to the discovery of coverage-increasing inputs. While the sheer number of generated inputs explains the overhead, we find that only a few critical seeds (20%), and only a few critical bytes in a seed (0.5%) are responsible for the vast majority of the coverage-increasing inputs (83% and 84%, respectively). Hence, we develop an efficient technique, called , to identify these critical seeds / bytes so as to prune a large number of unnecessary inputs. retains the benefits of the classic deterministic stage by only enumerating a tiny part of the total deterministic state space. We evaluate implementation on two benchmarking frameworks, FuzzBench and Magma. Our evaluation shows that outperforms state-of-the-art fuzzers with and without the (old) deterministic stage enabled, both in terms of coverage and bug finding. also discovered 8 new CVEs on exhaustively fuzzed security-critical applications. Finally, has been independently evaluated and integrated into AFL++ as default option. ![]() |
|
Bosu, Amiangshu |
![]() Jaydeb Sarker, Asif Kamal Turzo, and Amiangshu Bosu (University of Nebraska at Omaha, USA; Wayne State University, USA) Toxicity on GitHub can severely impact Open Source Software (OSS) development communities. To mitigate such behavior, a better understanding of its nature and how various measurable characteristics of project contexts and participants are associated with its prevalence is necessary. To achieve this goal, we conducted a large-scale mixed-method empirical study of 2,828 GitHub-based OSS projects randomly selected based on a stratified sampling strategy. Using ToxiCR, an SE domain-specific toxicity detector, we automatically classified each comment as toxic or non-toxic. Additionally, we manually analyzed a random sample of 600 comments to validate ToxiCR's performance and gain insights into the nature of toxicity within our dataset. The results of our study suggest that profanity is the most frequent toxicity on GitHub, followed by trolling and insults. While a project's popularity is positively associated with the prevalence of toxicity, its issue resolution rate has the opposite association. Corporate-sponsored projects are less toxic, but gaming projects are seven times more toxic than non-gaming ones. OSS contributors who have authored toxic comments in the past are significantly more likely to repeat such behavior. Moreover, such individuals are more likely to become targets of toxic texts. ![]() ![]() |
|
Bourcier, Valentin |
![]() Valentin Bourcier, Pooja Rani, Maximilian Ignacio Willembrinck Santander, Alberto Bacchelli, and Steven Costiou (University of Lille - Inria - CNRS - Centrale Lille - UMR 9189 CRIStAL, France; University of Zurich, Switzerland) Debugging consists in understanding the behavior of a program to identify and correct its defects. Breakpoints are the most commonly used debugging tool and aim to facilitate the debugging process by allowing developers to interrupt a program’s execution at a source code location of their choice and inspect the state of the program. Researchers suggest that in systems developed using object-oriented programming (OOP), traditional breakpoints may be a not effective method for debugging. In OOP, developers create code in classes, which at runtime are instantiated as object—entities with their own state and behavior that can interact with one another. Traditional breakpoints are set within the class code, halting execution for every object that shares that class’s code. This leads to unnecessary interruptions for developers who are focused on monitoring the behavior of a specific object. As an answer to this challenge, researchers proposed object-centric debugging, an approach based on debugging tools that focus on objects rather than classes. In particular, using object-centric breakpoints, developers can select specific objects (rather than classes) for which the execution must be interrupted. Even though it seems reasonable that this approach may ease the debugging process by reducing the time and actions needed for debugging objects, no research has yet verified its actual impact. To investigate the impact of object-centric breakpoints on the debugging process, we devised and conducted a controlled experiment with 81 developers who spent an average of 1 hour and 30 minutes each on the study. The experiment required participants to complete two debugging tasks using debugging tools with vs. without object-centric breakpoints. We found no significant effect from the use of object-centric breakpoints on the number of actions required to debug or the effectiveness in understanding or fixing the bug. However, for one of the two tasks, we measured a statistically significant reduction in debugging time for participants who used object-centric breakpoints, while for the other task, there was a statistically significant increase. Our analysis suggests that the impact of object-centric breakpoints varies depending on the context and the specific nature of the bug being addressed. In particular, our analysis indicates that object-centric breakpoints can speed up the process of locating the root cause of a bug when the bug can be replicated without needing to restart the program. We discuss the implications of these findings for debugging practices and future research. ![]() ![]() |
|
Brunswicker, Sabine |
![]() Weihao Chen, Jia Lin Cheoh, Manthan Keim, Sabine Brunswicker, and Tianyi Zhang (Purdue University, USA) ![]() |
|
Burk, Felix |
![]() Aryaz Eghbali, Felix Burk, and Michael Pradel (University of Stuttgart, Germany) Python is a dynamic language with applications in many domains, and one of the most popular languages in recent years. To increase code quality, developers have turned to “linters” that statically analyze the source code and warn about potential programming problems. However, the inherent limitations of static analysis and the dynamic nature of Python make it difficult or even impossible for static linters to detect some problems. This paper presents DyLin, the first dynamic linter for Python. Similar to a traditional linter, the approach has an extensible set of checkers, which, unlike in traditional linters, search for specific programming anti-patterns by analyzing the program as it executes. A key contribution of this paper is a set of 15 Python-specific anti-patterns that are hard to find statically but amenable to dynamic analysis, along with corresponding checkers to detect them. Our evaluation applies DyLin to 37 popular open-source Python projects on GitHub and to a dataset of code submitted to Kaggle machine learning competitions, totaling over 683k lines of Python code. The approach reports a total of 68 problems, 48 of which are previously unknown true positives, i.e., a precision of 70.6%. The detected problems include bugs that cause incorrect values, such as inf, incorrect behavior, e.g., missing out on relevant events, unnecessary computations that slow down the program, and unintended data leakage from test data to the training phase of machine learning pipelines. These issues remained unnoticed in public repositories for more than 3.5 years, on average, despite the fact that the corresponding code has been exercised by the developer-written tests. A comparison with popular static linters and a type checker shows that DyLin complements these tools by detecting problems that are missed statically. Based on our reporting of 42 issues to the developers, 31 issues have so far been fixed. ![]() |
|
Cai, Haipeng |
![]() Haoran Yang and Haipeng Cai (Washington State University, USA; SUNY Buffalo, USA) Multilingual systems are prevalent and broadly impactful, but also complex due to the intricate interactions between the heterogeneous programming languages the systems are developed in. This complexity is further aggravated by the diversity of cross-language interoperability across different language combinations, resulting in additional, often stealthy cross-language bugs. Yet despite the growing number of tools aimed to discover cross-language bugs, a systematic understanding of such bugs is still lacking. To fill this gap, we conduct the first comprehensive study of cross-language bugs, characterizing them in 5 aspects including their symptoms, locations, manifestation, root causes, and fixes, as well as their relationships. Through careful identification and detailed analysis of 400 cross-language bugs in real-world multilingual projects classified from 54,356 relevant code commits in their GitHub repositories, we revealed not only bug characteristics of those five aspects but also how they compare between two top language combinations in the multilingual world (Python-C and Java-C). In addition to findings of the study as well as its enabling tools and datasets, we also provide practical recommendations regarding the prevention, detection, and patching of cross-language bugs. ![]() ![]() |
|
Cai, Yufan |
![]() Xiaokun Luan, David Sanan, Zhe Hou, Qiyuan Xu, Chengwei Liu, Yufan Cai, Yang Liu, and Meng Sun (Peking University, China; Singapore Institute of Technology, Singapore; Griffith University, Australia; Nanyang Technological University, Singapore; China-Singapore International Joint Research Institute, China; National University of Singapore, Singapore) Proof assistants are software tools for formal modeling and verification of software, hardware, design, and mathematical proofs. Due to the growing complexity and scale of formal proofs, compatibility issues frequently arise when using different versions of proof assistants. These issues result in broken proofs, disrupting the maintenance of formalized theories and hindering the broader dissemination of results within the community. Although existing works have proposed techniques to address specific types of compatibility issues, the overall characteristics of these issues remain largely unexplored. To address this gap, we conduct the first extensive empirical study to characterize compatibility issues, using Isabelle as a case study. We develop a regression testing framework to automatically collect compatibility issues from the Archive of Formal Proofs, the largest repository of formal proofs in Isabelle. By analyzing 12,079 collected issues, we identify their types and symptoms and further investigate their root causes. We also extract updated proofs that address these issues to understand the applied resolution strategies. Our study provides an in-depth understanding of compatibility issues in proof assistants, offering insights that support the development of effective techniques to mitigate these issues. ![]() ![]() |
|
Çalıklı, Gül |
![]() Gül Çalıklı and Mohammed Alhamed (University of Glasgow, UK; Applied Behaviour Systems, UK) Software development Effort Estimation (SEE) comprises predicting the most realistic amount of effort (e.g., in work hours) required to develop or maintain software based on incomplete, uncertain, and noisy input. Expert judgment is the dominant SEE strategy used in the industry. Yet, expert-based judgment can provide inaccurate effort estimates, leading to projects’ poor budget planning and cost and time overruns, negatively impacting the world economy. Large Language Models (LLMs) are good candidates to assist software professionals in effort estimation. However, their effective leveraging for SEE requires thoroughly investigating their limitations and to what extent they overlap with those of (human) software practitioners. One primary limitation of LLMs is the sensitivity of their responses to prompt changes. Similarly, empirical studies showed that changes in the request format (e.g., rephrasing) could impact (human) software professionals’ effort estimates. This paper reports the first study that replicates a series of SEE experiments, which were initially carried out with software professionals (humans) in the literature. Our study aims to investigate how LLMs’ effort estimates change due to the transition from the traditional request format (i.e., "How much effort is required to complete X?”) to the alternative request format (i.e., "How much can be completed in Y work hours?”). Our experiments involved three different LLMs (GPT-4, Gemini 1.5 Pro, Llama 3.1) and 88 software project specifications (per treatment in each experiment), resulting in 880 prompts, in total that we prepared using 704 user stories from three large-scale open-source software projects (Hyperledger Fabric, Mulesoft Mule, Spring XD). Our findings align with the original experiments conducted with software professionals: The first four experiments showed that LLMs provide lower effort estimates due to transitioning from the traditional to the alternative request format. The findings of the fifth and first experiments detected that LLMs display patterns analogous to anchoring bias, a human cognitive bias defined as the tendency to stick to the anchor (i.e., the "Y work-hours” in the alternative request format). Our findings provide crucial insights into facilitating future human-AI collaboration and prompt designs for improved effort estimation accuracy. ![]() |
|
Canfora, Gerardo |
![]() Gregorio Dalia, Andrea Di Sorbo, Corrado Aaron Visaggio, and Gerardo Canfora (University of Sannio, Italy) ![]() |
|
Cao, Jialun |
![]() Xiao Chen, Hengcheng Zhu, Jialun Cao, Ming Wen, and Shing-Chi Cheung (Hong Kong University of Science and Technology, China; Huazhong University of Science and Technology, China) Debugging can be much facilitated if one can identify the evolution commit that introduced the bug leading to a detected failure (aka. bug-inducing commit, BIC). Although one may, in theory, locate BICs by executing the detected failing test on various historical commit versions, it is impractical when the test cannot be executed on some of those versions. On the other hand, existing static techniques often assume the availability of additional information such as patches and bug reports, or the applicability of predefined heuristics like commit chronology. However, these approaches are ineffective when such assumptions do not hold, which are often the case in practice. To address these limitations, we propose SEMBIC to identify the BIC of a bug by statically tracking the semantic changes in the execution path prescribed by the failing test across successive historical commit versions. Our insight is that the greater the semantic changes a commit introduces concerning the failing execution path of a target bug, the more likely it is to be the BIC. To distill semantic changes relevant to the failure, we focus on three fine-grained semantic properties. We evaluate the performance of SEMBIC on a benchmark containing 199 real-world bugs from 12 open-source projects. We found that SEMBIC can identify BICs with high accuracy – it ranks the BIC as top 1 for 88 out of 199 bugs, and achieves an MRR of 0.520, outperforming the state-of-the-art technique by 29.4% and 13.6%, respectively. ![]() ![]() |
|
Cao, Junming |
![]() Junming Cao, Xuwen Xiang, Mingfei Cheng, Bihuan Chen, Xinyan Wang, You Lu, Chaofeng Sha, Xiaofei Xie, and Xin Peng (Fudan University, China; Singapore Management University, Singapore) ![]() |
|
Cao, Liqing |
![]() Liqing Cao, Haofeng Li, Chenghang Shi, Jie Lu, Haining Meng, Lian Li, and Jingling Xue (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia) ![]() |
|
Cao, Ruochen |
![]() Ruichao Liang, Jing Chen, Ruochen Cao, Kun He, Ruiying Du, Shuhua Li, Zheng Lin, and Cong Wu (Wuhan University, China; University of Hong Kong, Hong Kong) Smart contracts, as Turing-complete programs managing billions of assets in decentralized finance, are prime targets for attackers. While fuzz testing seems effective for detecting vulnerabilities in these programs, we identify several significant challenges when targeting smart contracts: (i) the stateful nature of these contracts requires stateful exploration, but current fuzzers rely on transaction sequences to manipulate contract states, making the process inefficient; (ii) contract execution is influenced by the continuously changing blockchain environment, yet current fuzzers are limited to local deployments, failing to test contracts in real-world scenarios. These challenges hinder current fuzzers from uncovering hidden vulnerabilities, i.e., those concealed in deep contract states and specific blockchain environments. In this paper, we present SmartShot, a mutable snapshot-based fuzzer to hunt hidden vulnerabilities within smart contracts. We innovatively formulate contract states and blockchain environments as directly fuzzable elements and design mutable snapshots to quickly restore and mutate these elements. SmartShot features a symbolic taint analysis-based mutation strategy along with double validation to soundly guide the state mutation. SmartShot mutates blockchain environments using contract’s historical on-chain states, providing real-world execution contexts. We propose a snapshot checkpoint mechanism to integrate mutable snapshots into SmartShot’s fuzzing loops. These innovations enable SmartShot to effectively fuzz contract states, test contracts across varied and realistic blockchain environments, and support on-chain fuzzing. Experimental results show that SmartShot is effective to detect hidden vulnerabilities with the highest code coverage and lowest false positive rate. SmartShot is 4.8× to 20.2× faster than state-of-the-art tools, identifying 2,150 vulnerable contracts out of 42,738 real-world contracts which is 2.1× to 13.7× more than other tools. SmartShot has demonstrated its real-world impact by detecting vulnerabilities that are only discoverable on-chain and uncovering 24 0-day vulnerabilities in the latest 10,000 deployed contracts. ![]() |
|
Cao, Shaoheng |
![]() Shaoheng Cao, Renyi Chen, Wenhua Yang, Minxue Pan, and Xuandong Li (Nanjing University, China; Samsung Electronics, China; Nanjing University of Aeronautics and Astronautics, China) Model-based testing (MBT) has been an important methodology in software engineering, attracting extensive research attention for over four decades. However, despite its academic acclaim, studies examining the impact of MBT in industrial environments—particularly regarding its extended effects—remain limited and yield unclear results. This gap may contribute to the challenges in establishing a study environment for implementing and applying MBT in production settings to evaluate its impact over time. To bridge this gap, we collaborated with an industrial partner to undertake a comprehensive, longitudinal empirical study employing mixed methods. Over two months, we implemented our MBT tool within the corporation, assessing the immediate and extended effectiveness and efficiency of MBT compared to script-writing-based testing. Through a mix of quantitative and qualitative methods—spanning controlled experiments, questionnaire surveys, and interviews—our study uncovers several insightful findings. These include differences in effectiveness and maintainability between immediate and extended MBT application, the evolving perceptions and expectations of engineers regarding MBT, and more. Leveraging these insights, we propose actionable implications for both the academic and industrial communities, aimed at bolstering confidence in MBT adoption and investment for software testing purposes. ![]() |
|
Cao, Yiheng |
![]() Susheng Wu, Ruisi Wang, Yiheng Cao, Bihuan Chen, Zhuotong Zhou, Yiheng Huang, JunPeng Zhao, and Xin Peng (Fudan University, China) Branching repositories facilitates efficient software development but can also inadvertently propagate vulnerabilities. When an original branch is patched, other unfixed branches remain vulnerable unless the patch is successfully ported. However, due to inherent discrepancies between branches, many patches cannot be directly applied and require manual intervention, which is time-consuming and leads to delays in patch porting, increasing vulnerability risks. Existing automated patch porting approaches are prone to errors, as they often overlook essential semantic and syntactic context of vulnerability and fail to detect or refine faulty patches. We propose Mystique, a novel LLM-based approach to address these limitations. Mystique first slices the semantic-related statements linked to the vulnerability while ensuring syntactic correctness, allowing it to extract the signatures for both the original patched function and the target vulnerable function. Mystique then utilizes a fine-tuned LLM to generate a fixed function, which is further iteratively checked and refined to ensure successful porting. Our evaluation shows that Mystique achieved a success rate of 0.954 at function level and of 0.924 at CVE level, outperforming state-of-the-art approaches by at least 13.2% at function level and 12.3% at CVE level. Our evaluation also demonstrates Mystique’s superior generality across various projects, bugs, and programming languages. Mystique successfully ported patches for 34 real-world vulnerable branches. ![]() ![]() |
|
Cao, Yuan |
![]() Dezhi Ran, Yuan Cao, Yuzhe Guo, Yuetong Li, Mengzhou Wu, Simin Chen, Wei Yang, and Tao Xie (Peking University, China; Beijing Jiaotong University, China; University of Chicago, USA; University of Texas at Dallas, USA) ![]() |
|
Cao, Yuhang |
![]() Weibin Wu, Yuhang Cao, Ning Yi, Rongyi Ou, and Zibin Zheng (Sun Yat-sen University, China) Question answering (QA) is a fundamental task of large language models (LLMs), which requires LLMs to automatically answer human-posed questions in natural language. However, LLMs are known to distort facts and make non-factual statements (i.e., hallucinations) when dealing with QA tasks, which may affect the deployment of LLMs in real-life situations. In this work, we propose DrHall, a framework for detecting and reducing the factual hallucinations of black-box LLMs with metamorphic testing (MT). We believe that hallucinated answers are unstable. Therefore, when LLMs hallucinate, they are more likely to produce different answers if we use metamorphic testing to make LLMs re-execute the same task with different execution paths, which motivates the design of DrHall. The effectiveness of DrHall is evaluated empirically on three datasets, including a self-built dataset of natural language questions: FactHalluQA, as well as two datasets of programming questions: Refactory and LeetCode. The evaluation results confirm that DrHall can consistently outperform the state-of-the-art baselines, obtaining an average F1 score of over 0.856 for hallucination detection. For hallucination correction, DrHall can also outperform the state-of-the-art baselines, with an average hallucination correction rate of over 53%. We hope that our work can enhance the reliability of LLMs and provide new insights for the research of LLM hallucination mitigation. ![]() |
|
Chattaraj, Rajrupa |
![]() Rajrupa Chattaraj and Sridhar Chimalakonda (IIT Tirupati, India) ![]() |
|
Chau, Sze Yiu |
![]() Yuteng Sun, Joyanta Debnath, Wenzheng Hong, Omar Chowdhury, and Sze Yiu Chau (Chinese University of Hong Kong, Hong Kong; Stony Brook University, USA; Independent, China) TheX.509 Public Key Infrastructure (PKI) provides a cryptographically verifiable mechanism for authenticating a binding of an entity’s public-key with its identity, presented in a tamper-proof digital certificate. This often serves as a foundational building block for achieving different security guarantees in many critical applications and protocols (e.g., SSL/TLS). Identities in the context of X.509 PKI are often captured as names, which are encoded in certificates as composite records with different optional fields that can store various types of string values (e.g., ASCII, UTF8). Although such flexibility allows for the support of diverse name types (e.g., IP addresses, DNS names) and application scenarios, it imposes on library developers obligations to enforce unnecessarily convoluted requirements. Bugs in enforcing these requirements can lead to potential interoperability and performance issues, and might open doors to impersonation attacks. This paper focuses on analyzing how open-source libraries enforce the constraints regarding the formatting, encoding, and processing of complex name structures on X.509 certificate chains, for the purpose of establishing identities. Our analysis reveals that a portfolio of simplistic testing approaches can expose blatant violations of the requirements in widely used open-source libraries. Although there is a substantial amount of prior work that focused on testing the overall certificate chain validation process of X.509 libraries, the identified violations have escaped their scrutiny. To make matters worse, we observe that almost all the analyzed libraries completely ignore certain pre-processing steps prescribed by the standard. This begs the question of whether it is beneficial to have a standard so flexible but complex that none of the implementations can faithfully adhere to it. With our test results, we argue in the negative, and explain how simpler alternatives (e.g., other forms of identifiers such as Authority and Subject Key Identifiers) can be used to enforce similar constraints with no loss of security. ![]() |
|
Chen, Bihuan |
![]() Junming Cao, Xuwen Xiang, Mingfei Cheng, Bihuan Chen, Xinyan Wang, You Lu, Chaofeng Sha, Xiaofei Xie, and Xin Peng (Fudan University, China; Singapore Management University, Singapore) ![]() ![]() Susheng Wu, Ruisi Wang, Yiheng Cao, Bihuan Chen, Zhuotong Zhou, Yiheng Huang, JunPeng Zhao, and Xin Peng (Fudan University, China) Branching repositories facilitates efficient software development but can also inadvertently propagate vulnerabilities. When an original branch is patched, other unfixed branches remain vulnerable unless the patch is successfully ported. However, due to inherent discrepancies between branches, many patches cannot be directly applied and require manual intervention, which is time-consuming and leads to delays in patch porting, increasing vulnerability risks. Existing automated patch porting approaches are prone to errors, as they often overlook essential semantic and syntactic context of vulnerability and fail to detect or refine faulty patches. We propose Mystique, a novel LLM-based approach to address these limitations. Mystique first slices the semantic-related statements linked to the vulnerability while ensuring syntactic correctness, allowing it to extract the signatures for both the original patched function and the target vulnerable function. Mystique then utilizes a fine-tuned LLM to generate a fixed function, which is further iteratively checked and refined to ensure successful porting. Our evaluation shows that Mystique achieved a success rate of 0.954 at function level and of 0.924 at CVE level, outperforming state-of-the-art approaches by at least 13.2% at function level and 12.3% at CVE level. Our evaluation also demonstrates Mystique’s superior generality across various projects, bugs, and programming languages. Mystique successfully ported patches for 34 real-world vulnerable branches. ![]() ![]() |
|
Chen, Boqi |
![]() Mootez Saad, José Antonio Hernández López, Boqi Chen, Dániel Varró, and Tushar Sharma (Dalhousie University, Canada; University of Murcia, Spain; McGill University, Canada; Linköping University, Sweden) Language models of code have demonstrated remarkable performance across various software engineering and source code analysis tasks. However, their demanding computational resource requirements and consequential environmental footprint remain as significant challenges. This work introduces ALPINE, an adaptive programming language-agnostic pruning technique designed to substantially reduce the computational overhead of these models. The proposed method offers a pluggable layer that can be integrated with all Transformer-based models. With ALPINE, input sequences undergo adaptive compression throughout the pipeline, reaching a size that is up to x3 less their initial size, resulting in significantly reduced computational load. Our experiments on two software engineering tasks, defect prediction and code clone detection across three language models CodeBERT, GraphCodeBERT and UniXCoder show that ALPINE achieves up to a 50% reduction in FLOPs, a 58.1% decrease in memory footprint, and a 28.1% improvement in throughput on average. This led to a reduction in CO2 emissions by up to 44.85%. Importantly, it achieves a reduction in computation resources while maintaining up to 98.1% of the original predictive performance. These findings highlight the potential of ALPINE in making language models of code more resource-efficient and accessible while preserving their performance, contributing to the overall sustainability of their adoption in software development. Also, it sheds light on redundant and noisy information in source code analysis corpora, as shown by the substantial sequence compression achieved by ALPINE. ![]() |
|
Chen, Chun |
![]() Shoupeng Ren, Lipeng He, Tianyu Tu, Di Wu, Jian Liu, Kui Ren, and Chun Chen (Zhejiang University, China; University of Waterloo, Canada) ![]() |
|
Chen, Chunyang |
![]() Yanqi Su, Zhenchang Xing, Chong Wang, Chunyang Chen, Sherry (Xiwei) Xu, Qinghua Lu, and Liming Zhu (Australian National University, Australia; CSIRO's Data61, Australia; Nanyang Technological University, Singapore; TU Munich, Germany; UNSW, Australia) Exploratory testing (ET) harnesses tester's knowledge, creativity, and experience to create varying tests that uncover unexpected bugs from the end-user's perspective. Although ET has proven effective in system-level testing of interactive systems, the need for manual execution has hindered large-scale adoption. In this work, we explore the feasibility, challenges and road ahead of automated scenario-based ET (a.k.a soap opera testing). We conduct a formative study, identifying key insights for effective manual soap opera testing and challenges in automating the process. We then develop a multi-agent system leveraging LLMs and a Scenario Knowledge Graph (SKG) to automate soap opera testing. The system consists of three multi-modal agents, Planner, Player, and Detector that collaborate to execute tests and identify potential bugs. Experimental results demonstrate the potential of automated soap opera testing, but there remains a significant gap compared to manual execution, especially under-explored scenario boundaries and incorrectly identified bugs. Based on the observation, we envision road ahead for the future of automated soap opera testing, focusing on three key aspects: the synergy of neural and symbolic approaches, human-AI co-learning, and the integration of soap opera testing with broader software engineering practices. These insights aim to guide and inspire the future research. ![]() ![]() Mengzhuo Chen, Zhe Liu, Chunyang Chen, Junjie Wang, Boyu Wu, Jun Hu, and Qing Wang (Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; TU Munich, Germany) In software development, similar apps often encounter similar bugs due to shared functionalities and implementation methods. However, current automated GUI testing methods mainly focus on generating test scripts to cover more pages by analyzing the internal structure of the app, without targeted exploration of paths that may trigger bugs, resulting in low efficiency in bug discovery. Considering that a large number of bug reports on open source platforms can provide external knowledge for testing, this paper proposes BugHunter, a novel bug-aware automated GUI testing approach that generates exploration paths guided by bug reports from similar apps, utilizing a combination of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). Instead of focusing solely on coverage, BugHunter dynamically adapts the testing process to target bug paths, thereby increasing bug detection efficiency. BugHunter first builds a high-quality bug knowledge base from historical bug reports. Then it retrieves relevant reports from this large bug knowledge base using a two-stage retrieval process, and generates test paths based on similar apps’ bug reports. BugHunter also introduces a local and global path-planning mechanism to handle differences in functionality and UI design across apps, and the ambiguous behavior or missing steps in the online bug reports. We evaluate BugHunter on 121 bugs across 71 apps and compare its performance against 16 state-of-the-art baselines. BugHunter achieves 60% improvement in bug detection over the best baseline, with comparable or higher coverage against the baselines. Furthermore, BugHunter successfully detects 49 new crash bugs in real-world apps from Google Play, with 33 bugs fixed, 9 confirmed, and 7 pending feedback. ![]() |
|
Chen, Jesse |
![]() Tanner Finken, Jesse Chen, and Sazzadur Rahaman (University of Arizona, USA) ![]() |
|
Chen, Jia |
![]() Jia Chen, Yuang He, Peng Wang, Xiaolei Chen, Jie Shi, and Wei Wang (Fudan University, China) ![]() |
|
Chen, Jiachi |
![]() Yanlin Wang, Tianyue Jiang, Mingwei Liu, Jiachi Chen, Mingzhi Mao, Xilin Liu, Yuchi Ma, and Zibin Zheng (Sun Yat-sen University, China; Huawei Cloud Computing Technologies, China) Large language models (LLMs) have brought a paradigm shift to the field of code generation, offering the potential to enhance the software development process. However, previous research mainly focuses on the accuracy of code generation, while coding style differences between LLMs and human developers remain under-explored. In this paper, we empirically analyze the differences in coding style between the code generated by mainstream LLMs and the code written by human developers, and summarize coding style inconsistency taxonomy. Specifically, we first summarize the types of coding style inconsistencies by manually analyzing a large number of generation results. We then compare the code generated by LLMs with the code written by human programmers in terms of readability, conciseness, and robustness. The results reveal that LLMs and developers exhibit differences in coding style. Additionally, we study the possible causes of these inconsistencies and provide some solutions to alleviate the problem. ![]() |
|
Chen, Jing |
![]() Ruichao Liang, Jing Chen, Ruochen Cao, Kun He, Ruiying Du, Shuhua Li, Zheng Lin, and Cong Wu (Wuhan University, China; University of Hong Kong, Hong Kong) Smart contracts, as Turing-complete programs managing billions of assets in decentralized finance, are prime targets for attackers. While fuzz testing seems effective for detecting vulnerabilities in these programs, we identify several significant challenges when targeting smart contracts: (i) the stateful nature of these contracts requires stateful exploration, but current fuzzers rely on transaction sequences to manipulate contract states, making the process inefficient; (ii) contract execution is influenced by the continuously changing blockchain environment, yet current fuzzers are limited to local deployments, failing to test contracts in real-world scenarios. These challenges hinder current fuzzers from uncovering hidden vulnerabilities, i.e., those concealed in deep contract states and specific blockchain environments. In this paper, we present SmartShot, a mutable snapshot-based fuzzer to hunt hidden vulnerabilities within smart contracts. We innovatively formulate contract states and blockchain environments as directly fuzzable elements and design mutable snapshots to quickly restore and mutate these elements. SmartShot features a symbolic taint analysis-based mutation strategy along with double validation to soundly guide the state mutation. SmartShot mutates blockchain environments using contract’s historical on-chain states, providing real-world execution contexts. We propose a snapshot checkpoint mechanism to integrate mutable snapshots into SmartShot’s fuzzing loops. These innovations enable SmartShot to effectively fuzz contract states, test contracts across varied and realistic blockchain environments, and support on-chain fuzzing. Experimental results show that SmartShot is effective to detect hidden vulnerabilities with the highest code coverage and lowest false positive rate. SmartShot is 4.8× to 20.2× faster than state-of-the-art tools, identifying 2,150 vulnerable contracts out of 42,738 real-world contracts which is 2.1× to 13.7× more than other tools. SmartShot has demonstrated its real-world impact by detecting vulnerabilities that are only discoverable on-chain and uncovering 24 0-day vulnerabilities in the latest 10,000 deployed contracts. ![]() |
|
Chen, Jintao |
![]() Siwei Tan, Liqiang Lu, Debin Xiang, Tianyao Chu, Congliang Lang, Jintao Chen, Xing Hu, and Jianwei Yin (Zhejiang University, China) Quantum programs provide exponential speedups compared to classical programs in certain areas, but they also inevitably encounter logical faults. Automatically repairing quantum programs is much more challenging than repairing classical programs due to the non-replicability of data, the vast search space of program inputs, and the new programming paradigm. Existing works based on semantic-based or learning-based program repair techniques are fundamentally limited in repairing efficiency and effectiveness. In this work, we propose HornBro, an efficient framework for automated quantum program repair. The key insight of HornBro lies in the homotopy-like method, which iteratively switches between the classical and quantum parts. This approach allows the repair tasks to be efficiently offloaded to the most suitable platforms, enabling a progressive convergence toward the correct program. We start by designing an implication assertion pragma to enable rigorous specifications of quantum program behavior, which helps to generate a quantum test suite automatically. This suite leverages the orthonormal bases of quantum programs to accommodate different encoding schemes. Given a fixed number of test cases, it allows the maximum input coverage of potential counter-example candidates. Then, we develop a Clifford approximation method with an SMT-based search to transform the patch localization program into a symbolic reasoning problem. Finally, we offload the computationally intensive repair of gate parameters to quantum hardware by leveraging the differentiability of quantum gates. Experiments suggest that HornBro increases the repair success rate by more than 62.5% compared to the existing repair techniques, supporting more types of quantum bugs. It also achieves 35.7× speedup in the repair and 99.9% gate reduction of the patch. ![]() ![]() |
|
Chen, Junjie |
![]() Junjie Chen, Xingyu Fan, Chen Yang, Shuang Liu, and Jun Sun (Tianjin University, China; Renmin University of China, China; Singapore Management University, Singapore) ![]() |
|
Chen, Kai |
![]() Ting Zhou, Yanjie Zhao, Xinyi Hou, Xiaoyu Sun, Kai Chen, and Haoyu Wang (Huazhong University of Science and Technology, China; Australian National University, Australia) Declarative UI frameworks have gained widespread adoption in mobile app development, offering benefits such as improved code readability and easier maintenance. Despite these advantages, the process of translating UI designs into functional code remains challenging and time-consuming. Recent advancements in multimodal large language models (MLLMs) have shown promise in directly generating mobile app code from user interface (UI) designs. However, the direct application of MLLMs to this task is limited by challenges in accurately recognizing UI components and comprehensively capturing interaction logic. To address these challenges, we propose DeclarUI, an automated approach that synergizes computer vision (CV), MLLMs, and iterative compiler-driven optimization to generate and refine declarative UI code from designs. DeclarUI enhances visual fidelity, functional completeness, and code quality through precise component segmentation, Page Transition Graphs (PTGs) for modeling complex inter-page relationships, and iterative optimization. In our evaluation, DeclarUI outperforms baselines on React Native, a widely adopted declarative UI framework, achieving a 96.8% PTG coverage rate and a 98% compilation success rate. Notably, DeclarUI demonstrates significant improvements over state-of-the-art MLLMs, with a 123% increase in PTG coverage rate, up to 55% enhancement in visual similarity scores, and a 29% boost in compilation success rate. We further demonstrate DeclarUI’s generalizability through successful applications to Flutter and ArkUI frameworks. User studies with professional developers confirm that DeclarUI’s generated code meets industrial-grade standards in code availability, modification time, readability, and maintainability. By streamlining app development, improving efficiency, and fostering designer-developer collaboration, DeclarUI offers a practical solution to the persistent challenges in mobile UI development. ![]() |
|
Chen, Liqian |
![]() Xingpei Li, Yan Lei, Zhouyang Jia, Yuanliang Zhang, Haoran Liu, Liqian Chen, Wei Dong, and Shanshan Li (National University of Defense Technology, China; Chongqing University, China) As deep learning (DL) frameworks become widely used, converting models between frameworks is crucial for ecosystem flexibility. However, interestingly, existing model converters commonly focus on syntactic operator API mapping—transpiling operator names and parameters—which results in API compatibility issues (i.e., incompatible parameters, missing operators). These issues arise from semantic inconsistencies due to differences in operator implementation, causing conversion failure or performance degradation. In this paper, we present the first comprehensive study on operator semantic inconsistencies through API mapping analysis and framework source code inspection, revealing that 47% of sampled operators exhibit inconsistencies across DL frameworks, with source code limited to individual layers and no inter-layer interactions. This suggests that layer-wise source code alignment is feasible. Based on this, we propose ModelX, a source-level cross-framework conversion approach that extends operator API mapping by addressing semantic inconsistencies beyond the API level. Experiments on PyTorch-to-Paddle conversion show that ModelX successfully converts 624 out of 686 sampled operators and outperforms two state-of-the-art converters and three popular large language models. Moreover, ModelX achieves minimal metric gaps (avg. all under 3.4%) across 52 models from vision, text, and audio tasks, indicating strong robustness. ![]() |
|
Chen, Mengzhuo |
![]() Mengzhuo Chen, Zhe Liu, Chunyang Chen, Junjie Wang, Boyu Wu, Jun Hu, and Qing Wang (Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; TU Munich, Germany) In software development, similar apps often encounter similar bugs due to shared functionalities and implementation methods. However, current automated GUI testing methods mainly focus on generating test scripts to cover more pages by analyzing the internal structure of the app, without targeted exploration of paths that may trigger bugs, resulting in low efficiency in bug discovery. Considering that a large number of bug reports on open source platforms can provide external knowledge for testing, this paper proposes BugHunter, a novel bug-aware automated GUI testing approach that generates exploration paths guided by bug reports from similar apps, utilizing a combination of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). Instead of focusing solely on coverage, BugHunter dynamically adapts the testing process to target bug paths, thereby increasing bug detection efficiency. BugHunter first builds a high-quality bug knowledge base from historical bug reports. Then it retrieves relevant reports from this large bug knowledge base using a two-stage retrieval process, and generates test paths based on similar apps’ bug reports. BugHunter also introduces a local and global path-planning mechanism to handle differences in functionality and UI design across apps, and the ambiguous behavior or missing steps in the online bug reports. We evaluate BugHunter on 121 bugs across 71 apps and compare its performance against 16 state-of-the-art baselines. BugHunter achieves 60% improvement in bug detection over the best baseline, with comparable or higher coverage against the baselines. Furthermore, BugHunter successfully detects 49 new crash bugs in real-world apps from Google Play, with 33 bugs fixed, 9 confirmed, and 7 pending feedback. ![]() |
|
Chen, Peiran |
![]() Shuaiyu Xie, Jian Wang, Maodong Li, Peiran Chen, Jifeng Xuan, and Bing Li (Wuhan University, China; Zhongguancun Laboratory, China) Distributed tracing is a pivotal technique for software operators to understand and diagnose issues within microservice-based systems, offering a comprehensive view of user requests propagated through various services. However, the unprecedented volume of traces imposes expensive storage and analytical burdens on online systems. Conventional tracing approaches typically rely on random sampling with a fixed probability for each trace, which risks missing valuable traces. Several tail-based sampling methods have thus been proposed to sample traces based on their content. Nevertheless, these methods primarily evaluate traces on an individual basis, neglecting the collective attributes of the sample set in terms of comprehensiveness, balance, and consistency. To address these issues, we propose TracePicker, an optimization-based online sampler designed to enhance the quality of sampled data while mitigating storage burden. TracePicker employs a streaming anomaly detector to capture and retain anomalous traces that are crucial for troubleshooting. For normal traces, the sampling process is segmented into quota allocation and group sampling, both formulated as integer programming problems. By solving these problems using dynamic programming and evolution algorithms, TracePicker selects a high-quality subset of data, minimizing overall information loss. Experimental results demonstrate that TracePicker outperforms existing tail-based sampling methods in terms of both sampling quality and time consumption. ![]() |
|
Chen, Qi Alfred |
![]() Yuntianyi Chen, Yuqi Huai, Yirui He, Shilong Li, Changnam Hong, Qi Alfred Chen, and Joshua Garcia (University of California at Irvine, USA) As autonomous driving systems (ADSes) become increasingly complex and integral to daily life, the importance of understanding the nature and mitigation of software bugs in these systems has grown correspondingly. Addressing the challenges of software maintenance in autonomous driving systems (e.g., handling real-time system decisions and ensuring safety-critical reliability) is crucial due to the unique combination of real-time decision-making requirements and the high stakes of operational failures in ADSes. The potential of automated tools in this domain is promising, yet there remains a gap in our comprehension of the challenges faced and the strategies employed during manual debugging and repair of such systems. In this paper, we present an empirical study that investigates bug-fix patterns in ADSes, with the aim of improving reliability and safety. We have analyzed the commit histories and bug reports of two major autonomous driving projects, Apollo and Autoware, from 1,331 bug fixes with the study of bug symptoms, root causes, and bug-fix patterns. Our study reveals several dominant bug-fix patterns, including those related to path planning, data flow, and configuration management. Additionally, we find that the frequency distribution of bug-fix patterns varies significantly depending on their nature and types and that certain categories of bugs are recurrent and more challenging to exterminate. Based on our findings, we propose a hierarchy of ADS bugs and two taxonomies of 15 syntactic bug-fix patterns and 27 semantic bug-fix patterns that offer guidance for bug identification and resolution. We also contribute a benchmark of 1,331 ADS bug-fix instances. ![]() ![]() |
|
Chen, Renyi |
![]() Shaoheng Cao, Renyi Chen, Wenhua Yang, Minxue Pan, and Xuandong Li (Nanjing University, China; Samsung Electronics, China; Nanjing University of Aeronautics and Astronautics, China) Model-based testing (MBT) has been an important methodology in software engineering, attracting extensive research attention for over four decades. However, despite its academic acclaim, studies examining the impact of MBT in industrial environments—particularly regarding its extended effects—remain limited and yield unclear results. This gap may contribute to the challenges in establishing a study environment for implementing and applying MBT in production settings to evaluate its impact over time. To bridge this gap, we collaborated with an industrial partner to undertake a comprehensive, longitudinal empirical study employing mixed methods. Over two months, we implemented our MBT tool within the corporation, assessing the immediate and extended effectiveness and efficiency of MBT compared to script-writing-based testing. Through a mix of quantitative and qualitative methods—spanning controlled experiments, questionnaire surveys, and interviews—our study uncovers several insightful findings. These include differences in effectiveness and maintainability between immediate and extended MBT application, the evolving perceptions and expectations of engineers regarding MBT, and more. Leveraging these insights, we propose actionable implications for both the academic and industrial communities, aimed at bolstering confidence in MBT adoption and investment for software testing purposes. ![]() |
|
Chen, Simin |
![]() Dezhi Ran, Yuan Cao, Yuzhe Guo, Yuetong Li, Mengzhou Wu, Simin Chen, Wei Yang, and Tao Xie (Peking University, China; Beijing Jiaotong University, China; University of Chicago, USA; University of Texas at Dallas, USA) ![]() |
|
Chen, Ting |
![]() Kunsong Zhao, Zihao Li, Weimin Chen, Xiapu Luo, Ting Chen, Guozhu Meng, and Yajin Zhou (Hong Kong Polytechnic University, China; University of Electronic Science and Technology of China, China; Institute of Information Engineering at Chinese Academy of Sciences, China; Zhejiang University, China) WebAssembly has become the preferred smart contract format for various blockchain platforms due to its high portability and near-native execution speed. To effectively understand WebAssembly contracts, it is crucial to recover high-level type signatures because of the limited type information that WebAssembly provides. However, existing studies on type inference for smart contracts primarily center around Ethereum Virtual Machine bytecode, which is not applicable to WebAssembly owing to their differing targets and runtime semantics. This paper introduces WasmHint, a novel solution that leverages deep learning inference to automatically recover high-level parameter and return types from WebAssembly contracts. More specifically, WasmHint constructs a wCFG representation to clarify dependencies within WebAssembly code and simulates its execution to capture type-related operational information. By learning comprehensive code semantics, it infers parameter and return types, with a semantic corrector designed to enhance information coordination. We conduct experiments on a newly constructed dataset containing 77,208 WebAssembly contract functions. The results demonstrate that WasmHint achieves inference accuracies of 80.0% for parameter types and 95.8% for return types, with average improvements of 86.6% and 34.0% over the baseline methods, respectively. ![]() ![]() Shuwei Song, Ting Chen, Ao Qiao, Xiapu Luo, Leqing Wang, Zheyuan He, Ting Wang, Xiaodong Lin, Peng He, Wensheng Zhang, and Xiaosong Zhang (University of Electronic Science and Technology of China, China; Hong Kong Polytechnic University, China; Stony Brook University, USA; University of Guelph, Canada) Cryptocurrency tokens, implemented by smart contracts, are prime targets for attackers due to their substantial monetary value. To illicitly gain profit, attackers often embed malicious code or exploit vulnerabilities within token contracts. Token transfer identification is crucial for detecting malicious and vulnerable token contracts. However, existing methods suffer from high false positives or false negatives due to invalid assumptions or reliance on limited patterns. This paper introduces a novel approach that captures the essential principles of token contracts, which are independent of programming languages and token standards, and presents a new tool, CRYPTO-SCOUT. CRYPTO-SCOUT automatically and accurately identifies token transfers, enabling the detection of various malicious and vulnerable token contracts. CRYPTO-SCOUT's core innovation is its capability to automatically identify complex container-type variables used by token contracts for storing holder information. It processes the bytecode of smart contracts written in the two most widely-used languages, Solidity and Vyper, and supports the three most popular token standards, ERC20, ERC721, and ERC1155. Furthermore, CRYPTO-SCOUT detects four types of malicious and vulnerable token contracts and is designed to be extensible. Extensive experiments show that CRYPTO-SCOUT outperforms existing approaches and uncovers over 21,000 malicious/vulnerable token contracts and more than 12,000 transactions triggering them. ![]() |
|
Chen, Weiguo |
![]() Kangchen Zhu, Zhiliang Tian, Shangwen Wang, Weiguo Chen, Zixuan Dong, Mingyue Leng, and Xiaoguang Mao (National University of Defense Technology, China) The current landscape of binary code summarization predominantly focuses on generating a single summary, which limits its utility and understanding for reverse engineers. Existing approaches often fail to meet the diverse needs of users, such as providing detailed insights into usage patterns, implementation nuances, and design rationale, as observed in the field of source code summarization. This highlights the need for multi-intent binary code summarization to enhance the effectiveness of reverse engineering processes. To address this gap, we propose MiSum, a novel method that leverages multi-modality heterogeneous code graph alignment and learning to integrate both assembly code and pseudo-code. MiSum introduces a unified multi-modality heterogeneous code graph (MM-HCG) that aligns assembly code graphs with pseudo-code graphs, capturing both low-level execution details and high-level structural information. We further propose multi-modality heterogeneous graph learning with heterogeneous mutual attention and message passing, which highlights important code blocks and discovers inter-dependencies across different code forms. Additionally, an intent-aware summary generator with an intent-aware attention mechanism is introduced to produce customized summaries tailored to multiple intents. Extensive experiments, including evaluations across various architectures and optimization levels, demonstrate that MiSum outperforms state-of-the-art baselines in BLEU, METEOR, and ROUGE-L metrics. Human evaluations validate its capability to effectively support reverse engineers in understanding diverse binary code intents, marking a significant advancement in binary code analysis. ![]() ![]() ![]() |
|
Chen, Weihao |
![]() Weihao Chen, Jia Lin Cheoh, Manthan Keim, Sabine Brunswicker, and Tianyi Zhang (Purdue University, USA) ![]() |
|
Chen, Weimin |
![]() Kunsong Zhao, Zihao Li, Weimin Chen, Xiapu Luo, Ting Chen, Guozhu Meng, and Yajin Zhou (Hong Kong Polytechnic University, China; University of Electronic Science and Technology of China, China; Institute of Information Engineering at Chinese Academy of Sciences, China; Zhejiang University, China) WebAssembly has become the preferred smart contract format for various blockchain platforms due to its high portability and near-native execution speed. To effectively understand WebAssembly contracts, it is crucial to recover high-level type signatures because of the limited type information that WebAssembly provides. However, existing studies on type inference for smart contracts primarily center around Ethereum Virtual Machine bytecode, which is not applicable to WebAssembly owing to their differing targets and runtime semantics. This paper introduces WasmHint, a novel solution that leverages deep learning inference to automatically recover high-level parameter and return types from WebAssembly contracts. More specifically, WasmHint constructs a wCFG representation to clarify dependencies within WebAssembly code and simulates its execution to capture type-related operational information. By learning comprehensive code semantics, it infers parameter and return types, with a semantic corrector designed to enhance information coordination. We conduct experiments on a newly constructed dataset containing 77,208 WebAssembly contract functions. The results demonstrate that WasmHint achieves inference accuracies of 80.0% for parameter types and 95.8% for return types, with average improvements of 86.6% and 34.0% over the baseline methods, respectively. ![]() |
|
Chen, Xiao |
![]() Xiao Chen, Hengcheng Zhu, Jialun Cao, Ming Wen, and Shing-Chi Cheung (Hong Kong University of Science and Technology, China; Huazhong University of Science and Technology, China) Debugging can be much facilitated if one can identify the evolution commit that introduced the bug leading to a detected failure (aka. bug-inducing commit, BIC). Although one may, in theory, locate BICs by executing the detected failing test on various historical commit versions, it is impractical when the test cannot be executed on some of those versions. On the other hand, existing static techniques often assume the availability of additional information such as patches and bug reports, or the applicability of predefined heuristics like commit chronology. However, these approaches are ineffective when such assumptions do not hold, which are often the case in practice. To address these limitations, we propose SEMBIC to identify the BIC of a bug by statically tracking the semantic changes in the execution path prescribed by the failing test across successive historical commit versions. Our insight is that the greater the semantic changes a commit introduces concerning the failing execution path of a target bug, the more likely it is to be the BIC. To distill semantic changes relevant to the failure, we focus on three fine-grained semantic properties. We evaluate the performance of SEMBIC on a benchmark containing 199 real-world bugs from 12 open-source projects. We found that SEMBIC can identify BICs with high accuracy – it ranks the BIC as top 1 for 88 out of 199 bugs, and achieves an MRR of 0.520, outperforming the state-of-the-art technique by 29.4% and 13.6%, respectively. ![]() ![]() |
|
Chen, Xiaolei |
![]() Jia Chen, Yuang He, Peng Wang, Xiaolei Chen, Jie Shi, and Wei Wang (Fudan University, China) ![]() |
|
Chen, Yuchen |
![]() Weisong Sun, Yuchen Chen, Chunrong Fang, Yebo Feng, Yuan Xiao, An Guo, Quanjun Zhang, Zhenyu Chen, Baowen Xu, and Yang Liu (Nanyang Technological University, Singapore; Nanjing University, China) Neural code models (NCMs) have been widely used to address various code understanding tasks, such as defect detection. However, numerous recent studies reveal that such models are vulnerable to backdoor attacks. Backdoored NCMs function normally on normal/clean code snippets, but exhibit adversary-expected behavior on poisoned code snippets injected with the adversary-crafted trigger. It poses a significant security threat. For example, a backdoored defect detection model may misclassify user-submitted defective code as non-defective. If this insecure code is then integrated into critical systems, like autonomous driving systems, it could jeopardize life safety. Therefore, there is an urgent need for effective techniques to detect and eliminate backdoors stealthily implanted in NCMs. To address this issue, in this paper, we innovatively propose a backdoor elimination technique for secure code understanding, called EliBadCode. EliBadCode eliminates backdoors in NCMs by inverting/reverse-engineering and unlearning backdoor triggers. Specifically, EliBadCode first filters the model vocabulary for trigger tokens based on the naming conventions of specific programming languages to reduce the trigger search space and cost. Then, EliBadCode introduces a sample-specific trigger position identification method, which can reduce the interference of non-backdoor (adversarial) perturbations for subsequent trigger inversion, thereby producing effective inverted backdoor triggers efficiently. Backdoor triggers can be viewed as backdoor (adversarial) perturbations. Subsequently, EliBadCode employs a Greedy Coordinate Gradient algorithm to optimize the inverted trigger and designs a trigger anchoring method to purify the inverted trigger. Finally, EliBadCode eliminates backdoors through model unlearning. We evaluate the effectiveness of in eliminating backdoors implanted in multiple NCMs used for three safety-critical code understanding tasks. The results demonstrate that EliBadCode can effectively eliminate backdoors while having minimal adverse effects on the normal functionality of the model. For instance, on defect detection tasks, EliBadCode substantially decreases the average Attack Success Rate (ASR) of the advanced backdoor attack from 99.76% to 2.64%, significantly outperforming the three baselines. The clean model produced by EliBadCode exhibits an average decrease in defect prediction accuracy of only 0.01% (the same as the baseline). ![]() |
|
Chen, Yujia |
![]() Yujia Chen, Yang Ye, Zhongqi Li, Yuchi Ma, and Cuiyun Gao (Harbin Institute of Technology, Shenzhen, China; Huawei Cloud Computing Technologies, China) Large code models (LCMs) have remarkably advanced the field of code generation. Despite their impressive capabilities, they still face practical deployment issues, such as high inference costs, limited accessibility of proprietary LCMs, and adaptability issues of ultra-large LCMs. These issues highlight the critical need for more accessible, lightweight yet effective LCMs. Knowledge distillation (KD) offers a promising solution, which transfers the programming capabilities of larger, advanced LCMs (Teacher) to smaller, less powerful LCMs (Student). However, existing KD methods for code intelligence often lack consideration of fault domain knowledge and rely on static seed knowledge, leading to degraded programming capabilities of student models. In this paper, we propose a novel Self-Paced knOwledge DistillAtion framework, named SODA, aiming at developing lightweight yet effective student LCMs via adaptively transferring the programming capabilities from advanced teacher LCMs. SODA consists of three stages in one cycle: (1) Correct-and-Fault Knowledge Delivery stage aims at improving the student model’s capability to recognize errors while ensuring its basic programming skill during the knowledge transferring, which involves correctness-aware supervised learning and fault-aware contrastive learning methods. (2) Multi-View Feedback stage aims at measuring the quality of results generated by the student model from two views, including model-based and static tool-based measurement, for identifying the difficult questions; (3) Feedback-based Knowledge Update stage aims at updating the student model adaptively by generating new questions at different difficulty levels, in which the difficulty levels are categorized based on the feedback in the second stage. By performing the distillation process iteratively, the student model is continuously refined through learning more advanced programming skills from the teacher model. We compare SODA with four state-of-the-art KD approaches on three widely-used code generation benchmarks with different programming languages. Experimental results show that SODA improves the student model by 65.96% in terms of average Pass@1, outperforming the best baseline PERsD by 29.85%. Based on the proposed SODA framework, we develop SodaCoder, a series of lightweight yet effective LCMs with ∼7B parameters, which outperform 15 LCMs with less than or equal to 16B parameters. Notably, SodaCoder-DS-6.7B, built on DeepseekCoder-6.7B, even surpasses the prominent ChatGPT on average Pass@1 across seven programming languages (66.4 vs. 61.3). ![]() |
|
Chen, Yujing |
![]() Zuobin Wang, Zhiyuan Wan, Yujing Chen, Yun Zhang, David Lo, Difan Xie, and Xiaohu Yang (Zhejiang University, China; Hangzhou High-Tech Zone (Binjiang), China; Hangzhou City University, China; Singapore Management University, Singapore) In smart contract development, practitioners frequently reuse code to reduce development effort and avoid reinventing the wheel. This reused code, whether identical or similar to its original source, is referred to as a code clone. Unintentional code cloning can propagate flaws and vulnerabilities, potentially undermining the reliability and maintainability of software systems. Previous studies have identified a significant prevalence of code clones in Solidity smart contracts on the Ethereum blockchain. To mitigate the risks posed by code clones, clone detection has emerged as an active field of research and practice in software engineering. Recent studies have extended existing techniques or proposed novel techniques tailored to the unique syntactic and semantic features of Solidity. Nonetheless, the evaluations of existing techniques, whether conducted by their original authors or independent researchers, involve codebases in various programming languages and utilize different versions of the corresponding tools. The resulting inconsistency makes direct comparisons of the evaluation results impractical, and hinders the ability to derive meaningful conclusions across the evaluations. There remains a lack of clarity regarding the effectiveness of these techniques in detecting smart contract clones, and whether it is feasible to combine different techniques to achieve scalable yet accurate detection of code clones in smart contracts. To address this gap, we conduct a comprehensive empirical study that evaluates the effectiveness and scalability of five representative clone detection techniques on 33,073 verified Solidity smart contracts, along with a benchmark we curate, in which we manually label 72,010 pairs of Solidity smart contracts with clone tags. Moreover, we explore the potential of combining different techniques to achieve optimal performance of code clone detection for smart contracts, and propose SourceREClone, a framework designed for the refined integration of different techniques, which achieves a 36.9% improvement in F1 score compared to a straightforward combination of the state of the art. Based on our findings, we discuss implications, provide recommendations for practitioners, and outline directions for future research. ![]() |
|
Chen, Yuntianyi |
![]() Yuntianyi Chen, Yuqi Huai, Yirui He, Shilong Li, Changnam Hong, Qi Alfred Chen, and Joshua Garcia (University of California at Irvine, USA) As autonomous driving systems (ADSes) become increasingly complex and integral to daily life, the importance of understanding the nature and mitigation of software bugs in these systems has grown correspondingly. Addressing the challenges of software maintenance in autonomous driving systems (e.g., handling real-time system decisions and ensuring safety-critical reliability) is crucial due to the unique combination of real-time decision-making requirements and the high stakes of operational failures in ADSes. The potential of automated tools in this domain is promising, yet there remains a gap in our comprehension of the challenges faced and the strategies employed during manual debugging and repair of such systems. In this paper, we present an empirical study that investigates bug-fix patterns in ADSes, with the aim of improving reliability and safety. We have analyzed the commit histories and bug reports of two major autonomous driving projects, Apollo and Autoware, from 1,331 bug fixes with the study of bug symptoms, root causes, and bug-fix patterns. Our study reveals several dominant bug-fix patterns, including those related to path planning, data flow, and configuration management. Additionally, we find that the frequency distribution of bug-fix patterns varies significantly depending on their nature and types and that certain categories of bugs are recurrent and more challenging to exterminate. Based on our findings, we propose a hierarchy of ADS bugs and two taxonomies of 15 syntactic bug-fix patterns and 27 semantic bug-fix patterns that offer guidance for bug identification and resolution. We also contribute a benchmark of 1,331 ADS bug-fix instances. ![]() ![]() |
|
Chen, Zhenbang |
![]() Xu Yang, Zhenbang Chen, Wei Dong, and Ji Wang (National University of Defense Technology, China) Floating-point constraint solving is challenging due to the complex representation and non-linear computations. Search-based constraint solving provides an effective method for solving floating-point constraints. In this paper, we propose QSF to improve the efficiency of search-based solving for floating-point constraints. The key idea of QSF is to model the floating-point constraint solving problem as a multi-objective optimization problem. Specifically, QSF considers both the number of unsatisfied constraints and the sum of the violation degrees of unsatisfied constraints as the objectives for search-based optimization. Besides, we propose a new evolutionary algorithm in which the mutation operators are specially designed for floating-point numbers, aiming to solve the multi-objective problem more efficiently. We have implemented QSF and conducted extensive experiments on both the SMT-COMP benchmark and the benchmark from real-world floating-point programs. The results demonstrate that compared to SOTA floating-point solvers, QSF achieved an average speedup of 15.72X under a 60-second timeout and an impressive 87.48X under a 600-second timeout on the first benchmark. Similarly, on the second benchmark, QSF delivered an average speedup of 22.44X and 29.23X, respectively, under the two timeout configurations. Furthermore, QSF has also enhanced the performance of symbolic execution for floating-point programs. ![]() |
|
Chen, Zhenpeng |
![]() Zhenpeng Chen, Xinyue Li, Jie M. Zhang, Weisong Sun, Ying Xiao, Tianlin Li, Yiling Lou, and Yang Liu (Nanyang Technological University, Singapore; Peking University, China; King's College London, UK; Fudan University, China) Fairness is a critical requirement for Machine Learning (ML) software, driving the development of numerous bias mitigation methods. Previous research has identified a leveling-down effect in bias mitigation for computer vision and natural language processing tasks, where fairness is achieved by lowering performance for all groups without benefiting the unprivileged group. However, it remains unclear whether this effect applies to bias mitigation for tabular data tasks, a key area in fairness research with significant real-world applications. This study evaluates eight bias mitigation methods for tabular data, including both widely used and cutting-edge approaches, across 44 tasks using five real-world datasets and four common ML models. Contrary to earlier findings, our results show that these methods operate in a zero-sum fashion, where improvements for unprivileged groups are related to reduced benefits for traditionally privileged groups. However, previous research indicates that the perception of a zero-sum trade-off might complicate the broader adoption of fairness policies. To explore alternatives, we investigate an approach that applies the state-of-the-art bias mitigation method solely to unprivileged groups, showing potential to enhance benefits of unprivileged groups without negatively affecting privileged groups or overall ML performance. Our study highlights potential pathways for achieving fairness improvements without zero-sum trade-offs, which could help advance the adoption of bias mitigation methods. ![]() |
|
Chen, Zhenyu |
![]() Weisong Sun, Yuchen Chen, Chunrong Fang, Yebo Feng, Yuan Xiao, An Guo, Quanjun Zhang, Zhenyu Chen, Baowen Xu, and Yang Liu (Nanyang Technological University, Singapore; Nanjing University, China) Neural code models (NCMs) have been widely used to address various code understanding tasks, such as defect detection. However, numerous recent studies reveal that such models are vulnerable to backdoor attacks. Backdoored NCMs function normally on normal/clean code snippets, but exhibit adversary-expected behavior on poisoned code snippets injected with the adversary-crafted trigger. It poses a significant security threat. For example, a backdoored defect detection model may misclassify user-submitted defective code as non-defective. If this insecure code is then integrated into critical systems, like autonomous driving systems, it could jeopardize life safety. Therefore, there is an urgent need for effective techniques to detect and eliminate backdoors stealthily implanted in NCMs. To address this issue, in this paper, we innovatively propose a backdoor elimination technique for secure code understanding, called EliBadCode. EliBadCode eliminates backdoors in NCMs by inverting/reverse-engineering and unlearning backdoor triggers. Specifically, EliBadCode first filters the model vocabulary for trigger tokens based on the naming conventions of specific programming languages to reduce the trigger search space and cost. Then, EliBadCode introduces a sample-specific trigger position identification method, which can reduce the interference of non-backdoor (adversarial) perturbations for subsequent trigger inversion, thereby producing effective inverted backdoor triggers efficiently. Backdoor triggers can be viewed as backdoor (adversarial) perturbations. Subsequently, EliBadCode employs a Greedy Coordinate Gradient algorithm to optimize the inverted trigger and designs a trigger anchoring method to purify the inverted trigger. Finally, EliBadCode eliminates backdoors through model unlearning. We evaluate the effectiveness of in eliminating backdoors implanted in multiple NCMs used for three safety-critical code understanding tasks. The results demonstrate that EliBadCode can effectively eliminate backdoors while having minimal adverse effects on the normal functionality of the model. For instance, on defect detection tasks, EliBadCode substantially decreases the average Attack Success Rate (ASR) of the advanced backdoor attack from 99.76% to 2.64%, significantly outperforming the three baselines. The clean model produced by EliBadCode exhibits an average decrease in defect prediction accuracy of only 0.01% (the same as the baseline). ![]() |
|
Chen, Zhixiang |
![]() Anji Li, Neng Zhang, Ying Zou, Zhixiang Chen, Jian Wang, and Zibin Zheng (Sun Yat-sen University, China; Central China Normal University, China; Queen's University, Canada; Wuhan University, China) Code snippets are widely used in technical forums to demonstrate solutions to programming problems. They can be leveraged by developers to accelerate problem-solving. However, code snippets often lack concrete types of the APIs used in them, which impedes their understanding and resue. To enhance the description of a code snippet, a number of approaches are proposed to infer the types of APIs. Although existing approaches can achieve good performance, their performance is limited by ignoring other information outside the input code snippet (e.g., the descriptions of similar code snippets) that could potentially improve the performance. In this paper, we propose a novel type inference approach, named CKTyper, by leveraging crowdsourcing knowledge in technical posts. The key idea is to generate a relevant context for a target code snippet from the posts containing similar code snippets and then employ the context to promote the type inference with large language models (e.g., ChatGPT). More specifically, we build a crowdsourcing knowledge base (CKB) by extracting code snippets from a large set of posts and index the CKB using Lucene. An API type dictionary is also built from a set of API libraries. Given a code snippet to be inferred, we first retrieve a list of similar code snippets from the indexed CKB. Then, we generate a crowdsourcing knowledge context (CKC) by extracting and summarizing useful content (e.g., API-related sentences) in the posts that contain the similar code snippets. The CKC is subsequently used to improve the type inference of ChatGPT on the input code snippet. The hallucination of ChatGPT is eliminated by employing the API type dictionary. Evaluation results on two open-source datasets demonstrate the effectiveness and efficiency of CKTyper. CKTyper achieves the optimal precision/recall of 97.80% and 95.54% on both datasets, respectively, significantly outperforming three state-of-the-art baselines and ChatGPT. ![]() |
|
Chen, Zhuangbin |
![]() Junjie Huang, Zhihan Jiang, Zhuangbin Chen, and Michael Lyu (Chinese University of Hong Kong, China; Sun Yat-sen University, China) ![]() |
|
Cheng, Mingfei |
![]() Junming Cao, Xuwen Xiang, Mingfei Cheng, Bihuan Chen, Xinyan Wang, You Lu, Chaofeng Sha, Xiaofei Xie, and Xin Peng (Fudan University, China; Singapore Management University, Singapore) ![]() |
|
Cheng, Xiao |
![]() Haodong Li, Xiao Cheng, Guohan Zhang, Guosheng Xu, Guoai Xu, and Haoyu Wang (Beijing University of Posts and Telecommunications, China; UNSW, Australia; Harbin Institute of Technology, Shenzhen, China; Huazhong University of Science and Technology, China) Learning-based Android malware detection has earned significant recognition across industry and academia, yet its effectiveness hinges on the accuracy of labeled training data. Manual labeling, being prohibitively expensive, has prompted the use of automated methods, such as leveraging anti-virus engines like VirusTotal, which unfortunately introduces mislabeling, aka "label noise". The state-of-the-art label noise reduction approach, MalWhiteout, can effectively reduce random label noise but underperforms in mitigating real-world emergent malware (EM) label noise stemming from newly emerging Android malware variants overlooked by VirusTotal. To tackle this, we conceptualize EM label noise detection as an anomaly detection problem and introduce a novel tool, MalCleanse, that surpasses MalWhiteout's ability to address EM label noise. MalCleanse combines uncertainty estimation with unsupervised anomaly detection, identifying samples with high uncertainty as mislabeled, thereby enhancing its capability to remove EM label noise. Our experimental results demonstrate a significant reduction in EM label noise by approximately 25.25%, achieving an F1 Score of 80.32% for label noise detection at a noise ratio of 40.61%. Notably, MalCleanse outperforms MalWhiteout with an increase of 40.9% in overall F1 score for mitigating EM label noise. This paper pioneers the integration of deep neural network model uncertainty to refine label accuracy, thereby enhancing the reliability of malware detection systems. Our approach represents a significant step forward in addressing the challenges posed by emergent malware in automated labeling systems. ![]() |
|
Cheoh, Jia Lin |
![]() Weihao Chen, Jia Lin Cheoh, Manthan Keim, Sabine Brunswicker, and Tianyi Zhang (Purdue University, USA) ![]() |
|
Cheung, Shing-Chi |
![]() Weiqi Lu, Yongqiang Tian, Xiaohan Zhong, Haoyang Ma, Zhenyang Xu, Shing-Chi Cheung, and Chengnian Sun (Hong Kong University of Science and Technology, China; University of Waterloo, Canada) Data visualization (DataViz) libraries play a crucial role in presentation, data analysis, and application development, underscoring the importance of their accuracy in transforming data into visual representations. Incorrect visualizations can adversely impact user experience, distort information conveyance, and influence user perception and decision-making processes. Visual bugs in these libraries can be particularly insidious as they may not cause obvious errors like crashes, but instead mislead users of the underlying data graphically, resulting in wrong decision making. Consequently, a good understanding of the unique characteristics of bugs in DataViz libraries is essential for researchers and developers to detect and fix bugs in DataViz libraries. This study presents the first comprehensive analysis of bugs in DataViz libraries, examining 564 bugs collected from five widely-used libraries. Our study systematically analyzes their symptoms and root causes, and provides a detailed taxonomy. We found that incorrect/inaccurate plots are pervasive in DataViz libraries and incorrect graphic computation is the major root cause, which necessitates further automated testing methods for DataViz libraries. Moreover, we identified eight key steps to trigger such bugs and two test oracles specific to DataViz libraries, which may inspire future research in designing effective automated testing techniques. Furthermore, with the recent advancements in Vision Language Models (VLMs), we explored the feasibility of applying these models to detect incorrect/inaccurate plots. The results show that the effectiveness of VLMs in bug detection varies from 29% to 57%, depending on the prompts, and adding more information in prompts does not necessarily increase the effectiveness. Our findings offer valuable insights into the nature and patterns of bugs in DataViz libraries, providing a foundation for developers and researchers to improve library reliability, and ultimately benefit more accurate and reliable data visualizations across various domains. ![]() ![]() Hengcheng Zhu, Valerio Terragni, Lili Wei, Shing-Chi Cheung, Jiarong Wu, and Yepang Liu (Hong Kong University of Science and Technology, China; University of Auckland, New Zealand; McGill University, Canada; Southern University of Science and Technology, China) Mock assertions provide developers with a powerful means to validate program behaviors that are unobservable to test assertions. Despite their significance, they are rarely considered by automated test generation techniques. Effective generation of mock assertions requires understanding how they are used in practice. Although previous studies highlighted the importance of mock assertions, none provide insight into their usages. To bridge this gap, we conducted the first empirical study on mock assertions, examining their adoption, the characteristics of the verified method invocations, and their effectiveness in fault detection. Our analysis of 4,652 test cases from 11 popular Java projects reveals that mock assertions are mostly applied to validating specific kinds of method calls, such as those interacting with external resources and those reflecting whether a certain code path was traversed in systems under test. Additionally, we find that mock assertions complement traditional test assertions by ensuring the desired side effects have been produced, validating control flow logic, and checking internal computation results. Our findings contribute to a better understanding of mock assertion usages and provide a foundation for future related research such as automated test generation that support mock assertions. ![]() ![]() ![]() Xiao Chen, Hengcheng Zhu, Jialun Cao, Ming Wen, and Shing-Chi Cheung (Hong Kong University of Science and Technology, China; Huazhong University of Science and Technology, China) Debugging can be much facilitated if one can identify the evolution commit that introduced the bug leading to a detected failure (aka. bug-inducing commit, BIC). Although one may, in theory, locate BICs by executing the detected failing test on various historical commit versions, it is impractical when the test cannot be executed on some of those versions. On the other hand, existing static techniques often assume the availability of additional information such as patches and bug reports, or the applicability of predefined heuristics like commit chronology. However, these approaches are ineffective when such assumptions do not hold, which are often the case in practice. To address these limitations, we propose SEMBIC to identify the BIC of a bug by statically tracking the semantic changes in the execution path prescribed by the failing test across successive historical commit versions. Our insight is that the greater the semantic changes a commit introduces concerning the failing execution path of a target bug, the more likely it is to be the BIC. To distill semantic changes relevant to the failure, we focus on three fine-grained semantic properties. We evaluate the performance of SEMBIC on a benchmark containing 199 real-world bugs from 12 open-source projects. We found that SEMBIC can identify BICs with high accuracy – it ranks the BIC as top 1 for 88 out of 199 bugs, and achieves an MRR of 0.520, outperforming the state-of-the-art technique by 29.4% and 13.6%, respectively. ![]() ![]() |
|
Chimalakonda, Sridhar |
![]() Rajrupa Chattaraj and Sridhar Chimalakonda (IIT Tirupati, India) ![]() |
|
Cho, Steven |
![]() Lam Nguyen Tung, Steven Cho, Xiaoning Du, Neelofar Neelofar, Valerio Terragni, Stefano Ruberto, and Aldeida Aleti (Monash University, Australia; University of Auckland, New Zealand; Royal Melbourne Institute of Technology, Australia; Joint Research Centre at the European Commission, Italy) ![]() |
|
Chowdhury, Omar |
![]() Yuteng Sun, Joyanta Debnath, Wenzheng Hong, Omar Chowdhury, and Sze Yiu Chau (Chinese University of Hong Kong, Hong Kong; Stony Brook University, USA; Independent, China) TheX.509 Public Key Infrastructure (PKI) provides a cryptographically verifiable mechanism for authenticating a binding of an entity’s public-key with its identity, presented in a tamper-proof digital certificate. This often serves as a foundational building block for achieving different security guarantees in many critical applications and protocols (e.g., SSL/TLS). Identities in the context of X.509 PKI are often captured as names, which are encoded in certificates as composite records with different optional fields that can store various types of string values (e.g., ASCII, UTF8). Although such flexibility allows for the support of diverse name types (e.g., IP addresses, DNS names) and application scenarios, it imposes on library developers obligations to enforce unnecessarily convoluted requirements. Bugs in enforcing these requirements can lead to potential interoperability and performance issues, and might open doors to impersonation attacks. This paper focuses on analyzing how open-source libraries enforce the constraints regarding the formatting, encoding, and processing of complex name structures on X.509 certificate chains, for the purpose of establishing identities. Our analysis reveals that a portfolio of simplistic testing approaches can expose blatant violations of the requirements in widely used open-source libraries. Although there is a substantial amount of prior work that focused on testing the overall certificate chain validation process of X.509 libraries, the identified violations have escaped their scrutiny. To make matters worse, we observe that almost all the analyzed libraries completely ignore certain pre-processing steps prescribed by the standard. This begs the question of whether it is beneficial to have a standard so flexible but complex that none of the implementations can faithfully adhere to it. With our test results, we argue in the negative, and explain how simpler alternatives (e.g., other forms of identifiers such as Authority and Subject Key Identifiers) can be used to enforce similar constraints with no loss of security. ![]() |
|
Chu, Tianyao |
![]() Siwei Tan, Liqiang Lu, Debin Xiang, Tianyao Chu, Congliang Lang, Jintao Chen, Xing Hu, and Jianwei Yin (Zhejiang University, China) Quantum programs provide exponential speedups compared to classical programs in certain areas, but they also inevitably encounter logical faults. Automatically repairing quantum programs is much more challenging than repairing classical programs due to the non-replicability of data, the vast search space of program inputs, and the new programming paradigm. Existing works based on semantic-based or learning-based program repair techniques are fundamentally limited in repairing efficiency and effectiveness. In this work, we propose HornBro, an efficient framework for automated quantum program repair. The key insight of HornBro lies in the homotopy-like method, which iteratively switches between the classical and quantum parts. This approach allows the repair tasks to be efficiently offloaded to the most suitable platforms, enabling a progressive convergence toward the correct program. We start by designing an implication assertion pragma to enable rigorous specifications of quantum program behavior, which helps to generate a quantum test suite automatically. This suite leverages the orthonormal bases of quantum programs to accommodate different encoding schemes. Given a fixed number of test cases, it allows the maximum input coverage of potential counter-example candidates. Then, we develop a Clifford approximation method with an SMT-based search to transform the patch localization program into a symbolic reasoning problem. Finally, we offload the computationally intensive repair of gate parameters to quantum hardware by leveraging the differentiability of quantum gates. Experiments suggest that HornBro increases the repair success rate by more than 62.5% compared to the existing repair techniques, supporting more types of quantum bugs. It also achieves 35.7× speedup in the repair and 99.9% gate reduction of the patch. ![]() ![]() |
|
Costiou, Steven |
![]() Valentin Bourcier, Pooja Rani, Maximilian Ignacio Willembrinck Santander, Alberto Bacchelli, and Steven Costiou (University of Lille - Inria - CNRS - Centrale Lille - UMR 9189 CRIStAL, France; University of Zurich, Switzerland) Debugging consists in understanding the behavior of a program to identify and correct its defects. Breakpoints are the most commonly used debugging tool and aim to facilitate the debugging process by allowing developers to interrupt a program’s execution at a source code location of their choice and inspect the state of the program. Researchers suggest that in systems developed using object-oriented programming (OOP), traditional breakpoints may be a not effective method for debugging. In OOP, developers create code in classes, which at runtime are instantiated as object—entities with their own state and behavior that can interact with one another. Traditional breakpoints are set within the class code, halting execution for every object that shares that class’s code. This leads to unnecessary interruptions for developers who are focused on monitoring the behavior of a specific object. As an answer to this challenge, researchers proposed object-centric debugging, an approach based on debugging tools that focus on objects rather than classes. In particular, using object-centric breakpoints, developers can select specific objects (rather than classes) for which the execution must be interrupted. Even though it seems reasonable that this approach may ease the debugging process by reducing the time and actions needed for debugging objects, no research has yet verified its actual impact. To investigate the impact of object-centric breakpoints on the debugging process, we devised and conducted a controlled experiment with 81 developers who spent an average of 1 hour and 30 minutes each on the study. The experiment required participants to complete two debugging tasks using debugging tools with vs. without object-centric breakpoints. We found no significant effect from the use of object-centric breakpoints on the number of actions required to debug or the effectiveness in understanding or fixing the bug. However, for one of the two tasks, we measured a statistically significant reduction in debugging time for participants who used object-centric breakpoints, while for the other task, there was a statistically significant increase. Our analysis suggests that the impact of object-centric breakpoints varies depending on the context and the specific nature of the bug being addressed. In particular, our analysis indicates that object-centric breakpoints can speed up the process of locating the root cause of a bug when the bug can be replicated without needing to restart the program. We discuss the implications of these findings for debugging practices and future research. ![]() ![]() |
|
Dachselt, Raimund |
![]() Max Weber, Alina Mailach, Sven Apel, Janet Siegmund, Raimund Dachselt, and Norbert Siegmund (Leipzig University, Germany; ScaDS.AI Dresden-Leipzig, Germany; Saarland Informatics Campus, Germany; Saarland University, Germany; Chemnitz University of Technology, Germany; Dresden University of Technology, Germany) Debugging performance bugs in configurable software systems is a complex and time-consuming task that requires not only fixing a bug, but also understanding its root cause. While there is a vast body of literature on debugging strategies, there is no consensus on general debugging. This makes it difficult to provide concrete guidance for developers, especially for configuration-dependent performance bugs. The goal of our work is to alleviate this situation by providing an framework to describe debugging strategies in a more general, unifying way. We conducted a user study with 12 professional developers who debugged a performance bug in a real-world configurable system. To observe developers in an unobstructive way, we provided an immersive virtual reality tool, SoftVR, giving them a large degree of freedom to choose the preferred debugging strategy. The results show that the existing documentation of strategies is too coarse-grained and intermixed to identify successful approaches. In a subsequent qualitative analysis, we devised a coding framework to reason about debugging approaches. With this framework, we identified five goal-oriented episodes that developers employ, which they also confirmed in subsequent interviews. Our work provides a unified description of debugging strategies, allowing researchers a common foundation to study debugging and practitioners and teachers guidance on successful debugging strategies. ![]() |
|
Dalia, Gregorio |
![]() Gregorio Dalia, Andrea Di Sorbo, Corrado Aaron Visaggio, and Gerardo Canfora (University of Sannio, Italy) ![]() |
|
Daneshyan, Farbod |
![]() Farbod Daneshyan, Runzhi He, Jianyu Wu, and Minghui Zhou (Peking University, China) The release note is a crucial document outlining changes in new software versions. It plays a key role in helping stakeholders recognise important changes and understand the implications behind them. Despite this fact, many developers view the process of writing software release notes as a tedious and dreadful task. Consequently, numerous tools (e.g., DeepRelease and Conventional Changelog) have been developed by researchers and practitioners to automate the generation of software release notes. However, these tools fail to consider project domain and target audience for personalisation, limiting their relevance and conciseness. Additionally, they suffer from limited applicability, often necessitating significant workflow adjustments and adoption efforts, hindering practical use and stressing developers. Despite recent advancements in natural language processing and the proven capabilities of large language models (LLMs) in various code and text-related tasks, there are no existing studies investigating the integration and utilisation of LLMs in automated release note generation. Therefore, we propose SmartNote, a novel and widely applicable release note generation approach that produces high-quality, contextually personalised release notes by leveraging LLM capabilities to aggregate, describe, and summarise changes based on code, commit, and pull request details. It categorises and scores (for significance) commits to generate structured and concise release notes of prioritised changes. We conduct human and automatic evaluations that reveal SmartNote outperforms or achieves comparable performance to DeepRelease (state-of-the-art), Conventional Changelog (off-the-shelf), and the projects' original release note across four quality metrics: completeness, clarity, conciseness, and organisation. In both evaluations, SmartNote ranked first for completeness and organisation, while clarity ranked first in the human evaluation. Furthermore, our controlled study reveals the significance of contextual awareness, while our applicability analysis confirms SmartNote's effectiveness across diverse projects. ![]() ![]() |
|
David, Cristina |
![]() Yoav Alon and Cristina David (University of Bristol, UK) Large Language Models (LLMs) were shown to struggle with long-term planning, which may be caused by the limited way in which they explore the space of possible solutions. We propose an architecture where a Reinforcement Learning (RL) Agent guides an LLM's space exploration: (1) the Agent has access to domain-specific information, and can therefore make decisions about the quality of candidate solutions based on specific and relevant metrics, which were not explicitly considered by the LLM's training objective; (2) the LLM can focus on generating immediate next steps, without the need for long-term planning. We allow non-linear reasoning by exploring alternative paths and backtracking. We evaluate this architecture on the program equivalence task, and compare it against Chain of Thought (CoT) and Tree of Thoughts (ToT). We assess both the downstream task, denoting the binary classification, and the intermediate reasoning steps. Our approach compares positively against CoT and ToT. ![]() |
|
Davis, Matthew |
![]() Matthew Davis, Amy Wei, Brad Myers, and Joshua Sunshine (Carnegie Mellon University, USA; University of Michigan, USA) Software testing is difficult, tedious, and may consume 28%–50% of software engineering labor. Automatic test generators aim to ease this burden but have important trade-offs. Fuzzers use an implicit oracle that can detect obviously invalid results, but the oracle problem has no general solution, and an implicit oracle cannot automatically evaluate correctness. Test suite generators like EvoSuite use the program under test as the oracle and therefore cannot evaluate correctness. Property-based testing tools evaluate correctness, but users have difficulty coming up with properties to test and understanding whether their properties are correct. Consequently, practitioners create many test suites manually and often use an example-based oracle to tediously specify correct input and output examples. To help bridge the gaps among various oracle and tool types, we present the Composite Oracle, which organizes various oracle types into a hierarchy and renders a single test result per example execution. To understand the Composite Oracle’s practical properties, we built TerzoN, a test suite generator that includes a particular instantiation of the Composite Oracle. TerzoN displays all the test results in an integrated view composed from the results of three types of oracles and finds some types of test assertion inconsistencies that might otherwise lead to misleading test results. We evaluated TerzoN in a randomized controlled trial with 14 professional software engineers with a popular industry tool, fast-check, as the control. Participants using TerzoN elicited 72% more bugs (p < 0.01), accurately described more than twice the number of bugs (p < 0.01) and tested 16% more quickly (p < 0.05) relative to fast-check. ![]() ![]() |
|
Deatc, Sebastian |
![]() Ali Al-Kaswan, Sebastian Deatc, Begüm Koç, Arie van Deursen, and Maliheh Izadi (Delft University of Technology, Netherlands) ![]() |
|
Debnath, Joyanta |
![]() Yuteng Sun, Joyanta Debnath, Wenzheng Hong, Omar Chowdhury, and Sze Yiu Chau (Chinese University of Hong Kong, Hong Kong; Stony Brook University, USA; Independent, China) TheX.509 Public Key Infrastructure (PKI) provides a cryptographically verifiable mechanism for authenticating a binding of an entity’s public-key with its identity, presented in a tamper-proof digital certificate. This often serves as a foundational building block for achieving different security guarantees in many critical applications and protocols (e.g., SSL/TLS). Identities in the context of X.509 PKI are often captured as names, which are encoded in certificates as composite records with different optional fields that can store various types of string values (e.g., ASCII, UTF8). Although such flexibility allows for the support of diverse name types (e.g., IP addresses, DNS names) and application scenarios, it imposes on library developers obligations to enforce unnecessarily convoluted requirements. Bugs in enforcing these requirements can lead to potential interoperability and performance issues, and might open doors to impersonation attacks. This paper focuses on analyzing how open-source libraries enforce the constraints regarding the formatting, encoding, and processing of complex name structures on X.509 certificate chains, for the purpose of establishing identities. Our analysis reveals that a portfolio of simplistic testing approaches can expose blatant violations of the requirements in widely used open-source libraries. Although there is a substantial amount of prior work that focused on testing the overall certificate chain validation process of X.509 libraries, the identified violations have escaped their scrutiny. To make matters worse, we observe that almost all the analyzed libraries completely ignore certain pre-processing steps prescribed by the standard. This begs the question of whether it is beneficial to have a standard so flexible but complex that none of the implementations can faithfully adhere to it. With our test results, we argue in the negative, and explain how simpler alternatives (e.g., other forms of identifiers such as Authority and Subject Key Identifiers) can be used to enforce similar constraints with no loss of security. ![]() |
|
Demeyer, Serge |
![]() Yuqing Wang, Mika V. Mäntylä, Serge Demeyer, Mutlu Beyazıt, Joanna Kisaakye, and Jesse Nyyssölä (University of Helsinki, Finland; University of Oulu, Finland; University of Antwerp, Belgium) Microservice-based systems (MSS) may fail with various fault types, due to their complex and dynamic nature. While existing AIOps methods excel at detecting abnormal traces and locating the responsible service(s), human efforts from practitioners are still required for further root cause analysis to diagnose specific fault types and analyze failure reasons for detected abnormal traces, particularly when abnormal traces do not stem directly from specific services. In this paper, we propose a novel AIOps framework, TraFaultDia, to automatically classify abnormal traces into fault categories for MSS. We treat the classification process as a series of multi-class classification tasks, where each task represents an attempt to classify abnormal traces into specific fault categories for a MSS. TraFaultDia is trained on several abnormal trace classification tasks with a few labeled instances from a MSS using a meta-learning approach. After training, TraFaultDia can quickly adapt to new, unseen abnormal trace classification tasks with a few labeled instances across MSS. TraFaultDia’s use cases are scalable depending on how fault categories are built from anomalies within MSS. We evaluated TraFaultDia on two representative MSS, TrainTicket and OnlineBoutique, with open datasets. In these datasets, each fault category is tied to the faulty system component(s) (service/pod) with a root cause. Our TraFaultDia automatically classifies abnormal traces into these fault categories, thus enabling the automatic identification of faulty system components and root causes without manual analysis. Our results show that, within the MSS it is trained on, TraFaultDia achieves an average accuracy of 93.26% and 85.20% across 50 new, unseen abnormal trace classification tasks for TrainTicket and OnlineBoutique respectively, when provided with 10 labeled instances for each fault category per task in each system. In the cross-system context, when TraFaultDia is applied to a MSS different from the one it is trained on, TraFaultDia gets an average accuracy of 92.19% and 84.77% for the same set of 50 new, unseen abnormal trace classification tasks of the respective systems, also with 10 labeled instances provided for each fault category per task in each system. ![]() |
|
Deng, Yinlin |
![]() Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang (University of Illinois at Urbana-Champaign, USA) Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the following question: Do we really have to employ complex autonomous software agents? To attempt to answer this question, we build Agentless – an agentless approach to automatically resolve software development issues. Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic three-phase process of localization, repair, and patch validation, without letting the LLM decide future actions or operate with complex tools. Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance (32.00%, 96 correct fixes) and low cost ($0.70) compared with all existing open-source software agents at the time of paper submission! Agentless also achieves more than 50% solve rate when using Claude 3.5 Sonnet on the new SWE-bench Verified benchmark. In fact, Agentless has already been adopted by OpenAI as the go-to approach to showcase the real-world coding performance of both GPT-4o and the new o1 models; more recently, Agentless has also been used by DeepSeek to evaluate their newest DeepSeek V3 and R1 models. Furthermore, we manually classified the problems in SWE-bench Lite and found problems with exact ground truth patches or insufficient/misleading issue descriptions. As such, we construct SWE-bench Lite-𝑆 by excluding such problematic issues to perform more rigorous evaluation and comparison. Our work highlights the currently overlooked potential of a simplistic, cost-effective technique in autonomous software development. We hope Agentless will help reset the baseline, starting point, and horizon for autonomous software agents, and inspire future work along this crucial direction. We have open-sourced Agentless at: https://github.com/OpenAutoCoder/Agentless ![]() |
|
Deng, Yuetang |
![]() Chaozheng Wang, Jia Feng, Shuzheng Gao, Cuiyun Gao, Zongjie Li, Ting Peng, Hailiang Huang, Yuetang Deng, and Michael Lyu (Chinese University of Hong Kong, China; University of Electronic Science and Technology of China, China; Hong Kong University of Science and Technology, China; Tencent, China) Large Code Models (LCMs) have demonstrated remarkable effectiveness across various code intelligence tasks. Supervised fine-tuning is essential to optimize their performance for specific downstream tasks. Compared with the traditional full-parameter fine-tuning (FFT) method, Parameter-Efficient Fine-Tuning (PEFT) methods can train LCMs with substantially reduced resource consumption and have gained widespread attention among researchers and practitioners. While existing studies have explored PEFT methods for code intelligence tasks, they have predominantly focused on a limited subset of scenarios, such as code generation with publicly available datasets, leading to constrained generalizability of the findings. To mitigate the limitation, we conduct a comprehensive study on exploring the effectiveness of the PEFT methods, which involves five code intelligence tasks containing both public and private data. Our extensive experiments reveal a considerable performance gap between PEFT methods and FFT, which is contrary to the findings of existing studies. We also find that this disparity is particularly pronounced in tasks involving private data. To improve the tuning performance for LCMs while reducing resource utilization during training, we propose a Layer-Wise Optimization (LWO) strategy in the paper. LWO incrementally updates the parameters of each layer in the whole model architecture, without introducing any additional component and inference overhead. Experiments across five LCMs and five code intelligence tasks demonstrate that LWO trains LCMs more effectively and efficiently compared to previous PEFT methods, with significant improvements in tasks using private data. For instance, in the line-level code completion task using our private code repositories, LWO outperforms the state-of-the-art LoRA method by 22% and 12% in terms of accuracy and BLEU scores, respectively. Furthermore, LWO can enable more efficient LCM tuning, reducing the training time by an average of 42.7% compared to LoRA. ![]() |
|
Devanbu, Premkumar |
![]() Yuvraj Singh Virk, Premkumar Devanbu, and Toufique Ahmed (University of California at Davis, USA; IBM Research, USA) ![]() |
|
Dhaouadi, Mouna |
![]() Mouna Dhaouadi, Bentley Oakes, and Michalis Famelis (Université de Montréal, Canada; Polytechnique Montréal, Canada) ![]() |
|
Dhulipala, Hridya |
![]() Hridya Dhulipala, Aashish Yadavally, Smit Soneshbhai Patel, and Tien N. Nguyen (University of Texas at Dallas, USA) While LLMs excel in understanding source code and descriptive texts for tasks like code generation, code completion, etc., they exhibit weaknesses in predicting dynamic program behavior, such as code coverage and runtime error detection, which typically require program execution. Aiming to advance the capability of LLMs in reasoning and predicting the program behavior at runtime, we present CRISPE (short for Coverage Rationalization and Intelligent Selection ProcedurE), a novel approach for code coverage prediction. CRISPE guides an LLM in simulating program execution via an execution plan based on two key factors: (1) program semantics of each statement type, and (2) the observation of the set of covered statements at the current “execution” step relative to all feasible code coverage options. We formulate code coverage prediction as a process of semantic-guided execution-based planning, where feasible coverage options are utilized to assess whether the LLM is heading in the correct reasoning. We enhance the traditional generative task with the retrieval-based framework on feasible options of code coverage. Our experimental results show that CRISPE achieves high accuracy in coverage prediction in terms of both exact-match and statement-match coverage metrics, improving over the baselines. We also show that with semantic-guiding and dynamic reasoning from CRISPE, the LLM generates more correct planning steps. To demonstrate CRISPE’s usefulness, we used it in the downstream task of statically detecting runtime error(s) in incomplete code snippets with the given inputs. ![]() ![]() Yi Li, Hridya Dhulipala, Aashish Yadavally, Xiaokai Rong, Shaohua Wang, and Tien N. Nguyen (University of Texas at Dallas, USA; Central University of Finance and Economics, China) Although Large Language Models (LLMs) are highly proficient in understanding source code and descriptive texts, they have limitations in reasoning on dynamic program behaviors, such as execution trace and code coverage prediction, and runtime error prediction, which usually require actual program execution. To advance the ability of LLMs in predicting dynamic behaviors, we leverage the strengths of both approaches, Program Analysis (PA) and LLM, in building PredEx, a predictive executor for Python. Our principle is a blended analysis between PA and LLM to use PA to guide the LLM in predicting execution traces. We break down the task of predictive execution into smaller sub-tasks and leverage the deterministic nature when an execution order can be deterministically decided. When it is not certain, we use predictive backward slicing per variable, i.e., slicing the prior trace to only the parts that affect each variable separately breaks up the valuation prediction into significantly simpler problems. Our empirical evaluation on real-world datasets shows that PredEx achieves 31.5–47.1% relatively higher accuracy in predicting full execution traces than the state-of-the-art models. It also produces 8.6–53.7% more correct execution trace prefixes than those baselines. In predicting next executed statements, its relative improvement over the baselines is 15.7–102.3%. Finally, we show PredEx’s usefulness in two tasks: static code coverage analysis and static prediction of run-time errors for (in)complete code. ![]() |
|
Ding, Bo |
![]() Yiwen Wu, Yang Zhang, Tao Wang, Bo Ding, and Huaimin Wang (National University of Defense Technology, China; State Key Laboratory of Complex and Critical Software Environment, China) Docker building is a critical component of containerization in modern software development, automating the process of packaging and converting sources into container images. It is not uncommon to find that Docker build faults (DBFs) occur frequently across Docker-based projects, inducing non-negligible costs in development activities. DBF resolution is a challenging problem and previous studies have demonstrated that developers spend non-trivial time in resolving encountered build faults. However, the characteristics of DBFs is still under-investigated, hindering practical solutions to build management in the Docker community. In this paper, to bridge this gap, we present a comprehensive study for understanding the real-world DBFs in practice. We collect and analyze a DBF dataset of 255 issues and 219 posts from GitHub, Stack Overflow, and Docker Forum. We investigate and construct characteristic taxonomies for the DBFs, including 15 symptoms, 23 root causes, and 35 fix patterns. Moreover, we study the fault distributions of symptoms and root causes, in terms of the different build types, i.e., Dockerfile builds and Docker-compose builds. Based on the results, we provide actionable implications and develop a knowledge-based application, which can potentially facilitate research and assist developers in improving the Docker build management. ![]() |
|
Di Sorbo, Andrea |
![]() Gregorio Dalia, Andrea Di Sorbo, Corrado Aaron Visaggio, and Gerardo Canfora (University of Sannio, Italy) ![]() |
|
Dong, Wei |
![]() Xingpei Li, Yan Lei, Zhouyang Jia, Yuanliang Zhang, Haoran Liu, Liqian Chen, Wei Dong, and Shanshan Li (National University of Defense Technology, China; Chongqing University, China) As deep learning (DL) frameworks become widely used, converting models between frameworks is crucial for ecosystem flexibility. However, interestingly, existing model converters commonly focus on syntactic operator API mapping—transpiling operator names and parameters—which results in API compatibility issues (i.e., incompatible parameters, missing operators). These issues arise from semantic inconsistencies due to differences in operator implementation, causing conversion failure or performance degradation. In this paper, we present the first comprehensive study on operator semantic inconsistencies through API mapping analysis and framework source code inspection, revealing that 47% of sampled operators exhibit inconsistencies across DL frameworks, with source code limited to individual layers and no inter-layer interactions. This suggests that layer-wise source code alignment is feasible. Based on this, we propose ModelX, a source-level cross-framework conversion approach that extends operator API mapping by addressing semantic inconsistencies beyond the API level. Experiments on PyTorch-to-Paddle conversion show that ModelX successfully converts 624 out of 686 sampled operators and outperforms two state-of-the-art converters and three popular large language models. Moreover, ModelX achieves minimal metric gaps (avg. all under 3.4%) across 52 models from vision, text, and audio tasks, indicating strong robustness. ![]() ![]() Xu Yang, Zhenbang Chen, Wei Dong, and Ji Wang (National University of Defense Technology, China) Floating-point constraint solving is challenging due to the complex representation and non-linear computations. Search-based constraint solving provides an effective method for solving floating-point constraints. In this paper, we propose QSF to improve the efficiency of search-based solving for floating-point constraints. The key idea of QSF is to model the floating-point constraint solving problem as a multi-objective optimization problem. Specifically, QSF considers both the number of unsatisfied constraints and the sum of the violation degrees of unsatisfied constraints as the objectives for search-based optimization. Besides, we propose a new evolutionary algorithm in which the mutation operators are specially designed for floating-point numbers, aiming to solve the multi-objective problem more efficiently. We have implemented QSF and conducted extensive experiments on both the SMT-COMP benchmark and the benchmark from real-world floating-point programs. The results demonstrate that compared to SOTA floating-point solvers, QSF achieved an average speedup of 15.72X under a 60-second timeout and an impressive 87.48X under a 600-second timeout on the first benchmark. Similarly, on the second benchmark, QSF delivered an average speedup of 22.44X and 29.23X, respectively, under the two timeout configurations. Furthermore, QSF has also enhanced the performance of symbolic execution for floating-point programs. ![]() |
|
Dong, Yi |
![]() Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, and Michael Lyu (Chinese University of Hong Kong, China; Singapore Management University, Singapore) ![]() |
|
Dong, Zixuan |
![]() Kangchen Zhu, Zhiliang Tian, Shangwen Wang, Weiguo Chen, Zixuan Dong, Mingyue Leng, and Xiaoguang Mao (National University of Defense Technology, China) The current landscape of binary code summarization predominantly focuses on generating a single summary, which limits its utility and understanding for reverse engineers. Existing approaches often fail to meet the diverse needs of users, such as providing detailed insights into usage patterns, implementation nuances, and design rationale, as observed in the field of source code summarization. This highlights the need for multi-intent binary code summarization to enhance the effectiveness of reverse engineering processes. To address this gap, we propose MiSum, a novel method that leverages multi-modality heterogeneous code graph alignment and learning to integrate both assembly code and pseudo-code. MiSum introduces a unified multi-modality heterogeneous code graph (MM-HCG) that aligns assembly code graphs with pseudo-code graphs, capturing both low-level execution details and high-level structural information. We further propose multi-modality heterogeneous graph learning with heterogeneous mutual attention and message passing, which highlights important code blocks and discovers inter-dependencies across different code forms. Additionally, an intent-aware summary generator with an intent-aware attention mechanism is introduced to produce customized summaries tailored to multiple intents. Extensive experiments, including evaluations across various architectures and optimization levels, demonstrate that MiSum outperforms state-of-the-art baselines in BLEU, METEOR, and ROUGE-L metrics. Human evaluations validate its capability to effectively support reverse engineers in understanding diverse binary code intents, marking a significant advancement in binary code analysis. ![]() ![]() ![]() |
|
Du, Ruiying |
![]() Ruichao Liang, Jing Chen, Ruochen Cao, Kun He, Ruiying Du, Shuhua Li, Zheng Lin, and Cong Wu (Wuhan University, China; University of Hong Kong, Hong Kong) Smart contracts, as Turing-complete programs managing billions of assets in decentralized finance, are prime targets for attackers. While fuzz testing seems effective for detecting vulnerabilities in these programs, we identify several significant challenges when targeting smart contracts: (i) the stateful nature of these contracts requires stateful exploration, but current fuzzers rely on transaction sequences to manipulate contract states, making the process inefficient; (ii) contract execution is influenced by the continuously changing blockchain environment, yet current fuzzers are limited to local deployments, failing to test contracts in real-world scenarios. These challenges hinder current fuzzers from uncovering hidden vulnerabilities, i.e., those concealed in deep contract states and specific blockchain environments. In this paper, we present SmartShot, a mutable snapshot-based fuzzer to hunt hidden vulnerabilities within smart contracts. We innovatively formulate contract states and blockchain environments as directly fuzzable elements and design mutable snapshots to quickly restore and mutate these elements. SmartShot features a symbolic taint analysis-based mutation strategy along with double validation to soundly guide the state mutation. SmartShot mutates blockchain environments using contract’s historical on-chain states, providing real-world execution contexts. We propose a snapshot checkpoint mechanism to integrate mutable snapshots into SmartShot’s fuzzing loops. These innovations enable SmartShot to effectively fuzz contract states, test contracts across varied and realistic blockchain environments, and support on-chain fuzzing. Experimental results show that SmartShot is effective to detect hidden vulnerabilities with the highest code coverage and lowest false positive rate. SmartShot is 4.8× to 20.2× faster than state-of-the-art tools, identifying 2,150 vulnerable contracts out of 42,738 real-world contracts which is 2.1× to 13.7× more than other tools. SmartShot has demonstrated its real-world impact by detecting vulnerabilities that are only discoverable on-chain and uncovering 24 0-day vulnerabilities in the latest 10,000 deployed contracts. ![]() |
|
Du, Xiaohu |
![]() Xiaohu Du, Ming Wen, Haoyu Wang, Zichao Wei, and Hai Jin (Huazhong University of Science and Technology, China) ![]() |
|
Du, Xiaoning |
![]() Lam Nguyen Tung, Steven Cho, Xiaoning Du, Neelofar Neelofar, Valerio Terragni, Stefano Ruberto, and Aldeida Aleti (Monash University, Australia; University of Auckland, New Zealand; Royal Melbourne Institute of Technology, Australia; Joint Research Centre at the European Commission, Italy) ![]() |
|
Dunn, Soren |
![]() Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang (University of Illinois at Urbana-Champaign, USA) Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the following question: Do we really have to employ complex autonomous software agents? To attempt to answer this question, we build Agentless – an agentless approach to automatically resolve software development issues. Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic three-phase process of localization, repair, and patch validation, without letting the LLM decide future actions or operate with complex tools. Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance (32.00%, 96 correct fixes) and low cost ($0.70) compared with all existing open-source software agents at the time of paper submission! Agentless also achieves more than 50% solve rate when using Claude 3.5 Sonnet on the new SWE-bench Verified benchmark. In fact, Agentless has already been adopted by OpenAI as the go-to approach to showcase the real-world coding performance of both GPT-4o and the new o1 models; more recently, Agentless has also been used by DeepSeek to evaluate their newest DeepSeek V3 and R1 models. Furthermore, we manually classified the problems in SWE-bench Lite and found problems with exact ground truth patches or insufficient/misleading issue descriptions. As such, we construct SWE-bench Lite-𝑆 by excluding such problematic issues to perform more rigorous evaluation and comparison. Our work highlights the currently overlooked potential of a simplistic, cost-effective technique in autonomous software development. We hope Agentless will help reset the baseline, starting point, and horizon for autonomous software agents, and inspire future work along this crucial direction. We have open-sourced Agentless at: https://github.com/OpenAutoCoder/Agentless ![]() |
|
Dwyer, Matthew |
![]() Soneya Binta Hossain, Raygan Taylor, and Matthew Dwyer (University of Virginia, USA; Dillard University, USA) Code documentation is a critical artifact of software development, bridging human understanding and machine- readable code. Beyond aiding developers in code comprehension and maintenance, documentation also plays a critical role in automating various software engineering tasks, such as test oracle generation (TOG). In Java, Javadoc comments offer structured, natural language documentation embedded directly within the source code, typically describing functionality, usage, parameters, return values, and exceptional behavior. While prior research has explored the use of Javadoc comments in TOG alongside other information, such as the method under test (MUT), their potential as a stand-alone input source, the most relevant Javadoc components, and guidelines for writing effective Javadoc comments for automating TOG remain less explored. In this study, we investigate the impact of Javadoc comments on TOG through a comprehensive analysis. We begin by fine-tuning 10 large language models using three different prompt pairs to assess the role of Javadoc comments alongside other contextual information. Next, we systematically analyze the impact of different Javadoc comment’s components on TOG. To evaluate the generalizability of Javadoc comments from various sources, we also generate them using the GPT-3.5 model. We perform a thorough bug detection study using Defects4J dataset to understand their role in real-world bug detection. Our results show that incorporating Javadoc comments improves the accuracy of test oracles in most cases, aligning closely with ground truth. We find that Javadoc comments alone can match or even outperform approaches that utilize the MUT implementation. Additionally, we identify that the description and the return tag are the most valuable components for TOG. Finally, our approach, when using only Javadoc comments, detects between 19% and 94% more real-world bugs in Defects4J than prior methods, establishing a new state-of-the-art. To further guide developers in writing effective documentation, we conduct a detailed qualitative study on when Javadoc comments are helpful or harmful for TOG. ![]() ![]() Trey Woodlief, Felipe Toledo, Matthew Dwyer, and Sebastian Elbaum (University of Virginia, USA) ![]() |
|
Eghbali, Aryaz |
![]() Aryaz Eghbali, Felix Burk, and Michael Pradel (University of Stuttgart, Germany) Python is a dynamic language with applications in many domains, and one of the most popular languages in recent years. To increase code quality, developers have turned to “linters” that statically analyze the source code and warn about potential programming problems. However, the inherent limitations of static analysis and the dynamic nature of Python make it difficult or even impossible for static linters to detect some problems. This paper presents DyLin, the first dynamic linter for Python. Similar to a traditional linter, the approach has an extensible set of checkers, which, unlike in traditional linters, search for specific programming anti-patterns by analyzing the program as it executes. A key contribution of this paper is a set of 15 Python-specific anti-patterns that are hard to find statically but amenable to dynamic analysis, along with corresponding checkers to detect them. Our evaluation applies DyLin to 37 popular open-source Python projects on GitHub and to a dataset of code submitted to Kaggle machine learning competitions, totaling over 683k lines of Python code. The approach reports a total of 68 problems, 48 of which are previously unknown true positives, i.e., a precision of 70.6%. The detected problems include bugs that cause incorrect values, such as inf, incorrect behavior, e.g., missing out on relevant events, unnecessary computations that slow down the program, and unintended data leakage from test data to the training phase of machine learning pipelines. These issues remained unnoticed in public repositories for more than 3.5 years, on average, despite the fact that the corresponding code has been exercised by the developer-written tests. A comparison with popular static linters and a type checker shows that DyLin complements these tools by detecting problems that are missed statically. Based on our reporting of 42 issues to the developers, 31 issues have so far been fixed. ![]() |
|
Elbaum, Sebastian |
![]() Trey Woodlief, Felipe Toledo, Matthew Dwyer, and Sebastian Elbaum (University of Virginia, USA) ![]() ![]() Kathryn T. Stolee, Tobias Welp, Caitlin Sadowski, and Sebastian Elbaum (North Carolina State University, USA; Google, Germany; Unaffiliated, USA; University of Virginia, USA) Code search is an integral part of a developer’s workflow. In 2015, researchers published a paper reflecting on the code search practices at Google of 27 developers who used the internal Code Search tool. That paper had first-hand accounts for why those developers were using code search and highlighted how often and in what situations developers were searching for code. In the past decade, much has changed in the landscape of developer support. New languages have emerged, artificial intelligence (AI) for code generation has gained traction, auto-complete in the IDE has gotten better, Q&A forums have increased in popularity, and code repositories are larger than ever. It is worth considering whether those observations from almost a decade ago have stood the test of time. In this work, inspired by the prior survey about the Code Search tool, we run a series of three surveys with 1,945 total responses and report overall Code Search usage statistics for over 100,000 users. Unlike the prior work, in our surveys, we include explicit success criteria to understand when code search is meeting their needs, and when it is not. We dive further into two common sub-categories of code search effort: when its users are looking for examples and when they are using code search alongside code review. We find that Code Search users continue to use the tool frequently and the frequency has not changed despite the introduction of AI-enhanced development support. Users continue to turn to Code Search to find examples, but the frequency of example-seeking behavior has decreased. More often than before, users access the tool to learn about and explore code. This has implications for future Code Search support in software development. ![]() |
|
Faisal, Fazle |
![]() Tanakorn Leesatapornwongsa, Fazle Faisal, and Suman Nath (Microsoft Research, USA) ![]() |
|
Famelis, Michalis |
![]() Mouna Dhaouadi, Bentley Oakes, and Michalis Famelis (Université de Montréal, Canada; Polytechnique Montréal, Canada) ![]() |
|
Fan, Meng |
![]() Meng Fan, Yuxia Zhang, Klaas-Jan Stol, and Hui Liu (Beijing Institute of Technology, China; University College Cork, Ireland; Lero, Ireland; SINTEF, Norway) Continued contributions of core developers in open source software (OSS) projects are key for sustaining and maintaining successful OSS projects. A major risk to the sustainability of OSS projects is developer turnover. Prior studies have explored developer turnover at the level of individual projects. A shortcoming of such studies is that they ignore the impact of developer turnover on downstream projects. Yet, an awareness of the turnover of core developers offers useful insights to the rest of an open source ecosystem. This study performs a large-scale empirical analysis of code developer turnover in the Rust package ecosystem. We find that the turnover of core developers is quite common in the whole Rust ecosystem with 36,991 packages. This is particularly worrying as a vast majority of Rust packages only have a single core developer. We found that core developer turnover can significantly decrease the quality and efficiency of software development and maintenance, even leading to deprecation. This is a major source of concern for those Rust packages that are widely used. We surveyed developers' perspectives on the turnover of core developers in upstream packages. We found that developers widely agreed that core developer turnover can affect project stability and sustainability. They also emphasized the importance of transparency and timely notifications regarding the health status of upstream dependencies. This study provides unique insights to help communities focus on building reliable software dependency networks. ![]() |
|
Fan, Ming |
![]() Songyang Yan, Xiaodong Zhang, Kunkun Hao, Haojie Xin, Yonggang Luo, Jucheng Yang, Ming Fan, Chao Yang, Jun Sun, and Zijiang Yang (Xi'an Jiaotong University, China; Xidian University, China; Synkrotron, China; Chongqing Changan Automobile, China; Singapore Management University, Singapore; University of Science and Technology of China, China) The safety and reliability of Automated Driving Systems (ADS) are paramount, necessitating rigorous testing methodologies to uncover potential failures before deployment. Traditional testing approaches often prioritize either natural scenario sampling or safety-critical scenario generation, resulting in overly simplistic or unrealistic hazardous tests. In practice, the demand for natural scenarios (e.g., when evaluating the ADS's reliability in real-world conditions), critical scenarios (e.g., when evaluating safety in critical situations), or somewhere in between (e.g., when testing the ADS in regions with less civilized drivers) varies depending on the testing objectives. To address this issue, we propose the On-demand Scenario Generation (OSG) Framework, which generates diverse scenarios with varying risk levels. Achieving the goal of OSG is challenging due to the complexity of quantifying the criticalness and naturalness stemming from intricate vehicle-environment interactions, as well as the need to maintain scenario diversity across various risk levels. OSG learns from real-world traffic datasets and employs a Risk Intensity Regulator to quantitatively control the risk level. It also leverages an improved heuristic search method to ensure scenario diversity. We evaluate OSG on the Carla simulators using various ADSs. We verify OSG's ability to generate scenarios with different risk levels and demonstrate its necessity by comparing accident types across risk levels. With the help of OSG, we are now able to systematically and objectively compare the performance of different ADSs based on different risk levels. ![]() ![]() |
|
Fan, Wei |
![]() Zhi Tu, Liangkun Niu, Wei Fan, and Tianyi Zhang (Purdue University, USA) ![]() |
|
Fan, Xingyu |
![]() Junjie Chen, Xingyu Fan, Chen Yang, Shuang Liu, and Jun Sun (Tianjin University, China; Renmin University of China, China; Singapore Management University, Singapore) ![]() |
|
Fan, Zhaoji |
![]() Weibin Wu, Haoxuan Hu, Zhaoji Fan, Yitong Qiao, Yizhan Huang, Yichen Li, Zibin Zheng, and Michael Lyu (Sun Yat-sen University, China; Chinese University of Hong Kong, China) Deep learning (DL) has revolutionized various software engineering tasks. Particularly, the emergence of AI code generators has pushed the boundaries of automatic programming to synthesize entire programs based on user-defined specifications in natural language. However, it remains a mystery if these AI code generators rely on the copy-and-paste programming practice, resulting in code clone concerns. In this work, to comprehensively study the code cloning behavior of AI code generators, we conduct an empirical study on three state-of-the-art commercial AI code generators to investigate the existence of all types of clones, which remains underexplored. Our experimental results show that the total Type-1 and Type-2 clone rates of the state-of-the-art commercial AI code generators can reach up to 7.50%, indicating marked code clone issues. Furthermore, it is observed that AI code generators risk infringing copyrights and propagating buggy and vulnerable code resulting from cloning code and show a certain degree of stability in generating code clones. ![]() |
|
Fang, Chunrong |
![]() Weisong Sun, Yuchen Chen, Chunrong Fang, Yebo Feng, Yuan Xiao, An Guo, Quanjun Zhang, Zhenyu Chen, Baowen Xu, and Yang Liu (Nanyang Technological University, Singapore; Nanjing University, China) Neural code models (NCMs) have been widely used to address various code understanding tasks, such as defect detection. However, numerous recent studies reveal that such models are vulnerable to backdoor attacks. Backdoored NCMs function normally on normal/clean code snippets, but exhibit adversary-expected behavior on poisoned code snippets injected with the adversary-crafted trigger. It poses a significant security threat. For example, a backdoored defect detection model may misclassify user-submitted defective code as non-defective. If this insecure code is then integrated into critical systems, like autonomous driving systems, it could jeopardize life safety. Therefore, there is an urgent need for effective techniques to detect and eliminate backdoors stealthily implanted in NCMs. To address this issue, in this paper, we innovatively propose a backdoor elimination technique for secure code understanding, called EliBadCode. EliBadCode eliminates backdoors in NCMs by inverting/reverse-engineering and unlearning backdoor triggers. Specifically, EliBadCode first filters the model vocabulary for trigger tokens based on the naming conventions of specific programming languages to reduce the trigger search space and cost. Then, EliBadCode introduces a sample-specific trigger position identification method, which can reduce the interference of non-backdoor (adversarial) perturbations for subsequent trigger inversion, thereby producing effective inverted backdoor triggers efficiently. Backdoor triggers can be viewed as backdoor (adversarial) perturbations. Subsequently, EliBadCode employs a Greedy Coordinate Gradient algorithm to optimize the inverted trigger and designs a trigger anchoring method to purify the inverted trigger. Finally, EliBadCode eliminates backdoors through model unlearning. We evaluate the effectiveness of in eliminating backdoors implanted in multiple NCMs used for three safety-critical code understanding tasks. The results demonstrate that EliBadCode can effectively eliminate backdoors while having minimal adverse effects on the normal functionality of the model. For instance, on defect detection tasks, EliBadCode substantially decreases the average Attack Success Rate (ASR) of the advanced backdoor attack from 99.76% to 2.64%, significantly outperforming the three baselines. The clean model produced by EliBadCode exhibits an average decrease in defect prediction accuracy of only 0.01% (the same as the baseline). ![]() |
|
Fang, Jianbin |
![]() Weiyuan Tong, Zixu Wang, Zhanyong Tang, Jianbin Fang, Yuqun Zhang, and Guixin Ye (NorthWest University, China; National University of Defense Technology, China; Southern University of Science and Technology, China) ![]() |
|
Feldt, Robert |
![]() Matteo Biagiola, Robert Feldt, and Paolo Tonella (USI Lugano, Switzerland; Chalmers University of Technology, Sweden) Adaptive Random Testing (ART) has faced criticism, particularly for its computational inefficiency, as highlighted by Arcuri and Briand. Their analysis clarified how ART requires a quadratic number of distance computations as the number of test executions increases, which limits its scalability in scenarios requiring extensive testing to uncover faults. Simulation results support this, showing that the computational overhead of these distance calculations often outweighs ART’s benefits. While various ART variants have attempted to reduce these costs, they frequently do so at the expense of fault detection, lack complexity guarantees, or are restricted to specific input types, such as numerical or discrete data. In this paper, we introduce a novel framework for adaptive random testing that replaces pairwise distance computations with a compact aggregation of past executions, such as counting the q-grams observed in previous runs. Test case selection then leverages this aggregated data to measure diversity (e.g., entropy of q-grams), allowing us to reduce the computational complexity from quadratic to linear. Experiments with a benchmark of six web applications, show that ART with q-grams covers, on average, 4× more unique targets than random testing, and 3.5×more than ART using traditional distance-based methods. ![]() |
|
Feng, Jia |
![]() Chaozheng Wang, Jia Feng, Shuzheng Gao, Cuiyun Gao, Zongjie Li, Ting Peng, Hailiang Huang, Yuetang Deng, and Michael Lyu (Chinese University of Hong Kong, China; University of Electronic Science and Technology of China, China; Hong Kong University of Science and Technology, China; Tencent, China) Large Code Models (LCMs) have demonstrated remarkable effectiveness across various code intelligence tasks. Supervised fine-tuning is essential to optimize their performance for specific downstream tasks. Compared with the traditional full-parameter fine-tuning (FFT) method, Parameter-Efficient Fine-Tuning (PEFT) methods can train LCMs with substantially reduced resource consumption and have gained widespread attention among researchers and practitioners. While existing studies have explored PEFT methods for code intelligence tasks, they have predominantly focused on a limited subset of scenarios, such as code generation with publicly available datasets, leading to constrained generalizability of the findings. To mitigate the limitation, we conduct a comprehensive study on exploring the effectiveness of the PEFT methods, which involves five code intelligence tasks containing both public and private data. Our extensive experiments reveal a considerable performance gap between PEFT methods and FFT, which is contrary to the findings of existing studies. We also find that this disparity is particularly pronounced in tasks involving private data. To improve the tuning performance for LCMs while reducing resource utilization during training, we propose a Layer-Wise Optimization (LWO) strategy in the paper. LWO incrementally updates the parameters of each layer in the whole model architecture, without introducing any additional component and inference overhead. Experiments across five LCMs and five code intelligence tasks demonstrate that LWO trains LCMs more effectively and efficiently compared to previous PEFT methods, with significant improvements in tasks using private data. For instance, in the line-level code completion task using our private code repositories, LWO outperforms the state-of-the-art LoRA method by 22% and 12% in terms of accuracy and BLEU scores, respectively. Furthermore, LWO can enable more efficient LCM tuning, reducing the training time by an average of 42.7% compared to LoRA. ![]() |
|
Feng, Ruitao |
![]() Jiongchi Yu, Xie Xiaofei, Qiang Hu, Bowen Zhang, Ziming Zhao, Yun Lin, Lei Ma, Ruitao Feng, and Frank Liau (Singapore Management University, Singapore; Tianjin University, China; Zhejiang University, China; Shanghai Jiao Tong University, China; University of Tokyo, Japan; University of Alberta, Canada; Southern Cross University, Australia; Government Technology Agency of Singapore, Singapore) ![]() |
|
Feng, Shiwei |
![]() Sayali Kate, Yifei Gao, Shiwei Feng, and Xiangyu Zhang (Purdue University, USA) Increasingly popular Robot Operating System (ROS) framework allows building robotic systems by integrating newly developed and/or reused modules, where the modules can use different versions of the framework (e.g., ROS1 or ROS2) and programming language (e.g. C++ or Python). The majority of such robotic systems' work happens in callbacks. The framework provides various elements for initializing callbacks and for setting up the execution of callbacks. It is the responsibility of developers to compose callbacks and their execution setup elements, and hence can lead to inconsistencies related to the setup of callback execution due to developer's incomplete knowledge of the semantics of elements in various versions of the framework. Some of these inconsistencies do not throw errors at runtime, making their detection difficult for developers. We propose a static approach to detecting such inconsistencies by extracting a static view of the composition of robotic system's callbacks and their execution setup, and then checking it against the composition conventions based on the elements' semantics. We evaluate our ROSCallBaX prototype on the dataset created from the posts on developer forums and ROS projects that are publicly available. The evaluation results show that our approach can detect real inconsistencies. ![]() |
|
Feng, Yebo |
![]() Weisong Sun, Yuchen Chen, Chunrong Fang, Yebo Feng, Yuan Xiao, An Guo, Quanjun Zhang, Zhenyu Chen, Baowen Xu, and Yang Liu (Nanyang Technological University, Singapore; Nanjing University, China) Neural code models (NCMs) have been widely used to address various code understanding tasks, such as defect detection. However, numerous recent studies reveal that such models are vulnerable to backdoor attacks. Backdoored NCMs function normally on normal/clean code snippets, but exhibit adversary-expected behavior on poisoned code snippets injected with the adversary-crafted trigger. It poses a significant security threat. For example, a backdoored defect detection model may misclassify user-submitted defective code as non-defective. If this insecure code is then integrated into critical systems, like autonomous driving systems, it could jeopardize life safety. Therefore, there is an urgent need for effective techniques to detect and eliminate backdoors stealthily implanted in NCMs. To address this issue, in this paper, we innovatively propose a backdoor elimination technique for secure code understanding, called EliBadCode. EliBadCode eliminates backdoors in NCMs by inverting/reverse-engineering and unlearning backdoor triggers. Specifically, EliBadCode first filters the model vocabulary for trigger tokens based on the naming conventions of specific programming languages to reduce the trigger search space and cost. Then, EliBadCode introduces a sample-specific trigger position identification method, which can reduce the interference of non-backdoor (adversarial) perturbations for subsequent trigger inversion, thereby producing effective inverted backdoor triggers efficiently. Backdoor triggers can be viewed as backdoor (adversarial) perturbations. Subsequently, EliBadCode employs a Greedy Coordinate Gradient algorithm to optimize the inverted trigger and designs a trigger anchoring method to purify the inverted trigger. Finally, EliBadCode eliminates backdoors through model unlearning. We evaluate the effectiveness of in eliminating backdoors implanted in multiple NCMs used for three safety-critical code understanding tasks. The results demonstrate that EliBadCode can effectively eliminate backdoors while having minimal adverse effects on the normal functionality of the model. For instance, on defect detection tasks, EliBadCode substantially decreases the average Attack Success Rate (ASR) of the advanced backdoor attack from 99.76% to 2.64%, significantly outperforming the three baselines. The clean model produced by EliBadCode exhibits an average decrease in defect prediction accuracy of only 0.01% (the same as the baseline). ![]() |
|
Feng, Yuan |
![]() Ranindya Paramitha, Yuan Feng, and Fabio Massacci (University of Trento, Italy; Vrije Universiteit Amsterdam, Netherlands) Vulnerability datasets used for ML testing implicitly contain retrospective information. When tested on the field, one can only use the labels available at the time of training and testing (e.g. seen and assumed negatives). As vulnerabilities are discovered across calendar time, labels change and past performance is not necessarily aligned with future performance. Past works only considered the slices of the whole history (e.g. DiverseVUl) or individual differences between releases (e.g. Jimenez et al. ESEC/FSE 2019). Such approaches are either too optimistic in training (e.g. the whole history) or too conservative (e.g. consecutive releases). We propose a method to restructure a dataset into a series of datasets in which both training and testing labels change to account for the knowledge available at the time. If the model is actually learning, it should improve its performance over time as more data becomes available and data becomes more stable, an effect that can be checked with the Mann-Kendall test. We validate our methodology for vulnerability detection with 4 time-based datasets (3 projects from BigVul dataset + Vuldeepecker’s NVD) and 5 ML models (Code2Vec, CodeBERT, LineVul, ReGVD, and Vuldeepecker). In contrast to the intuitive expectation (more retrospective information, better performance), the trend results show that performance changes inconsistently across the years, showing that most models are not learning. ![]() |
|
Finken, Tanner |
![]() Tanner Finken, Jesse Chen, and Sazzadur Rahaman (University of Arizona, USA) ![]() |
|
Fraser, Gordon |
![]() Sebastian Schweikl and Gordon Fraser (University of Passau, Germany) Programming is increasingly taught using dedicated block-based programming environments such as Scratch. While the use of blocks instead of text prevents syntax errors, learners can still make semantic mistakes implying a need for feedback and help. Since teachers may be overwhelmed by help requests in a classroom, may not have the required programming education themselves, and may simply not be available in independent learning scenarios, automated hint generation is desirable. Automated program repair can provide the foundation for automated hints, but relies on multiple assumptions: (1) Program repair usually aims to produce localized patches for fixing single bugs, but learners may fundamentally misunderstand programming concepts and tasks or request help for substantially incomplete programs. (2) Software tests are required to guide the search and to localize broken statements, but test suites for block-based programs are different to those considered in past research on fault localization and repair: They consist of system tests, where very few tests are sufficient to fully cover the code. At the same time, these tests have vastly longer execution times caused by the use of animations and interactions on Scratch programs, thus inhibiting the applicability of metaheuristic search. (3) The plastic surgery hypothesis assumes that the code necessary for repairs already exists in the codebase. Block-based programs tend to be small and may lack this necessary redundancy. In order to study whether automated program repair of block-based programs is nevertheless feasible, in this paper we introduce, to the best of our knowledge, the first automated program repair approach for Scratch programs based on evolutionary search. Our RePurr prototype includes novel refinements of fault localization to improve the lack of guidance of the test suites, recovers the plastic surgery hypothesis by exploiting that a learning scenario provides model and student solutions as alternatives, and uses parallelization and accelerated executions to reduce the costs of fitness evaluations. Empirical evaluation of RePurr on a set of real learners' programs confirms the anticipated challenges, but also demonstrates that the repair can nonetheless effectively improve and fix learners' programs, thus enabling automated generation of hints and feedback for learners. ![]() |
|
Freund, Stephen N. |
![]() Kyla H. Levin, Nicolas van Kempen, Emery D. Berger, and Stephen N. Freund (University of Massachusetts at Amherst, USA; Amazon Web Services, USA; Williams College, USA) ![]() |
|
Fruntke, Lukas |
![]() Lukas Fruntke and Jens Krinke (University College London, UK) Breaking changes in dependencies are a common challenge in software development, requiring manual intervention to resolve. This study examines how well LLM automate the repair of breaking changes caused by dependency updates in Java projects. Although earlier methods have mostly concentrated on detecting breaking changes or investigating their impact, they have not been able to completely automate the repair process. We introduce and compare two new approaches: an agentic system that combines automated tool usage with LLM, and a recursive zero-shot approach, employing iterative prompt refinement. Our experimental framework assesses the repair success of both approaches, using the BUMP dataset of curated breaking changes. We also investigate the impact of variables such as dependency popularity and prompt configuration on repair outcomes. Our results demonstrate a substantial difference in test suite success rates, with the agentic approach achieving a repair success rate of up to 23%, while the zero-shot prompting approach achieved a repair success rate of up to 19%. We show that automated program repair of breaking dependencies with LLMs is feasible and can be optimised to achieve better repair outcomes. ![]() ![]() |
|
Gao, Amiao |
![]() Amiao Gao, Zenong Zhang, Simin Wang, LiGuo Huang, Shiyi Wei, and Vincent Ng (Southern Methodist University, Dallas, USA; University of Texas at Dallas, USA) ![]() |
|
Gao, Cuiyun |
![]() Chaozheng Wang, Jia Feng, Shuzheng Gao, Cuiyun Gao, Zongjie Li, Ting Peng, Hailiang Huang, Yuetang Deng, and Michael Lyu (Chinese University of Hong Kong, China; University of Electronic Science and Technology of China, China; Hong Kong University of Science and Technology, China; Tencent, China) Large Code Models (LCMs) have demonstrated remarkable effectiveness across various code intelligence tasks. Supervised fine-tuning is essential to optimize their performance for specific downstream tasks. Compared with the traditional full-parameter fine-tuning (FFT) method, Parameter-Efficient Fine-Tuning (PEFT) methods can train LCMs with substantially reduced resource consumption and have gained widespread attention among researchers and practitioners. While existing studies have explored PEFT methods for code intelligence tasks, they have predominantly focused on a limited subset of scenarios, such as code generation with publicly available datasets, leading to constrained generalizability of the findings. To mitigate the limitation, we conduct a comprehensive study on exploring the effectiveness of the PEFT methods, which involves five code intelligence tasks containing both public and private data. Our extensive experiments reveal a considerable performance gap between PEFT methods and FFT, which is contrary to the findings of existing studies. We also find that this disparity is particularly pronounced in tasks involving private data. To improve the tuning performance for LCMs while reducing resource utilization during training, we propose a Layer-Wise Optimization (LWO) strategy in the paper. LWO incrementally updates the parameters of each layer in the whole model architecture, without introducing any additional component and inference overhead. Experiments across five LCMs and five code intelligence tasks demonstrate that LWO trains LCMs more effectively and efficiently compared to previous PEFT methods, with significant improvements in tasks using private data. For instance, in the line-level code completion task using our private code repositories, LWO outperforms the state-of-the-art LoRA method by 22% and 12% in terms of accuracy and BLEU scores, respectively. Furthermore, LWO can enable more efficient LCM tuning, reducing the training time by an average of 42.7% compared to LoRA. ![]() ![]() Yujia Chen, Yang Ye, Zhongqi Li, Yuchi Ma, and Cuiyun Gao (Harbin Institute of Technology, Shenzhen, China; Huawei Cloud Computing Technologies, China) Large code models (LCMs) have remarkably advanced the field of code generation. Despite their impressive capabilities, they still face practical deployment issues, such as high inference costs, limited accessibility of proprietary LCMs, and adaptability issues of ultra-large LCMs. These issues highlight the critical need for more accessible, lightweight yet effective LCMs. Knowledge distillation (KD) offers a promising solution, which transfers the programming capabilities of larger, advanced LCMs (Teacher) to smaller, less powerful LCMs (Student). However, existing KD methods for code intelligence often lack consideration of fault domain knowledge and rely on static seed knowledge, leading to degraded programming capabilities of student models. In this paper, we propose a novel Self-Paced knOwledge DistillAtion framework, named SODA, aiming at developing lightweight yet effective student LCMs via adaptively transferring the programming capabilities from advanced teacher LCMs. SODA consists of three stages in one cycle: (1) Correct-and-Fault Knowledge Delivery stage aims at improving the student model’s capability to recognize errors while ensuring its basic programming skill during the knowledge transferring, which involves correctness-aware supervised learning and fault-aware contrastive learning methods. (2) Multi-View Feedback stage aims at measuring the quality of results generated by the student model from two views, including model-based and static tool-based measurement, for identifying the difficult questions; (3) Feedback-based Knowledge Update stage aims at updating the student model adaptively by generating new questions at different difficulty levels, in which the difficulty levels are categorized based on the feedback in the second stage. By performing the distillation process iteratively, the student model is continuously refined through learning more advanced programming skills from the teacher model. We compare SODA with four state-of-the-art KD approaches on three widely-used code generation benchmarks with different programming languages. Experimental results show that SODA improves the student model by 65.96% in terms of average Pass@1, outperforming the best baseline PERsD by 29.85%. Based on the proposed SODA framework, we develop SodaCoder, a series of lightweight yet effective LCMs with ∼7B parameters, which outperform 15 LCMs with less than or equal to 16B parameters. Notably, SodaCoder-DS-6.7B, built on DeepseekCoder-6.7B, even surpasses the prominent ChatGPT on average Pass@1 across seven programming languages (66.4 vs. 61.3). ![]() |
|
Gao, Shan |
![]() Junwei Zhang, Xing Hu, Shan Gao, Xin Xia, David Lo, and Shanping Li (Zhejiang University, China; Huawei, China; Singapore Management University, Singapore) Unit testing is crucial for software development and maintenance. Effective unit testing ensures and improves software quality, but writing unit tests is time-consuming and labor-intensive. Recent studies have proposed deep learning (DL) techniques or large language models (LLMs) to automate unit test generation. These models are usually trained or fine-tuned on large-scale datasets. Despite growing awareness of the importance of data quality, there has been limited research on the quality of datasets used for test generation. To bridge this gap, we systematically examine the impact of noise on the performance of learning-based test generation models. We first apply the open card sorting method to analyze the most popular and largest test generation dataset, Methods2Test, to categorize eight distinct types of noise. Further, we conduct detailed interviews with 17 domain experts to validate and assess the importance, reasonableness, and correctness of the noise taxonomy. Then, we propose CleanTest, an automated noise-cleaning framework designed to improve the quality of test generation datasets. CleanTest comprises three filters: a rule-based syntax filter, a rule-based relevance filter, and a model-based coverage filter. To evaluate its effectiveness, we apply CleanTest on two widely-used test generation datasets, i.e., Methods2Test and Atlas. Our findings indicate that 43.52% and 29.65% of datasets contain noise, highlighting its prevalence. Finally, we conduct comparative experiments using four LLMs (i.e., CodeBERT, AthenaTest, StarCoder, and CodeLlama7B) to assess the impact of noise on test generation performance. The results show that filtering noise positively influences the test generation ability of the models. Fine-tuning the four LLMs with the filtered Methods2Test dataset, on average, improves its performance by 67% in branch coverage, using the Defects4J benchmark. For the Atlas dataset, the four LLMs improve branch coverage by 39%. Additionally, filtering noise improves bug detection performance, resulting in a 21.42% increase in bugs detected by the generated tests. ![]() |
|
Gao, Shuzheng |
![]() Chaozheng Wang, Jia Feng, Shuzheng Gao, Cuiyun Gao, Zongjie Li, Ting Peng, Hailiang Huang, Yuetang Deng, and Michael Lyu (Chinese University of Hong Kong, China; University of Electronic Science and Technology of China, China; Hong Kong University of Science and Technology, China; Tencent, China) Large Code Models (LCMs) have demonstrated remarkable effectiveness across various code intelligence tasks. Supervised fine-tuning is essential to optimize their performance for specific downstream tasks. Compared with the traditional full-parameter fine-tuning (FFT) method, Parameter-Efficient Fine-Tuning (PEFT) methods can train LCMs with substantially reduced resource consumption and have gained widespread attention among researchers and practitioners. While existing studies have explored PEFT methods for code intelligence tasks, they have predominantly focused on a limited subset of scenarios, such as code generation with publicly available datasets, leading to constrained generalizability of the findings. To mitigate the limitation, we conduct a comprehensive study on exploring the effectiveness of the PEFT methods, which involves five code intelligence tasks containing both public and private data. Our extensive experiments reveal a considerable performance gap between PEFT methods and FFT, which is contrary to the findings of existing studies. We also find that this disparity is particularly pronounced in tasks involving private data. To improve the tuning performance for LCMs while reducing resource utilization during training, we propose a Layer-Wise Optimization (LWO) strategy in the paper. LWO incrementally updates the parameters of each layer in the whole model architecture, without introducing any additional component and inference overhead. Experiments across five LCMs and five code intelligence tasks demonstrate that LWO trains LCMs more effectively and efficiently compared to previous PEFT methods, with significant improvements in tasks using private data. For instance, in the line-level code completion task using our private code repositories, LWO outperforms the state-of-the-art LoRA method by 22% and 12% in terms of accuracy and BLEU scores, respectively. Furthermore, LWO can enable more efficient LCM tuning, reducing the training time by an average of 42.7% compared to LoRA. ![]() |
|
Gao, Yi |
![]() Yi Gao, Xing Hu, Xiaohu Yang, and Xin Xia (Zhejiang University, China) Test smells arise from poor design practices and insufficient domain knowledge, which can lower the quality of test code and make it harder to maintain and update. Manually refactoring of test smells is time-consuming and error-prone, highlighting the necessity for automated approaches. Current rule-based refactoring methods often struggle in scenarios not covered by predefined rules and lack the flexibility needed to handle diverse cases effectively. In this paper, we propose a novel approach called UTRefactor, a context-enhanced, LLM-based framework for automatic test refactoring in Java projects. UTRefactor extracts relevant context from test code and leverages an external knowledge base that includes test smell definitions, descriptions, and DSL-based refactoring rules. By simulating the manual refactoring process through a chain-of-thought approach, UTRefactor guides the LLM to eliminate test smells in a step-by-step process, ensuring both accuracy and consistency throughout the refactoring. Additionally, we implement a checkpoint mechanism to facilitate comprehensive refactoring, particularly when multiple smells are present. We evaluate UTRefactor on 879 tests from six open-source Java projects, reducing the number of test smells from 2,375 to 265, achieving an 89% reduction. UTRefactor outperforms direct LLM-based refactoring methods by 61.82% in smell elimination and significantly surpasses the performance of a rule-based test smell refactoring tool. Our results demonstrate the effectiveness of UTRefactor in enhancing test code quality while minimizing manual involvement. ![]() |
|
Gao, Yifei |
![]() Sayali Kate, Yifei Gao, Shiwei Feng, and Xiangyu Zhang (Purdue University, USA) Increasingly popular Robot Operating System (ROS) framework allows building robotic systems by integrating newly developed and/or reused modules, where the modules can use different versions of the framework (e.g., ROS1 or ROS2) and programming language (e.g. C++ or Python). The majority of such robotic systems' work happens in callbacks. The framework provides various elements for initializing callbacks and for setting up the execution of callbacks. It is the responsibility of developers to compose callbacks and their execution setup elements, and hence can lead to inconsistencies related to the setup of callback execution due to developer's incomplete knowledge of the semantics of elements in various versions of the framework. Some of these inconsistencies do not throw errors at runtime, making their detection difficult for developers. We propose a static approach to detecting such inconsistencies by extracting a static view of the composition of robotic system's callbacks and their execution setup, and then checking it against the composition conventions based on the elements' semantics. We evaluate our ROSCallBaX prototype on the dataset created from the posts on developer forums and ROS projects that are publicly available. The evaluation results show that our approach can detect real inconsistencies. ![]() |
|
Garcia, Joshua |
![]() Yuntianyi Chen, Yuqi Huai, Yirui He, Shilong Li, Changnam Hong, Qi Alfred Chen, and Joshua Garcia (University of California at Irvine, USA) As autonomous driving systems (ADSes) become increasingly complex and integral to daily life, the importance of understanding the nature and mitigation of software bugs in these systems has grown correspondingly. Addressing the challenges of software maintenance in autonomous driving systems (e.g., handling real-time system decisions and ensuring safety-critical reliability) is crucial due to the unique combination of real-time decision-making requirements and the high stakes of operational failures in ADSes. The potential of automated tools in this domain is promising, yet there remains a gap in our comprehension of the challenges faced and the strategies employed during manual debugging and repair of such systems. In this paper, we present an empirical study that investigates bug-fix patterns in ADSes, with the aim of improving reliability and safety. We have analyzed the commit histories and bug reports of two major autonomous driving projects, Apollo and Autoware, from 1,331 bug fixes with the study of bug symptoms, root causes, and bug-fix patterns. Our study reveals several dominant bug-fix patterns, including those related to path planning, data flow, and configuration management. Additionally, we find that the frequency distribution of bug-fix patterns varies significantly depending on their nature and types and that certain categories of bugs are recurrent and more challenging to exterminate. Based on our findings, we propose a hierarchy of ADS bugs and two taxonomies of 15 syntactic bug-fix patterns and 27 semantic bug-fix patterns that offer guidance for bug identification and resolution. We also contribute a benchmark of 1,331 ADS bug-fix instances. ![]() ![]() |
|
Gröninger, Lars |
![]() Lars Gröninger, Beatriz Souza, and Michael Pradel (University of Stuttgart, Germany) Code changes are an integral part of the software development process. Many code changes are meant to improve the code without changing its functional behavior, e.g., refactorings and performance improvements. Unfortunately, validating whether a code change preserves the behavior is non-trivial, particularly when the code change is performed deep inside a complex project. This paper presents ChangeGuard, an approach that uses learning-guided execution to compare the runtime behavior of a modified function. The approach is enabled by the novel concept of pairwise learning-guided execution and by a set of techniques that improve the robustness and coverage of the state-of-the-art learning-guided execution technique. Our evaluation applies ChangeGuard to a dataset of 224 manually annotated code changes from popular Python open-source projects and to three datasets of code changes obtained by applying automated code transformations. Our results show that the approach identifies semantics-changing code changes with a precision of 77.1% and a recall of 69.5%, and that it detects unexpected behavioral changes introduced by automatic code refactoring tools. In contrast, the existing regression tests of the analyzed projects miss the vast majority of semantics-changing code changes, with a recall of only 7.6%. We envision our approach being useful for detecting unintended behavioral changes early in the development process and for improving the quality of automated code transformations. ![]() |
|
Guo, An |
![]() Weisong Sun, Yuchen Chen, Chunrong Fang, Yebo Feng, Yuan Xiao, An Guo, Quanjun Zhang, Zhenyu Chen, Baowen Xu, and Yang Liu (Nanyang Technological University, Singapore; Nanjing University, China) Neural code models (NCMs) have been widely used to address various code understanding tasks, such as defect detection. However, numerous recent studies reveal that such models are vulnerable to backdoor attacks. Backdoored NCMs function normally on normal/clean code snippets, but exhibit adversary-expected behavior on poisoned code snippets injected with the adversary-crafted trigger. It poses a significant security threat. For example, a backdoored defect detection model may misclassify user-submitted defective code as non-defective. If this insecure code is then integrated into critical systems, like autonomous driving systems, it could jeopardize life safety. Therefore, there is an urgent need for effective techniques to detect and eliminate backdoors stealthily implanted in NCMs. To address this issue, in this paper, we innovatively propose a backdoor elimination technique for secure code understanding, called EliBadCode. EliBadCode eliminates backdoors in NCMs by inverting/reverse-engineering and unlearning backdoor triggers. Specifically, EliBadCode first filters the model vocabulary for trigger tokens based on the naming conventions of specific programming languages to reduce the trigger search space and cost. Then, EliBadCode introduces a sample-specific trigger position identification method, which can reduce the interference of non-backdoor (adversarial) perturbations for subsequent trigger inversion, thereby producing effective inverted backdoor triggers efficiently. Backdoor triggers can be viewed as backdoor (adversarial) perturbations. Subsequently, EliBadCode employs a Greedy Coordinate Gradient algorithm to optimize the inverted trigger and designs a trigger anchoring method to purify the inverted trigger. Finally, EliBadCode eliminates backdoors through model unlearning. We evaluate the effectiveness of in eliminating backdoors implanted in multiple NCMs used for three safety-critical code understanding tasks. The results demonstrate that EliBadCode can effectively eliminate backdoors while having minimal adverse effects on the normal functionality of the model. For instance, on defect detection tasks, EliBadCode substantially decreases the average Attack Success Rate (ASR) of the advanced backdoor attack from 99.76% to 2.64%, significantly outperforming the three baselines. The clean model produced by EliBadCode exhibits an average decrease in defect prediction accuracy of only 0.01% (the same as the baseline). ![]() |
|
Guo, Lihua |
![]() Lihua Guo, Yiwei Hou, Chijin Zhou, Quan Zhang, and Yu Jiang (Tsinghua University, China) In recent years, the increase in ransomware attacks has significantly impacted individuals and organizations. Many strategies have been proposed to detect ransomware’s file disruption operation. However, they rely on certain assumptions that gradually fail in the face of evolving ransomware attacks, which use more stealthy encryption methods or benign-imitation-based I/O orchestration. To mitigate this, we propose an approach to detect ransomware attacks through temporal correlation between encryption and I/O behaviors. Its key idea is that there is a strong temporal correlation inherent in ransomware’s encryption and I/O behaviors. To disrupt files, ransomware must first read the file data from the disk into memory, encrypt it, and then write the encrypted data back to the disk. This process creates a pronounced temporal correlation between the computation load of the encryption operations and the size of the files being encrypted. Based on this invariant, we implement a prototype called RansomRadar and evaluate its effectiveness against 411 latest ransomware samples from 89 families. Experimental results show that it achieves a detection rate of 100.00% with 2.68% false alarms. Its F1-score is 96.03, higher than the existing detectors for 31.82 on average. ![]() |
|
Guo, Yuzhe |
![]() Dezhi Ran, Yuan Cao, Yuzhe Guo, Yuetong Li, Mengzhou Wu, Simin Chen, Wei Yang, and Tao Xie (Peking University, China; Beijing Jiaotong University, China; University of Chicago, USA; University of Texas at Dallas, USA) ![]() |
|
Halfond, William G.J. |
![]() Zhaoxu Zhang, Komei Ryu, Tingting Yu, and William G.J. Halfond (University of Southern California, USA; University of Connecticut, USA) Bug report reproduction is a crucial but time-consuming task to be carried out during mobile app maintenance. To accelerate this process, researchers have developed automated techniques for reproducing mobile app bug reports. However, due to the lack of an effective mechanism to recognize different buggy behaviors described in the report, existing work is limited to reproducing crash bug reports, or requires developers to manually analyze execution traces to determine if a bug was successfully reproduced. To address this limitation, we introduce a novel technique to automatically identify and extract the buggy behavior from the bug report and detect it during the automated reproduction process. To accommodate various buggy behaviors of mobile app bugs, we conducted an empirical study and created a standardized representation for expressing the bug behavior identified from our study. Given a report, our approach first transforms the documented buggy behavior into this standardized representation, then matches it against real-time device and UI information during the reproduction to recognize the bug. Our empirical evaluation demonstrated that our approach achieved over 90% precision and recall in generating the standardized representation of buggy behaviors. It correctly identified bugs in 83% of the bug reports and enhanced existing reproduction techniques, allowing them to reproduce four times more bug reports. ![]() |
|
Hao, Dan |
![]() Mingxuan Zhu, Zeyu Sun, and Dan Hao (Peking University, China; Institute of Software at Chinese Academy of Sciences, China) Compilers are crucial software tools that usually convert programs in high-level languages into machine code. A compiler provides hundreds of optimizations to improve the performance of the compiled code, which are controlled by enabled or disabled optimization flags. However, the vast number of combinations of these flags makes it extremely challenging to select the desired settings for compiler optimization flags (i.e., an optimization sequence) for a given target program. In the literature, many auto-tuning techniques have been proposed to select a desired optimization sequence via different strategies across the entire optimization space. However, due to the huge optimization space, these techniques commonly suffer from the widely recognized efficiency problem. To address this problem, in this paper, we propose a preference-driven selection approach PDCAT, which reduces the search space of optimization sequences through three components. In particular, PDCAT first identifies combined optimizations based on compiler documentation to exclude optimization sequences violating the combined constraints, and then categorizes the optimizations into a common optimization set (whose optimization flags are fixed) and an exploration set containing the remaining optimizations. Finally, within the search process, PDCAT assigns distinct enable probabilities to the explored optimization flags and finally selects a desired optimization sequence. The former two components reduce the search space by removing invalid optimization sequences and fixing some optimization flags, whereas the latter performs a biased search in the search space. To evaluate the performance of the proposed approach PDCAT, we conducted an extensive experimental study on the latest version of the GCC compiler with two widely used benchmarks, cBench and PolyBench. The results show that PDCAT significantly outperforms the four compared techniques, including the state-of-art technique SRTuner. Moreover, each component of PDCAT not only contributes to its performance, but also improves the acceleration performance of the compared techniques. ![]() |
|
Hao, Kunkun |
![]() Songyang Yan, Xiaodong Zhang, Kunkun Hao, Haojie Xin, Yonggang Luo, Jucheng Yang, Ming Fan, Chao Yang, Jun Sun, and Zijiang Yang (Xi'an Jiaotong University, China; Xidian University, China; Synkrotron, China; Chongqing Changan Automobile, China; Singapore Management University, Singapore; University of Science and Technology of China, China) The safety and reliability of Automated Driving Systems (ADS) are paramount, necessitating rigorous testing methodologies to uncover potential failures before deployment. Traditional testing approaches often prioritize either natural scenario sampling or safety-critical scenario generation, resulting in overly simplistic or unrealistic hazardous tests. In practice, the demand for natural scenarios (e.g., when evaluating the ADS's reliability in real-world conditions), critical scenarios (e.g., when evaluating safety in critical situations), or somewhere in between (e.g., when testing the ADS in regions with less civilized drivers) varies depending on the testing objectives. To address this issue, we propose the On-demand Scenario Generation (OSG) Framework, which generates diverse scenarios with varying risk levels. Achieving the goal of OSG is challenging due to the complexity of quantifying the criticalness and naturalness stemming from intricate vehicle-environment interactions, as well as the need to maintain scenario diversity across various risk levels. OSG learns from real-world traffic datasets and employs a Risk Intensity Regulator to quantitatively control the risk level. It also leverages an improved heuristic search method to ensure scenario diversity. We evaluate OSG on the Carla simulators using various ADSs. We verify OSG's ability to generate scenarios with different risk levels and demonstrate its necessity by comparing accident types across risk levels. With the help of OSG, we are now able to systematically and objectively compare the performance of different ADSs based on different risk levels. ![]() ![]() |
|
Hasan, Md Rashedul |
![]() Md Rashedul Hasan, Mohammad Rashedul Hasan, and Hamid Bagheri (University of Nebraska-Lincoln, USA) Optimizing object-relational database mapping (ORM) design is crucial for performance and scalability in modern software systems. However, widely used ORM tools offer limited support for exploring performance tradeoffs, often enforcing a single design and overlooking alternatives, which can lead to suboptimal outcomes. While systematic tradeoff analysis can reveal Pareto-optimal designs, its high computational cost and poor scalability hinder practical adoption. This paper presents DesignTradeoffSculptor, an extensible tool suite for efficient, scalable tradeoff analysis in ORM database design. Leveraging advanced Transformer-based deep learning models—trained and fine-tuned on formally analyzed database designs—and framing design exploration as a Natural Language Processing task, DesignTradeoffSculptor efficiently identifies and removes suboptimal designs, sharply reducing the number of candidates requiring costly tradeoff analysis. Experiments show that DesignTradeoffSculptor uncovers optimal designs missed by leading ORM tools and improves analysis efficiency by over 98.21%, reducing tradeoff analysis time from 15 days to just 18 minutes, demonstrating the transformative potential of integrating formal methods with deep learning. ![]() |
|
Hasan, Mohammad Rashedul |
![]() Md Rashedul Hasan, Mohammad Rashedul Hasan, and Hamid Bagheri (University of Nebraska-Lincoln, USA) Optimizing object-relational database mapping (ORM) design is crucial for performance and scalability in modern software systems. However, widely used ORM tools offer limited support for exploring performance tradeoffs, often enforcing a single design and overlooking alternatives, which can lead to suboptimal outcomes. While systematic tradeoff analysis can reveal Pareto-optimal designs, its high computational cost and poor scalability hinder practical adoption. This paper presents DesignTradeoffSculptor, an extensible tool suite for efficient, scalable tradeoff analysis in ORM database design. Leveraging advanced Transformer-based deep learning models—trained and fine-tuned on formally analyzed database designs—and framing design exploration as a Natural Language Processing task, DesignTradeoffSculptor efficiently identifies and removes suboptimal designs, sharply reducing the number of candidates requiring costly tradeoff analysis. Experiments show that DesignTradeoffSculptor uncovers optimal designs missed by leading ORM tools and improves analysis efficiency by over 98.21%, reducing tradeoff analysis time from 15 days to just 18 minutes, demonstrating the transformative potential of integrating formal methods with deep learning. ![]() |
|
He, Hao |
![]() Hao He, Bogdan Vasilescu, and Christian Kästner (Carnegie Mellon University, USA) Recent high-profile incidents in open-source software have greatly raised practitioner attention on software supply chain attacks. To guard against potential malicious package updates, security practitioners advocate pinning dependency to specific versions rather than floating in version ranges. However, it remains controversial whether pinning carries a meaningful security benefit that outweighs the cost of maintaining outdated and possibly vulnerable dependencies. In this paper, we quantify, through counterfactual analysis and simulations, the security and maintenance impact of version constraints in the npm ecosystem. By simulating dependency resolutions over historical time points, we find that pinning direct dependencies not only (as expected) increases the cost of maintaining vulnerable and outdated dependencies, but also (surprisingly) even increases the risk of exposure to malicious package updates in larger dependency graphs due to the specifics of npm’s dependency resolution mechanism. Finally, we explore collective pinning strategies to secure the ecosystem against supply chain attacks, suggesting specific changes to npm to enable such interventions. Our study provides guidance for practitioners and tool designers to manage their supply chains more securely. ![]() |
|
He, Kun |
![]() Ruichao Liang, Jing Chen, Ruochen Cao, Kun He, Ruiying Du, Shuhua Li, Zheng Lin, and Cong Wu (Wuhan University, China; University of Hong Kong, Hong Kong) Smart contracts, as Turing-complete programs managing billions of assets in decentralized finance, are prime targets for attackers. While fuzz testing seems effective for detecting vulnerabilities in these programs, we identify several significant challenges when targeting smart contracts: (i) the stateful nature of these contracts requires stateful exploration, but current fuzzers rely on transaction sequences to manipulate contract states, making the process inefficient; (ii) contract execution is influenced by the continuously changing blockchain environment, yet current fuzzers are limited to local deployments, failing to test contracts in real-world scenarios. These challenges hinder current fuzzers from uncovering hidden vulnerabilities, i.e., those concealed in deep contract states and specific blockchain environments. In this paper, we present SmartShot, a mutable snapshot-based fuzzer to hunt hidden vulnerabilities within smart contracts. We innovatively formulate contract states and blockchain environments as directly fuzzable elements and design mutable snapshots to quickly restore and mutate these elements. SmartShot features a symbolic taint analysis-based mutation strategy along with double validation to soundly guide the state mutation. SmartShot mutates blockchain environments using contract’s historical on-chain states, providing real-world execution contexts. We propose a snapshot checkpoint mechanism to integrate mutable snapshots into SmartShot’s fuzzing loops. These innovations enable SmartShot to effectively fuzz contract states, test contracts across varied and realistic blockchain environments, and support on-chain fuzzing. Experimental results show that SmartShot is effective to detect hidden vulnerabilities with the highest code coverage and lowest false positive rate. SmartShot is 4.8× to 20.2× faster than state-of-the-art tools, identifying 2,150 vulnerable contracts out of 42,738 real-world contracts which is 2.1× to 13.7× more than other tools. SmartShot has demonstrated its real-world impact by detecting vulnerabilities that are only discoverable on-chain and uncovering 24 0-day vulnerabilities in the latest 10,000 deployed contracts. ![]() |
|
He, Lipeng |
![]() Shoupeng Ren, Lipeng He, Tianyu Tu, Di Wu, Jian Liu, Kui Ren, and Chun Chen (Zhejiang University, China; University of Waterloo, Canada) ![]() |
|
He, Peng |
![]() Shuwei Song, Ting Chen, Ao Qiao, Xiapu Luo, Leqing Wang, Zheyuan He, Ting Wang, Xiaodong Lin, Peng He, Wensheng Zhang, and Xiaosong Zhang (University of Electronic Science and Technology of China, China; Hong Kong Polytechnic University, China; Stony Brook University, USA; University of Guelph, Canada) Cryptocurrency tokens, implemented by smart contracts, are prime targets for attackers due to their substantial monetary value. To illicitly gain profit, attackers often embed malicious code or exploit vulnerabilities within token contracts. Token transfer identification is crucial for detecting malicious and vulnerable token contracts. However, existing methods suffer from high false positives or false negatives due to invalid assumptions or reliance on limited patterns. This paper introduces a novel approach that captures the essential principles of token contracts, which are independent of programming languages and token standards, and presents a new tool, CRYPTO-SCOUT. CRYPTO-SCOUT automatically and accurately identifies token transfers, enabling the detection of various malicious and vulnerable token contracts. CRYPTO-SCOUT's core innovation is its capability to automatically identify complex container-type variables used by token contracts for storing holder information. It processes the bytecode of smart contracts written in the two most widely-used languages, Solidity and Vyper, and supports the three most popular token standards, ERC20, ERC721, and ERC1155. Furthermore, CRYPTO-SCOUT detects four types of malicious and vulnerable token contracts and is designed to be extensible. Extensive experiments show that CRYPTO-SCOUT outperforms existing approaches and uncovers over 21,000 malicious/vulnerable token contracts and more than 12,000 transactions triggering them. ![]() |
|
He, Pinjia |
![]() Haowen Yang, Zhengda Li, Zhiqing Zhong, Xiaoying Tang, and Pinjia He (Chinese University of Hong Kong, Shenzhen, China) With the increasing demand for handling large-scale and complex data, data science (DS) applications often suffer from long execution time and rapid RAM exhaustion, which leads to many serious issues like unbearable delays and crashes in financial transactions. As popular DS libraries are frequently used in these applications, their performance bugs (PBs) are a major contributing factor to these issues, making it crucial to address them for improving overall application performance. However, PBs in popular DS libraries remain largely unexplored. To address this gap, we conducted a study of 202 PBs collected from seven popular DS libraries. Our study examined the impact of PBs and proposed a taxonomy for common root causes. We found over half of the PBs arise from inefficient data processing operations, especially within data structure. We also explored the effort required to locate their root causes and fix these bugs, along with the challenges involved. Notably, around 20% of the PBs could be fixed using simple strategies (e.g. Conditions Optimizing), suggesting the potential for automated repair approaches. Our findings highlight the severity of PBs in core DS libraries and offer insights for developing high-performance libraries and detecting PBs. Furthermore, we derived test rules from our identified root causes, identifying eight PBs, of which four were confirmed, demonstrating the practical utility of our findings. ![]() |
|
He, Runzhi |
![]() Farbod Daneshyan, Runzhi He, Jianyu Wu, and Minghui Zhou (Peking University, China) The release note is a crucial document outlining changes in new software versions. It plays a key role in helping stakeholders recognise important changes and understand the implications behind them. Despite this fact, many developers view the process of writing software release notes as a tedious and dreadful task. Consequently, numerous tools (e.g., DeepRelease and Conventional Changelog) have been developed by researchers and practitioners to automate the generation of software release notes. However, these tools fail to consider project domain and target audience for personalisation, limiting their relevance and conciseness. Additionally, they suffer from limited applicability, often necessitating significant workflow adjustments and adoption efforts, hindering practical use and stressing developers. Despite recent advancements in natural language processing and the proven capabilities of large language models (LLMs) in various code and text-related tasks, there are no existing studies investigating the integration and utilisation of LLMs in automated release note generation. Therefore, we propose SmartNote, a novel and widely applicable release note generation approach that produces high-quality, contextually personalised release notes by leveraging LLM capabilities to aggregate, describe, and summarise changes based on code, commit, and pull request details. It categorises and scores (for significance) commits to generate structured and concise release notes of prioritised changes. We conduct human and automatic evaluations that reveal SmartNote outperforms or achieves comparable performance to DeepRelease (state-of-the-art), Conventional Changelog (off-the-shelf), and the projects' original release note across four quality metrics: completeness, clarity, conciseness, and organisation. In both evaluations, SmartNote ranked first for completeness and organisation, while clarity ranked first in the human evaluation. Furthermore, our controlled study reveals the significance of contextual awareness, while our applicability analysis confirms SmartNote's effectiveness across diverse projects. ![]() ![]() |
|
He, Yirui |
![]() Yuntianyi Chen, Yuqi Huai, Yirui He, Shilong Li, Changnam Hong, Qi Alfred Chen, and Joshua Garcia (University of California at Irvine, USA) As autonomous driving systems (ADSes) become increasingly complex and integral to daily life, the importance of understanding the nature and mitigation of software bugs in these systems has grown correspondingly. Addressing the challenges of software maintenance in autonomous driving systems (e.g., handling real-time system decisions and ensuring safety-critical reliability) is crucial due to the unique combination of real-time decision-making requirements and the high stakes of operational failures in ADSes. The potential of automated tools in this domain is promising, yet there remains a gap in our comprehension of the challenges faced and the strategies employed during manual debugging and repair of such systems. In this paper, we present an empirical study that investigates bug-fix patterns in ADSes, with the aim of improving reliability and safety. We have analyzed the commit histories and bug reports of two major autonomous driving projects, Apollo and Autoware, from 1,331 bug fixes with the study of bug symptoms, root causes, and bug-fix patterns. Our study reveals several dominant bug-fix patterns, including those related to path planning, data flow, and configuration management. Additionally, we find that the frequency distribution of bug-fix patterns varies significantly depending on their nature and types and that certain categories of bugs are recurrent and more challenging to exterminate. Based on our findings, we propose a hierarchy of ADS bugs and two taxonomies of 15 syntactic bug-fix patterns and 27 semantic bug-fix patterns that offer guidance for bug identification and resolution. We also contribute a benchmark of 1,331 ADS bug-fix instances. ![]() ![]() |
|
He, Yuang |
![]() Jia Chen, Yuang He, Peng Wang, Xiaolei Chen, Jie Shi, and Wei Wang (Fudan University, China) ![]() |
|
He, Zheyuan |
![]() Shuwei Song, Ting Chen, Ao Qiao, Xiapu Luo, Leqing Wang, Zheyuan He, Ting Wang, Xiaodong Lin, Peng He, Wensheng Zhang, and Xiaosong Zhang (University of Electronic Science and Technology of China, China; Hong Kong Polytechnic University, China; Stony Brook University, USA; University of Guelph, Canada) Cryptocurrency tokens, implemented by smart contracts, are prime targets for attackers due to their substantial monetary value. To illicitly gain profit, attackers often embed malicious code or exploit vulnerabilities within token contracts. Token transfer identification is crucial for detecting malicious and vulnerable token contracts. However, existing methods suffer from high false positives or false negatives due to invalid assumptions or reliance on limited patterns. This paper introduces a novel approach that captures the essential principles of token contracts, which are independent of programming languages and token standards, and presents a new tool, CRYPTO-SCOUT. CRYPTO-SCOUT automatically and accurately identifies token transfers, enabling the detection of various malicious and vulnerable token contracts. CRYPTO-SCOUT's core innovation is its capability to automatically identify complex container-type variables used by token contracts for storing holder information. It processes the bytecode of smart contracts written in the two most widely-used languages, Solidity and Vyper, and supports the three most popular token standards, ERC20, ERC721, and ERC1155. Furthermore, CRYPTO-SCOUT detects four types of malicious and vulnerable token contracts and is designed to be extensible. Extensive experiments show that CRYPTO-SCOUT outperforms existing approaches and uncovers over 21,000 malicious/vulnerable token contracts and more than 12,000 transactions triggering them. ![]() |
|
He, Ziyao |
![]() Ziyao He, Syed Fatiul Huq, and Sam Malek (University of California at Irvine, USA) ![]() |
|
Heo, Kihong |
![]() Sujin Jang, Yeonhee Ryou, Heewon Lee, and Kihong Heo (KAIST, Republic of Korea) ![]() |
|
Hong, Changnam |
![]() Yuntianyi Chen, Yuqi Huai, Yirui He, Shilong Li, Changnam Hong, Qi Alfred Chen, and Joshua Garcia (University of California at Irvine, USA) As autonomous driving systems (ADSes) become increasingly complex and integral to daily life, the importance of understanding the nature and mitigation of software bugs in these systems has grown correspondingly. Addressing the challenges of software maintenance in autonomous driving systems (e.g., handling real-time system decisions and ensuring safety-critical reliability) is crucial due to the unique combination of real-time decision-making requirements and the high stakes of operational failures in ADSes. The potential of automated tools in this domain is promising, yet there remains a gap in our comprehension of the challenges faced and the strategies employed during manual debugging and repair of such systems. In this paper, we present an empirical study that investigates bug-fix patterns in ADSes, with the aim of improving reliability and safety. We have analyzed the commit histories and bug reports of two major autonomous driving projects, Apollo and Autoware, from 1,331 bug fixes with the study of bug symptoms, root causes, and bug-fix patterns. Our study reveals several dominant bug-fix patterns, including those related to path planning, data flow, and configuration management. Additionally, we find that the frequency distribution of bug-fix patterns varies significantly depending on their nature and types and that certain categories of bugs are recurrent and more challenging to exterminate. Based on our findings, we propose a hierarchy of ADS bugs and two taxonomies of 15 syntactic bug-fix patterns and 27 semantic bug-fix patterns that offer guidance for bug identification and resolution. We also contribute a benchmark of 1,331 ADS bug-fix instances. ![]() ![]() |
|
Hong, Wenzheng |
![]() Yuteng Sun, Joyanta Debnath, Wenzheng Hong, Omar Chowdhury, and Sze Yiu Chau (Chinese University of Hong Kong, Hong Kong; Stony Brook University, USA; Independent, China) TheX.509 Public Key Infrastructure (PKI) provides a cryptographically verifiable mechanism for authenticating a binding of an entity’s public-key with its identity, presented in a tamper-proof digital certificate. This often serves as a foundational building block for achieving different security guarantees in many critical applications and protocols (e.g., SSL/TLS). Identities in the context of X.509 PKI are often captured as names, which are encoded in certificates as composite records with different optional fields that can store various types of string values (e.g., ASCII, UTF8). Although such flexibility allows for the support of diverse name types (e.g., IP addresses, DNS names) and application scenarios, it imposes on library developers obligations to enforce unnecessarily convoluted requirements. Bugs in enforcing these requirements can lead to potential interoperability and performance issues, and might open doors to impersonation attacks. This paper focuses on analyzing how open-source libraries enforce the constraints regarding the formatting, encoding, and processing of complex name structures on X.509 certificate chains, for the purpose of establishing identities. Our analysis reveals that a portfolio of simplistic testing approaches can expose blatant violations of the requirements in widely used open-source libraries. Although there is a substantial amount of prior work that focused on testing the overall certificate chain validation process of X.509 libraries, the identified violations have escaped their scrutiny. To make matters worse, we observe that almost all the analyzed libraries completely ignore certain pre-processing steps prescribed by the standard. This begs the question of whether it is beneficial to have a standard so flexible but complex that none of the implementations can faithfully adhere to it. With our test results, we argue in the negative, and explain how simpler alternatives (e.g., other forms of identifiers such as Authority and Subject Key Identifiers) can be used to enforce similar constraints with no loss of security. ![]() |
|
Hossain, Soneya Binta |
![]() Soneya Binta Hossain, Raygan Taylor, and Matthew Dwyer (University of Virginia, USA; Dillard University, USA) Code documentation is a critical artifact of software development, bridging human understanding and machine- readable code. Beyond aiding developers in code comprehension and maintenance, documentation also plays a critical role in automating various software engineering tasks, such as test oracle generation (TOG). In Java, Javadoc comments offer structured, natural language documentation embedded directly within the source code, typically describing functionality, usage, parameters, return values, and exceptional behavior. While prior research has explored the use of Javadoc comments in TOG alongside other information, such as the method under test (MUT), their potential as a stand-alone input source, the most relevant Javadoc components, and guidelines for writing effective Javadoc comments for automating TOG remain less explored. In this study, we investigate the impact of Javadoc comments on TOG through a comprehensive analysis. We begin by fine-tuning 10 large language models using three different prompt pairs to assess the role of Javadoc comments alongside other contextual information. Next, we systematically analyze the impact of different Javadoc comment’s components on TOG. To evaluate the generalizability of Javadoc comments from various sources, we also generate them using the GPT-3.5 model. We perform a thorough bug detection study using Defects4J dataset to understand their role in real-world bug detection. Our results show that incorporating Javadoc comments improves the accuracy of test oracles in most cases, aligning closely with ground truth. We find that Javadoc comments alone can match or even outperform approaches that utilize the MUT implementation. Additionally, we identify that the description and the return tag are the most valuable components for TOG. Finally, our approach, when using only Javadoc comments, detects between 19% and 94% more real-world bugs in Defects4J than prior methods, establishing a new state-of-the-art. To further guide developers in writing effective documentation, we conduct a detailed qualitative study on when Javadoc comments are helpful or harmful for TOG. ![]() |
|
Hou, Xinyi |
![]() Ting Zhou, Yanjie Zhao, Xinyi Hou, Xiaoyu Sun, Kai Chen, and Haoyu Wang (Huazhong University of Science and Technology, China; Australian National University, Australia) Declarative UI frameworks have gained widespread adoption in mobile app development, offering benefits such as improved code readability and easier maintenance. Despite these advantages, the process of translating UI designs into functional code remains challenging and time-consuming. Recent advancements in multimodal large language models (MLLMs) have shown promise in directly generating mobile app code from user interface (UI) designs. However, the direct application of MLLMs to this task is limited by challenges in accurately recognizing UI components and comprehensively capturing interaction logic. To address these challenges, we propose DeclarUI, an automated approach that synergizes computer vision (CV), MLLMs, and iterative compiler-driven optimization to generate and refine declarative UI code from designs. DeclarUI enhances visual fidelity, functional completeness, and code quality through precise component segmentation, Page Transition Graphs (PTGs) for modeling complex inter-page relationships, and iterative optimization. In our evaluation, DeclarUI outperforms baselines on React Native, a widely adopted declarative UI framework, achieving a 96.8% PTG coverage rate and a 98% compilation success rate. Notably, DeclarUI demonstrates significant improvements over state-of-the-art MLLMs, with a 123% increase in PTG coverage rate, up to 55% enhancement in visual similarity scores, and a 29% boost in compilation success rate. We further demonstrate DeclarUI’s generalizability through successful applications to Flutter and ArkUI frameworks. User studies with professional developers confirm that DeclarUI’s generated code meets industrial-grade standards in code availability, modification time, readability, and maintainability. By streamlining app development, improving efficiency, and fostering designer-developer collaboration, DeclarUI offers a practical solution to the persistent challenges in mobile UI development. ![]() |
|
Hou, Yiwei |
![]() Lihua Guo, Yiwei Hou, Chijin Zhou, Quan Zhang, and Yu Jiang (Tsinghua University, China) In recent years, the increase in ransomware attacks has significantly impacted individuals and organizations. Many strategies have been proposed to detect ransomware’s file disruption operation. However, they rely on certain assumptions that gradually fail in the face of evolving ransomware attacks, which use more stealthy encryption methods or benign-imitation-based I/O orchestration. To mitigate this, we propose an approach to detect ransomware attacks through temporal correlation between encryption and I/O behaviors. Its key idea is that there is a strong temporal correlation inherent in ransomware’s encryption and I/O behaviors. To disrupt files, ransomware must first read the file data from the disk into memory, encrypt it, and then write the encrypted data back to the disk. This process creates a pronounced temporal correlation between the computation load of the encryption operations and the size of the files being encrypted. Based on this invariant, we implement a prototype called RansomRadar and evaluate its effectiveness against 411 latest ransomware samples from 89 families. Experimental results show that it achieves a detection rate of 100.00% with 2.68% false alarms. Its F1-score is 96.03, higher than the existing detectors for 31.82 on average. ![]() |
|
Hou, Zhe |
![]() Xiaokun Luan, David Sanan, Zhe Hou, Qiyuan Xu, Chengwei Liu, Yufan Cai, Yang Liu, and Meng Sun (Peking University, China; Singapore Institute of Technology, Singapore; Griffith University, Australia; Nanyang Technological University, Singapore; China-Singapore International Joint Research Institute, China; National University of Singapore, Singapore) Proof assistants are software tools for formal modeling and verification of software, hardware, design, and mathematical proofs. Due to the growing complexity and scale of formal proofs, compatibility issues frequently arise when using different versions of proof assistants. These issues result in broken proofs, disrupting the maintenance of formalized theories and hindering the broader dissemination of results within the community. Although existing works have proposed techniques to address specific types of compatibility issues, the overall characteristics of these issues remain largely unexplored. To address this gap, we conduct the first extensive empirical study to characterize compatibility issues, using Isabelle as a case study. We develop a regression testing framework to automatically collect compatibility issues from the Archive of Formal Proofs, the largest repository of formal proofs in Isabelle. By analyzing 12,079 collected issues, we identify their types and symptoms and further investigate their root causes. We also extract updated proofs that address these issues to understand the applied resolution strategies. Our study provides an in-depth understanding of compatibility issues in proof assistants, offering insights that support the development of effective techniques to mitigate these issues. ![]() ![]() |
|
Hough, Katherine |
![]() Katherine Hough and Jonathan Bell (Northeastern University, USA) Dynamic taint tracking is a program analysis that traces the flow of information through a program. In the Java virtual machine (JVM), there are two prominent approaches for dynamic taint tracking: "shadowing" and "mirroring". Shadowing is able to precisely track information flows, but is also prone to disrupting the semantics of the program under analysis. Mirroring is better able to preserve program semantics, but often inaccurate. The limitations of these approaches are further exacerbated by features introduced in the latest Java versions. In this paper, we propose Galette, an approach for dynamic taint tracking in the JVM that combines aspects of both shadowing and mirroring to provide precise, robust taint tag propagation in modern JVMs. On a benchmark suite of 3,451 synthetic Java programs, we found that Galette was able to propagate taint tags with perfect accuracy while preserving program semantics on all four active long-term support versions of Java. We also found that Galette's runtime and memory overheads were competitive with that of two state-of-the-art dynamic taint tracking systems on a benchmark suite of twenty real-world Java programs. ![]() |
|
Hu, Chunming |
![]() Xu Wang, Mingming Zhang, Xiangxin Meng, Jian Zhang, Yang Liu, and Chunming Hu (Beihang University, China; Zhongguancun Laboratory, China; Ministry of Education, China; Nanyang Technological University, Singapore) Deep Neural Networks (DNNs) are prevalent across a wide range of applications. Despite their success, the complexity and opaque nature of DNNs pose significant challenges in debugging and repairing DNN models, limiting their reliability and broader adoption. In this paper, we propose MLM4DNN, an element-based automated DNN repair method. Unlike previous techniques that focus on post-training adjustments or rely heavily on predefined bug patterns, MLM4DNN repairs DNNs by leveraging a fine-tuned Masked Language Model (MLM) to predict correct fixes for nine predefined key elements in DNNs. We construct a large-scale dataset by masking nine key elements from the correct DNN source code and then force the MLM to restore the correct elements to learn the deep semantics that ensure the normal functionalities of DNNs. Afterwards, a light-weight static analysis tool is designed to filter out low-quality patches to enhance the repair efficiency. We introduce a patch validation method specifically for DNN repair tasks, which consists of three evaluation metrics from different aspects to model the effectiveness of generated patches. We construct a benchmark, BenchmarkAPR4DNN, including 51 buggy DNN models and an evaluation tool that outputs the three metrics. We evaluate MLM4DNN against six baselines on BenchmarkAPR4DNN, and results show that MLM4DNN outperforms all state-of-the-art baselines, including two dynamic-based and four zero-shot learning-based methods. After applying the fine-tuned MLM design to several prevalent Large Language Models (LLMs), we consistently observe improved performance in DNN repair tasks compared to the original LLMs, which demonstrates the effectiveness of the method proposed in this paper. ![]() |
|
Hu, Haoxuan |
![]() Weibin Wu, Haoxuan Hu, Zhaoji Fan, Yitong Qiao, Yizhan Huang, Yichen Li, Zibin Zheng, and Michael Lyu (Sun Yat-sen University, China; Chinese University of Hong Kong, China) Deep learning (DL) has revolutionized various software engineering tasks. Particularly, the emergence of AI code generators has pushed the boundaries of automatic programming to synthesize entire programs based on user-defined specifications in natural language. However, it remains a mystery if these AI code generators rely on the copy-and-paste programming practice, resulting in code clone concerns. In this work, to comprehensively study the code cloning behavior of AI code generators, we conduct an empirical study on three state-of-the-art commercial AI code generators to investigate the existence of all types of clones, which remains underexplored. Our experimental results show that the total Type-1 and Type-2 clone rates of the state-of-the-art commercial AI code generators can reach up to 7.50%, indicating marked code clone issues. Furthermore, it is observed that AI code generators risk infringing copyrights and propagating buggy and vulnerable code resulting from cloning code and show a certain degree of stability in generating code clones. ![]() |
|
Hu, Huimin |
![]() Huimin Hu, Yingying Wang, Julia Rubin, and Michael Pradel (University of Stuttgart, Germany; University of British Columbia, Canada) Scalable static analyzers are popular tools for finding incorrect, inefficient, insecure, and hard-to-maintain code early during the development process. Because not all warnings reported by a static analyzer are immediately useful to developers, many static analyzers provide a way to suppress warnings, e.g., in the form of special comments added into the code. Such suppressions are an important mechanism at the interface between static analyzers and software developers, but little is currently known about them. This paper presents the first in-depth empirical study of suppressions of static analysis warnings, addressing questions about the prevalence of suppressions, their evolution over time, the relationship between suppressions and warnings, and the reasons for using suppressions. We answer these questions by studying projects written in three popular languages and suppressions for warnings by four popular static analyzers. Our findings show that (i) suppressions are relatively common, e.g., with a total of 7,357 suppressions in 46 Python projects, (ii) the number of suppressions in a project tends to continuously increase over time, (iii) surprisingly, 50.8% of all suppressions do not affect any warning and hence are practically useless, (iv) some suppressions, including useless ones, may unintentionally hide future warnings, and (v) common reasons for introducing suppressions include false positives, suboptimal configurations of the static analyzer, and misleading warning messages. These results have actionable implications, e.g., that developers should be made aware of useless suppressions and the potential risk of unintentional suppressing, that static analyzers should provide better warning messages, and that static analyzers should separately categorize warnings from third-party libraries. ![]() |
|
Hu, Jun |
![]() Mengzhuo Chen, Zhe Liu, Chunyang Chen, Junjie Wang, Boyu Wu, Jun Hu, and Qing Wang (Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; TU Munich, Germany) In software development, similar apps often encounter similar bugs due to shared functionalities and implementation methods. However, current automated GUI testing methods mainly focus on generating test scripts to cover more pages by analyzing the internal structure of the app, without targeted exploration of paths that may trigger bugs, resulting in low efficiency in bug discovery. Considering that a large number of bug reports on open source platforms can provide external knowledge for testing, this paper proposes BugHunter, a novel bug-aware automated GUI testing approach that generates exploration paths guided by bug reports from similar apps, utilizing a combination of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). Instead of focusing solely on coverage, BugHunter dynamically adapts the testing process to target bug paths, thereby increasing bug detection efficiency. BugHunter first builds a high-quality bug knowledge base from historical bug reports. Then it retrieves relevant reports from this large bug knowledge base using a two-stage retrieval process, and generates test paths based on similar apps’ bug reports. BugHunter also introduces a local and global path-planning mechanism to handle differences in functionality and UI design across apps, and the ambiguous behavior or missing steps in the online bug reports. We evaluate BugHunter on 121 bugs across 71 apps and compare its performance against 16 state-of-the-art baselines. BugHunter achieves 60% improvement in bug detection over the best baseline, with comparable or higher coverage against the baselines. Furthermore, BugHunter successfully detects 49 new crash bugs in real-world apps from Google Play, with 33 bugs fixed, 9 confirmed, and 7 pending feedback. ![]() |
|
Hu, Kun |
![]() Bingkun Sun, Shiqi Sun, Jialin Ren, Mingming Hu, Kun Hu, Liwei Shen, and Xin Peng (Fudan University, China; Northwestern Polytechnical University, China) The Web of Things (WoT) system standardizes the integration of ubiquitous IoT devices in physical environments, enabling various software applications to automatically sense and regulate the physical environment. While providing convenience, the complex interactions among software applications and physical environment make the WoT system vulnerable to violations caused by improper actuator operations, which may cause undesirable or even harmful results, posing serious risks to user safety and security. In response to this critical concern, many previous efforts have be made. However, existing works primarily focus on analyzing software application behaviors, with insufficient consideration of the physical interactions, multi-source violations, and environmental dynamics in such ubiquitous software systems. As a result, they fail to comprehensively detect the impact of actuator operations on the dynamic environment, thus limiting their effectiveness. To address these limitations, we propose SysGuard, a violation detecting and handling approach. SysGuard employs the dynamic probabilistic graphical model (DPGM) to model the physical interactions as the physical interaction graph (PIG). In the offline phase, SysGuard takes device description models and history device logs as input to capture physical interactions by learning the PIG. In this process, a large language model (LLM) based causal analysis method is further introduced to filter out the device dependencies unrelated to physical interaction by analyzing the device interaction scenarios recorded in device logs. In the online phase, SysGuard processes user-customized violation rules, and monitors runtime device logs to predict violation states and generates handling policies by inferring the PIG. Evaluation on two real-world WoT systems shows that SysGuard significantly outperforms existing state-of-the-art works, achieving high performance in both violation detection and handling. It also confirms the runtime efficiency and scalability of SysGuard. Ablation experiment on our constructed dataset demonstrates that the LLM-based causal analysis significantly improves the performance of SysGuard, with the accuracy increasing in both violation detecting and handling. ![]() ![]() |
|
Hu, Ming |
![]() Ziqiao Kong, Cen Zhang, Maoyi Xie, Ming Hu, Yue Xue, Ye Liu, Haijun Wang, and Yang Liu (Nanyang Technological University, Singapore; Singapore Management University, Singapore; MetaTrust Labs, Singapore; Xi'an Jiaotong University, China) Billions of dollars are transacted through smart contracts, making vulnerabilities a major financial risk. One focus in the security arms race is on profitable vulnerabilities that attackers can exploit. Fuzzing is a key method for identifying these vulnerabilities. However, current solutions face two main limitations: 1. a lack of profit-centric techniques for expediting detection and, 2. insufficient automation in maximizing the profitability of discovered vulnerabilities, leaving the analysis to human experts. To address these gaps, we have developed VERITE, a profit-centric smart contract fuzzing framework that not only effectively detects those profitable vulnerabilities but also maximizes the exploited profits. VERITE has three key features: 1. DeFi action-based mutators for boosting the exploration of transactions with different fund flows; 2. potentially profitable candidates identification criteria, which checks whether the input has caused abnormal fund flow properties during testing; 3. a gradient descent-based profit maximization strategy for these identified candidates. VERITE is fully developed from scratch and evaluated on a dataset consisting of 61 exploited real-world DeFi projects with an average of over 1.1 million dollars loss. The results show that VERITE can automatically extract more than 18 million dollars in total and is significantly better than state-of-the-art fuzzer ItyFuzz in both detection (29/10) and exploitation (134 times more profits gained on average). Remarkably, in 12 targets, it gains more profits than real-world attacking exploits (1.01 to 11.45 times more). VERITE is also applied by auditors in contract auditing, where 6 (5 high severity) zero-day vulnerabilities are found with over $2,500 bounty rewards. ![]() ![]() ![]() |
|
Hu, Mingming |
![]() Bingkun Sun, Shiqi Sun, Jialin Ren, Mingming Hu, Kun Hu, Liwei Shen, and Xin Peng (Fudan University, China; Northwestern Polytechnical University, China) The Web of Things (WoT) system standardizes the integration of ubiquitous IoT devices in physical environments, enabling various software applications to automatically sense and regulate the physical environment. While providing convenience, the complex interactions among software applications and physical environment make the WoT system vulnerable to violations caused by improper actuator operations, which may cause undesirable or even harmful results, posing serious risks to user safety and security. In response to this critical concern, many previous efforts have be made. However, existing works primarily focus on analyzing software application behaviors, with insufficient consideration of the physical interactions, multi-source violations, and environmental dynamics in such ubiquitous software systems. As a result, they fail to comprehensively detect the impact of actuator operations on the dynamic environment, thus limiting their effectiveness. To address these limitations, we propose SysGuard, a violation detecting and handling approach. SysGuard employs the dynamic probabilistic graphical model (DPGM) to model the physical interactions as the physical interaction graph (PIG). In the offline phase, SysGuard takes device description models and history device logs as input to capture physical interactions by learning the PIG. In this process, a large language model (LLM) based causal analysis method is further introduced to filter out the device dependencies unrelated to physical interaction by analyzing the device interaction scenarios recorded in device logs. In the online phase, SysGuard processes user-customized violation rules, and monitors runtime device logs to predict violation states and generates handling policies by inferring the PIG. Evaluation on two real-world WoT systems shows that SysGuard significantly outperforms existing state-of-the-art works, achieving high performance in both violation detection and handling. It also confirms the runtime efficiency and scalability of SysGuard. Ablation experiment on our constructed dataset demonstrates that the LLM-based causal analysis significantly improves the performance of SysGuard, with the accuracy increasing in both violation detecting and handling. ![]() ![]() |
|
Hu, Qiang |
![]() Jiongchi Yu, Xie Xiaofei, Qiang Hu, Bowen Zhang, Ziming Zhao, Yun Lin, Lei Ma, Ruitao Feng, and Frank Liau (Singapore Management University, Singapore; Tianjin University, China; Zhejiang University, China; Shanghai Jiao Tong University, China; University of Tokyo, Japan; University of Alberta, Canada; Southern Cross University, Australia; Government Technology Agency of Singapore, Singapore) ![]() |
|
Hu, Xing |
![]() Xu Yang, Wenhan Zhu, Michael Pacheco, Jiayuan Zhou, Shaowei Wang, Xing Hu, and Kui Liu (University of Manitoba, Canada; Huawei, Canada; Zhejiang University, China; Huawei Software Engineering Application Technology Lab, China) Detecting vulnerability fix commits in open-source software is crucial for maintaining software security. To help OSS identify vulnerability fix commits, several automated approaches are developed. However, existing approaches like VulFixMiner and CoLeFunDa, focus solely on code changes, neglecting essential context from development artifacts. Tools like Vulcurator, which integrates issue reports, fail to leverage semantic associations between different development artifacts (e.g., pull requests and history vulnerability fixes). Moreover, they miss vulnerability fixes in tangled commits and lack explanations, limiting practical use. Hence to address those limitations, we propose LLM4VFD, a novel framework that leverages Large Language Models (LLMs) enhanced with Chain-of-Thought reasoning and In-Context Learning to improve the accuracy of vulnerability fix detection. LLM4VFD comprises three components: (1) Code Change Intention, which analyzes commit summaries, purposes, and implications using Chain-of-Thought reasoning; (2) Development Artifact, which incorporates context from related issue reports and pull requests; (3) Historical Vulnerability, which retrieves similar past vulnerability fixes to enrich context. More importantly, on top of the prediction, LLM4VFD also provides a detailed analysis and explanation to help security experts understand the rationale behind the decision. We evaluated LLM4VFD against state-of-the-art techniques, including Pre-trained Language Model-based approaches and vanilla LLMs, using a newly collected dataset, BigVulFixes. Experimental results demonstrate that LLM4VFD significantly outperforms the best-performed existing approach by 68.1%--145.4%. Furthermore, We conducted a user study with security experts, showing that the analysis generated by LLM4VFD improves the efficiency of vulnerability fix identification. ![]() ![]() Yi Gao, Xing Hu, Xiaohu Yang, and Xin Xia (Zhejiang University, China) Test smells arise from poor design practices and insufficient domain knowledge, which can lower the quality of test code and make it harder to maintain and update. Manually refactoring of test smells is time-consuming and error-prone, highlighting the necessity for automated approaches. Current rule-based refactoring methods often struggle in scenarios not covered by predefined rules and lack the flexibility needed to handle diverse cases effectively. In this paper, we propose a novel approach called UTRefactor, a context-enhanced, LLM-based framework for automatic test refactoring in Java projects. UTRefactor extracts relevant context from test code and leverages an external knowledge base that includes test smell definitions, descriptions, and DSL-based refactoring rules. By simulating the manual refactoring process through a chain-of-thought approach, UTRefactor guides the LLM to eliminate test smells in a step-by-step process, ensuring both accuracy and consistency throughout the refactoring. Additionally, we implement a checkpoint mechanism to facilitate comprehensive refactoring, particularly when multiple smells are present. We evaluate UTRefactor on 879 tests from six open-source Java projects, reducing the number of test smells from 2,375 to 265, achieving an 89% reduction. UTRefactor outperforms direct LLM-based refactoring methods by 61.82% in smell elimination and significantly surpasses the performance of a rule-based test smell refactoring tool. Our results demonstrate the effectiveness of UTRefactor in enhancing test code quality while minimizing manual involvement. ![]() ![]() Siwei Tan, Liqiang Lu, Debin Xiang, Tianyao Chu, Congliang Lang, Jintao Chen, Xing Hu, and Jianwei Yin (Zhejiang University, China) Quantum programs provide exponential speedups compared to classical programs in certain areas, but they also inevitably encounter logical faults. Automatically repairing quantum programs is much more challenging than repairing classical programs due to the non-replicability of data, the vast search space of program inputs, and the new programming paradigm. Existing works based on semantic-based or learning-based program repair techniques are fundamentally limited in repairing efficiency and effectiveness. In this work, we propose HornBro, an efficient framework for automated quantum program repair. The key insight of HornBro lies in the homotopy-like method, which iteratively switches between the classical and quantum parts. This approach allows the repair tasks to be efficiently offloaded to the most suitable platforms, enabling a progressive convergence toward the correct program. We start by designing an implication assertion pragma to enable rigorous specifications of quantum program behavior, which helps to generate a quantum test suite automatically. This suite leverages the orthonormal bases of quantum programs to accommodate different encoding schemes. Given a fixed number of test cases, it allows the maximum input coverage of potential counter-example candidates. Then, we develop a Clifford approximation method with an SMT-based search to transform the patch localization program into a symbolic reasoning problem. Finally, we offload the computationally intensive repair of gate parameters to quantum hardware by leveraging the differentiability of quantum gates. Experiments suggest that HornBro increases the repair success rate by more than 62.5% compared to the existing repair techniques, supporting more types of quantum bugs. It also achieves 35.7× speedup in the repair and 99.9% gate reduction of the patch. ![]() ![]() ![]() Junwei Zhang, Xing Hu, Shan Gao, Xin Xia, David Lo, and Shanping Li (Zhejiang University, China; Huawei, China; Singapore Management University, Singapore) Unit testing is crucial for software development and maintenance. Effective unit testing ensures and improves software quality, but writing unit tests is time-consuming and labor-intensive. Recent studies have proposed deep learning (DL) techniques or large language models (LLMs) to automate unit test generation. These models are usually trained or fine-tuned on large-scale datasets. Despite growing awareness of the importance of data quality, there has been limited research on the quality of datasets used for test generation. To bridge this gap, we systematically examine the impact of noise on the performance of learning-based test generation models. We first apply the open card sorting method to analyze the most popular and largest test generation dataset, Methods2Test, to categorize eight distinct types of noise. Further, we conduct detailed interviews with 17 domain experts to validate and assess the importance, reasonableness, and correctness of the noise taxonomy. Then, we propose CleanTest, an automated noise-cleaning framework designed to improve the quality of test generation datasets. CleanTest comprises three filters: a rule-based syntax filter, a rule-based relevance filter, and a model-based coverage filter. To evaluate its effectiveness, we apply CleanTest on two widely-used test generation datasets, i.e., Methods2Test and Atlas. Our findings indicate that 43.52% and 29.65% of datasets contain noise, highlighting its prevalence. Finally, we conduct comparative experiments using four LLMs (i.e., CodeBERT, AthenaTest, StarCoder, and CodeLlama7B) to assess the impact of noise on test generation performance. The results show that filtering noise positively influences the test generation ability of the models. Fine-tuning the four LLMs with the filtered Methods2Test dataset, on average, improves its performance by 67% in branch coverage, using the Defects4J benchmark. For the Atlas dataset, the four LLMs improve branch coverage by 39%. Additionally, filtering noise improves bug detection performance, resulting in a 21.42% increase in bugs detected by the generated tests. ![]() |
|
Huai, Yuqi |
![]() Yuntianyi Chen, Yuqi Huai, Yirui He, Shilong Li, Changnam Hong, Qi Alfred Chen, and Joshua Garcia (University of California at Irvine, USA) As autonomous driving systems (ADSes) become increasingly complex and integral to daily life, the importance of understanding the nature and mitigation of software bugs in these systems has grown correspondingly. Addressing the challenges of software maintenance in autonomous driving systems (e.g., handling real-time system decisions and ensuring safety-critical reliability) is crucial due to the unique combination of real-time decision-making requirements and the high stakes of operational failures in ADSes. The potential of automated tools in this domain is promising, yet there remains a gap in our comprehension of the challenges faced and the strategies employed during manual debugging and repair of such systems. In this paper, we present an empirical study that investigates bug-fix patterns in ADSes, with the aim of improving reliability and safety. We have analyzed the commit histories and bug reports of two major autonomous driving projects, Apollo and Autoware, from 1,331 bug fixes with the study of bug symptoms, root causes, and bug-fix patterns. Our study reveals several dominant bug-fix patterns, including those related to path planning, data flow, and configuration management. Additionally, we find that the frequency distribution of bug-fix patterns varies significantly depending on their nature and types and that certain categories of bugs are recurrent and more challenging to exterminate. Based on our findings, we propose a hierarchy of ADS bugs and two taxonomies of 15 syntactic bug-fix patterns and 27 semantic bug-fix patterns that offer guidance for bug identification and resolution. We also contribute a benchmark of 1,331 ADS bug-fix instances. ![]() ![]() |
|
Huang, Hailiang |
![]() Chaozheng Wang, Jia Feng, Shuzheng Gao, Cuiyun Gao, Zongjie Li, Ting Peng, Hailiang Huang, Yuetang Deng, and Michael Lyu (Chinese University of Hong Kong, China; University of Electronic Science and Technology of China, China; Hong Kong University of Science and Technology, China; Tencent, China) Large Code Models (LCMs) have demonstrated remarkable effectiveness across various code intelligence tasks. Supervised fine-tuning is essential to optimize their performance for specific downstream tasks. Compared with the traditional full-parameter fine-tuning (FFT) method, Parameter-Efficient Fine-Tuning (PEFT) methods can train LCMs with substantially reduced resource consumption and have gained widespread attention among researchers and practitioners. While existing studies have explored PEFT methods for code intelligence tasks, they have predominantly focused on a limited subset of scenarios, such as code generation with publicly available datasets, leading to constrained generalizability of the findings. To mitigate the limitation, we conduct a comprehensive study on exploring the effectiveness of the PEFT methods, which involves five code intelligence tasks containing both public and private data. Our extensive experiments reveal a considerable performance gap between PEFT methods and FFT, which is contrary to the findings of existing studies. We also find that this disparity is particularly pronounced in tasks involving private data. To improve the tuning performance for LCMs while reducing resource utilization during training, we propose a Layer-Wise Optimization (LWO) strategy in the paper. LWO incrementally updates the parameters of each layer in the whole model architecture, without introducing any additional component and inference overhead. Experiments across five LCMs and five code intelligence tasks demonstrate that LWO trains LCMs more effectively and efficiently compared to previous PEFT methods, with significant improvements in tasks using private data. For instance, in the line-level code completion task using our private code repositories, LWO outperforms the state-of-the-art LoRA method by 22% and 12% in terms of accuracy and BLEU scores, respectively. Furthermore, LWO can enable more efficient LCM tuning, reducing the training time by an average of 42.7% compared to LoRA. ![]() |
|
Huang, Junjie |
![]() Junjie Huang, Zhihan Jiang, Zhuangbin Chen, and Michael Lyu (Chinese University of Hong Kong, China; Sun Yat-sen University, China) ![]() |
|
Huang, LiGuo |
![]() Amiao Gao, Zenong Zhang, Simin Wang, LiGuo Huang, Shiyi Wei, and Vincent Ng (Southern Methodist University, Dallas, USA; University of Texas at Dallas, USA) ![]() |
|
Huang, Yiheng |
![]() Susheng Wu, Ruisi Wang, Yiheng Cao, Bihuan Chen, Zhuotong Zhou, Yiheng Huang, JunPeng Zhao, and Xin Peng (Fudan University, China) Branching repositories facilitates efficient software development but can also inadvertently propagate vulnerabilities. When an original branch is patched, other unfixed branches remain vulnerable unless the patch is successfully ported. However, due to inherent discrepancies between branches, many patches cannot be directly applied and require manual intervention, which is time-consuming and leads to delays in patch porting, increasing vulnerability risks. Existing automated patch porting approaches are prone to errors, as they often overlook essential semantic and syntactic context of vulnerability and fail to detect or refine faulty patches. We propose Mystique, a novel LLM-based approach to address these limitations. Mystique first slices the semantic-related statements linked to the vulnerability while ensuring syntactic correctness, allowing it to extract the signatures for both the original patched function and the target vulnerable function. Mystique then utilizes a fine-tuned LLM to generate a fixed function, which is further iteratively checked and refined to ensure successful porting. Our evaluation shows that Mystique achieved a success rate of 0.954 at function level and of 0.924 at CVE level, outperforming state-of-the-art approaches by at least 13.2% at function level and 12.3% at CVE level. Our evaluation also demonstrates Mystique’s superior generality across various projects, bugs, and programming languages. Mystique successfully ported patches for 34 real-world vulnerable branches. ![]() ![]() |
|
Huang, Yizhan |
![]() Weibin Wu, Haoxuan Hu, Zhaoji Fan, Yitong Qiao, Yizhan Huang, Yichen Li, Zibin Zheng, and Michael Lyu (Sun Yat-sen University, China; Chinese University of Hong Kong, China) Deep learning (DL) has revolutionized various software engineering tasks. Particularly, the emergence of AI code generators has pushed the boundaries of automatic programming to synthesize entire programs based on user-defined specifications in natural language. However, it remains a mystery if these AI code generators rely on the copy-and-paste programming practice, resulting in code clone concerns. In this work, to comprehensively study the code cloning behavior of AI code generators, we conduct an empirical study on three state-of-the-art commercial AI code generators to investigate the existence of all types of clones, which remains underexplored. Our experimental results show that the total Type-1 and Type-2 clone rates of the state-of-the-art commercial AI code generators can reach up to 7.50%, indicating marked code clone issues. Furthermore, it is observed that AI code generators risk infringing copyrights and propagating buggy and vulnerable code resulting from cloning code and show a certain degree of stability in generating code clones. ![]() |
|
Huang, Yuheng |
![]() Zhijie Wang, Zhehua Zhou, Jiayang Song, Yuheng Huang, Zhan Shu, and Lei Ma (University of Alberta, Canada; University of Tokyo, Japan) ![]() |
|
Huo, Yintong |
![]() Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, and Michael Lyu (Chinese University of Hong Kong, China; Singapore Management University, Singapore) ![]() |
|
Huq, Syed Fatiul |
![]() Ziyao He, Syed Fatiul Huq, and Sam Malek (University of California at Irvine, USA) ![]() |
|
Ibrahimzada, Ali Reza |
![]() Ali Reza Ibrahimzada, Kaiyao Ke, Mrigank Pawagi, Muhammad Salman Abid, Rangeet Pan, Saurabh Sinha, and Reyhaneh Jabbarvand (University of Illinois at Urbana-Champaign, USA; Indian Institute of Science, India; Cornell University, USA; IBM Research, USA) ![]() |
|
Igwilo, Chiamaka |
![]() Shaila Sharmin, Anwar Hossain Zahid, Subhankar Bhattacharjee, Chiamaka Igwilo, Miryung Kim, and Wei Le (Iowa State University, USA; University of California at Los Angeles, USA) ![]() |
|
Imtiaz, Sayem Mohammad |
![]() Sayem Mohammad Imtiaz, Astha Singh, Fraol Batole, and Hridesh Rajan (Iowa State University, USA; Tulane University, USA) Not a day goes by without hearing about the impressive feats of large language models (LLMs), and equally, not a day passes without hearing about their challenges. LLMs are notoriously vulnerable to biases in their dataset, leading to issues such as toxicity, harmful responses, and factual inaccuracies. While domain-adaptive training has been employed to mitigate these issues, these techniques often address all model parameters indiscriminately during the repair process, resulting in poor repair quality and reduced model versatility. In this paper, drawing inspiration from fault localization via program slicing, we introduce a novel dynamic slicing-based intent-aware LLM repair strategy, IRepair. This approach selectively targets the most error-prone sections of the model for repair. Specifically, we propose dynamically slicing the model’s most sensitive layers that require immediate attention, concentrating repair efforts on those areas. This method enables more effective repairs with potentially less impact on the model’s overall versatility by altering a smaller portion of the model. Furthermore, dynamic selection allows for a more nuanced and precise model repair compared to a fixed selection strategy. We evaluated our technique on three models from the GPT2 and GPT-Neo families, with parameters ranging from 800M to 1.6B, in a toxicity mitigation setup. Our results show that IRepair repairs errors 43.6% more effectively while causing 46% less disruption to general performance compared to the closest baseline, direct preference optimization. Our empirical analysis also reveals that errors are more concentrated in a smaller section of the model, with the top 20% of layers exhibiting 773% more error density than the remaining 80%. This highlights the need for selective repair. Additionally, we demonstrate that a dynamic selection approach is essential for addressing errors dispersed throughout the model, ensuring a robust and efficient repair. ![]() |
|
Izadi, Maliheh |
![]() Ali Al-Kaswan, Sebastian Deatc, Begüm Koç, Arie van Deursen, and Maliheh Izadi (Delft University of Technology, Netherlands) ![]() |
|
Jabbarvand, Reyhaneh |
![]() Ali Reza Ibrahimzada, Kaiyao Ke, Mrigank Pawagi, Muhammad Salman Abid, Rangeet Pan, Saurabh Sinha, and Reyhaneh Jabbarvand (University of Illinois at Urbana-Champaign, USA; Indian Institute of Science, India; Cornell University, USA; IBM Research, USA) ![]() |
|
Jahanshahi, Mahmoud |
![]() Addi Malviya Thakur, Reed Milewicz, Mahmoud Jahanshahi, Lavínia Paganini, Bogdan Vasilescu, and Audris Mockus (University of Tennessee at Knoxville, USA; Oak Ridge National Laboratory, USA; Sandia National Laboratories, USA; Eindhoven University of Technology, Netherlands; Carnegie Mellon University, USA) ![]() |
|
Jang, Sujin |
![]() Sujin Jang, Yeonhee Ryou, Heewon Lee, and Kihong Heo (KAIST, Republic of Korea) ![]() |
|
Jia, Zhouyang |
![]() Xingpei Li, Yan Lei, Zhouyang Jia, Yuanliang Zhang, Haoran Liu, Liqian Chen, Wei Dong, and Shanshan Li (National University of Defense Technology, China; Chongqing University, China) As deep learning (DL) frameworks become widely used, converting models between frameworks is crucial for ecosystem flexibility. However, interestingly, existing model converters commonly focus on syntactic operator API mapping—transpiling operator names and parameters—which results in API compatibility issues (i.e., incompatible parameters, missing operators). These issues arise from semantic inconsistencies due to differences in operator implementation, causing conversion failure or performance degradation. In this paper, we present the first comprehensive study on operator semantic inconsistencies through API mapping analysis and framework source code inspection, revealing that 47% of sampled operators exhibit inconsistencies across DL frameworks, with source code limited to individual layers and no inter-layer interactions. This suggests that layer-wise source code alignment is feasible. Based on this, we propose ModelX, a source-level cross-framework conversion approach that extends operator API mapping by addressing semantic inconsistencies beyond the API level. Experiments on PyTorch-to-Paddle conversion show that ModelX successfully converts 624 out of 686 sampled operators and outperforms two state-of-the-art converters and three popular large language models. Moreover, ModelX achieves minimal metric gaps (avg. all under 3.4%) across 52 models from vision, text, and audio tasks, indicating strong robustness. ![]() ![]() Haoran Liu, Shanshan Li, Zhouyang Jia, Yuanliang Zhang, Linxiao Bai, Si Zheng, Xiaoguang Mao, and Xiangke Liao (National University of Defense Technology, China) ![]() |
|
Jiang, Tianyue |
![]() Yanlin Wang, Tianyue Jiang, Mingwei Liu, Jiachi Chen, Mingzhi Mao, Xilin Liu, Yuchi Ma, and Zibin Zheng (Sun Yat-sen University, China; Huawei Cloud Computing Technologies, China) Large language models (LLMs) have brought a paradigm shift to the field of code generation, offering the potential to enhance the software development process. However, previous research mainly focuses on the accuracy of code generation, while coding style differences between LLMs and human developers remain under-explored. In this paper, we empirically analyze the differences in coding style between the code generated by mainstream LLMs and the code written by human developers, and summarize coding style inconsistency taxonomy. Specifically, we first summarize the types of coding style inconsistencies by manually analyzing a large number of generation results. We then compare the code generated by LLMs with the code written by human programmers in terms of readability, conciseness, and robustness. The results reveal that LLMs and developers exhibit differences in coding style. Additionally, we study the possible causes of these inconsistencies and provide some solutions to alleviate the problem. ![]() |
|
Jiang, Yanjie |
![]() Waseem Akram, Yanjie Jiang, Yuxia Zhang, Haris Ali Khan, and Hui Liu (Beijing Institute of Technology, China; Peking University, China) Accurate method naming is crucial for code readability and maintainability. However, manually creating concise and meaningful names remains a significant challenge. To this end, in this paper, we propose an approach based on Large Language Model (LLMs) to suggest method names according to function descriptions. The key of the approach is ContextCraft, an automated algorithm for generating context-rich prompts for LLM that suggests the expected method names according to the prompts. For a given query (functional description), it retrieves a few best examples whose functional descriptions have the greatest similarity with the query. From the examples, it identifies tokens that are likely to appear in the final method name as well as their likely positions, picks up pivot words that are semantically related to tokens in the according method names, and specifies the evaluation results of the LLM on the selected examples. All such outputs (tokens with probabilities and position information, pivot words accompanied by associated name tokens and similarity scores, and evaluation results) together with the query and the selected examples are then filled in a predefined prompt template, resulting in a context-rich prompt. This context-rich prompt reduces the randomness of LLMs by focusing the LLM’s attention on relevant contexts, constraining the solution space, and anchoring results to meaningful semantic relationships. Consequently, the LLM leverages this prompt to generate the expected method name, producing a more accurate and relevant suggestion. We evaluated the proposed approach with 43k real-world Java and Python methods accompanied by functional descriptions. Our evaluation results suggested that it significantly outperforms the state-of-the-art approach RNN-att-Copy, improving the chance of exact match by 52% and decreasing the edit distance between generated and expected method names by 32%. Our evaluation results also suggested that the proposed approach worked well for various LLMs, including ChatGPT-3.5, ChatGPT-4, ChatGPT-4o, Gemini-1.5, and Llama-3. ![]() ![]() Haris Ali Khan, Yanjie Jiang, Qasim Umer, Yuxia Zhang, Waseem Akram, and Hui Liu (Beijing Institute of Technology, China; Peking University, China; King Fahd University of Petroleum and Minerals, Saudi Arabia) It is often valuable to know whether a given piece of source code has or hasn’t been used to train a given deep learning model. On one side, it helps avoid data contamination problems that may exaggerate the performance of evaluated models. Conversely, it facilitates copyright protection by identifying private or protected code leveraged for model training without permission. To this end, automated approaches have been proposed for the detection, known as data contamination detection. Such approaches often heavily rely on the confidence of the involved models, assuming that the models should be more confident in handling contaminated data than cleaned data. However, such approaches do not consider the nature of the given data item, i.e., how difficult it is to predict the given item. Consequently, difficult-to-predict contaminated data and easy-to-predict cleaned data are often misclassified. As an initial attempt to solve this problem, this paper presents a naturalness-based approach, called Natural-DaCoDe, for code-completion models to distinguish contaminated source code from cleaned ones. Natural-DaCoDe leverages code naturalness to quantitatively measure the difficulty of a given source code for code-completion models. It then trains a classifier to distinguish contaminated source code according to both code naturalness and the performance of the code-completion models on the given source code. We evaluate Natural-DaCoDe with two pre-trained large language models (e.g., ChatGPT and Claude) and two code-completion models that we trained from scratch for detecting contamination data. Our evaluation results suggest that Natural-DaCoDe substantially outperformed the state-of-the-art approaches in detecting contaminated data, improving the average accuracy by 61.78%. We also evaluate Natural-DaCoDe with method name suggestion task, and it remains more accurate than the state-of-the-art approaches, improving the accuracy by 54.39%. Furthermore, Natural-DaCoDe was tested on a natural language text benchmark, significantly outperforming the state-of-the-art approaches by 22% . It may suggest that Natural-DaCoDe could be applied to various source code related tasks besides code complete. ![]() |
|
Jiang, Yu |
![]() Lihua Guo, Yiwei Hou, Chijin Zhou, Quan Zhang, and Yu Jiang (Tsinghua University, China) In recent years, the increase in ransomware attacks has significantly impacted individuals and organizations. Many strategies have been proposed to detect ransomware’s file disruption operation. However, they rely on certain assumptions that gradually fail in the face of evolving ransomware attacks, which use more stealthy encryption methods or benign-imitation-based I/O orchestration. To mitigate this, we propose an approach to detect ransomware attacks through temporal correlation between encryption and I/O behaviors. Its key idea is that there is a strong temporal correlation inherent in ransomware’s encryption and I/O behaviors. To disrupt files, ransomware must first read the file data from the disk into memory, encrypt it, and then write the encrypted data back to the disk. This process creates a pronounced temporal correlation between the computation load of the encryption operations and the size of the files being encrypted. Based on this invariant, we implement a prototype called RansomRadar and evaluate its effectiveness against 411 latest ransomware samples from 89 families. Experimental results show that it achieves a detection rate of 100.00% with 2.68% false alarms. Its F1-score is 96.03, higher than the existing detectors for 31.82 on average. ![]() |
|
Jiang, Zhihan |
![]() Junjie Huang, Zhihan Jiang, Zhuangbin Chen, and Michael Lyu (Chinese University of Hong Kong, China; Sun Yat-sen University, China) ![]() |
|
Jin, Hai |
![]() Junyao Ye, Zhen Li, Xi Tang, Deqing Zou, Shouhuai Xu, Weizhong Qiang, and Hai Jin (Huazhong University of Science and Technology, China; University of Colorado at Colorado Springs, USA) ![]() ![]() Xiaohu Du, Ming Wen, Haoyu Wang, Zichao Wei, and Hai Jin (Huazhong University of Science and Technology, China) ![]() |
|
Kabir, Md Mahir Asef |
![]() Md Mahir Asef Kabir, Xiaoyin Wang, and Na Meng (Virginia Tech, USA; University of Texas at San Antonio, USA) When building enterprise applications (EAs) on Java frameworks (e.g., Spring), developers often configure application components via metadata (i.e., Java annotations and XML files). It is challenging for developers to correctly use metadata, because the usage rules can be complex and existing tools provide limited assistance. When developers misuse metadata, EAs become misconfigured, which defects can trigger erroneous runtime behaviors or introduce security vulnerabilities. To help developers correctly use metadata, this paper presents (1) RSL---a domain-specific language that domain experts can adopt to prescribe metadata checking rules, and (2) MeCheck---a tool that takes in RSL rules and EAs to check for rule violations. With RSL, domain experts (e.g., developers of a Java framework) can specify metadata checking rules by defining content consistency among XML files, annotations, and Java code. Given such RSL rules and a program to scan, MeCheck interprets rules as cross-file static analyzers, which analyzers scan Java and/or XML files to gather information and look for consistency violations. For evaluation, we studied the Spring and JUnit documentation to manually define 15 rules, and created 2 datasets with 115 open-source EAs. The first dataset includes 45 EAs, and the ground truth of 45 manually injected bugs. The second dataset includes multiple versions of 70 EAs. We observed that MeCheck identified bugs in the first dataset with 100% precision, 96% recall, and 98% F-score. It reported 152 bugs in the second dataset, 49 of which bugs were already fixed by developers. Our evaluation shows that MeCheck helps ensure the correct usage of metadata. ![]() |
|
Kaboré, Abdoul Kader |
![]() Micheline Bénédicte Moumoula, Abdoul Kader Kaboré, Jacques Klein, and Tegawendé F. Bissyandé (University of Luxembourg, Luxembourg; Interdisciplinary Centre of Excellence in AI for Development (CITADEL), Burkina Faso) With the involvement of multiple programming languages in modern software development, cross-lingual code clone detection has gained traction within the software engineering community. Numerous studies have explored this topic, proposing various promising approaches. Inspired by the significant advances in machine learning in recent years, particularly Large Language Models (LLMs), which have demonstrated their ability to tackle various tasks, this paper revisits cross-lingual code clone detection. We evaluate the performance of five (05) LLMs and, eight prompts (08) for the identification of cross-lingual code clones. Additionally, we compare these results against two baseline methods. Finally, we evaluate a pre-trained embedding model to assess the effectiveness of the generated representations for classifying clone and non-clone pairs. The studies involving LLMs and Embedding models are evaluated using two widely used cross-lingual datasets, XLCoST and CodeNet. Our results show that LLMs can achieve high F1 scores, up to 0.99, for straightforward programming examples. However, they not only perform less well on programs associated with complex programming challenges but also do not necessarily understand the meaning of “code clones” in a cross-lingual setting. We show that embedding models used to represent code fragments from different programming languages in the same representation space enable the training of a basic classifier that outperforms all LLMs by ∼1 and ∼20 percentage points on the XLCoST and CodeNet datasets, respectively. This finding suggests that, despite the apparent capabilities of LLMs, embeddings provided by embedding models offer suitable representations to achieve state-of-the-art performance in cross-lingual code clone detection. ![]() |
|
Kang, Hong Jin |
![]() Jiyuan Wang, Yuxin Qiu, Ben Limpanukorn, Hong Jin Kang, Qian Zhang, and Miryung Kim (University of California at Los Angeles, USA; University of California at Riverside, USA) In recent years, the MLIR framework has had explosive growth due to the need for extensible deep learning compilers for hardware accelerators. Such examples include Triton, CIRCT, and ONNX-MLIR. MLIR compilers introduce significant complexities in localizing bugs or inefficiencies because of their layered optimization and transformation process with compilation passes. While existing delta debugging techniques can be used to identify a minimum subset of IR code that reproduces a given bug symptom, their naive application to MLIR is time-consuming because real-world MLIR compilers usually involve a large number of compilation passes. Compiler developers must identify a minimized set of relevant compilation passes to reduce the footprint of MLIR compiler code to be inspected for a bug fix. We propose DuoReduce, a dual-dimensional reduction approach for MLIR bug localization. DuoReduce leverages three key ideas in tandem to design an efficient MLIR delta debugger. First, DuoReduce reduces compiler passes that are irrelevant to the bug by identifying ordering dependencies among the different compilation passes. Second, DuoReduce uses MLIR-semantics-aware transformations to expedite IR code reduction. Finally, DuoReduce leverages cross-dependence between the IR code dimension and the compilation pass dimension by accounting for which IR code segments are related to which compilation passes to reduce unused passes. Experiments with three large-scale MLIR compiler projects find that DuoReduce outperforms syntax-aware reducers such as Perses and Vulcan in terms of IR code reduction by 31.6% and 21.5% respectively. If one uses these reducers by enumerating all possible compilation passes (on average 18 passes), it could take up to 145 hours. By identifying ordering dependencies among compilation passes, DuoReduce reduces this time to 9.5 minutes. By identifying which compilation passes are unused for compiling reduced IR code, DuoReduce reduces the number of passes by 14.6%. This translates to not needing to examine 281 lines of MLIR compiler code on average to fix the bugs. DuoReduce has the potential to significantly reduce debugging effort in MLIR compilers, which serves as the foundation for the current landscape of machine learning and hardware accelerators. ![]() |
|
Karimipour, Nima |
![]() Nima Karimipour, Erfan Arvan, Martin Kellogg, and Manu Sridharan (University of California at Riverside, USA; New Jersey Institute of Technology, USA) Null-pointer exceptions are serious problem for Java, and researchers have developed type-based nullness checking tools to prevent them. These tools, however, have a downside: they require developers to write nullability annotations, which is time-consuming and hinders adoption. Researchers have therefore proposed nullability annotation inference tools, whose goal is to (partially) automate the task of annotating a program for nullability. However, prior works rely on differing theories of what makes a set of nullability annotations good, making comparing their effectiveness challenging. In this work, we identify a systematic bias in some prior experimental evaluation of these tools: the use of “type reconstruction” experiments to see if a tool can recover erased developer-written annotations. We show that developers make semantic code changes while adding annotations to facilitate typechecking, leading such experiments to overestimate the effectiveness of inference tools on never-annotated code. We propose a new definition of the “best” inferred annotations for a program that avoids this bias, based on a systematic exploration of the design space. With this new definition, we perform the first head-to-head comparison of three extant nullability inference tools. Our evaluation showed the complementary strengths of the tools and remaining weaknesses that could be addressed in future work. ![]() |
|
Kästner, Christian |
![]() Hao He, Bogdan Vasilescu, and Christian Kästner (Carnegie Mellon University, USA) Recent high-profile incidents in open-source software have greatly raised practitioner attention on software supply chain attacks. To guard against potential malicious package updates, security practitioners advocate pinning dependency to specific versions rather than floating in version ranges. However, it remains controversial whether pinning carries a meaningful security benefit that outweighs the cost of maintaining outdated and possibly vulnerable dependencies. In this paper, we quantify, through counterfactual analysis and simulations, the security and maintenance impact of version constraints in the npm ecosystem. By simulating dependency resolutions over historical time points, we find that pinning direct dependencies not only (as expected) increases the cost of maintaining vulnerable and outdated dependencies, but also (surprisingly) even increases the risk of exposure to malicious package updates in larger dependency graphs due to the specifics of npm’s dependency resolution mechanism. Finally, we explore collective pinning strategies to secure the ecosystem against supply chain attacks, suggesting specific changes to npm to enable such interventions. Our study provides guidance for practitioners and tool designers to manage their supply chains more securely. ![]() |
|
Kate, Sayali |
![]() Sayali Kate, Yifei Gao, Shiwei Feng, and Xiangyu Zhang (Purdue University, USA) Increasingly popular Robot Operating System (ROS) framework allows building robotic systems by integrating newly developed and/or reused modules, where the modules can use different versions of the framework (e.g., ROS1 or ROS2) and programming language (e.g. C++ or Python). The majority of such robotic systems' work happens in callbacks. The framework provides various elements for initializing callbacks and for setting up the execution of callbacks. It is the responsibility of developers to compose callbacks and their execution setup elements, and hence can lead to inconsistencies related to the setup of callback execution due to developer's incomplete knowledge of the semantics of elements in various versions of the framework. Some of these inconsistencies do not throw errors at runtime, making their detection difficult for developers. We propose a static approach to detecting such inconsistencies by extracting a static view of the composition of robotic system's callbacks and their execution setup, and then checking it against the composition conventions based on the elements' semantics. We evaluate our ROSCallBaX prototype on the dataset created from the posts on developer forums and ROS projects that are publicly available. The evaluation results show that our approach can detect real inconsistencies. ![]() |
|
Ke, Kaiyao |
![]() Ali Reza Ibrahimzada, Kaiyao Ke, Mrigank Pawagi, Muhammad Salman Abid, Rangeet Pan, Saurabh Sinha, and Reyhaneh Jabbarvand (University of Illinois at Urbana-Champaign, USA; Indian Institute of Science, India; Cornell University, USA; IBM Research, USA) ![]() |
|
Keim, Manthan |
![]() Weihao Chen, Jia Lin Cheoh, Manthan Keim, Sabine Brunswicker, and Tianyi Zhang (Purdue University, USA) ![]() |
|
Kellogg, Martin |
![]() Nima Karimipour, Erfan Arvan, Martin Kellogg, and Manu Sridharan (University of California at Riverside, USA; New Jersey Institute of Technology, USA) Null-pointer exceptions are serious problem for Java, and researchers have developed type-based nullness checking tools to prevent them. These tools, however, have a downside: they require developers to write nullability annotations, which is time-consuming and hinders adoption. Researchers have therefore proposed nullability annotation inference tools, whose goal is to (partially) automate the task of annotating a program for nullability. However, prior works rely on differing theories of what makes a set of nullability annotations good, making comparing their effectiveness challenging. In this work, we identify a systematic bias in some prior experimental evaluation of these tools: the use of “type reconstruction” experiments to see if a tool can recover erased developer-written annotations. We show that developers make semantic code changes while adding annotations to facilitate typechecking, leading such experiments to overestimate the effectiveness of inference tools on never-annotated code. We propose a new definition of the “best” inferred annotations for a program that avoids this bias, based on a systematic exploration of the design space. With this new definition, we perform the first head-to-head comparison of three extant nullability inference tools. Our evaluation showed the complementary strengths of the tools and remaining weaknesses that could be addressed in future work. ![]() |
|
Khan, Haris Ali |
![]() Waseem Akram, Yanjie Jiang, Yuxia Zhang, Haris Ali Khan, and Hui Liu (Beijing Institute of Technology, China; Peking University, China) Accurate method naming is crucial for code readability and maintainability. However, manually creating concise and meaningful names remains a significant challenge. To this end, in this paper, we propose an approach based on Large Language Model (LLMs) to suggest method names according to function descriptions. The key of the approach is ContextCraft, an automated algorithm for generating context-rich prompts for LLM that suggests the expected method names according to the prompts. For a given query (functional description), it retrieves a few best examples whose functional descriptions have the greatest similarity with the query. From the examples, it identifies tokens that are likely to appear in the final method name as well as their likely positions, picks up pivot words that are semantically related to tokens in the according method names, and specifies the evaluation results of the LLM on the selected examples. All such outputs (tokens with probabilities and position information, pivot words accompanied by associated name tokens and similarity scores, and evaluation results) together with the query and the selected examples are then filled in a predefined prompt template, resulting in a context-rich prompt. This context-rich prompt reduces the randomness of LLMs by focusing the LLM’s attention on relevant contexts, constraining the solution space, and anchoring results to meaningful semantic relationships. Consequently, the LLM leverages this prompt to generate the expected method name, producing a more accurate and relevant suggestion. We evaluated the proposed approach with 43k real-world Java and Python methods accompanied by functional descriptions. Our evaluation results suggested that it significantly outperforms the state-of-the-art approach RNN-att-Copy, improving the chance of exact match by 52% and decreasing the edit distance between generated and expected method names by 32%. Our evaluation results also suggested that the proposed approach worked well for various LLMs, including ChatGPT-3.5, ChatGPT-4, ChatGPT-4o, Gemini-1.5, and Llama-3. ![]() ![]() Haris Ali Khan, Yanjie Jiang, Qasim Umer, Yuxia Zhang, Waseem Akram, and Hui Liu (Beijing Institute of Technology, China; Peking University, China; King Fahd University of Petroleum and Minerals, Saudi Arabia) It is often valuable to know whether a given piece of source code has or hasn’t been used to train a given deep learning model. On one side, it helps avoid data contamination problems that may exaggerate the performance of evaluated models. Conversely, it facilitates copyright protection by identifying private or protected code leveraged for model training without permission. To this end, automated approaches have been proposed for the detection, known as data contamination detection. Such approaches often heavily rely on the confidence of the involved models, assuming that the models should be more confident in handling contaminated data than cleaned data. However, such approaches do not consider the nature of the given data item, i.e., how difficult it is to predict the given item. Consequently, difficult-to-predict contaminated data and easy-to-predict cleaned data are often misclassified. As an initial attempt to solve this problem, this paper presents a naturalness-based approach, called Natural-DaCoDe, for code-completion models to distinguish contaminated source code from cleaned ones. Natural-DaCoDe leverages code naturalness to quantitatively measure the difficulty of a given source code for code-completion models. It then trains a classifier to distinguish contaminated source code according to both code naturalness and the performance of the code-completion models on the given source code. We evaluate Natural-DaCoDe with two pre-trained large language models (e.g., ChatGPT and Claude) and two code-completion models that we trained from scratch for detecting contamination data. Our evaluation results suggest that Natural-DaCoDe substantially outperformed the state-of-the-art approaches in detecting contaminated data, improving the average accuracy by 61.78%. We also evaluate Natural-DaCoDe with method name suggestion task, and it remains more accurate than the state-of-the-art approaches, improving the accuracy by 54.39%. Furthermore, Natural-DaCoDe was tested on a natural language text benchmark, significantly outperforming the state-of-the-art approaches by 22% . It may suggest that Natural-DaCoDe could be applied to various source code related tasks besides code complete. ![]() |
|
Khomh, Foutse |
![]() Roozbeh Aghili, Heng Li, and Foutse Khomh (Polytechnique Montréal, Canada) Software logs, generated during the runtime of software systems, are essential for various development and analysis activities, such as anomaly detection and failure diagnosis. However, the presence of sensitive information in these logs poses significant privacy concerns, particularly regarding Personally Identifiable Information (PII) and quasi-identifiers that could lead to re-identification risks. While general data privacy has been extensively studied, the specific domain of privacy in software logs remains underexplored, with inconsistent definitions of sensitivity and a lack of standardized guidelines for anonymization. To mitigate this gap, this study offers a comprehensive analysis of privacy in software logs from multiple perspectives. We start by performing an analysis of 25 publicly available log datasets to identify potentially sensitive attributes. Based on the result of this step, we focus on three perspectives: privacy regulations, research literature, and industry practices. We first analyze key data privacy regulations, such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), to understand the legal requirements concerning sensitive information in logs. Second, we conduct a systematic literature review to identify common privacy attributes and practices in log anonymization, revealing gaps in existing approaches. Finally, we survey 45 industry professionals to capture practical insights on log anonymization practices. Our findings shed light on various perspectives of log privacy and reveal industry challenges, such as technical and efficiency issues while highlighting the need for standardized guidelines. By combining insights from regulatory, academic, and industry perspectives, our study aims to provide a clearer framework for identifying and protecting sensitive information in software logs. ![]() |
|
Kim, Miryung |
![]() Shaila Sharmin, Anwar Hossain Zahid, Subhankar Bhattacharjee, Chiamaka Igwilo, Miryung Kim, and Wei Le (Iowa State University, USA; University of California at Los Angeles, USA) ![]() ![]() Jiyuan Wang, Yuxin Qiu, Ben Limpanukorn, Hong Jin Kang, Qian Zhang, and Miryung Kim (University of California at Los Angeles, USA; University of California at Riverside, USA) In recent years, the MLIR framework has had explosive growth due to the need for extensible deep learning compilers for hardware accelerators. Such examples include Triton, CIRCT, and ONNX-MLIR. MLIR compilers introduce significant complexities in localizing bugs or inefficiencies because of their layered optimization and transformation process with compilation passes. While existing delta debugging techniques can be used to identify a minimum subset of IR code that reproduces a given bug symptom, their naive application to MLIR is time-consuming because real-world MLIR compilers usually involve a large number of compilation passes. Compiler developers must identify a minimized set of relevant compilation passes to reduce the footprint of MLIR compiler code to be inspected for a bug fix. We propose DuoReduce, a dual-dimensional reduction approach for MLIR bug localization. DuoReduce leverages three key ideas in tandem to design an efficient MLIR delta debugger. First, DuoReduce reduces compiler passes that are irrelevant to the bug by identifying ordering dependencies among the different compilation passes. Second, DuoReduce uses MLIR-semantics-aware transformations to expedite IR code reduction. Finally, DuoReduce leverages cross-dependence between the IR code dimension and the compilation pass dimension by accounting for which IR code segments are related to which compilation passes to reduce unused passes. Experiments with three large-scale MLIR compiler projects find that DuoReduce outperforms syntax-aware reducers such as Perses and Vulcan in terms of IR code reduction by 31.6% and 21.5% respectively. If one uses these reducers by enumerating all possible compilation passes (on average 18 passes), it could take up to 145 hours. By identifying ordering dependencies among compilation passes, DuoReduce reduces this time to 9.5 minutes. By identifying which compilation passes are unused for compiling reduced IR code, DuoReduce reduces the number of passes by 14.6%. This translates to not needing to examine 281 lines of MLIR compiler code on average to fix the bugs. DuoReduce has the potential to significantly reduce debugging effort in MLIR compilers, which serves as the foundation for the current landscape of machine learning and hardware accelerators. ![]() |
|
Kim, Myeongsoo |
![]() Myeongsoo Kim, Saurabh Sinha, and Alessandro Orso (Georgia Institute of Technology, USA; IBM Research, USA) Modern web services rely heavily on REST APIs, typically documented using the OpenAPI specification. The widespread adoption of this standard has resulted in the development of many black-box testing tools that generate tests based on OpenAPI specifications. Although Large Language Models (LLMs) have shown promising test-generation abilities, their application to REST API testing remains mostly unexplored. We present LlamaRestTest, a novel approach that employs two custom LLMs-created by fine-tuning and quantizing the Llama3-8B model using mined datasets of REST API example values and inter-parameter dependencies-to generate realistic test inputs and uncover inter-parameter dependencies during the testing process by analyzing server responses. We evaluated LlamaRestTest on 12 real-world services (including popular services such as Spotify), comparing it against RESTGPT, a GPT-powered specification-enhancement tool, as well as several state-of-the-art REST API testing tools, including RESTler, MoRest, EvoMaster, and ARAT-RL. Our results demonstrate that fine-tuning enables smaller models to outperform much larger models in detecting actionable parameter-dependency rules and generating valid inputs for REST API testing. We also evaluated different tool configurations, ranging from the base Llama3-8B model to fine-tuned versions, and explored multiple quantization techniques, including 2-bit, 4-bit, and 8-bit integer formats. Our study shows that small language models can perform as well as, or better than, large language models in REST API testing, balancing effectiveness and efficiency. Furthermore, LlamaRestTest outperforms state-of-the-art REST API testing tools in code coverage achieved and internal server errors identified, even when those tools use RESTGPT-enhanced specifications. Finally, through an ablation study, we show that each component of LlamaRestTest contributes to its overall performance. ![]() |
|
Kisaakye, Joanna |
![]() Yuqing Wang, Mika V. Mäntylä, Serge Demeyer, Mutlu Beyazıt, Joanna Kisaakye, and Jesse Nyyssölä (University of Helsinki, Finland; University of Oulu, Finland; University of Antwerp, Belgium) Microservice-based systems (MSS) may fail with various fault types, due to their complex and dynamic nature. While existing AIOps methods excel at detecting abnormal traces and locating the responsible service(s), human efforts from practitioners are still required for further root cause analysis to diagnose specific fault types and analyze failure reasons for detected abnormal traces, particularly when abnormal traces do not stem directly from specific services. In this paper, we propose a novel AIOps framework, TraFaultDia, to automatically classify abnormal traces into fault categories for MSS. We treat the classification process as a series of multi-class classification tasks, where each task represents an attempt to classify abnormal traces into specific fault categories for a MSS. TraFaultDia is trained on several abnormal trace classification tasks with a few labeled instances from a MSS using a meta-learning approach. After training, TraFaultDia can quickly adapt to new, unseen abnormal trace classification tasks with a few labeled instances across MSS. TraFaultDia’s use cases are scalable depending on how fault categories are built from anomalies within MSS. We evaluated TraFaultDia on two representative MSS, TrainTicket and OnlineBoutique, with open datasets. In these datasets, each fault category is tied to the faulty system component(s) (service/pod) with a root cause. Our TraFaultDia automatically classifies abnormal traces into these fault categories, thus enabling the automatic identification of faulty system components and root causes without manual analysis. Our results show that, within the MSS it is trained on, TraFaultDia achieves an average accuracy of 93.26% and 85.20% across 50 new, unseen abnormal trace classification tasks for TrainTicket and OnlineBoutique respectively, when provided with 10 labeled instances for each fault category per task in each system. In the cross-system context, when TraFaultDia is applied to a MSS different from the one it is trained on, TraFaultDia gets an average accuracy of 92.19% and 84.77% for the same set of 50 new, unseen abnormal trace classification tasks of the respective systems, also with 10 labeled instances provided for each fault category per task in each system. ![]() |
|
Klein, Jacques |
![]() Micheline Bénédicte Moumoula, Abdoul Kader Kaboré, Jacques Klein, and Tegawendé F. Bissyandé (University of Luxembourg, Luxembourg; Interdisciplinary Centre of Excellence in AI for Development (CITADEL), Burkina Faso) With the involvement of multiple programming languages in modern software development, cross-lingual code clone detection has gained traction within the software engineering community. Numerous studies have explored this topic, proposing various promising approaches. Inspired by the significant advances in machine learning in recent years, particularly Large Language Models (LLMs), which have demonstrated their ability to tackle various tasks, this paper revisits cross-lingual code clone detection. We evaluate the performance of five (05) LLMs and, eight prompts (08) for the identification of cross-lingual code clones. Additionally, we compare these results against two baseline methods. Finally, we evaluate a pre-trained embedding model to assess the effectiveness of the generated representations for classifying clone and non-clone pairs. The studies involving LLMs and Embedding models are evaluated using two widely used cross-lingual datasets, XLCoST and CodeNet. Our results show that LLMs can achieve high F1 scores, up to 0.99, for straightforward programming examples. However, they not only perform less well on programs associated with complex programming challenges but also do not necessarily understand the meaning of “code clones” in a cross-lingual setting. We show that embedding models used to represent code fragments from different programming languages in the same representation space enable the training of a basic classifier that outperforms all LLMs by ∼1 and ∼20 percentage points on the XLCoST and CodeNet datasets, respectively. This finding suggests that, despite the apparent capabilities of LLMs, embeddings provided by embedding models offer suitable representations to achieve state-of-the-art performance in cross-lingual code clone detection. ![]() |
|
Koç, Begüm |
![]() Ali Al-Kaswan, Sebastian Deatc, Begüm Koç, Arie van Deursen, and Maliheh Izadi (Delft University of Technology, Netherlands) ![]() |
|
Kong, Jiaolong |
![]() Jiaolong Kong, Xie Xiaofei, and Shangqing Liu (Singapore Management University, Singapore; Nanyang Technological University, Singapore) ![]() |
|
Kong, Ziqiao |
![]() Ziqiao Kong, Cen Zhang, Maoyi Xie, Ming Hu, Yue Xue, Ye Liu, Haijun Wang, and Yang Liu (Nanyang Technological University, Singapore; Singapore Management University, Singapore; MetaTrust Labs, Singapore; Xi'an Jiaotong University, China) Billions of dollars are transacted through smart contracts, making vulnerabilities a major financial risk. One focus in the security arms race is on profitable vulnerabilities that attackers can exploit. Fuzzing is a key method for identifying these vulnerabilities. However, current solutions face two main limitations: 1. a lack of profit-centric techniques for expediting detection and, 2. insufficient automation in maximizing the profitability of discovered vulnerabilities, leaving the analysis to human experts. To address these gaps, we have developed VERITE, a profit-centric smart contract fuzzing framework that not only effectively detects those profitable vulnerabilities but also maximizes the exploited profits. VERITE has three key features: 1. DeFi action-based mutators for boosting the exploration of transactions with different fund flows; 2. potentially profitable candidates identification criteria, which checks whether the input has caused abnormal fund flow properties during testing; 3. a gradient descent-based profit maximization strategy for these identified candidates. VERITE is fully developed from scratch and evaluated on a dataset consisting of 61 exploited real-world DeFi projects with an average of over 1.1 million dollars loss. The results show that VERITE can automatically extract more than 18 million dollars in total and is significantly better than state-of-the-art fuzzer ItyFuzz in both detection (29/10) and exploitation (134 times more profits gained on average). Remarkably, in 12 targets, it gains more profits than real-world attacking exploits (1.01 to 11.45 times more). VERITE is also applied by auditors in contract auditing, where 6 (5 high severity) zero-day vulnerabilities are found with over $2,500 bounty rewards. ![]() ![]() ![]() |
|
Kreyssig, Bruno |
![]() Bruno Kreyssig and Alexandre Bartel (Umeå University, Sweden) While multiple recent publications on detecting Java Deserialization Vulnerabilities highlight an increasing relevance of the topic, until now no proper benchmark has been established to evaluate the individual approaches. Hence, it has become increasingly difficult to show improvements over previous tools and trade-offs that were made. In this work, we synthesize the main challenges in gadget chain detection. More specifically, this unveils the constraints program analysis faces in the context of gadget chain detection. From there, we develop Gleipner: the first synthetic, large-scale and systematic benchmark to validate the effectiveness of algorithms for detecting gadget chains in the Java programming language. We then benchmark seven previous publications in the field using Gleipner. As a result, it shows, that (1) our benchmark provides a transparent, qualitative, and sound measurement for the maturity of gadget chain detecting tools, (2) Gleipner alleviates severe benchmarking flaws which were previously common in the field and (3) state-of-the-art tools still struggle with most challenges in gadget chain detection. ![]() |
|
Krinke, Jens |
![]() Lukas Fruntke and Jens Krinke (University College London, UK) Breaking changes in dependencies are a common challenge in software development, requiring manual intervention to resolve. This study examines how well LLM automate the repair of breaking changes caused by dependency updates in Java projects. Although earlier methods have mostly concentrated on detecting breaking changes or investigating their impact, they have not been able to completely automate the repair process. We introduce and compare two new approaches: an agentic system that combines automated tool usage with LLM, and a recursive zero-shot approach, employing iterative prompt refinement. Our experimental framework assesses the repair success of both approaches, using the BUMP dataset of curated breaking changes. We also investigate the impact of variables such as dependency popularity and prompt configuration on repair outcomes. Our results demonstrate a substantial difference in test suite success rates, with the agentic approach achieving a repair success rate of up to 23%, while the zero-shot prompting approach achieved a repair success rate of up to 19%. We show that automated program repair of breaking dependencies with LLMs is feasible and can be optimised to achieve better repair outcomes. ![]() ![]() |
|
Lang, Congliang |
![]() Siwei Tan, Liqiang Lu, Debin Xiang, Tianyao Chu, Congliang Lang, Jintao Chen, Xing Hu, and Jianwei Yin (Zhejiang University, China) Quantum programs provide exponential speedups compared to classical programs in certain areas, but they also inevitably encounter logical faults. Automatically repairing quantum programs is much more challenging than repairing classical programs due to the non-replicability of data, the vast search space of program inputs, and the new programming paradigm. Existing works based on semantic-based or learning-based program repair techniques are fundamentally limited in repairing efficiency and effectiveness. In this work, we propose HornBro, an efficient framework for automated quantum program repair. The key insight of HornBro lies in the homotopy-like method, which iteratively switches between the classical and quantum parts. This approach allows the repair tasks to be efficiently offloaded to the most suitable platforms, enabling a progressive convergence toward the correct program. We start by designing an implication assertion pragma to enable rigorous specifications of quantum program behavior, which helps to generate a quantum test suite automatically. This suite leverages the orthonormal bases of quantum programs to accommodate different encoding schemes. Given a fixed number of test cases, it allows the maximum input coverage of potential counter-example candidates. Then, we develop a Clifford approximation method with an SMT-based search to transform the patch localization program into a symbolic reasoning problem. Finally, we offload the computationally intensive repair of gate parameters to quantum hardware by leveraging the differentiability of quantum gates. Experiments suggest that HornBro increases the repair success rate by more than 62.5% compared to the existing repair techniques, supporting more types of quantum bugs. It also achieves 35.7× speedup in the repair and 99.9% gate reduction of the patch. ![]() ![]() |
|
Le, Wei |
![]() Shaila Sharmin, Anwar Hossain Zahid, Subhankar Bhattacharjee, Chiamaka Igwilo, Miryung Kim, and Wei Le (Iowa State University, USA; University of California at Los Angeles, USA) ![]() |
|
Lee, Heewon |
![]() Sujin Jang, Yeonhee Ryou, Heewon Lee, and Kihong Heo (KAIST, Republic of Korea) ![]() |
|
Leesatapornwongsa, Tanakorn |
![]() Tanakorn Leesatapornwongsa, Fazle Faisal, and Suman Nath (Microsoft Research, USA) ![]() |
|
Lei, Yan |
![]() Xingpei Li, Yan Lei, Zhouyang Jia, Yuanliang Zhang, Haoran Liu, Liqian Chen, Wei Dong, and Shanshan Li (National University of Defense Technology, China; Chongqing University, China) As deep learning (DL) frameworks become widely used, converting models between frameworks is crucial for ecosystem flexibility. However, interestingly, existing model converters commonly focus on syntactic operator API mapping—transpiling operator names and parameters—which results in API compatibility issues (i.e., incompatible parameters, missing operators). These issues arise from semantic inconsistencies due to differences in operator implementation, causing conversion failure or performance degradation. In this paper, we present the first comprehensive study on operator semantic inconsistencies through API mapping analysis and framework source code inspection, revealing that 47% of sampled operators exhibit inconsistencies across DL frameworks, with source code limited to individual layers and no inter-layer interactions. This suggests that layer-wise source code alignment is feasible. Based on this, we propose ModelX, a source-level cross-framework conversion approach that extends operator API mapping by addressing semantic inconsistencies beyond the API level. Experiments on PyTorch-to-Paddle conversion show that ModelX successfully converts 624 out of 686 sampled operators and outperforms two state-of-the-art converters and three popular large language models. Moreover, ModelX achieves minimal metric gaps (avg. all under 3.4%) across 52 models from vision, text, and audio tasks, indicating strong robustness. ![]() |
|
Lei, Yu |
![]() Xiaolei Ren, Mengfei Ren, Yu Lei, and Jiang Ming (Macau University of Science and Technology, China; University of Alabama in Huntsville, USA; University of Texas at Arlington, USA; Tulane University, USA) ![]() |
|
Leng, Mingyue |
![]() Kangchen Zhu, Zhiliang Tian, Shangwen Wang, Weiguo Chen, Zixuan Dong, Mingyue Leng, and Xiaoguang Mao (National University of Defense Technology, China) The current landscape of binary code summarization predominantly focuses on generating a single summary, which limits its utility and understanding for reverse engineers. Existing approaches often fail to meet the diverse needs of users, such as providing detailed insights into usage patterns, implementation nuances, and design rationale, as observed in the field of source code summarization. This highlights the need for multi-intent binary code summarization to enhance the effectiveness of reverse engineering processes. To address this gap, we propose MiSum, a novel method that leverages multi-modality heterogeneous code graph alignment and learning to integrate both assembly code and pseudo-code. MiSum introduces a unified multi-modality heterogeneous code graph (MM-HCG) that aligns assembly code graphs with pseudo-code graphs, capturing both low-level execution details and high-level structural information. We further propose multi-modality heterogeneous graph learning with heterogeneous mutual attention and message passing, which highlights important code blocks and discovers inter-dependencies across different code forms. Additionally, an intent-aware summary generator with an intent-aware attention mechanism is introduced to produce customized summaries tailored to multiple intents. Extensive experiments, including evaluations across various architectures and optimization levels, demonstrate that MiSum outperforms state-of-the-art baselines in BLEU, METEOR, and ROUGE-L metrics. Human evaluations validate its capability to effectively support reverse engineers in understanding diverse binary code intents, marking a significant advancement in binary code analysis. ![]() ![]() ![]() |
|
Levin, Kyla H. |
![]() Kyla H. Levin, Nicolas van Kempen, Emery D. Berger, and Stephen N. Freund (University of Massachusetts at Amherst, USA; Amazon Web Services, USA; Williams College, USA) ![]() |
|
Li, Anji |
![]() Anji Li, Neng Zhang, Ying Zou, Zhixiang Chen, Jian Wang, and Zibin Zheng (Sun Yat-sen University, China; Central China Normal University, China; Queen's University, Canada; Wuhan University, China) Code snippets are widely used in technical forums to demonstrate solutions to programming problems. They can be leveraged by developers to accelerate problem-solving. However, code snippets often lack concrete types of the APIs used in them, which impedes their understanding and resue. To enhance the description of a code snippet, a number of approaches are proposed to infer the types of APIs. Although existing approaches can achieve good performance, their performance is limited by ignoring other information outside the input code snippet (e.g., the descriptions of similar code snippets) that could potentially improve the performance. In this paper, we propose a novel type inference approach, named CKTyper, by leveraging crowdsourcing knowledge in technical posts. The key idea is to generate a relevant context for a target code snippet from the posts containing similar code snippets and then employ the context to promote the type inference with large language models (e.g., ChatGPT). More specifically, we build a crowdsourcing knowledge base (CKB) by extracting code snippets from a large set of posts and index the CKB using Lucene. An API type dictionary is also built from a set of API libraries. Given a code snippet to be inferred, we first retrieve a list of similar code snippets from the indexed CKB. Then, we generate a crowdsourcing knowledge context (CKC) by extracting and summarizing useful content (e.g., API-related sentences) in the posts that contain the similar code snippets. The CKC is subsequently used to improve the type inference of ChatGPT on the input code snippet. The hallucination of ChatGPT is eliminated by employing the API type dictionary. Evaluation results on two open-source datasets demonstrate the effectiveness and efficiency of CKTyper. CKTyper achieves the optimal precision/recall of 97.80% and 95.54% on both datasets, respectively, significantly outperforming three state-of-the-art baselines and ChatGPT. ![]() |
|
Li, Bing |
![]() Shuaiyu Xie, Jian Wang, Maodong Li, Peiran Chen, Jifeng Xuan, and Bing Li (Wuhan University, China; Zhongguancun Laboratory, China) Distributed tracing is a pivotal technique for software operators to understand and diagnose issues within microservice-based systems, offering a comprehensive view of user requests propagated through various services. However, the unprecedented volume of traces imposes expensive storage and analytical burdens on online systems. Conventional tracing approaches typically rely on random sampling with a fixed probability for each trace, which risks missing valuable traces. Several tail-based sampling methods have thus been proposed to sample traces based on their content. Nevertheless, these methods primarily evaluate traces on an individual basis, neglecting the collective attributes of the sample set in terms of comprehensiveness, balance, and consistency. To address these issues, we propose TracePicker, an optimization-based online sampler designed to enhance the quality of sampled data while mitigating storage burden. TracePicker employs a streaming anomaly detector to capture and retain anomalous traces that are crucial for troubleshooting. For normal traces, the sampling process is segmented into quota allocation and group sampling, both formulated as integer programming problems. By solving these problems using dynamic programming and evolution algorithms, TracePicker selects a high-quality subset of data, minimizing overall information loss. Experimental results demonstrate that TracePicker outperforms existing tail-based sampling methods in terms of both sampling quality and time consumption. ![]() |
|
Li, Chun |
![]() Chun Li, Hui Li, Zhong Li, Minxue Pan, and Xuandong Li (Nanjing University, China; Samsung Electronics, China) Due to advancements in graph neural networks, graph learning-based fault localization (GBFL) methods have achieved promising results. However, as these methods are supervised learning paradigms and deep learning is typically data-hungry, they can only be trained on fully labeled large-scale datasets. This is impractical because labeling failed tests is similar to manual fault localization, which is time-consuming and labor-intensive, leading to only a small portion of failed tests that can be labeled within limited budgets. These data labeling limitations would lead to the sub-optimal effectiveness of supervised GBFL techniques. Semi-supervised learning (SSL) provides an effective means of leveraging unlabeled data to improve a model's performance and address data labeling limitations. However, as these methods are not specifically designed for fault localization, directly utilizing them might lead to sub-optimal effectiveness. In response, we propose a novel semi-supervised GBFL framework, LEGATO. LEGATO first leverages the attention mechanism to identify and augment likely fault-unrelated sub-graphs in unlabeled graphs and then quantifies the suspiciousness distribution of unlabeled graphs to estimate pseudo-labels. Through training the model on augmented unlabeled graphs and pseudo-labels, LEGATO can utilize the unlabeled data to improve the effectiveness of fault localization and address the restrictions in data labeling. By extensive evaluations against 3 baselines SSL methods, LEGATO demonstrates superior performance by outperforming all the methods in comparison. ![]() |
|
Li, Haodong |
![]() Haodong Li, Xiao Cheng, Guohan Zhang, Guosheng Xu, Guoai Xu, and Haoyu Wang (Beijing University of Posts and Telecommunications, China; UNSW, Australia; Harbin Institute of Technology, Shenzhen, China; Huazhong University of Science and Technology, China) Learning-based Android malware detection has earned significant recognition across industry and academia, yet its effectiveness hinges on the accuracy of labeled training data. Manual labeling, being prohibitively expensive, has prompted the use of automated methods, such as leveraging anti-virus engines like VirusTotal, which unfortunately introduces mislabeling, aka "label noise". The state-of-the-art label noise reduction approach, MalWhiteout, can effectively reduce random label noise but underperforms in mitigating real-world emergent malware (EM) label noise stemming from newly emerging Android malware variants overlooked by VirusTotal. To tackle this, we conceptualize EM label noise detection as an anomaly detection problem and introduce a novel tool, MalCleanse, that surpasses MalWhiteout's ability to address EM label noise. MalCleanse combines uncertainty estimation with unsupervised anomaly detection, identifying samples with high uncertainty as mislabeled, thereby enhancing its capability to remove EM label noise. Our experimental results demonstrate a significant reduction in EM label noise by approximately 25.25%, achieving an F1 Score of 80.32% for label noise detection at a noise ratio of 40.61%. Notably, MalCleanse outperforms MalWhiteout with an increase of 40.9% in overall F1 score for mitigating EM label noise. This paper pioneers the integration of deep neural network model uncertainty to refine label accuracy, thereby enhancing the reliability of malware detection systems. Our approach represents a significant step forward in addressing the challenges posed by emergent malware in automated labeling systems. ![]() |
|
Li, Haofeng |
![]() Liqing Cao, Haofeng Li, Chenghang Shi, Jie Lu, Haining Meng, Lian Li, and Jingling Xue (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia) ![]() |
|
Li, Heng |
![]() Shayan Noei, Heng Li, and Ying Zou (Queen's University, Canada; Polytechnique Montréal, Canada) Refactoring is a technical approach to increase the internal quality of software without altering its external functionalities. Developers often invest significant effort in refactoring. With the increased adoption of continuous integration and deployment (CI/CD), refactoring activities may vary within and across different releases and be influenced by various release goals. For example, developers may consistently allocate refactoring activities throughout a release, or prioritize new features early on in a release and only pick up refactoring late in a release. Different approaches to allocating refactoring tasks may have different implications for code quality. However, there is a lack of existing research on how practitioners allocate their refactoring activities within a release and their impact on code quality. Therefore, we first empirically study the frequent release-wise refactoring patterns in 207 open-source Java projects and their characteristics. Then, we analyze how these patterns and their transitions affect code quality. We identify four major release-wise refactoring patterns: early active, late active, steady active, and steady inactive. We find that adopting the late active pattern—characterized by gradually increasing refactoring activities as the release approaches—leads to the best code quality. We observe that as projects mature, refactoring becomes more active, reflected in the increasing use of the steady active release-wise refactoring pattern and the decreasing utilization of the steady inactive release-wise refactoring pattern. While the steady active pattern shows improvement in quality-related code metrics (e.g., cohesion), it can also lead to more architectural problems. Additionally, we observe that developers tend to adhere to a single refactoring pattern rather than switching between different patterns. The late active pattern, in particular, can be a safe release-wise refactoring pattern that is used repeatedly. Our results can help practitioners understand existing release-wise refactoring patterns and their effects on code quality, enabling them to utilize the most effective pattern to enhance release quality. ![]() ![]() Roozbeh Aghili, Heng Li, and Foutse Khomh (Polytechnique Montréal, Canada) Software logs, generated during the runtime of software systems, are essential for various development and analysis activities, such as anomaly detection and failure diagnosis. However, the presence of sensitive information in these logs poses significant privacy concerns, particularly regarding Personally Identifiable Information (PII) and quasi-identifiers that could lead to re-identification risks. While general data privacy has been extensively studied, the specific domain of privacy in software logs remains underexplored, with inconsistent definitions of sensitivity and a lack of standardized guidelines for anonymization. To mitigate this gap, this study offers a comprehensive analysis of privacy in software logs from multiple perspectives. We start by performing an analysis of 25 publicly available log datasets to identify potentially sensitive attributes. Based on the result of this step, we focus on three perspectives: privacy regulations, research literature, and industry practices. We first analyze key data privacy regulations, such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), to understand the legal requirements concerning sensitive information in logs. Second, we conduct a systematic literature review to identify common privacy attributes and practices in log anonymization, revealing gaps in existing approaches. Finally, we survey 45 industry professionals to capture practical insights on log anonymization practices. Our findings shed light on various perspectives of log privacy and reveal industry challenges, such as technical and efficiency issues while highlighting the need for standardized guidelines. By combining insights from regulatory, academic, and industry perspectives, our study aims to provide a clearer framework for identifying and protecting sensitive information in software logs. ![]() |
|
Li, Hui |
![]() Chun Li, Hui Li, Zhong Li, Minxue Pan, and Xuandong Li (Nanjing University, China; Samsung Electronics, China) Due to advancements in graph neural networks, graph learning-based fault localization (GBFL) methods have achieved promising results. However, as these methods are supervised learning paradigms and deep learning is typically data-hungry, they can only be trained on fully labeled large-scale datasets. This is impractical because labeling failed tests is similar to manual fault localization, which is time-consuming and labor-intensive, leading to only a small portion of failed tests that can be labeled within limited budgets. These data labeling limitations would lead to the sub-optimal effectiveness of supervised GBFL techniques. Semi-supervised learning (SSL) provides an effective means of leveraging unlabeled data to improve a model's performance and address data labeling limitations. However, as these methods are not specifically designed for fault localization, directly utilizing them might lead to sub-optimal effectiveness. In response, we propose a novel semi-supervised GBFL framework, LEGATO. LEGATO first leverages the attention mechanism to identify and augment likely fault-unrelated sub-graphs in unlabeled graphs and then quantifies the suspiciousness distribution of unlabeled graphs to estimate pseudo-labels. Through training the model on augmented unlabeled graphs and pseudo-labels, LEGATO can utilize the unlabeled data to improve the effectiveness of fault localization and address the restrictions in data labeling. By extensive evaluations against 3 baselines SSL methods, LEGATO demonstrates superior performance by outperforming all the methods in comparison. ![]() |
|
Li, Lian |
![]() Liqing Cao, Haofeng Li, Chenghang Shi, Jie Lu, Haining Meng, Lian Li, and Jingling Xue (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia) ![]() |
|
Li, Maodong |
![]() Shuaiyu Xie, Jian Wang, Maodong Li, Peiran Chen, Jifeng Xuan, and Bing Li (Wuhan University, China; Zhongguancun Laboratory, China) Distributed tracing is a pivotal technique for software operators to understand and diagnose issues within microservice-based systems, offering a comprehensive view of user requests propagated through various services. However, the unprecedented volume of traces imposes expensive storage and analytical burdens on online systems. Conventional tracing approaches typically rely on random sampling with a fixed probability for each trace, which risks missing valuable traces. Several tail-based sampling methods have thus been proposed to sample traces based on their content. Nevertheless, these methods primarily evaluate traces on an individual basis, neglecting the collective attributes of the sample set in terms of comprehensiveness, balance, and consistency. To address these issues, we propose TracePicker, an optimization-based online sampler designed to enhance the quality of sampled data while mitigating storage burden. TracePicker employs a streaming anomaly detector to capture and retain anomalous traces that are crucial for troubleshooting. For normal traces, the sampling process is segmented into quota allocation and group sampling, both formulated as integer programming problems. By solving these problems using dynamic programming and evolution algorithms, TracePicker selects a high-quality subset of data, minimizing overall information loss. Experimental results demonstrate that TracePicker outperforms existing tail-based sampling methods in terms of both sampling quality and time consumption. ![]() |
|
Li, Shanping |
![]() Junwei Zhang, Xing Hu, Shan Gao, Xin Xia, David Lo, and Shanping Li (Zhejiang University, China; Huawei, China; Singapore Management University, Singapore) Unit testing is crucial for software development and maintenance. Effective unit testing ensures and improves software quality, but writing unit tests is time-consuming and labor-intensive. Recent studies have proposed deep learning (DL) techniques or large language models (LLMs) to automate unit test generation. These models are usually trained or fine-tuned on large-scale datasets. Despite growing awareness of the importance of data quality, there has been limited research on the quality of datasets used for test generation. To bridge this gap, we systematically examine the impact of noise on the performance of learning-based test generation models. We first apply the open card sorting method to analyze the most popular and largest test generation dataset, Methods2Test, to categorize eight distinct types of noise. Further, we conduct detailed interviews with 17 domain experts to validate and assess the importance, reasonableness, and correctness of the noise taxonomy. Then, we propose CleanTest, an automated noise-cleaning framework designed to improve the quality of test generation datasets. CleanTest comprises three filters: a rule-based syntax filter, a rule-based relevance filter, and a model-based coverage filter. To evaluate its effectiveness, we apply CleanTest on two widely-used test generation datasets, i.e., Methods2Test and Atlas. Our findings indicate that 43.52% and 29.65% of datasets contain noise, highlighting its prevalence. Finally, we conduct comparative experiments using four LLMs (i.e., CodeBERT, AthenaTest, StarCoder, and CodeLlama7B) to assess the impact of noise on test generation performance. The results show that filtering noise positively influences the test generation ability of the models. Fine-tuning the four LLMs with the filtered Methods2Test dataset, on average, improves its performance by 67% in branch coverage, using the Defects4J benchmark. For the Atlas dataset, the four LLMs improve branch coverage by 39%. Additionally, filtering noise improves bug detection performance, resulting in a 21.42% increase in bugs detected by the generated tests. ![]() |
|
Li, Shanshan |
![]() Xingpei Li, Yan Lei, Zhouyang Jia, Yuanliang Zhang, Haoran Liu, Liqian Chen, Wei Dong, and Shanshan Li (National University of Defense Technology, China; Chongqing University, China) As deep learning (DL) frameworks become widely used, converting models between frameworks is crucial for ecosystem flexibility. However, interestingly, existing model converters commonly focus on syntactic operator API mapping—transpiling operator names and parameters—which results in API compatibility issues (i.e., incompatible parameters, missing operators). These issues arise from semantic inconsistencies due to differences in operator implementation, causing conversion failure or performance degradation. In this paper, we present the first comprehensive study on operator semantic inconsistencies through API mapping analysis and framework source code inspection, revealing that 47% of sampled operators exhibit inconsistencies across DL frameworks, with source code limited to individual layers and no inter-layer interactions. This suggests that layer-wise source code alignment is feasible. Based on this, we propose ModelX, a source-level cross-framework conversion approach that extends operator API mapping by addressing semantic inconsistencies beyond the API level. Experiments on PyTorch-to-Paddle conversion show that ModelX successfully converts 624 out of 686 sampled operators and outperforms two state-of-the-art converters and three popular large language models. Moreover, ModelX achieves minimal metric gaps (avg. all under 3.4%) across 52 models from vision, text, and audio tasks, indicating strong robustness. ![]() ![]() Haoran Liu, Shanshan Li, Zhouyang Jia, Yuanliang Zhang, Linxiao Bai, Si Zheng, Xiaoguang Mao, and Xiangke Liao (National University of Defense Technology, China) ![]() |
|
Li, Shilong |
![]() Yuntianyi Chen, Yuqi Huai, Yirui He, Shilong Li, Changnam Hong, Qi Alfred Chen, and Joshua Garcia (University of California at Irvine, USA) As autonomous driving systems (ADSes) become increasingly complex and integral to daily life, the importance of understanding the nature and mitigation of software bugs in these systems has grown correspondingly. Addressing the challenges of software maintenance in autonomous driving systems (e.g., handling real-time system decisions and ensuring safety-critical reliability) is crucial due to the unique combination of real-time decision-making requirements and the high stakes of operational failures in ADSes. The potential of automated tools in this domain is promising, yet there remains a gap in our comprehension of the challenges faced and the strategies employed during manual debugging and repair of such systems. In this paper, we present an empirical study that investigates bug-fix patterns in ADSes, with the aim of improving reliability and safety. We have analyzed the commit histories and bug reports of two major autonomous driving projects, Apollo and Autoware, from 1,331 bug fixes with the study of bug symptoms, root causes, and bug-fix patterns. Our study reveals several dominant bug-fix patterns, including those related to path planning, data flow, and configuration management. Additionally, we find that the frequency distribution of bug-fix patterns varies significantly depending on their nature and types and that certain categories of bugs are recurrent and more challenging to exterminate. Based on our findings, we propose a hierarchy of ADS bugs and two taxonomies of 15 syntactic bug-fix patterns and 27 semantic bug-fix patterns that offer guidance for bug identification and resolution. We also contribute a benchmark of 1,331 ADS bug-fix instances. ![]() ![]() |
|
Li, Shuhua |
![]() Ruichao Liang, Jing Chen, Ruochen Cao, Kun He, Ruiying Du, Shuhua Li, Zheng Lin, and Cong Wu (Wuhan University, China; University of Hong Kong, Hong Kong) Smart contracts, as Turing-complete programs managing billions of assets in decentralized finance, are prime targets for attackers. While fuzz testing seems effective for detecting vulnerabilities in these programs, we identify several significant challenges when targeting smart contracts: (i) the stateful nature of these contracts requires stateful exploration, but current fuzzers rely on transaction sequences to manipulate contract states, making the process inefficient; (ii) contract execution is influenced by the continuously changing blockchain environment, yet current fuzzers are limited to local deployments, failing to test contracts in real-world scenarios. These challenges hinder current fuzzers from uncovering hidden vulnerabilities, i.e., those concealed in deep contract states and specific blockchain environments. In this paper, we present SmartShot, a mutable snapshot-based fuzzer to hunt hidden vulnerabilities within smart contracts. We innovatively formulate contract states and blockchain environments as directly fuzzable elements and design mutable snapshots to quickly restore and mutate these elements. SmartShot features a symbolic taint analysis-based mutation strategy along with double validation to soundly guide the state mutation. SmartShot mutates blockchain environments using contract’s historical on-chain states, providing real-world execution contexts. We propose a snapshot checkpoint mechanism to integrate mutable snapshots into SmartShot’s fuzzing loops. These innovations enable SmartShot to effectively fuzz contract states, test contracts across varied and realistic blockchain environments, and support on-chain fuzzing. Experimental results show that SmartShot is effective to detect hidden vulnerabilities with the highest code coverage and lowest false positive rate. SmartShot is 4.8× to 20.2× faster than state-of-the-art tools, identifying 2,150 vulnerable contracts out of 42,738 real-world contracts which is 2.1× to 13.7× more than other tools. SmartShot has demonstrated its real-world impact by detecting vulnerabilities that are only discoverable on-chain and uncovering 24 0-day vulnerabilities in the latest 10,000 deployed contracts. ![]() |
|
Li, Shuqing |
![]() Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, and Michael Lyu (Chinese University of Hong Kong, China; Singapore Management University, Singapore) ![]() |
|
Li, Song |
![]() Yuan Li, Peisen Yao, Kan Yu, Chengpeng Wang, Yaoyang Ye, Song Li, Meng Luo, Yepang Liu, and Kui Ren (Zhejiang University, China; Ant Group, China; Hong Kong University of Science and Technology, China; Southern University of Science and Technology, China) ![]() |
|
Li, Tianlin |
![]() Zhenpeng Chen, Xinyue Li, Jie M. Zhang, Weisong Sun, Ying Xiao, Tianlin Li, Yiling Lou, and Yang Liu (Nanyang Technological University, Singapore; Peking University, China; King's College London, UK; Fudan University, China) Fairness is a critical requirement for Machine Learning (ML) software, driving the development of numerous bias mitigation methods. Previous research has identified a leveling-down effect in bias mitigation for computer vision and natural language processing tasks, where fairness is achieved by lowering performance for all groups without benefiting the unprivileged group. However, it remains unclear whether this effect applies to bias mitigation for tabular data tasks, a key area in fairness research with significant real-world applications. This study evaluates eight bias mitigation methods for tabular data, including both widely used and cutting-edge approaches, across 44 tasks using five real-world datasets and four common ML models. Contrary to earlier findings, our results show that these methods operate in a zero-sum fashion, where improvements for unprivileged groups are related to reduced benefits for traditionally privileged groups. However, previous research indicates that the perception of a zero-sum trade-off might complicate the broader adoption of fairness policies. To explore alternatives, we investigate an approach that applies the state-of-the-art bias mitigation method solely to unprivileged groups, showing potential to enhance benefits of unprivileged groups without negatively affecting privileged groups or overall ML performance. Our study highlights potential pathways for achieving fairness improvements without zero-sum trade-offs, which could help advance the adoption of bias mitigation methods. ![]() |
|
Li, Xingpei |
![]() Xingpei Li, Yan Lei, Zhouyang Jia, Yuanliang Zhang, Haoran Liu, Liqian Chen, Wei Dong, and Shanshan Li (National University of Defense Technology, China; Chongqing University, China) As deep learning (DL) frameworks become widely used, converting models between frameworks is crucial for ecosystem flexibility. However, interestingly, existing model converters commonly focus on syntactic operator API mapping—transpiling operator names and parameters—which results in API compatibility issues (i.e., incompatible parameters, missing operators). These issues arise from semantic inconsistencies due to differences in operator implementation, causing conversion failure or performance degradation. In this paper, we present the first comprehensive study on operator semantic inconsistencies through API mapping analysis and framework source code inspection, revealing that 47% of sampled operators exhibit inconsistencies across DL frameworks, with source code limited to individual layers and no inter-layer interactions. This suggests that layer-wise source code alignment is feasible. Based on this, we propose ModelX, a source-level cross-framework conversion approach that extends operator API mapping by addressing semantic inconsistencies beyond the API level. Experiments on PyTorch-to-Paddle conversion show that ModelX successfully converts 624 out of 686 sampled operators and outperforms two state-of-the-art converters and three popular large language models. Moreover, ModelX achieves minimal metric gaps (avg. all under 3.4%) across 52 models from vision, text, and audio tasks, indicating strong robustness. ![]() |
|
Li, Xinyue |
![]() Zhenpeng Chen, Xinyue Li, Jie M. Zhang, Weisong Sun, Ying Xiao, Tianlin Li, Yiling Lou, and Yang Liu (Nanyang Technological University, Singapore; Peking University, China; King's College London, UK; Fudan University, China) Fairness is a critical requirement for Machine Learning (ML) software, driving the development of numerous bias mitigation methods. Previous research has identified a leveling-down effect in bias mitigation for computer vision and natural language processing tasks, where fairness is achieved by lowering performance for all groups without benefiting the unprivileged group. However, it remains unclear whether this effect applies to bias mitigation for tabular data tasks, a key area in fairness research with significant real-world applications. This study evaluates eight bias mitigation methods for tabular data, including both widely used and cutting-edge approaches, across 44 tasks using five real-world datasets and four common ML models. Contrary to earlier findings, our results show that these methods operate in a zero-sum fashion, where improvements for unprivileged groups are related to reduced benefits for traditionally privileged groups. However, previous research indicates that the perception of a zero-sum trade-off might complicate the broader adoption of fairness policies. To explore alternatives, we investigate an approach that applies the state-of-the-art bias mitigation method solely to unprivileged groups, showing potential to enhance benefits of unprivileged groups without negatively affecting privileged groups or overall ML performance. Our study highlights potential pathways for achieving fairness improvements without zero-sum trade-offs, which could help advance the adoption of bias mitigation methods. ![]() |
|
Li, Xuandong |
![]() Chun Li, Hui Li, Zhong Li, Minxue Pan, and Xuandong Li (Nanjing University, China; Samsung Electronics, China) Due to advancements in graph neural networks, graph learning-based fault localization (GBFL) methods have achieved promising results. However, as these methods are supervised learning paradigms and deep learning is typically data-hungry, they can only be trained on fully labeled large-scale datasets. This is impractical because labeling failed tests is similar to manual fault localization, which is time-consuming and labor-intensive, leading to only a small portion of failed tests that can be labeled within limited budgets. These data labeling limitations would lead to the sub-optimal effectiveness of supervised GBFL techniques. Semi-supervised learning (SSL) provides an effective means of leveraging unlabeled data to improve a model's performance and address data labeling limitations. However, as these methods are not specifically designed for fault localization, directly utilizing them might lead to sub-optimal effectiveness. In response, we propose a novel semi-supervised GBFL framework, LEGATO. LEGATO first leverages the attention mechanism to identify and augment likely fault-unrelated sub-graphs in unlabeled graphs and then quantifies the suspiciousness distribution of unlabeled graphs to estimate pseudo-labels. Through training the model on augmented unlabeled graphs and pseudo-labels, LEGATO can utilize the unlabeled data to improve the effectiveness of fault localization and address the restrictions in data labeling. By extensive evaluations against 3 baselines SSL methods, LEGATO demonstrates superior performance by outperforming all the methods in comparison. ![]() ![]() Shaoheng Cao, Renyi Chen, Wenhua Yang, Minxue Pan, and Xuandong Li (Nanjing University, China; Samsung Electronics, China; Nanjing University of Aeronautics and Astronautics, China) Model-based testing (MBT) has been an important methodology in software engineering, attracting extensive research attention for over four decades. However, despite its academic acclaim, studies examining the impact of MBT in industrial environments—particularly regarding its extended effects—remain limited and yield unclear results. This gap may contribute to the challenges in establishing a study environment for implementing and applying MBT in production settings to evaluate its impact over time. To bridge this gap, we collaborated with an industrial partner to undertake a comprehensive, longitudinal empirical study employing mixed methods. Over two months, we implemented our MBT tool within the corporation, assessing the immediate and extended effectiveness and efficiency of MBT compared to script-writing-based testing. Through a mix of quantitative and qualitative methods—spanning controlled experiments, questionnaire surveys, and interviews—our study uncovers several insightful findings. These include differences in effectiveness and maintainability between immediate and extended MBT application, the evolving perceptions and expectations of engineers regarding MBT, and more. Leveraging these insights, we propose actionable implications for both the academic and industrial communities, aimed at bolstering confidence in MBT adoption and investment for software testing purposes. ![]() |
|
Li, Yi |
![]() Yi Li, Hridya Dhulipala, Aashish Yadavally, Xiaokai Rong, Shaohua Wang, and Tien N. Nguyen (University of Texas at Dallas, USA; Central University of Finance and Economics, China) Although Large Language Models (LLMs) are highly proficient in understanding source code and descriptive texts, they have limitations in reasoning on dynamic program behaviors, such as execution trace and code coverage prediction, and runtime error prediction, which usually require actual program execution. To advance the ability of LLMs in predicting dynamic behaviors, we leverage the strengths of both approaches, Program Analysis (PA) and LLM, in building PredEx, a predictive executor for Python. Our principle is a blended analysis between PA and LLM to use PA to guide the LLM in predicting execution traces. We break down the task of predictive execution into smaller sub-tasks and leverage the deterministic nature when an execution order can be deterministically decided. When it is not certain, we use predictive backward slicing per variable, i.e., slicing the prior trace to only the parts that affect each variable separately breaks up the valuation prediction into significantly simpler problems. Our empirical evaluation on real-world datasets shows that PredEx achieves 31.5–47.1% relatively higher accuracy in predicting full execution traces than the state-of-the-art models. It also produces 8.6–53.7% more correct execution trace prefixes than those baselines. In predicting next executed statements, its relative improvement over the baselines is 15.7–102.3%. Finally, we show PredEx’s usefulness in two tasks: static code coverage analysis and static prediction of run-time errors for (in)complete code. ![]() |
|
Li, Yichen |
![]() Weibin Wu, Haoxuan Hu, Zhaoji Fan, Yitong Qiao, Yizhan Huang, Yichen Li, Zibin Zheng, and Michael Lyu (Sun Yat-sen University, China; Chinese University of Hong Kong, China) Deep learning (DL) has revolutionized various software engineering tasks. Particularly, the emergence of AI code generators has pushed the boundaries of automatic programming to synthesize entire programs based on user-defined specifications in natural language. However, it remains a mystery if these AI code generators rely on the copy-and-paste programming practice, resulting in code clone concerns. In this work, to comprehensively study the code cloning behavior of AI code generators, we conduct an empirical study on three state-of-the-art commercial AI code generators to investigate the existence of all types of clones, which remains underexplored. Our experimental results show that the total Type-1 and Type-2 clone rates of the state-of-the-art commercial AI code generators can reach up to 7.50%, indicating marked code clone issues. Furthermore, it is observed that AI code generators risk infringing copyrights and propagating buggy and vulnerable code resulting from cloning code and show a certain degree of stability in generating code clones. ![]() ![]() Yun Peng, Jun Wan, Yichen Li, and Xiaoxue Ren (Chinese University of Hong Kong, China; Zhejiang University, China) Code generation has largely improved development efficiency in the era of large language models (LLMs). With the ability to follow instructions, current LLMs can be prompted to generate code solutions given detailed descriptions in natural language. Many research efforts are being devoted to improving the correctness of LLM-generated code, and many benchmarks are proposed to evaluate the correctness comprehensively. Despite the focus on correctness, the time efficiency of LLM-generated code solutions is under-explored. Current correctness benchmarks are not suitable for time efficiency evaluation since their test cases cannot well distinguish the time efficiency of different code solutions. Besides, the current execution time measurement is not stable and comprehensive, threatening the validity of the time efficiency evaluation. To address the challenges in the time efficiency evaluation of code generation, we propose COFFE, a code generation benchmark for evaluating the time efficiency of LLM-generated code solutions. COFFE contains 398 and 358 problems for function-level and file-level code generation, respectively. To improve the distinguishability, we design a novel stressful test case generation approach with contracts and two new formats of test cases to improve the accuracy of generation. For the time evaluation metric, we propose efficienct@k based on CPU instruction count to ensure a stable and solid comparison between different solutions. We evaluate 14 popular LLMs on COFFE and identify four findings. Based on the findings, we draw some implications for LLM researchers and software practitioners to facilitate future research and usage of LLMs in code generation. ![]() |
|
Li, Yuan |
![]() Yuan Li, Peisen Yao, Kan Yu, Chengpeng Wang, Yaoyang Ye, Song Li, Meng Luo, Yepang Liu, and Kui Ren (Zhejiang University, China; Ant Group, China; Hong Kong University of Science and Technology, China; Southern University of Science and Technology, China) ![]() |
|
Li, Yue |
![]() Yue Li, Bohan Liu, Ting Zhang, Zhiqi Wang, David Lo, Lanxin Yang, Jun Lyu, and He Zhang (Nanjing University, China; Singapore Management University, Singapore) ![]() |
|
Li, Yuetong |
![]() Dezhi Ran, Yuan Cao, Yuzhe Guo, Yuetong Li, Mengzhou Wu, Simin Chen, Wei Yang, and Tao Xie (Peking University, China; Beijing Jiaotong University, China; University of Chicago, USA; University of Texas at Dallas, USA) ![]() |
|
Li, Zhen |
![]() Junyao Ye, Zhen Li, Xi Tang, Deqing Zou, Shouhuai Xu, Weizhong Qiang, and Hai Jin (Huazhong University of Science and Technology, China; University of Colorado at Colorado Springs, USA) ![]() |
|
Li, Zhengda |
![]() Haowen Yang, Zhengda Li, Zhiqing Zhong, Xiaoying Tang, and Pinjia He (Chinese University of Hong Kong, Shenzhen, China) With the increasing demand for handling large-scale and complex data, data science (DS) applications often suffer from long execution time and rapid RAM exhaustion, which leads to many serious issues like unbearable delays and crashes in financial transactions. As popular DS libraries are frequently used in these applications, their performance bugs (PBs) are a major contributing factor to these issues, making it crucial to address them for improving overall application performance. However, PBs in popular DS libraries remain largely unexplored. To address this gap, we conducted a study of 202 PBs collected from seven popular DS libraries. Our study examined the impact of PBs and proposed a taxonomy for common root causes. We found over half of the PBs arise from inefficient data processing operations, especially within data structure. We also explored the effort required to locate their root causes and fix these bugs, along with the challenges involved. Notably, around 20% of the PBs could be fixed using simple strategies (e.g. Conditions Optimizing), suggesting the potential for automated repair approaches. Our findings highlight the severity of PBs in core DS libraries and offer insights for developing high-performance libraries and detecting PBs. Furthermore, we derived test rules from our identified root causes, identifying eight PBs, of which four were confirmed, demonstrating the practical utility of our findings. ![]() |
|
Li, Zhong |
![]() Chun Li, Hui Li, Zhong Li, Minxue Pan, and Xuandong Li (Nanjing University, China; Samsung Electronics, China) Due to advancements in graph neural networks, graph learning-based fault localization (GBFL) methods have achieved promising results. However, as these methods are supervised learning paradigms and deep learning is typically data-hungry, they can only be trained on fully labeled large-scale datasets. This is impractical because labeling failed tests is similar to manual fault localization, which is time-consuming and labor-intensive, leading to only a small portion of failed tests that can be labeled within limited budgets. These data labeling limitations would lead to the sub-optimal effectiveness of supervised GBFL techniques. Semi-supervised learning (SSL) provides an effective means of leveraging unlabeled data to improve a model's performance and address data labeling limitations. However, as these methods are not specifically designed for fault localization, directly utilizing them might lead to sub-optimal effectiveness. In response, we propose a novel semi-supervised GBFL framework, LEGATO. LEGATO first leverages the attention mechanism to identify and augment likely fault-unrelated sub-graphs in unlabeled graphs and then quantifies the suspiciousness distribution of unlabeled graphs to estimate pseudo-labels. Through training the model on augmented unlabeled graphs and pseudo-labels, LEGATO can utilize the unlabeled data to improve the effectiveness of fault localization and address the restrictions in data labeling. By extensive evaluations against 3 baselines SSL methods, LEGATO demonstrates superior performance by outperforming all the methods in comparison. ![]() |
|
Li, Zhongqi |
![]() Yujia Chen, Yang Ye, Zhongqi Li, Yuchi Ma, and Cuiyun Gao (Harbin Institute of Technology, Shenzhen, China; Huawei Cloud Computing Technologies, China) Large code models (LCMs) have remarkably advanced the field of code generation. Despite their impressive capabilities, they still face practical deployment issues, such as high inference costs, limited accessibility of proprietary LCMs, and adaptability issues of ultra-large LCMs. These issues highlight the critical need for more accessible, lightweight yet effective LCMs. Knowledge distillation (KD) offers a promising solution, which transfers the programming capabilities of larger, advanced LCMs (Teacher) to smaller, less powerful LCMs (Student). However, existing KD methods for code intelligence often lack consideration of fault domain knowledge and rely on static seed knowledge, leading to degraded programming capabilities of student models. In this paper, we propose a novel Self-Paced knOwledge DistillAtion framework, named SODA, aiming at developing lightweight yet effective student LCMs via adaptively transferring the programming capabilities from advanced teacher LCMs. SODA consists of three stages in one cycle: (1) Correct-and-Fault Knowledge Delivery stage aims at improving the student model’s capability to recognize errors while ensuring its basic programming skill during the knowledge transferring, which involves correctness-aware supervised learning and fault-aware contrastive learning methods. (2) Multi-View Feedback stage aims at measuring the quality of results generated by the student model from two views, including model-based and static tool-based measurement, for identifying the difficult questions; (3) Feedback-based Knowledge Update stage aims at updating the student model adaptively by generating new questions at different difficulty levels, in which the difficulty levels are categorized based on the feedback in the second stage. By performing the distillation process iteratively, the student model is continuously refined through learning more advanced programming skills from the teacher model. We compare SODA with four state-of-the-art KD approaches on three widely-used code generation benchmarks with different programming languages. Experimental results show that SODA improves the student model by 65.96% in terms of average Pass@1, outperforming the best baseline PERsD by 29.85%. Based on the proposed SODA framework, we develop SodaCoder, a series of lightweight yet effective LCMs with ∼7B parameters, which outperform 15 LCMs with less than or equal to 16B parameters. Notably, SodaCoder-DS-6.7B, built on DeepseekCoder-6.7B, even surpasses the prominent ChatGPT on average Pass@1 across seven programming languages (66.4 vs. 61.3). ![]() |
|
Li, Zihao |
![]() Kunsong Zhao, Zihao Li, Weimin Chen, Xiapu Luo, Ting Chen, Guozhu Meng, and Yajin Zhou (Hong Kong Polytechnic University, China; University of Electronic Science and Technology of China, China; Institute of Information Engineering at Chinese Academy of Sciences, China; Zhejiang University, China) WebAssembly has become the preferred smart contract format for various blockchain platforms due to its high portability and near-native execution speed. To effectively understand WebAssembly contracts, it is crucial to recover high-level type signatures because of the limited type information that WebAssembly provides. However, existing studies on type inference for smart contracts primarily center around Ethereum Virtual Machine bytecode, which is not applicable to WebAssembly owing to their differing targets and runtime semantics. This paper introduces WasmHint, a novel solution that leverages deep learning inference to automatically recover high-level parameter and return types from WebAssembly contracts. More specifically, WasmHint constructs a wCFG representation to clarify dependencies within WebAssembly code and simulates its execution to capture type-related operational information. By learning comprehensive code semantics, it infers parameter and return types, with a semantic corrector designed to enhance information coordination. We conduct experiments on a newly constructed dataset containing 77,208 WebAssembly contract functions. The results demonstrate that WasmHint achieves inference accuracies of 80.0% for parameter types and 95.8% for return types, with average improvements of 86.6% and 34.0% over the baseline methods, respectively. ![]() |
|
Li, Zongjie |
![]() Chaozheng Wang, Jia Feng, Shuzheng Gao, Cuiyun Gao, Zongjie Li, Ting Peng, Hailiang Huang, Yuetang Deng, and Michael Lyu (Chinese University of Hong Kong, China; University of Electronic Science and Technology of China, China; Hong Kong University of Science and Technology, China; Tencent, China) Large Code Models (LCMs) have demonstrated remarkable effectiveness across various code intelligence tasks. Supervised fine-tuning is essential to optimize their performance for specific downstream tasks. Compared with the traditional full-parameter fine-tuning (FFT) method, Parameter-Efficient Fine-Tuning (PEFT) methods can train LCMs with substantially reduced resource consumption and have gained widespread attention among researchers and practitioners. While existing studies have explored PEFT methods for code intelligence tasks, they have predominantly focused on a limited subset of scenarios, such as code generation with publicly available datasets, leading to constrained generalizability of the findings. To mitigate the limitation, we conduct a comprehensive study on exploring the effectiveness of the PEFT methods, which involves five code intelligence tasks containing both public and private data. Our extensive experiments reveal a considerable performance gap between PEFT methods and FFT, which is contrary to the findings of existing studies. We also find that this disparity is particularly pronounced in tasks involving private data. To improve the tuning performance for LCMs while reducing resource utilization during training, we propose a Layer-Wise Optimization (LWO) strategy in the paper. LWO incrementally updates the parameters of each layer in the whole model architecture, without introducing any additional component and inference overhead. Experiments across five LCMs and five code intelligence tasks demonstrate that LWO trains LCMs more effectively and efficiently compared to previous PEFT methods, with significant improvements in tasks using private data. For instance, in the line-level code completion task using our private code repositories, LWO outperforms the state-of-the-art LoRA method by 22% and 12% in terms of accuracy and BLEU scores, respectively. Furthermore, LWO can enable more efficient LCM tuning, reducing the training time by an average of 42.7% compared to LoRA. ![]() |
|
Lian, Xiaoli |
![]() Xiaoli Lian, Shuaisong Wang, Hanyu Zou, Fang Liu, Jiajun Wu, and Li Zhang (Beihang University, China) In the current software-driven era, ensuring privacy and security is critical. Despite this, the specification of security requirements for software is still largely a manual and labor-intensive process. Engineers are tasked with analyzing potential security threats based on functional requirements (FRs), a procedure prone to omissions and errors due to the expertise gap between cybersecurity experts and software engineers. To bridge this gap, we introduce F2SRD (Function-to-Security Requirements Derivation), an automated approach that proactively derives security requirements (SRs) from functional specifications under the guidance of relevant security verification requirements (VRs) drawn from the well recognized OWASP Application Security Verification Standard (ASVS). F2SRD operates in two main phases: Initially, we develop a VR retriever trained on a custom database of FR-VR pairs, enabling it to adeptly select applicable VRs from ASVS. This targeted retrieval informs the precise and actionable formulation of SRs. Subsequently, these VRs are used to construct structured prompts that direct GPT-4 in generating SRs. Our comparative analysis against two established models demonstrates F2SRD's enhanced performance in producing SRs that excel in inspiration, diversity, and specificity—essential attributes for effective security requirement generation. By leveraging security verification standards, we believe that the generated SRs are not only more focused but also resonate stronger with the needs of engineers. ![]() ![]() Xin Tan, Xiao Long, Yinghao Zhu, Lin Shi, Xiaoli Lian, and Li Zhang (Beihang University, China) Onboarding newcomers is vital for the sustainability of open-source software (OSS) projects. To lower barriers and increase engagement, OSS projects have dedicated experts who provide guidance for newcomers. However, timely responses are often hindered by experts' busy schedules. The recent rapid advancements of AI in software engineering have brought opportunities to leverage AI as a substitute for expert mentoring. However, the potential role of AI as a comprehensive mentor throughout the entire onboarding process remains unexplored. To identify design strategies of this "AI mentor", we applied Design Fiction as a participatory method with 19 OSS newcomers. We investigated their current onboarding experience and elicited 32 design strategies for future AI mentor. Participants envisioned AI mentor being integrated into OSS platforms like GitHub, where it could offer assistance to newcomers, such as "recommending projects based on personalized requirements" and "assessing and categorizing project issues by difficulty". We also collected participants' perceptions of a prototype, named "OSSerCopilot", that implemented the envisioned strategies. They found the interface useful and user-friendly, showing a willingness to use it in the future, which suggests the design strategies are effective. Finally, in order to identify the gaps between our design strategies and current research, we conducted a comprehensive literature review, evaluating the extent of existing research support for this concept. We find that research is relatively scarce in certain areas where newcomers highly anticipate AI mentor assistance, such as "discovering an interested project". Our study has the potential to revolutionize the current newcomer-expert mentorship and provides valuable insights for researchers and tool designers aiming to develop and enhance AI mentor systems. ![]() |
|
Liang, Hanzhong |
![]() Xing Su, Hanzhong Liang, Hao Wu, Ben Niu, Fengyuan Xu, and Sheng Zhong (Nanjing University, China; Institute of Information Engineering at Chinese Academy of Sciences, China) Understanding the Ethereum smart contract bytecode is essential for ensuring cryptoeconomics security. However, existing decompilers primarily convert bytecode into pseudocode, which is not easily comprehensible for general users, potentially leading to misunderstanding of contract behavior and increased vulnerability to scams or exploits. In this paper, we propose DiSCo, the first LLMs-based EVM decompilation pipeline, which aims to enable LLMs to understand the opaque bytecode and lift it into smart contract code. DiSCo introduces three core technologies. First, a logic-invariant intermediate representation is proposed to reproject the low-level bytecode into high-level abstracted units. The second technique involves semantic enhancement based on a novel type-aware graph model to infer stripped variables during compilation, enhancing the lifting effect. The third technology is a flexible method incorporating code specifications to construct LLM-comprehensible prompts for source code generation. Extensive experiments illustrate that our generated code guarantees a high compilability rate at 75%, with differential fuzzing pass rate averaging at 50%. Manual validation results further indicate that the generated solidity contracts significantly outperforms baseline methods in tasks such as code comprehension and attack reproduction. ![]() |
|
Liang, Jenny T. |
![]() Jenny T. Liang, Melissa Lin, Nikitha Rao, and Brad Myers (Carnegie Mellon University, USA) ![]() |
|
Liang, Keyu |
![]() Keyu Liang, Zhongxin Liu, Chao Liu, Zhiyuan Wan, David Lo, and Xiaohu Yang (Zhejiang University, China; Chongqing University, China; Singapore Management University, Singapore) Code search is a crucial task in software engineering, aiming to retrieve code snippets that are semantically relevant to a natural language query. Recently, Pre-trained Language Models (PLMs) have shown remarkable success and are widely adopted for code search tasks. However, PLM-based methods often struggle in cross-domain scenarios. When applied to a new domain, they typically require extensive fine-tuning with substantial data. Even worse, the data scarcity problem in new domains often forces these methods to operate in a zero-shot setting, resulting in a significant decline in performance. RAPID, which generates synthetic data for model fine-tuning, is currently the only effective method for zero-shot cross-domain code search. Despite its effectiveness, RAPID demands substantial computational resources for fine-tuning and needs to maintain specialized models for each domain, underscoring the need for a zero-shot, fine-tuning-free approach for cross-domain code search. The key to tackling zero-shot cross-domain code search lies in bridging the gaps among domains. In this work, we propose to break the query-code matching process of code search into two simpler tasks: query-comment matching and code-code matching. We first conduct an empirical study to investigate the effectiveness of these two matching schemas in zero-shot cross-domain code search. Our findings highlight the strong complementarity among the three matching schemas, i.e., query-code, query-comment, and code-code matching. Based on the findings, we propose CodeBridge, a zero-shot, fine-tuning-free approach for cross-domain code search. Specifically, CodeBridge first employs zero-shot prompting to guide Large Language Models (LLMs) to generate a comment for each code snippet in the codebase and produce a code for each query. Subsequently, it encodes queries, code snippets, comments, and the generated code using PLMs and assesses similarities through three matching schemas: query-code, query-comment, and generated code-code. Lastly, CodeBridge leverages a sampling-based fusion approach that combines these three similarity scores to rank the final search outcomes. Experimental results show that our approach outperforms the state-of-the-art PLM-based code search approaches, i.e., CoCoSoDa and UniXcoder, by an average of 21.4% and 24.9% in MRR, respectively, across three datasets. Our approach also yields results that are better than or comparable to those of the zero-shot cross-domain code search approach RAPID, which requires costly fine-tuning. ![]() ![]() |
|
Liang, Ruichao |
![]() Ruichao Liang, Jing Chen, Ruochen Cao, Kun He, Ruiying Du, Shuhua Li, Zheng Lin, and Cong Wu (Wuhan University, China; University of Hong Kong, Hong Kong) Smart contracts, as Turing-complete programs managing billions of assets in decentralized finance, are prime targets for attackers. While fuzz testing seems effective for detecting vulnerabilities in these programs, we identify several significant challenges when targeting smart contracts: (i) the stateful nature of these contracts requires stateful exploration, but current fuzzers rely on transaction sequences to manipulate contract states, making the process inefficient; (ii) contract execution is influenced by the continuously changing blockchain environment, yet current fuzzers are limited to local deployments, failing to test contracts in real-world scenarios. These challenges hinder current fuzzers from uncovering hidden vulnerabilities, i.e., those concealed in deep contract states and specific blockchain environments. In this paper, we present SmartShot, a mutable snapshot-based fuzzer to hunt hidden vulnerabilities within smart contracts. We innovatively formulate contract states and blockchain environments as directly fuzzable elements and design mutable snapshots to quickly restore and mutate these elements. SmartShot features a symbolic taint analysis-based mutation strategy along with double validation to soundly guide the state mutation. SmartShot mutates blockchain environments using contract’s historical on-chain states, providing real-world execution contexts. We propose a snapshot checkpoint mechanism to integrate mutable snapshots into SmartShot’s fuzzing loops. These innovations enable SmartShot to effectively fuzz contract states, test contracts across varied and realistic blockchain environments, and support on-chain fuzzing. Experimental results show that SmartShot is effective to detect hidden vulnerabilities with the highest code coverage and lowest false positive rate. SmartShot is 4.8× to 20.2× faster than state-of-the-art tools, identifying 2,150 vulnerable contracts out of 42,738 real-world contracts which is 2.1× to 13.7× more than other tools. SmartShot has demonstrated its real-world impact by detecting vulnerabilities that are only discoverable on-chain and uncovering 24 0-day vulnerabilities in the latest 10,000 deployed contracts. ![]() |
|
Liao, Xiangke |
![]() Haoran Liu, Shanshan Li, Zhouyang Jia, Yuanliang Zhang, Linxiao Bai, Si Zheng, Xiaoguang Mao, and Xiangke Liao (National University of Defense Technology, China) ![]() |
|
Liau, Frank |
![]() Jiongchi Yu, Xie Xiaofei, Qiang Hu, Bowen Zhang, Ziming Zhao, Yun Lin, Lei Ma, Ruitao Feng, and Frank Liau (Singapore Management University, Singapore; Tianjin University, China; Zhejiang University, China; Shanghai Jiao Tong University, China; University of Tokyo, Japan; University of Alberta, Canada; Southern Cross University, Australia; Government Technology Agency of Singapore, Singapore) ![]() |
|
Limpanukorn, Ben |
![]() Jiyuan Wang, Yuxin Qiu, Ben Limpanukorn, Hong Jin Kang, Qian Zhang, and Miryung Kim (University of California at Los Angeles, USA; University of California at Riverside, USA) In recent years, the MLIR framework has had explosive growth due to the need for extensible deep learning compilers for hardware accelerators. Such examples include Triton, CIRCT, and ONNX-MLIR. MLIR compilers introduce significant complexities in localizing bugs or inefficiencies because of their layered optimization and transformation process with compilation passes. While existing delta debugging techniques can be used to identify a minimum subset of IR code that reproduces a given bug symptom, their naive application to MLIR is time-consuming because real-world MLIR compilers usually involve a large number of compilation passes. Compiler developers must identify a minimized set of relevant compilation passes to reduce the footprint of MLIR compiler code to be inspected for a bug fix. We propose DuoReduce, a dual-dimensional reduction approach for MLIR bug localization. DuoReduce leverages three key ideas in tandem to design an efficient MLIR delta debugger. First, DuoReduce reduces compiler passes that are irrelevant to the bug by identifying ordering dependencies among the different compilation passes. Second, DuoReduce uses MLIR-semantics-aware transformations to expedite IR code reduction. Finally, DuoReduce leverages cross-dependence between the IR code dimension and the compilation pass dimension by accounting for which IR code segments are related to which compilation passes to reduce unused passes. Experiments with three large-scale MLIR compiler projects find that DuoReduce outperforms syntax-aware reducers such as Perses and Vulcan in terms of IR code reduction by 31.6% and 21.5% respectively. If one uses these reducers by enumerating all possible compilation passes (on average 18 passes), it could take up to 145 hours. By identifying ordering dependencies among compilation passes, DuoReduce reduces this time to 9.5 minutes. By identifying which compilation passes are unused for compiling reduced IR code, DuoReduce reduces the number of passes by 14.6%. This translates to not needing to examine 281 lines of MLIR compiler code on average to fix the bugs. DuoReduce has the potential to significantly reduce debugging effort in MLIR compilers, which serves as the foundation for the current landscape of machine learning and hardware accelerators. ![]() |
|
Lin, Melissa |
![]() Jenny T. Liang, Melissa Lin, Nikitha Rao, and Brad Myers (Carnegie Mellon University, USA) ![]() |
|
Lin, Xiaodong |
![]() Shuwei Song, Ting Chen, Ao Qiao, Xiapu Luo, Leqing Wang, Zheyuan He, Ting Wang, Xiaodong Lin, Peng He, Wensheng Zhang, and Xiaosong Zhang (University of Electronic Science and Technology of China, China; Hong Kong Polytechnic University, China; Stony Brook University, USA; University of Guelph, Canada) Cryptocurrency tokens, implemented by smart contracts, are prime targets for attackers due to their substantial monetary value. To illicitly gain profit, attackers often embed malicious code or exploit vulnerabilities within token contracts. Token transfer identification is crucial for detecting malicious and vulnerable token contracts. However, existing methods suffer from high false positives or false negatives due to invalid assumptions or reliance on limited patterns. This paper introduces a novel approach that captures the essential principles of token contracts, which are independent of programming languages and token standards, and presents a new tool, CRYPTO-SCOUT. CRYPTO-SCOUT automatically and accurately identifies token transfers, enabling the detection of various malicious and vulnerable token contracts. CRYPTO-SCOUT's core innovation is its capability to automatically identify complex container-type variables used by token contracts for storing holder information. It processes the bytecode of smart contracts written in the two most widely-used languages, Solidity and Vyper, and supports the three most popular token standards, ERC20, ERC721, and ERC1155. Furthermore, CRYPTO-SCOUT detects four types of malicious and vulnerable token contracts and is designed to be extensible. Extensive experiments show that CRYPTO-SCOUT outperforms existing approaches and uncovers over 21,000 malicious/vulnerable token contracts and more than 12,000 transactions triggering them. ![]() |
|
Lin, Yun |
![]() Jiongchi Yu, Xie Xiaofei, Qiang Hu, Bowen Zhang, Ziming Zhao, Yun Lin, Lei Ma, Ruitao Feng, and Frank Liau (Singapore Management University, Singapore; Tianjin University, China; Zhejiang University, China; Shanghai Jiao Tong University, China; University of Tokyo, Japan; University of Alberta, Canada; Southern Cross University, Australia; Government Technology Agency of Singapore, Singapore) ![]() |
|
Lin, Zheng |
![]() Ruichao Liang, Jing Chen, Ruochen Cao, Kun He, Ruiying Du, Shuhua Li, Zheng Lin, and Cong Wu (Wuhan University, China; University of Hong Kong, Hong Kong) Smart contracts, as Turing-complete programs managing billions of assets in decentralized finance, are prime targets for attackers. While fuzz testing seems effective for detecting vulnerabilities in these programs, we identify several significant challenges when targeting smart contracts: (i) the stateful nature of these contracts requires stateful exploration, but current fuzzers rely on transaction sequences to manipulate contract states, making the process inefficient; (ii) contract execution is influenced by the continuously changing blockchain environment, yet current fuzzers are limited to local deployments, failing to test contracts in real-world scenarios. These challenges hinder current fuzzers from uncovering hidden vulnerabilities, i.e., those concealed in deep contract states and specific blockchain environments. In this paper, we present SmartShot, a mutable snapshot-based fuzzer to hunt hidden vulnerabilities within smart contracts. We innovatively formulate contract states and blockchain environments as directly fuzzable elements and design mutable snapshots to quickly restore and mutate these elements. SmartShot features a symbolic taint analysis-based mutation strategy along with double validation to soundly guide the state mutation. SmartShot mutates blockchain environments using contract’s historical on-chain states, providing real-world execution contexts. We propose a snapshot checkpoint mechanism to integrate mutable snapshots into SmartShot’s fuzzing loops. These innovations enable SmartShot to effectively fuzz contract states, test contracts across varied and realistic blockchain environments, and support on-chain fuzzing. Experimental results show that SmartShot is effective to detect hidden vulnerabilities with the highest code coverage and lowest false positive rate. SmartShot is 4.8× to 20.2× faster than state-of-the-art tools, identifying 2,150 vulnerable contracts out of 42,738 real-world contracts which is 2.1× to 13.7× more than other tools. SmartShot has demonstrated its real-world impact by detecting vulnerabilities that are only discoverable on-chain and uncovering 24 0-day vulnerabilities in the latest 10,000 deployed contracts. ![]() |
|
Liu, Bohan |
![]() Yue Li, Bohan Liu, Ting Zhang, Zhiqi Wang, David Lo, Lanxin Yang, Jun Lyu, and He Zhang (Nanjing University, China; Singapore Management University, Singapore) ![]() |
|
Liu, Chao |
![]() Keyu Liang, Zhongxin Liu, Chao Liu, Zhiyuan Wan, David Lo, and Xiaohu Yang (Zhejiang University, China; Chongqing University, China; Singapore Management University, Singapore) Code search is a crucial task in software engineering, aiming to retrieve code snippets that are semantically relevant to a natural language query. Recently, Pre-trained Language Models (PLMs) have shown remarkable success and are widely adopted for code search tasks. However, PLM-based methods often struggle in cross-domain scenarios. When applied to a new domain, they typically require extensive fine-tuning with substantial data. Even worse, the data scarcity problem in new domains often forces these methods to operate in a zero-shot setting, resulting in a significant decline in performance. RAPID, which generates synthetic data for model fine-tuning, is currently the only effective method for zero-shot cross-domain code search. Despite its effectiveness, RAPID demands substantial computational resources for fine-tuning and needs to maintain specialized models for each domain, underscoring the need for a zero-shot, fine-tuning-free approach for cross-domain code search. The key to tackling zero-shot cross-domain code search lies in bridging the gaps among domains. In this work, we propose to break the query-code matching process of code search into two simpler tasks: query-comment matching and code-code matching. We first conduct an empirical study to investigate the effectiveness of these two matching schemas in zero-shot cross-domain code search. Our findings highlight the strong complementarity among the three matching schemas, i.e., query-code, query-comment, and code-code matching. Based on the findings, we propose CodeBridge, a zero-shot, fine-tuning-free approach for cross-domain code search. Specifically, CodeBridge first employs zero-shot prompting to guide Large Language Models (LLMs) to generate a comment for each code snippet in the codebase and produce a code for each query. Subsequently, it encodes queries, code snippets, comments, and the generated code using PLMs and assesses similarities through three matching schemas: query-code, query-comment, and generated code-code. Lastly, CodeBridge leverages a sampling-based fusion approach that combines these three similarity scores to rank the final search outcomes. Experimental results show that our approach outperforms the state-of-the-art PLM-based code search approaches, i.e., CoCoSoDa and UniXcoder, by an average of 21.4% and 24.9% in MRR, respectively, across three datasets. Our approach also yields results that are better than or comparable to those of the zero-shot cross-domain code search approach RAPID, which requires costly fine-tuning. ![]() ![]() |
|
Liu, Chengwei |
![]() Xiaokun Luan, David Sanan, Zhe Hou, Qiyuan Xu, Chengwei Liu, Yufan Cai, Yang Liu, and Meng Sun (Peking University, China; Singapore Institute of Technology, Singapore; Griffith University, Australia; Nanyang Technological University, Singapore; China-Singapore International Joint Research Institute, China; National University of Singapore, Singapore) Proof assistants are software tools for formal modeling and verification of software, hardware, design, and mathematical proofs. Due to the growing complexity and scale of formal proofs, compatibility issues frequently arise when using different versions of proof assistants. These issues result in broken proofs, disrupting the maintenance of formalized theories and hindering the broader dissemination of results within the community. Although existing works have proposed techniques to address specific types of compatibility issues, the overall characteristics of these issues remain largely unexplored. To address this gap, we conduct the first extensive empirical study to characterize compatibility issues, using Isabelle as a case study. We develop a regression testing framework to automatically collect compatibility issues from the Archive of Formal Proofs, the largest repository of formal proofs in Isabelle. By analyzing 12,079 collected issues, we identify their types and symptoms and further investigate their root causes. We also extract updated proofs that address these issues to understand the applied resolution strategies. Our study provides an in-depth understanding of compatibility issues in proof assistants, offering insights that support the development of effective techniques to mitigate these issues. ![]() ![]() |
|
Liu, Fang |
![]() Xiaoli Lian, Shuaisong Wang, Hanyu Zou, Fang Liu, Jiajun Wu, and Li Zhang (Beihang University, China) In the current software-driven era, ensuring privacy and security is critical. Despite this, the specification of security requirements for software is still largely a manual and labor-intensive process. Engineers are tasked with analyzing potential security threats based on functional requirements (FRs), a procedure prone to omissions and errors due to the expertise gap between cybersecurity experts and software engineers. To bridge this gap, we introduce F2SRD (Function-to-Security Requirements Derivation), an automated approach that proactively derives security requirements (SRs) from functional specifications under the guidance of relevant security verification requirements (VRs) drawn from the well recognized OWASP Application Security Verification Standard (ASVS). F2SRD operates in two main phases: Initially, we develop a VR retriever trained on a custom database of FR-VR pairs, enabling it to adeptly select applicable VRs from ASVS. This targeted retrieval informs the precise and actionable formulation of SRs. Subsequently, these VRs are used to construct structured prompts that direct GPT-4 in generating SRs. Our comparative analysis against two established models demonstrates F2SRD's enhanced performance in producing SRs that excel in inspiration, diversity, and specificity—essential attributes for effective security requirement generation. By leveraging security verification standards, we believe that the generated SRs are not only more focused but also resonate stronger with the needs of engineers. ![]() |
|
Liu, Haoran |
![]() Xingpei Li, Yan Lei, Zhouyang Jia, Yuanliang Zhang, Haoran Liu, Liqian Chen, Wei Dong, and Shanshan Li (National University of Defense Technology, China; Chongqing University, China) As deep learning (DL) frameworks become widely used, converting models between frameworks is crucial for ecosystem flexibility. However, interestingly, existing model converters commonly focus on syntactic operator API mapping—transpiling operator names and parameters—which results in API compatibility issues (i.e., incompatible parameters, missing operators). These issues arise from semantic inconsistencies due to differences in operator implementation, causing conversion failure or performance degradation. In this paper, we present the first comprehensive study on operator semantic inconsistencies through API mapping analysis and framework source code inspection, revealing that 47% of sampled operators exhibit inconsistencies across DL frameworks, with source code limited to individual layers and no inter-layer interactions. This suggests that layer-wise source code alignment is feasible. Based on this, we propose ModelX, a source-level cross-framework conversion approach that extends operator API mapping by addressing semantic inconsistencies beyond the API level. Experiments on PyTorch-to-Paddle conversion show that ModelX successfully converts 624 out of 686 sampled operators and outperforms two state-of-the-art converters and three popular large language models. Moreover, ModelX achieves minimal metric gaps (avg. all under 3.4%) across 52 models from vision, text, and audio tasks, indicating strong robustness. ![]() ![]() Haoran Liu, Shanshan Li, Zhouyang Jia, Yuanliang Zhang, Linxiao Bai, Si Zheng, Xiaoguang Mao, and Xiangke Liao (National University of Defense Technology, China) ![]() |
|
Liu, Hui |
![]() Meng Fan, Yuxia Zhang, Klaas-Jan Stol, and Hui Liu (Beijing Institute of Technology, China; University College Cork, Ireland; Lero, Ireland; SINTEF, Norway) Continued contributions of core developers in open source software (OSS) projects are key for sustaining and maintaining successful OSS projects. A major risk to the sustainability of OSS projects is developer turnover. Prior studies have explored developer turnover at the level of individual projects. A shortcoming of such studies is that they ignore the impact of developer turnover on downstream projects. Yet, an awareness of the turnover of core developers offers useful insights to the rest of an open source ecosystem. This study performs a large-scale empirical analysis of code developer turnover in the Rust package ecosystem. We find that the turnover of core developers is quite common in the whole Rust ecosystem with 36,991 packages. This is particularly worrying as a vast majority of Rust packages only have a single core developer. We found that core developer turnover can significantly decrease the quality and efficiency of software development and maintenance, even leading to deprecation. This is a major source of concern for those Rust packages that are widely used. We surveyed developers' perspectives on the turnover of core developers in upstream packages. We found that developers widely agreed that core developer turnover can affect project stability and sustainability. They also emphasized the importance of transparency and timely notifications regarding the health status of upstream dependencies. This study provides unique insights to help communities focus on building reliable software dependency networks. ![]() ![]() Mian Qin, Yuxia Zhang, Klaas-Jan Stol, and Hui Liu (Beijing Institute of Technology, China; University College Cork, Ireland; Lero, Ireland; SINTEF, Norway) Open Source Software (OSS) projects are no longer only developed by volunteers. Instead, many organizations, from early-stage startups to large global enterprises, actively participate in many well-known projects. The survival and success of OSS projects rely on long-term contributors, who have extensive experience and knowledge. While prior literature has explored volunteer turnover in OSS, there is a paucity of research on company turnover in OSS ecosystems. Given the intensive involvement of companies in OSS and the different nature of corporate contributors vis-a-vis volunteers, it is important to investigate company turnover in OSS projects. This study first explores the prevalence and characteristics of companies that discontinue contributing to OSS projects, and then develops models to predict companies’ turnover. Based on a study of the Linux kernel, we analyze the early-stage behavior of 1,322 companies that have contributed to the project. We find that approximately 12% of companies discontinue contributing each year; one-sixth of those used to be core contributing companies (those that ranked in the top 20% by commit volume). Furthermore, withdrawing companies tend to have a lower intensity and scope of contributions, make primarily perfective changes, collaborate less, and operate on a smaller scale. We propose a Temporal Convolutional Network (TCN) deep learning model based on these indicators to predict whether companies will discontinue. The evaluation results show that the model achieves an AUC metric of .76 and an accuracy of .71. We evaluated the model in two other OSS projects, Rust and OpenStack, and the performance remains stable. ![]() ![]() Waseem Akram, Yanjie Jiang, Yuxia Zhang, Haris Ali Khan, and Hui Liu (Beijing Institute of Technology, China; Peking University, China) Accurate method naming is crucial for code readability and maintainability. However, manually creating concise and meaningful names remains a significant challenge. To this end, in this paper, we propose an approach based on Large Language Model (LLMs) to suggest method names according to function descriptions. The key of the approach is ContextCraft, an automated algorithm for generating context-rich prompts for LLM that suggests the expected method names according to the prompts. For a given query (functional description), it retrieves a few best examples whose functional descriptions have the greatest similarity with the query. From the examples, it identifies tokens that are likely to appear in the final method name as well as their likely positions, picks up pivot words that are semantically related to tokens in the according method names, and specifies the evaluation results of the LLM on the selected examples. All such outputs (tokens with probabilities and position information, pivot words accompanied by associated name tokens and similarity scores, and evaluation results) together with the query and the selected examples are then filled in a predefined prompt template, resulting in a context-rich prompt. This context-rich prompt reduces the randomness of LLMs by focusing the LLM’s attention on relevant contexts, constraining the solution space, and anchoring results to meaningful semantic relationships. Consequently, the LLM leverages this prompt to generate the expected method name, producing a more accurate and relevant suggestion. We evaluated the proposed approach with 43k real-world Java and Python methods accompanied by functional descriptions. Our evaluation results suggested that it significantly outperforms the state-of-the-art approach RNN-att-Copy, improving the chance of exact match by 52% and decreasing the edit distance between generated and expected method names by 32%. Our evaluation results also suggested that the proposed approach worked well for various LLMs, including ChatGPT-3.5, ChatGPT-4, ChatGPT-4o, Gemini-1.5, and Llama-3. ![]() ![]() Haris Ali Khan, Yanjie Jiang, Qasim Umer, Yuxia Zhang, Waseem Akram, and Hui Liu (Beijing Institute of Technology, China; Peking University, China; King Fahd University of Petroleum and Minerals, Saudi Arabia) It is often valuable to know whether a given piece of source code has or hasn’t been used to train a given deep learning model. On one side, it helps avoid data contamination problems that may exaggerate the performance of evaluated models. Conversely, it facilitates copyright protection by identifying private or protected code leveraged for model training without permission. To this end, automated approaches have been proposed for the detection, known as data contamination detection. Such approaches often heavily rely on the confidence of the involved models, assuming that the models should be more confident in handling contaminated data than cleaned data. However, such approaches do not consider the nature of the given data item, i.e., how difficult it is to predict the given item. Consequently, difficult-to-predict contaminated data and easy-to-predict cleaned data are often misclassified. As an initial attempt to solve this problem, this paper presents a naturalness-based approach, called Natural-DaCoDe, for code-completion models to distinguish contaminated source code from cleaned ones. Natural-DaCoDe leverages code naturalness to quantitatively measure the difficulty of a given source code for code-completion models. It then trains a classifier to distinguish contaminated source code according to both code naturalness and the performance of the code-completion models on the given source code. We evaluate Natural-DaCoDe with two pre-trained large language models (e.g., ChatGPT and Claude) and two code-completion models that we trained from scratch for detecting contamination data. Our evaluation results suggest that Natural-DaCoDe substantially outperformed the state-of-the-art approaches in detecting contaminated data, improving the average accuracy by 61.78%. We also evaluate Natural-DaCoDe with method name suggestion task, and it remains more accurate than the state-of-the-art approaches, improving the accuracy by 54.39%. Furthermore, Natural-DaCoDe was tested on a natural language text benchmark, significantly outperforming the state-of-the-art approaches by 22% . It may suggest that Natural-DaCoDe could be applied to various source code related tasks besides code complete. ![]() |
|
Liu, Jian |
![]() Shoupeng Ren, Lipeng He, Tianyu Tu, Di Wu, Jian Liu, Kui Ren, and Chun Chen (Zhejiang University, China; University of Waterloo, Canada) ![]() |
|
Liu, Kui |
![]() Xu Yang, Wenhan Zhu, Michael Pacheco, Jiayuan Zhou, Shaowei Wang, Xing Hu, and Kui Liu (University of Manitoba, Canada; Huawei, Canada; Zhejiang University, China; Huawei Software Engineering Application Technology Lab, China) Detecting vulnerability fix commits in open-source software is crucial for maintaining software security. To help OSS identify vulnerability fix commits, several automated approaches are developed. However, existing approaches like VulFixMiner and CoLeFunDa, focus solely on code changes, neglecting essential context from development artifacts. Tools like Vulcurator, which integrates issue reports, fail to leverage semantic associations between different development artifacts (e.g., pull requests and history vulnerability fixes). Moreover, they miss vulnerability fixes in tangled commits and lack explanations, limiting practical use. Hence to address those limitations, we propose LLM4VFD, a novel framework that leverages Large Language Models (LLMs) enhanced with Chain-of-Thought reasoning and In-Context Learning to improve the accuracy of vulnerability fix detection. LLM4VFD comprises three components: (1) Code Change Intention, which analyzes commit summaries, purposes, and implications using Chain-of-Thought reasoning; (2) Development Artifact, which incorporates context from related issue reports and pull requests; (3) Historical Vulnerability, which retrieves similar past vulnerability fixes to enrich context. More importantly, on top of the prediction, LLM4VFD also provides a detailed analysis and explanation to help security experts understand the rationale behind the decision. We evaluated LLM4VFD against state-of-the-art techniques, including Pre-trained Language Model-based approaches and vanilla LLMs, using a newly collected dataset, BigVulFixes. Experimental results demonstrate that LLM4VFD significantly outperforms the best-performed existing approach by 68.1%--145.4%. Furthermore, We conducted a user study with security experts, showing that the analysis generated by LLM4VFD improves the efficiency of vulnerability fix identification. ![]() |
|
Liu, Mingwei |
![]() Yanlin Wang, Tianyue Jiang, Mingwei Liu, Jiachi Chen, Mingzhi Mao, Xilin Liu, Yuchi Ma, and Zibin Zheng (Sun Yat-sen University, China; Huawei Cloud Computing Technologies, China) Large language models (LLMs) have brought a paradigm shift to the field of code generation, offering the potential to enhance the software development process. However, previous research mainly focuses on the accuracy of code generation, while coding style differences between LLMs and human developers remain under-explored. In this paper, we empirically analyze the differences in coding style between the code generated by mainstream LLMs and the code written by human developers, and summarize coding style inconsistency taxonomy. Specifically, we first summarize the types of coding style inconsistencies by manually analyzing a large number of generation results. We then compare the code generated by LLMs with the code written by human programmers in terms of readability, conciseness, and robustness. The results reveal that LLMs and developers exhibit differences in coding style. Additionally, we study the possible causes of these inconsistencies and provide some solutions to alleviate the problem. ![]() |
|
Liu, Shangqing |
![]() Jiaolong Kong, Xie Xiaofei, and Shangqing Liu (Singapore Management University, Singapore; Nanyang Technological University, Singapore) ![]() |
|
Liu, Shuang |
![]() Junjie Chen, Xingyu Fan, Chen Yang, Shuang Liu, and Jun Sun (Tianjin University, China; Renmin University of China, China; Singapore Management University, Singapore) ![]() |
|
Liu, Tianming |
![]() Chenxu Wang, Tianming Liu, Yanjie Zhao, Minghui Yang, and Haoyu Wang (Huazhong University of Science and Technology, China; Monash University, Australia; OPPO, China) With the rapid development of Large Language Models (LLMs), their integration into automated mobile GUI testing has emerged as a promising research direction. However, existing LLM-based testing approaches face significant challenges, including time inefficiency and high costs due to constant LLM querying. To address these issues, this paper introduces LLMDroid, a novel testing framework designed to enhance existing automated mobile GUI testing tools by leveraging LLMs more efficiently. The workflow of LLMDroid comprises two main stages: Autonomous Exploration and LLM Guidance. During Autonomous Exploration, LLMDroid utilizes existing testing tools while leveraging LLMs to summarize explored pages. When code coverage growth slows, it transitions to LLM Guidance to strategically direct testing towards unexplored functionalities. This approach minimizes LLM interactions while maximizing their impact on test coverage. We applied LLMDroid to three popular open-source Android testing tools and evaluated it on 14 top-listed apps from Google Play. Results demonstrate an average increase of 26.16% in code coverage and 29.31% in activity coverage. Furthermore, our evaluation under different LLMs reveals that LLMDroid outperform existing step-wise approaches with significant cost efficiency, achieving optimal performance at $0.49 per hour using GPT-4o among tested models, with a cost-effective alternative achieving 94% of this performance at just $0.03 per hour. These findings highlight LLMDroid’s effectiveness in enhancing automated mobile app testing and its potential for widespread adoption. ![]() |
|
Liu, Xilin |
![]() Yanlin Wang, Tianyue Jiang, Mingwei Liu, Jiachi Chen, Mingzhi Mao, Xilin Liu, Yuchi Ma, and Zibin Zheng (Sun Yat-sen University, China; Huawei Cloud Computing Technologies, China) Large language models (LLMs) have brought a paradigm shift to the field of code generation, offering the potential to enhance the software development process. However, previous research mainly focuses on the accuracy of code generation, while coding style differences between LLMs and human developers remain under-explored. In this paper, we empirically analyze the differences in coding style between the code generated by mainstream LLMs and the code written by human developers, and summarize coding style inconsistency taxonomy. Specifically, we first summarize the types of coding style inconsistencies by manually analyzing a large number of generation results. We then compare the code generated by LLMs with the code written by human programmers in terms of readability, conciseness, and robustness. The results reveal that LLMs and developers exhibit differences in coding style. Additionally, we study the possible causes of these inconsistencies and provide some solutions to alleviate the problem. ![]() |
|
Liu, Yang |
![]() Zhenpeng Chen, Xinyue Li, Jie M. Zhang, Weisong Sun, Ying Xiao, Tianlin Li, Yiling Lou, and Yang Liu (Nanyang Technological University, Singapore; Peking University, China; King's College London, UK; Fudan University, China) Fairness is a critical requirement for Machine Learning (ML) software, driving the development of numerous bias mitigation methods. Previous research has identified a leveling-down effect in bias mitigation for computer vision and natural language processing tasks, where fairness is achieved by lowering performance for all groups without benefiting the unprivileged group. However, it remains unclear whether this effect applies to bias mitigation for tabular data tasks, a key area in fairness research with significant real-world applications. This study evaluates eight bias mitigation methods for tabular data, including both widely used and cutting-edge approaches, across 44 tasks using five real-world datasets and four common ML models. Contrary to earlier findings, our results show that these methods operate in a zero-sum fashion, where improvements for unprivileged groups are related to reduced benefits for traditionally privileged groups. However, previous research indicates that the perception of a zero-sum trade-off might complicate the broader adoption of fairness policies. To explore alternatives, we investigate an approach that applies the state-of-the-art bias mitigation method solely to unprivileged groups, showing potential to enhance benefits of unprivileged groups without negatively affecting privileged groups or overall ML performance. Our study highlights potential pathways for achieving fairness improvements without zero-sum trade-offs, which could help advance the adoption of bias mitigation methods. ![]() ![]() Xu Wang, Mingming Zhang, Xiangxin Meng, Jian Zhang, Yang Liu, and Chunming Hu (Beihang University, China; Zhongguancun Laboratory, China; Ministry of Education, China; Nanyang Technological University, Singapore) Deep Neural Networks (DNNs) are prevalent across a wide range of applications. Despite their success, the complexity and opaque nature of DNNs pose significant challenges in debugging and repairing DNN models, limiting their reliability and broader adoption. In this paper, we propose MLM4DNN, an element-based automated DNN repair method. Unlike previous techniques that focus on post-training adjustments or rely heavily on predefined bug patterns, MLM4DNN repairs DNNs by leveraging a fine-tuned Masked Language Model (MLM) to predict correct fixes for nine predefined key elements in DNNs. We construct a large-scale dataset by masking nine key elements from the correct DNN source code and then force the MLM to restore the correct elements to learn the deep semantics that ensure the normal functionalities of DNNs. Afterwards, a light-weight static analysis tool is designed to filter out low-quality patches to enhance the repair efficiency. We introduce a patch validation method specifically for DNN repair tasks, which consists of three evaluation metrics from different aspects to model the effectiveness of generated patches. We construct a benchmark, BenchmarkAPR4DNN, including 51 buggy DNN models and an evaluation tool that outputs the three metrics. We evaluate MLM4DNN against six baselines on BenchmarkAPR4DNN, and results show that MLM4DNN outperforms all state-of-the-art baselines, including two dynamic-based and four zero-shot learning-based methods. After applying the fine-tuned MLM design to several prevalent Large Language Models (LLMs), we consistently observe improved performance in DNN repair tasks compared to the original LLMs, which demonstrates the effectiveness of the method proposed in this paper. ![]() ![]() Ziqiao Kong, Cen Zhang, Maoyi Xie, Ming Hu, Yue Xue, Ye Liu, Haijun Wang, and Yang Liu (Nanyang Technological University, Singapore; Singapore Management University, Singapore; MetaTrust Labs, Singapore; Xi'an Jiaotong University, China) Billions of dollars are transacted through smart contracts, making vulnerabilities a major financial risk. One focus in the security arms race is on profitable vulnerabilities that attackers can exploit. Fuzzing is a key method for identifying these vulnerabilities. However, current solutions face two main limitations: 1. a lack of profit-centric techniques for expediting detection and, 2. insufficient automation in maximizing the profitability of discovered vulnerabilities, leaving the analysis to human experts. To address these gaps, we have developed VERITE, a profit-centric smart contract fuzzing framework that not only effectively detects those profitable vulnerabilities but also maximizes the exploited profits. VERITE has three key features: 1. DeFi action-based mutators for boosting the exploration of transactions with different fund flows; 2. potentially profitable candidates identification criteria, which checks whether the input has caused abnormal fund flow properties during testing; 3. a gradient descent-based profit maximization strategy for these identified candidates. VERITE is fully developed from scratch and evaluated on a dataset consisting of 61 exploited real-world DeFi projects with an average of over 1.1 million dollars loss. The results show that VERITE can automatically extract more than 18 million dollars in total and is significantly better than state-of-the-art fuzzer ItyFuzz in both detection (29/10) and exploitation (134 times more profits gained on average). Remarkably, in 12 targets, it gains more profits than real-world attacking exploits (1.01 to 11.45 times more). VERITE is also applied by auditors in contract auditing, where 6 (5 high severity) zero-day vulnerabilities are found with over $2,500 bounty rewards. ![]() ![]() ![]() ![]() Weisong Sun, Yuchen Chen, Chunrong Fang, Yebo Feng, Yuan Xiao, An Guo, Quanjun Zhang, Zhenyu Chen, Baowen Xu, and Yang Liu (Nanyang Technological University, Singapore; Nanjing University, China) Neural code models (NCMs) have been widely used to address various code understanding tasks, such as defect detection. However, numerous recent studies reveal that such models are vulnerable to backdoor attacks. Backdoored NCMs function normally on normal/clean code snippets, but exhibit adversary-expected behavior on poisoned code snippets injected with the adversary-crafted trigger. It poses a significant security threat. For example, a backdoored defect detection model may misclassify user-submitted defective code as non-defective. If this insecure code is then integrated into critical systems, like autonomous driving systems, it could jeopardize life safety. Therefore, there is an urgent need for effective techniques to detect and eliminate backdoors stealthily implanted in NCMs. To address this issue, in this paper, we innovatively propose a backdoor elimination technique for secure code understanding, called EliBadCode. EliBadCode eliminates backdoors in NCMs by inverting/reverse-engineering and unlearning backdoor triggers. Specifically, EliBadCode first filters the model vocabulary for trigger tokens based on the naming conventions of specific programming languages to reduce the trigger search space and cost. Then, EliBadCode introduces a sample-specific trigger position identification method, which can reduce the interference of non-backdoor (adversarial) perturbations for subsequent trigger inversion, thereby producing effective inverted backdoor triggers efficiently. Backdoor triggers can be viewed as backdoor (adversarial) perturbations. Subsequently, EliBadCode employs a Greedy Coordinate Gradient algorithm to optimize the inverted trigger and designs a trigger anchoring method to purify the inverted trigger. Finally, EliBadCode eliminates backdoors through model unlearning. We evaluate the effectiveness of in eliminating backdoors implanted in multiple NCMs used for three safety-critical code understanding tasks. The results demonstrate that EliBadCode can effectively eliminate backdoors while having minimal adverse effects on the normal functionality of the model. For instance, on defect detection tasks, EliBadCode substantially decreases the average Attack Success Rate (ASR) of the advanced backdoor attack from 99.76% to 2.64%, significantly outperforming the three baselines. The clean model produced by EliBadCode exhibits an average decrease in defect prediction accuracy of only 0.01% (the same as the baseline). ![]() ![]() Xiaokun Luan, David Sanan, Zhe Hou, Qiyuan Xu, Chengwei Liu, Yufan Cai, Yang Liu, and Meng Sun (Peking University, China; Singapore Institute of Technology, Singapore; Griffith University, Australia; Nanyang Technological University, Singapore; China-Singapore International Joint Research Institute, China; National University of Singapore, Singapore) Proof assistants are software tools for formal modeling and verification of software, hardware, design, and mathematical proofs. Due to the growing complexity and scale of formal proofs, compatibility issues frequently arise when using different versions of proof assistants. These issues result in broken proofs, disrupting the maintenance of formalized theories and hindering the broader dissemination of results within the community. Although existing works have proposed techniques to address specific types of compatibility issues, the overall characteristics of these issues remain largely unexplored. To address this gap, we conduct the first extensive empirical study to characterize compatibility issues, using Isabelle as a case study. We develop a regression testing framework to automatically collect compatibility issues from the Archive of Formal Proofs, the largest repository of formal proofs in Isabelle. By analyzing 12,079 collected issues, we identify their types and symptoms and further investigate their root causes. We also extract updated proofs that address these issues to understand the applied resolution strategies. Our study provides an in-depth understanding of compatibility issues in proof assistants, offering insights that support the development of effective techniques to mitigate these issues. ![]() ![]() |
|
Liu, Ye |
![]() Ziqiao Kong, Cen Zhang, Maoyi Xie, Ming Hu, Yue Xue, Ye Liu, Haijun Wang, and Yang Liu (Nanyang Technological University, Singapore; Singapore Management University, Singapore; MetaTrust Labs, Singapore; Xi'an Jiaotong University, China) Billions of dollars are transacted through smart contracts, making vulnerabilities a major financial risk. One focus in the security arms race is on profitable vulnerabilities that attackers can exploit. Fuzzing is a key method for identifying these vulnerabilities. However, current solutions face two main limitations: 1. a lack of profit-centric techniques for expediting detection and, 2. insufficient automation in maximizing the profitability of discovered vulnerabilities, leaving the analysis to human experts. To address these gaps, we have developed VERITE, a profit-centric smart contract fuzzing framework that not only effectively detects those profitable vulnerabilities but also maximizes the exploited profits. VERITE has three key features: 1. DeFi action-based mutators for boosting the exploration of transactions with different fund flows; 2. potentially profitable candidates identification criteria, which checks whether the input has caused abnormal fund flow properties during testing; 3. a gradient descent-based profit maximization strategy for these identified candidates. VERITE is fully developed from scratch and evaluated on a dataset consisting of 61 exploited real-world DeFi projects with an average of over 1.1 million dollars loss. The results show that VERITE can automatically extract more than 18 million dollars in total and is significantly better than state-of-the-art fuzzer ItyFuzz in both detection (29/10) and exploitation (134 times more profits gained on average). Remarkably, in 12 targets, it gains more profits than real-world attacking exploits (1.01 to 11.45 times more). VERITE is also applied by auditors in contract auditing, where 6 (5 high severity) zero-day vulnerabilities are found with over $2,500 bounty rewards. ![]() ![]() ![]() |
|
Liu, Yepang |
![]() Yuan Li, Peisen Yao, Kan Yu, Chengpeng Wang, Yaoyang Ye, Song Li, Meng Luo, Yepang Liu, and Kui Ren (Zhejiang University, China; Ant Group, China; Hong Kong University of Science and Technology, China; Southern University of Science and Technology, China) ![]() ![]() Hengcheng Zhu, Valerio Terragni, Lili Wei, Shing-Chi Cheung, Jiarong Wu, and Yepang Liu (Hong Kong University of Science and Technology, China; University of Auckland, New Zealand; McGill University, Canada; Southern University of Science and Technology, China) Mock assertions provide developers with a powerful means to validate program behaviors that are unobservable to test assertions. Despite their significance, they are rarely considered by automated test generation techniques. Effective generation of mock assertions requires understanding how they are used in practice. Although previous studies highlighted the importance of mock assertions, none provide insight into their usages. To bridge this gap, we conducted the first empirical study on mock assertions, examining their adoption, the characteristics of the verified method invocations, and their effectiveness in fault detection. Our analysis of 4,652 test cases from 11 popular Java projects reveals that mock assertions are mostly applied to validating specific kinds of method calls, such as those interacting with external resources and those reflecting whether a certain code path was traversed in systems under test. Additionally, we find that mock assertions complement traditional test assertions by ensuring the desired side effects have been produced, validating control flow logic, and checking internal computation results. Our findings contribute to a better understanding of mock assertion usages and provide a foundation for future related research such as automated test generation that support mock assertions. ![]() ![]() |
|
Liu, Yinxi |
![]() Yinxi Liu, Wei Meng, and Yinqian Zhang (Rochester Institute of Technology, USA; Chinese University of Hong Kong, China; Southern University of Science and Technology, China) Ethereum smart contracts determine state transition results not only by the previous states, but also by a mutable global state consisting of storage variables. This has resulted in state-inconsistency bugs, which grant an attacker the ability to modify contract states either through recursive function calls to a contract (reentrancy), or by exploiting transaction order dependence (TOD). Current studies have determined that identifying data races on global storage variables can capture all state-inconsistency bugs. Nevertheless, eliminating false positives poses a significant challenge, given the extensive number of execution paths that could potentially cause a data race. For simplicity, existing research considers a data race to be vulnerable as long as the variable involved could have inconsistent values under different execution orders. However, such a data race could be benign when the inconsistent value does not affect any critical computation or decision-making process in the program. Besides, the data race could also be infeasible when there is no valid state in the contract that allows the execution of both orders. In this paper, we aim to appreciably reduce these false positives without introducing false negatives. We present DivertScan, a precise framework to detect exploitable state-inconsistency bugs in smart contracts. We first introduce the use of flow divergence to check where the involved variable may flow to. This allows DivertScan to precisely infer the potential effects of a data race and determine whether it can be exploited for inducing unexpected program behaviors. We also propose multiplex symbolic execution to examine different execution orders in one time of solving. This helps DivertScan to determine whether a common starting state could potentially exist. To address the scalability issue in symbolic execution, DivertScan utilizes an overapproximated pre-checking and a selective exploration strategy. As a result, it only needs to explore a limited state space. DivertScan significantly outperformed state-of-the-art tools by improving the precision rate by 20.72% to 74.93% while introducing no false negatives. It also identified five exploitable real-world vulnerabilities that other tools missed. The detected vulnerabilities could potentially lead to a loss of up to $68.2M, based on trading records and rate limits. ![]() |
|
Liu, Zhe |
![]() Mengzhuo Chen, Zhe Liu, Chunyang Chen, Junjie Wang, Boyu Wu, Jun Hu, and Qing Wang (Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; TU Munich, Germany) In software development, similar apps often encounter similar bugs due to shared functionalities and implementation methods. However, current automated GUI testing methods mainly focus on generating test scripts to cover more pages by analyzing the internal structure of the app, without targeted exploration of paths that may trigger bugs, resulting in low efficiency in bug discovery. Considering that a large number of bug reports on open source platforms can provide external knowledge for testing, this paper proposes BugHunter, a novel bug-aware automated GUI testing approach that generates exploration paths guided by bug reports from similar apps, utilizing a combination of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). Instead of focusing solely on coverage, BugHunter dynamically adapts the testing process to target bug paths, thereby increasing bug detection efficiency. BugHunter first builds a high-quality bug knowledge base from historical bug reports. Then it retrieves relevant reports from this large bug knowledge base using a two-stage retrieval process, and generates test paths based on similar apps’ bug reports. BugHunter also introduces a local and global path-planning mechanism to handle differences in functionality and UI design across apps, and the ambiguous behavior or missing steps in the online bug reports. We evaluate BugHunter on 121 bugs across 71 apps and compare its performance against 16 state-of-the-art baselines. BugHunter achieves 60% improvement in bug detection over the best baseline, with comparable or higher coverage against the baselines. Furthermore, BugHunter successfully detects 49 new crash bugs in real-world apps from Google Play, with 33 bugs fixed, 9 confirmed, and 7 pending feedback. ![]() |
|
Liu, Zhongxin |
![]() Keyu Liang, Zhongxin Liu, Chao Liu, Zhiyuan Wan, David Lo, and Xiaohu Yang (Zhejiang University, China; Chongqing University, China; Singapore Management University, Singapore) Code search is a crucial task in software engineering, aiming to retrieve code snippets that are semantically relevant to a natural language query. Recently, Pre-trained Language Models (PLMs) have shown remarkable success and are widely adopted for code search tasks. However, PLM-based methods often struggle in cross-domain scenarios. When applied to a new domain, they typically require extensive fine-tuning with substantial data. Even worse, the data scarcity problem in new domains often forces these methods to operate in a zero-shot setting, resulting in a significant decline in performance. RAPID, which generates synthetic data for model fine-tuning, is currently the only effective method for zero-shot cross-domain code search. Despite its effectiveness, RAPID demands substantial computational resources for fine-tuning and needs to maintain specialized models for each domain, underscoring the need for a zero-shot, fine-tuning-free approach for cross-domain code search. The key to tackling zero-shot cross-domain code search lies in bridging the gaps among domains. In this work, we propose to break the query-code matching process of code search into two simpler tasks: query-comment matching and code-code matching. We first conduct an empirical study to investigate the effectiveness of these two matching schemas in zero-shot cross-domain code search. Our findings highlight the strong complementarity among the three matching schemas, i.e., query-code, query-comment, and code-code matching. Based on the findings, we propose CodeBridge, a zero-shot, fine-tuning-free approach for cross-domain code search. Specifically, CodeBridge first employs zero-shot prompting to guide Large Language Models (LLMs) to generate a comment for each code snippet in the codebase and produce a code for each query. Subsequently, it encodes queries, code snippets, comments, and the generated code using PLMs and assesses similarities through three matching schemas: query-code, query-comment, and generated code-code. Lastly, CodeBridge leverages a sampling-based fusion approach that combines these three similarity scores to rank the final search outcomes. Experimental results show that our approach outperforms the state-of-the-art PLM-based code search approaches, i.e., CoCoSoDa and UniXcoder, by an average of 21.4% and 24.9% in MRR, respectively, across three datasets. Our approach also yields results that are better than or comparable to those of the zero-shot cross-domain code search approach RAPID, which requires costly fine-tuning. ![]() ![]() |
|
Lo, David |
![]() Yue Li, Bohan Liu, Ting Zhang, Zhiqi Wang, David Lo, Lanxin Yang, Jun Lyu, and He Zhang (Nanjing University, China; Singapore Management University, Singapore) ![]() ![]() Keyu Liang, Zhongxin Liu, Chao Liu, Zhiyuan Wan, David Lo, and Xiaohu Yang (Zhejiang University, China; Chongqing University, China; Singapore Management University, Singapore) Code search is a crucial task in software engineering, aiming to retrieve code snippets that are semantically relevant to a natural language query. Recently, Pre-trained Language Models (PLMs) have shown remarkable success and are widely adopted for code search tasks. However, PLM-based methods often struggle in cross-domain scenarios. When applied to a new domain, they typically require extensive fine-tuning with substantial data. Even worse, the data scarcity problem in new domains often forces these methods to operate in a zero-shot setting, resulting in a significant decline in performance. RAPID, which generates synthetic data for model fine-tuning, is currently the only effective method for zero-shot cross-domain code search. Despite its effectiveness, RAPID demands substantial computational resources for fine-tuning and needs to maintain specialized models for each domain, underscoring the need for a zero-shot, fine-tuning-free approach for cross-domain code search. The key to tackling zero-shot cross-domain code search lies in bridging the gaps among domains. In this work, we propose to break the query-code matching process of code search into two simpler tasks: query-comment matching and code-code matching. We first conduct an empirical study to investigate the effectiveness of these two matching schemas in zero-shot cross-domain code search. Our findings highlight the strong complementarity among the three matching schemas, i.e., query-code, query-comment, and code-code matching. Based on the findings, we propose CodeBridge, a zero-shot, fine-tuning-free approach for cross-domain code search. Specifically, CodeBridge first employs zero-shot prompting to guide Large Language Models (LLMs) to generate a comment for each code snippet in the codebase and produce a code for each query. Subsequently, it encodes queries, code snippets, comments, and the generated code using PLMs and assesses similarities through three matching schemas: query-code, query-comment, and generated code-code. Lastly, CodeBridge leverages a sampling-based fusion approach that combines these three similarity scores to rank the final search outcomes. Experimental results show that our approach outperforms the state-of-the-art PLM-based code search approaches, i.e., CoCoSoDa and UniXcoder, by an average of 21.4% and 24.9% in MRR, respectively, across three datasets. Our approach also yields results that are better than or comparable to those of the zero-shot cross-domain code search approach RAPID, which requires costly fine-tuning. ![]() ![]() ![]() Zuobin Wang, Zhiyuan Wan, Yujing Chen, Yun Zhang, David Lo, Difan Xie, and Xiaohu Yang (Zhejiang University, China; Hangzhou High-Tech Zone (Binjiang), China; Hangzhou City University, China; Singapore Management University, Singapore) In smart contract development, practitioners frequently reuse code to reduce development effort and avoid reinventing the wheel. This reused code, whether identical or similar to its original source, is referred to as a code clone. Unintentional code cloning can propagate flaws and vulnerabilities, potentially undermining the reliability and maintainability of software systems. Previous studies have identified a significant prevalence of code clones in Solidity smart contracts on the Ethereum blockchain. To mitigate the risks posed by code clones, clone detection has emerged as an active field of research and practice in software engineering. Recent studies have extended existing techniques or proposed novel techniques tailored to the unique syntactic and semantic features of Solidity. Nonetheless, the evaluations of existing techniques, whether conducted by their original authors or independent researchers, involve codebases in various programming languages and utilize different versions of the corresponding tools. The resulting inconsistency makes direct comparisons of the evaluation results impractical, and hinders the ability to derive meaningful conclusions across the evaluations. There remains a lack of clarity regarding the effectiveness of these techniques in detecting smart contract clones, and whether it is feasible to combine different techniques to achieve scalable yet accurate detection of code clones in smart contracts. To address this gap, we conduct a comprehensive empirical study that evaluates the effectiveness and scalability of five representative clone detection techniques on 33,073 verified Solidity smart contracts, along with a benchmark we curate, in which we manually label 72,010 pairs of Solidity smart contracts with clone tags. Moreover, we explore the potential of combining different techniques to achieve optimal performance of code clone detection for smart contracts, and propose SourceREClone, a framework designed for the refined integration of different techniques, which achieves a 36.9% improvement in F1 score compared to a straightforward combination of the state of the art. Based on our findings, we discuss implications, provide recommendations for practitioners, and outline directions for future research. ![]() ![]() Junwei Zhang, Xing Hu, Shan Gao, Xin Xia, David Lo, and Shanping Li (Zhejiang University, China; Huawei, China; Singapore Management University, Singapore) Unit testing is crucial for software development and maintenance. Effective unit testing ensures and improves software quality, but writing unit tests is time-consuming and labor-intensive. Recent studies have proposed deep learning (DL) techniques or large language models (LLMs) to automate unit test generation. These models are usually trained or fine-tuned on large-scale datasets. Despite growing awareness of the importance of data quality, there has been limited research on the quality of datasets used for test generation. To bridge this gap, we systematically examine the impact of noise on the performance of learning-based test generation models. We first apply the open card sorting method to analyze the most popular and largest test generation dataset, Methods2Test, to categorize eight distinct types of noise. Further, we conduct detailed interviews with 17 domain experts to validate and assess the importance, reasonableness, and correctness of the noise taxonomy. Then, we propose CleanTest, an automated noise-cleaning framework designed to improve the quality of test generation datasets. CleanTest comprises three filters: a rule-based syntax filter, a rule-based relevance filter, and a model-based coverage filter. To evaluate its effectiveness, we apply CleanTest on two widely-used test generation datasets, i.e., Methods2Test and Atlas. Our findings indicate that 43.52% and 29.65% of datasets contain noise, highlighting its prevalence. Finally, we conduct comparative experiments using four LLMs (i.e., CodeBERT, AthenaTest, StarCoder, and CodeLlama7B) to assess the impact of noise on test generation performance. The results show that filtering noise positively influences the test generation ability of the models. Fine-tuning the four LLMs with the filtered Methods2Test dataset, on average, improves its performance by 67% in branch coverage, using the Defects4J benchmark. For the Atlas dataset, the four LLMs improve branch coverage by 39%. Additionally, filtering noise improves bug detection performance, resulting in a 21.42% increase in bugs detected by the generated tests. ![]() |
|
Long, Xiao |
![]() Xin Tan, Xiao Long, Yinghao Zhu, Lin Shi, Xiaoli Lian, and Li Zhang (Beihang University, China) Onboarding newcomers is vital for the sustainability of open-source software (OSS) projects. To lower barriers and increase engagement, OSS projects have dedicated experts who provide guidance for newcomers. However, timely responses are often hindered by experts' busy schedules. The recent rapid advancements of AI in software engineering have brought opportunities to leverage AI as a substitute for expert mentoring. However, the potential role of AI as a comprehensive mentor throughout the entire onboarding process remains unexplored. To identify design strategies of this "AI mentor", we applied Design Fiction as a participatory method with 19 OSS newcomers. We investigated their current onboarding experience and elicited 32 design strategies for future AI mentor. Participants envisioned AI mentor being integrated into OSS platforms like GitHub, where it could offer assistance to newcomers, such as "recommending projects based on personalized requirements" and "assessing and categorizing project issues by difficulty". We also collected participants' perceptions of a prototype, named "OSSerCopilot", that implemented the envisioned strategies. They found the interface useful and user-friendly, showing a willingness to use it in the future, which suggests the design strategies are effective. Finally, in order to identify the gaps between our design strategies and current research, we conducted a comprehensive literature review, evaluating the extent of existing research support for this concept. We find that research is relatively scarce in certain areas where newcomers highly anticipate AI mentor assistance, such as "discovering an interested project". Our study has the potential to revolutionize the current newcomer-expert mentorship and provides valuable insights for researchers and tool designers aiming to develop and enhance AI mentor systems. ![]() |
|
López, José Antonio Hernández |
![]() Mootez Saad, José Antonio Hernández López, Boqi Chen, Dániel Varró, and Tushar Sharma (Dalhousie University, Canada; University of Murcia, Spain; McGill University, Canada; Linköping University, Sweden) Language models of code have demonstrated remarkable performance across various software engineering and source code analysis tasks. However, their demanding computational resource requirements and consequential environmental footprint remain as significant challenges. This work introduces ALPINE, an adaptive programming language-agnostic pruning technique designed to substantially reduce the computational overhead of these models. The proposed method offers a pluggable layer that can be integrated with all Transformer-based models. With ALPINE, input sequences undergo adaptive compression throughout the pipeline, reaching a size that is up to x3 less their initial size, resulting in significantly reduced computational load. Our experiments on two software engineering tasks, defect prediction and code clone detection across three language models CodeBERT, GraphCodeBERT and UniXCoder show that ALPINE achieves up to a 50% reduction in FLOPs, a 58.1% decrease in memory footprint, and a 28.1% improvement in throughput on average. This led to a reduction in CO2 emissions by up to 44.85%. Importantly, it achieves a reduction in computation resources while maintaining up to 98.1% of the original predictive performance. These findings highlight the potential of ALPINE in making language models of code more resource-efficient and accessible while preserving their performance, contributing to the overall sustainability of their adoption in software development. Also, it sheds light on redundant and noisy information in source code analysis corpora, as shown by the substantial sequence compression achieved by ALPINE. ![]() |
|
Lou, Yiling |
![]() Zhenpeng Chen, Xinyue Li, Jie M. Zhang, Weisong Sun, Ying Xiao, Tianlin Li, Yiling Lou, and Yang Liu (Nanyang Technological University, Singapore; Peking University, China; King's College London, UK; Fudan University, China) Fairness is a critical requirement for Machine Learning (ML) software, driving the development of numerous bias mitigation methods. Previous research has identified a leveling-down effect in bias mitigation for computer vision and natural language processing tasks, where fairness is achieved by lowering performance for all groups without benefiting the unprivileged group. However, it remains unclear whether this effect applies to bias mitigation for tabular data tasks, a key area in fairness research with significant real-world applications. This study evaluates eight bias mitigation methods for tabular data, including both widely used and cutting-edge approaches, across 44 tasks using five real-world datasets and four common ML models. Contrary to earlier findings, our results show that these methods operate in a zero-sum fashion, where improvements for unprivileged groups are related to reduced benefits for traditionally privileged groups. However, previous research indicates that the perception of a zero-sum trade-off might complicate the broader adoption of fairness policies. To explore alternatives, we investigate an approach that applies the state-of-the-art bias mitigation method solely to unprivileged groups, showing potential to enhance benefits of unprivileged groups without negatively affecting privileged groups or overall ML performance. Our study highlights potential pathways for achieving fairness improvements without zero-sum trade-offs, which could help advance the adoption of bias mitigation methods. ![]() |
|
Lu, Jie |
![]() Liqing Cao, Haofeng Li, Chenghang Shi, Jie Lu, Haining Meng, Lian Li, and Jingling Xue (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia) ![]() |
|
Lu, Liqiang |
![]() Siwei Tan, Liqiang Lu, Debin Xiang, Tianyao Chu, Congliang Lang, Jintao Chen, Xing Hu, and Jianwei Yin (Zhejiang University, China) Quantum programs provide exponential speedups compared to classical programs in certain areas, but they also inevitably encounter logical faults. Automatically repairing quantum programs is much more challenging than repairing classical programs due to the non-replicability of data, the vast search space of program inputs, and the new programming paradigm. Existing works based on semantic-based or learning-based program repair techniques are fundamentally limited in repairing efficiency and effectiveness. In this work, we propose HornBro, an efficient framework for automated quantum program repair. The key insight of HornBro lies in the homotopy-like method, which iteratively switches between the classical and quantum parts. This approach allows the repair tasks to be efficiently offloaded to the most suitable platforms, enabling a progressive convergence toward the correct program. We start by designing an implication assertion pragma to enable rigorous specifications of quantum program behavior, which helps to generate a quantum test suite automatically. This suite leverages the orthonormal bases of quantum programs to accommodate different encoding schemes. Given a fixed number of test cases, it allows the maximum input coverage of potential counter-example candidates. Then, we develop a Clifford approximation method with an SMT-based search to transform the patch localization program into a symbolic reasoning problem. Finally, we offload the computationally intensive repair of gate parameters to quantum hardware by leveraging the differentiability of quantum gates. Experiments suggest that HornBro increases the repair success rate by more than 62.5% compared to the existing repair techniques, supporting more types of quantum bugs. It also achieves 35.7× speedup in the repair and 99.9% gate reduction of the patch. ![]() ![]() |
|
Lu, Qinghua |
![]() Yanqi Su, Zhenchang Xing, Chong Wang, Chunyang Chen, Sherry (Xiwei) Xu, Qinghua Lu, and Liming Zhu (Australian National University, Australia; CSIRO's Data61, Australia; Nanyang Technological University, Singapore; TU Munich, Germany; UNSW, Australia) Exploratory testing (ET) harnesses tester's knowledge, creativity, and experience to create varying tests that uncover unexpected bugs from the end-user's perspective. Although ET has proven effective in system-level testing of interactive systems, the need for manual execution has hindered large-scale adoption. In this work, we explore the feasibility, challenges and road ahead of automated scenario-based ET (a.k.a soap opera testing). We conduct a formative study, identifying key insights for effective manual soap opera testing and challenges in automating the process. We then develop a multi-agent system leveraging LLMs and a Scenario Knowledge Graph (SKG) to automate soap opera testing. The system consists of three multi-modal agents, Planner, Player, and Detector that collaborate to execute tests and identify potential bugs. Experimental results demonstrate the potential of automated soap opera testing, but there remains a significant gap compared to manual execution, especially under-explored scenario boundaries and incorrectly identified bugs. Based on the observation, we envision road ahead for the future of automated soap opera testing, focusing on three key aspects: the synergy of neural and symbolic approaches, human-AI co-learning, and the integration of soap opera testing with broader software engineering practices. These insights aim to guide and inspire the future research. ![]() |
|
Lu, Weiqi |
![]() Weiqi Lu, Yongqiang Tian, Xiaohan Zhong, Haoyang Ma, Zhenyang Xu, Shing-Chi Cheung, and Chengnian Sun (Hong Kong University of Science and Technology, China; University of Waterloo, Canada) Data visualization (DataViz) libraries play a crucial role in presentation, data analysis, and application development, underscoring the importance of their accuracy in transforming data into visual representations. Incorrect visualizations can adversely impact user experience, distort information conveyance, and influence user perception and decision-making processes. Visual bugs in these libraries can be particularly insidious as they may not cause obvious errors like crashes, but instead mislead users of the underlying data graphically, resulting in wrong decision making. Consequently, a good understanding of the unique characteristics of bugs in DataViz libraries is essential for researchers and developers to detect and fix bugs in DataViz libraries. This study presents the first comprehensive analysis of bugs in DataViz libraries, examining 564 bugs collected from five widely-used libraries. Our study systematically analyzes their symptoms and root causes, and provides a detailed taxonomy. We found that incorrect/inaccurate plots are pervasive in DataViz libraries and incorrect graphic computation is the major root cause, which necessitates further automated testing methods for DataViz libraries. Moreover, we identified eight key steps to trigger such bugs and two test oracles specific to DataViz libraries, which may inspire future research in designing effective automated testing techniques. Furthermore, with the recent advancements in Vision Language Models (VLMs), we explored the feasibility of applying these models to detect incorrect/inaccurate plots. The results show that the effectiveness of VLMs in bug detection varies from 29% to 57%, depending on the prompts, and adding more information in prompts does not necessarily increase the effectiveness. Our findings offer valuable insights into the nature and patterns of bugs in DataViz libraries, providing a foundation for developers and researchers to improve library reliability, and ultimately benefit more accurate and reliable data visualizations across various domains. ![]() |
|
Lu, You |
![]() Junming Cao, Xuwen Xiang, Mingfei Cheng, Bihuan Chen, Xinyan Wang, You Lu, Chaofeng Sha, Xiaofei Xie, and Xin Peng (Fudan University, China; Singapore Management University, Singapore) ![]() |
|
Luan, Xiaokun |
![]() Xiaokun Luan, David Sanan, Zhe Hou, Qiyuan Xu, Chengwei Liu, Yufan Cai, Yang Liu, and Meng Sun (Peking University, China; Singapore Institute of Technology, Singapore; Griffith University, Australia; Nanyang Technological University, Singapore; China-Singapore International Joint Research Institute, China; National University of Singapore, Singapore) Proof assistants are software tools for formal modeling and verification of software, hardware, design, and mathematical proofs. Due to the growing complexity and scale of formal proofs, compatibility issues frequently arise when using different versions of proof assistants. These issues result in broken proofs, disrupting the maintenance of formalized theories and hindering the broader dissemination of results within the community. Although existing works have proposed techniques to address specific types of compatibility issues, the overall characteristics of these issues remain largely unexplored. To address this gap, we conduct the first extensive empirical study to characterize compatibility issues, using Isabelle as a case study. We develop a regression testing framework to automatically collect compatibility issues from the Archive of Formal Proofs, the largest repository of formal proofs in Isabelle. By analyzing 12,079 collected issues, we identify their types and symptoms and further investigate their root causes. We also extract updated proofs that address these issues to understand the applied resolution strategies. Our study provides an in-depth understanding of compatibility issues in proof assistants, offering insights that support the development of effective techniques to mitigate these issues. ![]() ![]() |
|
Luo, Meng |
![]() Yuan Li, Peisen Yao, Kan Yu, Chengpeng Wang, Yaoyang Ye, Song Li, Meng Luo, Yepang Liu, and Kui Ren (Zhejiang University, China; Ant Group, China; Hong Kong University of Science and Technology, China; Southern University of Science and Technology, China) ![]() |
|
Luo, Xiapu |
![]() Kunsong Zhao, Zihao Li, Weimin Chen, Xiapu Luo, Ting Chen, Guozhu Meng, and Yajin Zhou (Hong Kong Polytechnic University, China; University of Electronic Science and Technology of China, China; Institute of Information Engineering at Chinese Academy of Sciences, China; Zhejiang University, China) WebAssembly has become the preferred smart contract format for various blockchain platforms due to its high portability and near-native execution speed. To effectively understand WebAssembly contracts, it is crucial to recover high-level type signatures because of the limited type information that WebAssembly provides. However, existing studies on type inference for smart contracts primarily center around Ethereum Virtual Machine bytecode, which is not applicable to WebAssembly owing to their differing targets and runtime semantics. This paper introduces WasmHint, a novel solution that leverages deep learning inference to automatically recover high-level parameter and return types from WebAssembly contracts. More specifically, WasmHint constructs a wCFG representation to clarify dependencies within WebAssembly code and simulates its execution to capture type-related operational information. By learning comprehensive code semantics, it infers parameter and return types, with a semantic corrector designed to enhance information coordination. We conduct experiments on a newly constructed dataset containing 77,208 WebAssembly contract functions. The results demonstrate that WasmHint achieves inference accuracies of 80.0% for parameter types and 95.8% for return types, with average improvements of 86.6% and 34.0% over the baseline methods, respectively. ![]() ![]() Shuwei Song, Ting Chen, Ao Qiao, Xiapu Luo, Leqing Wang, Zheyuan He, Ting Wang, Xiaodong Lin, Peng He, Wensheng Zhang, and Xiaosong Zhang (University of Electronic Science and Technology of China, China; Hong Kong Polytechnic University, China; Stony Brook University, USA; University of Guelph, Canada) Cryptocurrency tokens, implemented by smart contracts, are prime targets for attackers due to their substantial monetary value. To illicitly gain profit, attackers often embed malicious code or exploit vulnerabilities within token contracts. Token transfer identification is crucial for detecting malicious and vulnerable token contracts. However, existing methods suffer from high false positives or false negatives due to invalid assumptions or reliance on limited patterns. This paper introduces a novel approach that captures the essential principles of token contracts, which are independent of programming languages and token standards, and presents a new tool, CRYPTO-SCOUT. CRYPTO-SCOUT automatically and accurately identifies token transfers, enabling the detection of various malicious and vulnerable token contracts. CRYPTO-SCOUT's core innovation is its capability to automatically identify complex container-type variables used by token contracts for storing holder information. It processes the bytecode of smart contracts written in the two most widely-used languages, Solidity and Vyper, and supports the three most popular token standards, ERC20, ERC721, and ERC1155. Furthermore, CRYPTO-SCOUT detects four types of malicious and vulnerable token contracts and is designed to be extensible. Extensive experiments show that CRYPTO-SCOUT outperforms existing approaches and uncovers over 21,000 malicious/vulnerable token contracts and more than 12,000 transactions triggering them. ![]() |
|
Luo, Yonggang |
![]() Songyang Yan, Xiaodong Zhang, Kunkun Hao, Haojie Xin, Yonggang Luo, Jucheng Yang, Ming Fan, Chao Yang, Jun Sun, and Zijiang Yang (Xi'an Jiaotong University, China; Xidian University, China; Synkrotron, China; Chongqing Changan Automobile, China; Singapore Management University, Singapore; University of Science and Technology of China, China) The safety and reliability of Automated Driving Systems (ADS) are paramount, necessitating rigorous testing methodologies to uncover potential failures before deployment. Traditional testing approaches often prioritize either natural scenario sampling or safety-critical scenario generation, resulting in overly simplistic or unrealistic hazardous tests. In practice, the demand for natural scenarios (e.g., when evaluating the ADS's reliability in real-world conditions), critical scenarios (e.g., when evaluating safety in critical situations), or somewhere in between (e.g., when testing the ADS in regions with less civilized drivers) varies depending on the testing objectives. To address this issue, we propose the On-demand Scenario Generation (OSG) Framework, which generates diverse scenarios with varying risk levels. Achieving the goal of OSG is challenging due to the complexity of quantifying the criticalness and naturalness stemming from intricate vehicle-environment interactions, as well as the need to maintain scenario diversity across various risk levels. OSG learns from real-world traffic datasets and employs a Risk Intensity Regulator to quantitatively control the risk level. It also leverages an improved heuristic search method to ensure scenario diversity. We evaluate OSG on the Carla simulators using various ADSs. We verify OSG's ability to generate scenarios with different risk levels and demonstrate its necessity by comparing accident types across risk levels. With the help of OSG, we are now able to systematically and objectively compare the performance of different ADSs based on different risk levels. ![]() ![]() |
|
Lyu, Jun |
![]() Yue Li, Bohan Liu, Ting Zhang, Zhiqi Wang, David Lo, Lanxin Yang, Jun Lyu, and He Zhang (Nanjing University, China; Singapore Management University, Singapore) ![]() |
|
Lyu, Michael |
![]() Chaozheng Wang, Jia Feng, Shuzheng Gao, Cuiyun Gao, Zongjie Li, Ting Peng, Hailiang Huang, Yuetang Deng, and Michael Lyu (Chinese University of Hong Kong, China; University of Electronic Science and Technology of China, China; Hong Kong University of Science and Technology, China; Tencent, China) Large Code Models (LCMs) have demonstrated remarkable effectiveness across various code intelligence tasks. Supervised fine-tuning is essential to optimize their performance for specific downstream tasks. Compared with the traditional full-parameter fine-tuning (FFT) method, Parameter-Efficient Fine-Tuning (PEFT) methods can train LCMs with substantially reduced resource consumption and have gained widespread attention among researchers and practitioners. While existing studies have explored PEFT methods for code intelligence tasks, they have predominantly focused on a limited subset of scenarios, such as code generation with publicly available datasets, leading to constrained generalizability of the findings. To mitigate the limitation, we conduct a comprehensive study on exploring the effectiveness of the PEFT methods, which involves five code intelligence tasks containing both public and private data. Our extensive experiments reveal a considerable performance gap between PEFT methods and FFT, which is contrary to the findings of existing studies. We also find that this disparity is particularly pronounced in tasks involving private data. To improve the tuning performance for LCMs while reducing resource utilization during training, we propose a Layer-Wise Optimization (LWO) strategy in the paper. LWO incrementally updates the parameters of each layer in the whole model architecture, without introducing any additional component and inference overhead. Experiments across five LCMs and five code intelligence tasks demonstrate that LWO trains LCMs more effectively and efficiently compared to previous PEFT methods, with significant improvements in tasks using private data. For instance, in the line-level code completion task using our private code repositories, LWO outperforms the state-of-the-art LoRA method by 22% and 12% in terms of accuracy and BLEU scores, respectively. Furthermore, LWO can enable more efficient LCM tuning, reducing the training time by an average of 42.7% compared to LoRA. ![]() ![]() Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, and Michael Lyu (Chinese University of Hong Kong, China; Singapore Management University, Singapore) ![]() ![]() Junjie Huang, Zhihan Jiang, Zhuangbin Chen, and Michael Lyu (Chinese University of Hong Kong, China; Sun Yat-sen University, China) ![]() ![]() Weibin Wu, Haoxuan Hu, Zhaoji Fan, Yitong Qiao, Yizhan Huang, Yichen Li, Zibin Zheng, and Michael Lyu (Sun Yat-sen University, China; Chinese University of Hong Kong, China) Deep learning (DL) has revolutionized various software engineering tasks. Particularly, the emergence of AI code generators has pushed the boundaries of automatic programming to synthesize entire programs based on user-defined specifications in natural language. However, it remains a mystery if these AI code generators rely on the copy-and-paste programming practice, resulting in code clone concerns. In this work, to comprehensively study the code cloning behavior of AI code generators, we conduct an empirical study on three state-of-the-art commercial AI code generators to investigate the existence of all types of clones, which remains underexplored. Our experimental results show that the total Type-1 and Type-2 clone rates of the state-of-the-art commercial AI code generators can reach up to 7.50%, indicating marked code clone issues. Furthermore, it is observed that AI code generators risk infringing copyrights and propagating buggy and vulnerable code resulting from cloning code and show a certain degree of stability in generating code clones. ![]() |
|
Ma, Haoyang |
![]() Weiqi Lu, Yongqiang Tian, Xiaohan Zhong, Haoyang Ma, Zhenyang Xu, Shing-Chi Cheung, and Chengnian Sun (Hong Kong University of Science and Technology, China; University of Waterloo, Canada) Data visualization (DataViz) libraries play a crucial role in presentation, data analysis, and application development, underscoring the importance of their accuracy in transforming data into visual representations. Incorrect visualizations can adversely impact user experience, distort information conveyance, and influence user perception and decision-making processes. Visual bugs in these libraries can be particularly insidious as they may not cause obvious errors like crashes, but instead mislead users of the underlying data graphically, resulting in wrong decision making. Consequently, a good understanding of the unique characteristics of bugs in DataViz libraries is essential for researchers and developers to detect and fix bugs in DataViz libraries. This study presents the first comprehensive analysis of bugs in DataViz libraries, examining 564 bugs collected from five widely-used libraries. Our study systematically analyzes their symptoms and root causes, and provides a detailed taxonomy. We found that incorrect/inaccurate plots are pervasive in DataViz libraries and incorrect graphic computation is the major root cause, which necessitates further automated testing methods for DataViz libraries. Moreover, we identified eight key steps to trigger such bugs and two test oracles specific to DataViz libraries, which may inspire future research in designing effective automated testing techniques. Furthermore, with the recent advancements in Vision Language Models (VLMs), we explored the feasibility of applying these models to detect incorrect/inaccurate plots. The results show that the effectiveness of VLMs in bug detection varies from 29% to 57%, depending on the prompts, and adding more information in prompts does not necessarily increase the effectiveness. Our findings offer valuable insights into the nature and patterns of bugs in DataViz libraries, providing a foundation for developers and researchers to improve library reliability, and ultimately benefit more accurate and reliable data visualizations across various domains. ![]() |
|
Ma, Lei |
![]() Zhijie Wang, Zhehua Zhou, Jiayang Song, Yuheng Huang, Zhan Shu, and Lei Ma (University of Alberta, Canada; University of Tokyo, Japan) ![]() ![]() Jiongchi Yu, Xie Xiaofei, Qiang Hu, Bowen Zhang, Ziming Zhao, Yun Lin, Lei Ma, Ruitao Feng, and Frank Liau (Singapore Management University, Singapore; Tianjin University, China; Zhejiang University, China; Shanghai Jiao Tong University, China; University of Tokyo, Japan; University of Alberta, Canada; Southern Cross University, Australia; Government Technology Agency of Singapore, Singapore) ![]() |
|
Ma, Yuchi |
![]() Yujia Chen, Yang Ye, Zhongqi Li, Yuchi Ma, and Cuiyun Gao (Harbin Institute of Technology, Shenzhen, China; Huawei Cloud Computing Technologies, China) Large code models (LCMs) have remarkably advanced the field of code generation. Despite their impressive capabilities, they still face practical deployment issues, such as high inference costs, limited accessibility of proprietary LCMs, and adaptability issues of ultra-large LCMs. These issues highlight the critical need for more accessible, lightweight yet effective LCMs. Knowledge distillation (KD) offers a promising solution, which transfers the programming capabilities of larger, advanced LCMs (Teacher) to smaller, less powerful LCMs (Student). However, existing KD methods for code intelligence often lack consideration of fault domain knowledge and rely on static seed knowledge, leading to degraded programming capabilities of student models. In this paper, we propose a novel Self-Paced knOwledge DistillAtion framework, named SODA, aiming at developing lightweight yet effective student LCMs via adaptively transferring the programming capabilities from advanced teacher LCMs. SODA consists of three stages in one cycle: (1) Correct-and-Fault Knowledge Delivery stage aims at improving the student model’s capability to recognize errors while ensuring its basic programming skill during the knowledge transferring, which involves correctness-aware supervised learning and fault-aware contrastive learning methods. (2) Multi-View Feedback stage aims at measuring the quality of results generated by the student model from two views, including model-based and static tool-based measurement, for identifying the difficult questions; (3) Feedback-based Knowledge Update stage aims at updating the student model adaptively by generating new questions at different difficulty levels, in which the difficulty levels are categorized based on the feedback in the second stage. By performing the distillation process iteratively, the student model is continuously refined through learning more advanced programming skills from the teacher model. We compare SODA with four state-of-the-art KD approaches on three widely-used code generation benchmarks with different programming languages. Experimental results show that SODA improves the student model by 65.96% in terms of average Pass@1, outperforming the best baseline PERsD by 29.85%. Based on the proposed SODA framework, we develop SodaCoder, a series of lightweight yet effective LCMs with ∼7B parameters, which outperform 15 LCMs with less than or equal to 16B parameters. Notably, SodaCoder-DS-6.7B, built on DeepseekCoder-6.7B, even surpasses the prominent ChatGPT on average Pass@1 across seven programming languages (66.4 vs. 61.3). ![]() ![]() Yanlin Wang, Tianyue Jiang, Mingwei Liu, Jiachi Chen, Mingzhi Mao, Xilin Liu, Yuchi Ma, and Zibin Zheng (Sun Yat-sen University, China; Huawei Cloud Computing Technologies, China) Large language models (LLMs) have brought a paradigm shift to the field of code generation, offering the potential to enhance the software development process. However, previous research mainly focuses on the accuracy of code generation, while coding style differences between LLMs and human developers remain under-explored. In this paper, we empirically analyze the differences in coding style between the code generated by mainstream LLMs and the code written by human developers, and summarize coding style inconsistency taxonomy. Specifically, we first summarize the types of coding style inconsistencies by manually analyzing a large number of generation results. We then compare the code generated by LLMs with the code written by human programmers in terms of readability, conciseness, and robustness. The results reveal that LLMs and developers exhibit differences in coding style. Additionally, we study the possible causes of these inconsistencies and provide some solutions to alleviate the problem. ![]() |
|
Maalej, Walid |
![]() Christian Rahe and Walid Maalej (University of Hamburg, Germany) Programming students have a widespread access to powerful Generative AI tools like ChatGPT. While this can help understand the learning material and assist with exercises, educators are voicing more and more concerns about an overreliance on generated outputs and lack of critical thinking skills. It is thus important to understand how students actually use generative AI and what impact this could have on their learning behavior. To this end, we conducted a study including an exploratory experiment with 37 programming students, giving them monitored access to ChatGPT while solving a code authoring exercise. The task was not directly solvable by ChatGPT and required code comprehension and reasoning. While only 23 of the students actually opted to use the chatbot, the majority of those eventually prompted it to simply generate a full solution. We observed two prevalent usage strategies: to seek knowledge about general concepts and to directly generate solutions. Instead of using the bot to comprehend the code and their own mistakes, students often got trapped in a vicious cycle of submitting wrong generated code and then asking the bot for a fix. Those who self-reported using generative AI regularly were more likely to prompt the bot to generate a solution. Our findings indicate that concerns about potential decrease in programmers' agency and productivity with Generative AI are justified. We discuss how researchers and educators can respond to the potential risk of students uncritically over-relying on Generative AI. We also discuss potential modifications to our study design for large-scale replications. ![]() |
|
Mailach, Alina |
![]() Max Weber, Alina Mailach, Sven Apel, Janet Siegmund, Raimund Dachselt, and Norbert Siegmund (Leipzig University, Germany; ScaDS.AI Dresden-Leipzig, Germany; Saarland Informatics Campus, Germany; Saarland University, Germany; Chemnitz University of Technology, Germany; Dresden University of Technology, Germany) Debugging performance bugs in configurable software systems is a complex and time-consuming task that requires not only fixing a bug, but also understanding its root cause. While there is a vast body of literature on debugging strategies, there is no consensus on general debugging. This makes it difficult to provide concrete guidance for developers, especially for configuration-dependent performance bugs. The goal of our work is to alleviate this situation by providing an framework to describe debugging strategies in a more general, unifying way. We conducted a user study with 12 professional developers who debugged a performance bug in a real-world configurable system. To observe developers in an unobstructive way, we provided an immersive virtual reality tool, SoftVR, giving them a large degree of freedom to choose the preferred debugging strategy. The results show that the existing documentation of strategies is too coarse-grained and intermixed to identify successful approaches. In a subsequent qualitative analysis, we devised a coding framework to reason about debugging approaches. With this framework, we identified five goal-oriented episodes that developers employ, which they also confirmed in subsequent interviews. Our work provides a unified description of debugging strategies, allowing researchers a common foundation to study debugging and practitioners and teachers guidance on successful debugging strategies. ![]() |
|
Malek, Sam |
![]() Ziyao He, Syed Fatiul Huq, and Sam Malek (University of California at Irvine, USA) ![]() |
|
Mäntylä, Mika V. |
![]() Yuqing Wang, Mika V. Mäntylä, Serge Demeyer, Mutlu Beyazıt, Joanna Kisaakye, and Jesse Nyyssölä (University of Helsinki, Finland; University of Oulu, Finland; University of Antwerp, Belgium) Microservice-based systems (MSS) may fail with various fault types, due to their complex and dynamic nature. While existing AIOps methods excel at detecting abnormal traces and locating the responsible service(s), human efforts from practitioners are still required for further root cause analysis to diagnose specific fault types and analyze failure reasons for detected abnormal traces, particularly when abnormal traces do not stem directly from specific services. In this paper, we propose a novel AIOps framework, TraFaultDia, to automatically classify abnormal traces into fault categories for MSS. We treat the classification process as a series of multi-class classification tasks, where each task represents an attempt to classify abnormal traces into specific fault categories for a MSS. TraFaultDia is trained on several abnormal trace classification tasks with a few labeled instances from a MSS using a meta-learning approach. After training, TraFaultDia can quickly adapt to new, unseen abnormal trace classification tasks with a few labeled instances across MSS. TraFaultDia’s use cases are scalable depending on how fault categories are built from anomalies within MSS. We evaluated TraFaultDia on two representative MSS, TrainTicket and OnlineBoutique, with open datasets. In these datasets, each fault category is tied to the faulty system component(s) (service/pod) with a root cause. Our TraFaultDia automatically classifies abnormal traces into these fault categories, thus enabling the automatic identification of faulty system components and root causes without manual analysis. Our results show that, within the MSS it is trained on, TraFaultDia achieves an average accuracy of 93.26% and 85.20% across 50 new, unseen abnormal trace classification tasks for TrainTicket and OnlineBoutique respectively, when provided with 10 labeled instances for each fault category per task in each system. In the cross-system context, when TraFaultDia is applied to a MSS different from the one it is trained on, TraFaultDia gets an average accuracy of 92.19% and 84.77% for the same set of 50 new, unseen abnormal trace classification tasks of the respective systems, also with 10 labeled instances provided for each fault category per task in each system. ![]() |
|
Mao, Mingzhi |
![]() Yanlin Wang, Tianyue Jiang, Mingwei Liu, Jiachi Chen, Mingzhi Mao, Xilin Liu, Yuchi Ma, and Zibin Zheng (Sun Yat-sen University, China; Huawei Cloud Computing Technologies, China) Large language models (LLMs) have brought a paradigm shift to the field of code generation, offering the potential to enhance the software development process. However, previous research mainly focuses on the accuracy of code generation, while coding style differences between LLMs and human developers remain under-explored. In this paper, we empirically analyze the differences in coding style between the code generated by mainstream LLMs and the code written by human developers, and summarize coding style inconsistency taxonomy. Specifically, we first summarize the types of coding style inconsistencies by manually analyzing a large number of generation results. We then compare the code generated by LLMs with the code written by human programmers in terms of readability, conciseness, and robustness. The results reveal that LLMs and developers exhibit differences in coding style. Additionally, we study the possible causes of these inconsistencies and provide some solutions to alleviate the problem. ![]() |
|
Mao, Xiaoguang |
![]() Haoran Liu, Shanshan Li, Zhouyang Jia, Yuanliang Zhang, Linxiao Bai, Si Zheng, Xiaoguang Mao, and Xiangke Liao (National University of Defense Technology, China) ![]() ![]() Kangchen Zhu, Zhiliang Tian, Shangwen Wang, Weiguo Chen, Zixuan Dong, Mingyue Leng, and Xiaoguang Mao (National University of Defense Technology, China) The current landscape of binary code summarization predominantly focuses on generating a single summary, which limits its utility and understanding for reverse engineers. Existing approaches often fail to meet the diverse needs of users, such as providing detailed insights into usage patterns, implementation nuances, and design rationale, as observed in the field of source code summarization. This highlights the need for multi-intent binary code summarization to enhance the effectiveness of reverse engineering processes. To address this gap, we propose MiSum, a novel method that leverages multi-modality heterogeneous code graph alignment and learning to integrate both assembly code and pseudo-code. MiSum introduces a unified multi-modality heterogeneous code graph (MM-HCG) that aligns assembly code graphs with pseudo-code graphs, capturing both low-level execution details and high-level structural information. We further propose multi-modality heterogeneous graph learning with heterogeneous mutual attention and message passing, which highlights important code blocks and discovers inter-dependencies across different code forms. Additionally, an intent-aware summary generator with an intent-aware attention mechanism is introduced to produce customized summaries tailored to multiple intents. Extensive experiments, including evaluations across various architectures and optimization levels, demonstrate that MiSum outperforms state-of-the-art baselines in BLEU, METEOR, and ROUGE-L metrics. Human evaluations validate its capability to effectively support reverse engineers in understanding diverse binary code intents, marking a significant advancement in binary code analysis. ![]() ![]() ![]() |
|
Massacci, Fabio |
![]() Ranindya Paramitha, Yuan Feng, and Fabio Massacci (University of Trento, Italy; Vrije Universiteit Amsterdam, Netherlands) Vulnerability datasets used for ML testing implicitly contain retrospective information. When tested on the field, one can only use the labels available at the time of training and testing (e.g. seen and assumed negatives). As vulnerabilities are discovered across calendar time, labels change and past performance is not necessarily aligned with future performance. Past works only considered the slices of the whole history (e.g. DiverseVUl) or individual differences between releases (e.g. Jimenez et al. ESEC/FSE 2019). Such approaches are either too optimistic in training (e.g. the whole history) or too conservative (e.g. consecutive releases). We propose a method to restructure a dataset into a series of datasets in which both training and testing labels change to account for the knowledge available at the time. If the model is actually learning, it should improve its performance over time as more data becomes available and data becomes more stable, an effect that can be checked with the Mann-Kendall test. We validate our methodology for vulnerability detection with 4 time-based datasets (3 projects from BigVul dataset + Vuldeepecker’s NVD) and 5 ML models (Code2Vec, CodeBERT, LineVul, ReGVD, and Vuldeepecker). In contrast to the intuitive expectation (more retrospective information, better performance), the trend results show that performance changes inconsistently across the years, showing that most models are not learning. ![]() |
|
Meng, Guozhu |
![]() Kunsong Zhao, Zihao Li, Weimin Chen, Xiapu Luo, Ting Chen, Guozhu Meng, and Yajin Zhou (Hong Kong Polytechnic University, China; University of Electronic Science and Technology of China, China; Institute of Information Engineering at Chinese Academy of Sciences, China; Zhejiang University, China) WebAssembly has become the preferred smart contract format for various blockchain platforms due to its high portability and near-native execution speed. To effectively understand WebAssembly contracts, it is crucial to recover high-level type signatures because of the limited type information that WebAssembly provides. However, existing studies on type inference for smart contracts primarily center around Ethereum Virtual Machine bytecode, which is not applicable to WebAssembly owing to their differing targets and runtime semantics. This paper introduces WasmHint, a novel solution that leverages deep learning inference to automatically recover high-level parameter and return types from WebAssembly contracts. More specifically, WasmHint constructs a wCFG representation to clarify dependencies within WebAssembly code and simulates its execution to capture type-related operational information. By learning comprehensive code semantics, it infers parameter and return types, with a semantic corrector designed to enhance information coordination. We conduct experiments on a newly constructed dataset containing 77,208 WebAssembly contract functions. The results demonstrate that WasmHint achieves inference accuracies of 80.0% for parameter types and 95.8% for return types, with average improvements of 86.6% and 34.0% over the baseline methods, respectively. ![]() |
|
Meng, Haining |
![]() Liqing Cao, Haofeng Li, Chenghang Shi, Jie Lu, Haining Meng, Lian Li, and Jingling Xue (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia) ![]() |
|
Meng, Na |
![]() Md Mahir Asef Kabir, Xiaoyin Wang, and Na Meng (Virginia Tech, USA; University of Texas at San Antonio, USA) When building enterprise applications (EAs) on Java frameworks (e.g., Spring), developers often configure application components via metadata (i.e., Java annotations and XML files). It is challenging for developers to correctly use metadata, because the usage rules can be complex and existing tools provide limited assistance. When developers misuse metadata, EAs become misconfigured, which defects can trigger erroneous runtime behaviors or introduce security vulnerabilities. To help developers correctly use metadata, this paper presents (1) RSL---a domain-specific language that domain experts can adopt to prescribe metadata checking rules, and (2) MeCheck---a tool that takes in RSL rules and EAs to check for rule violations. With RSL, domain experts (e.g., developers of a Java framework) can specify metadata checking rules by defining content consistency among XML files, annotations, and Java code. Given such RSL rules and a program to scan, MeCheck interprets rules as cross-file static analyzers, which analyzers scan Java and/or XML files to gather information and look for consistency violations. For evaluation, we studied the Spring and JUnit documentation to manually define 15 rules, and created 2 datasets with 115 open-source EAs. The first dataset includes 45 EAs, and the ground truth of 45 manually injected bugs. The second dataset includes multiple versions of 70 EAs. We observed that MeCheck identified bugs in the first dataset with 100% precision, 96% recall, and 98% F-score. It reported 152 bugs in the second dataset, 49 of which bugs were already fixed by developers. Our evaluation shows that MeCheck helps ensure the correct usage of metadata. ![]() |
|
Meng, Wei |
![]() Yinxi Liu, Wei Meng, and Yinqian Zhang (Rochester Institute of Technology, USA; Chinese University of Hong Kong, China; Southern University of Science and Technology, China) Ethereum smart contracts determine state transition results not only by the previous states, but also by a mutable global state consisting of storage variables. This has resulted in state-inconsistency bugs, which grant an attacker the ability to modify contract states either through recursive function calls to a contract (reentrancy), or by exploiting transaction order dependence (TOD). Current studies have determined that identifying data races on global storage variables can capture all state-inconsistency bugs. Nevertheless, eliminating false positives poses a significant challenge, given the extensive number of execution paths that could potentially cause a data race. For simplicity, existing research considers a data race to be vulnerable as long as the variable involved could have inconsistent values under different execution orders. However, such a data race could be benign when the inconsistent value does not affect any critical computation or decision-making process in the program. Besides, the data race could also be infeasible when there is no valid state in the contract that allows the execution of both orders. In this paper, we aim to appreciably reduce these false positives without introducing false negatives. We present DivertScan, a precise framework to detect exploitable state-inconsistency bugs in smart contracts. We first introduce the use of flow divergence to check where the involved variable may flow to. This allows DivertScan to precisely infer the potential effects of a data race and determine whether it can be exploited for inducing unexpected program behaviors. We also propose multiplex symbolic execution to examine different execution orders in one time of solving. This helps DivertScan to determine whether a common starting state could potentially exist. To address the scalability issue in symbolic execution, DivertScan utilizes an overapproximated pre-checking and a selective exploration strategy. As a result, it only needs to explore a limited state space. DivertScan significantly outperformed state-of-the-art tools by improving the precision rate by 20.72% to 74.93% while introducing no false negatives. It also identified five exploitable real-world vulnerabilities that other tools missed. The detected vulnerabilities could potentially lead to a loss of up to $68.2M, based on trading records and rate limits. ![]() |
|
Meng, Xiangxin |
![]() Xu Wang, Mingming Zhang, Xiangxin Meng, Jian Zhang, Yang Liu, and Chunming Hu (Beihang University, China; Zhongguancun Laboratory, China; Ministry of Education, China; Nanyang Technological University, Singapore) Deep Neural Networks (DNNs) are prevalent across a wide range of applications. Despite their success, the complexity and opaque nature of DNNs pose significant challenges in debugging and repairing DNN models, limiting their reliability and broader adoption. In this paper, we propose MLM4DNN, an element-based automated DNN repair method. Unlike previous techniques that focus on post-training adjustments or rely heavily on predefined bug patterns, MLM4DNN repairs DNNs by leveraging a fine-tuned Masked Language Model (MLM) to predict correct fixes for nine predefined key elements in DNNs. We construct a large-scale dataset by masking nine key elements from the correct DNN source code and then force the MLM to restore the correct elements to learn the deep semantics that ensure the normal functionalities of DNNs. Afterwards, a light-weight static analysis tool is designed to filter out low-quality patches to enhance the repair efficiency. We introduce a patch validation method specifically for DNN repair tasks, which consists of three evaluation metrics from different aspects to model the effectiveness of generated patches. We construct a benchmark, BenchmarkAPR4DNN, including 51 buggy DNN models and an evaluation tool that outputs the three metrics. We evaluate MLM4DNN against six baselines on BenchmarkAPR4DNN, and results show that MLM4DNN outperforms all state-of-the-art baselines, including two dynamic-based and four zero-shot learning-based methods. After applying the fine-tuned MLM design to several prevalent Large Language Models (LLMs), we consistently observe improved performance in DNN repair tasks compared to the original LLMs, which demonstrates the effectiveness of the method proposed in this paper. ![]() |
|
Milewicz, Reed |
![]() Addi Malviya Thakur, Reed Milewicz, Mahmoud Jahanshahi, Lavínia Paganini, Bogdan Vasilescu, and Audris Mockus (University of Tennessee at Knoxville, USA; Oak Ridge National Laboratory, USA; Sandia National Laboratories, USA; Eindhoven University of Technology, Netherlands; Carnegie Mellon University, USA) ![]() |
|
Ming, Jiang |
![]() Xiaolei Ren, Mengfei Ren, Yu Lei, and Jiang Ming (Macau University of Science and Technology, China; University of Alabama in Huntsville, USA; University of Texas at Arlington, USA; Tulane University, USA) ![]() |
|
Mockus, Audris |
![]() Addi Malviya Thakur, Reed Milewicz, Mahmoud Jahanshahi, Lavínia Paganini, Bogdan Vasilescu, and Audris Mockus (University of Tennessee at Knoxville, USA; Oak Ridge National Laboratory, USA; Sandia National Laboratories, USA; Eindhoven University of Technology, Netherlands; Carnegie Mellon University, USA) ![]() |
|
Moumoula, Micheline Bénédicte |
![]() Micheline Bénédicte Moumoula, Abdoul Kader Kaboré, Jacques Klein, and Tegawendé F. Bissyandé (University of Luxembourg, Luxembourg; Interdisciplinary Centre of Excellence in AI for Development (CITADEL), Burkina Faso) With the involvement of multiple programming languages in modern software development, cross-lingual code clone detection has gained traction within the software engineering community. Numerous studies have explored this topic, proposing various promising approaches. Inspired by the significant advances in machine learning in recent years, particularly Large Language Models (LLMs), which have demonstrated their ability to tackle various tasks, this paper revisits cross-lingual code clone detection. We evaluate the performance of five (05) LLMs and, eight prompts (08) for the identification of cross-lingual code clones. Additionally, we compare these results against two baseline methods. Finally, we evaluate a pre-trained embedding model to assess the effectiveness of the generated representations for classifying clone and non-clone pairs. The studies involving LLMs and Embedding models are evaluated using two widely used cross-lingual datasets, XLCoST and CodeNet. Our results show that LLMs can achieve high F1 scores, up to 0.99, for straightforward programming examples. However, they not only perform less well on programs associated with complex programming challenges but also do not necessarily understand the meaning of “code clones” in a cross-lingual setting. We show that embedding models used to represent code fragments from different programming languages in the same representation space enable the training of a basic classifier that outperforms all LLMs by ∼1 and ∼20 percentage points on the XLCoST and CodeNet datasets, respectively. This finding suggests that, despite the apparent capabilities of LLMs, embeddings provided by embedding models offer suitable representations to achieve state-of-the-art performance in cross-lingual code clone detection. ![]() |
|
Myers, Brad |
![]() Jenny T. Liang, Melissa Lin, Nikitha Rao, and Brad Myers (Carnegie Mellon University, USA) ![]() ![]() Matthew Davis, Amy Wei, Brad Myers, and Joshua Sunshine (Carnegie Mellon University, USA; University of Michigan, USA) Software testing is difficult, tedious, and may consume 28%–50% of software engineering labor. Automatic test generators aim to ease this burden but have important trade-offs. Fuzzers use an implicit oracle that can detect obviously invalid results, but the oracle problem has no general solution, and an implicit oracle cannot automatically evaluate correctness. Test suite generators like EvoSuite use the program under test as the oracle and therefore cannot evaluate correctness. Property-based testing tools evaluate correctness, but users have difficulty coming up with properties to test and understanding whether their properties are correct. Consequently, practitioners create many test suites manually and often use an example-based oracle to tediously specify correct input and output examples. To help bridge the gaps among various oracle and tool types, we present the Composite Oracle, which organizes various oracle types into a hierarchy and renders a single test result per example execution. To understand the Composite Oracle’s practical properties, we built TerzoN, a test suite generator that includes a particular instantiation of the Composite Oracle. TerzoN displays all the test results in an integrated view composed from the results of three types of oracles and finds some types of test assertion inconsistencies that might otherwise lead to misleading test results. We evaluated TerzoN in a randomized controlled trial with 14 professional software engineers with a popular industry tool, fast-check, as the control. Participants using TerzoN elicited 72% more bugs (p < 0.01), accurately described more than twice the number of bugs (p < 0.01) and tested 16% more quickly (p < 0.05) relative to fast-check. ![]() ![]() |
|
Nath, Suman |
![]() Tanakorn Leesatapornwongsa, Fazle Faisal, and Suman Nath (Microsoft Research, USA) ![]() |
|
Neelofar, Neelofar |
![]() Lam Nguyen Tung, Steven Cho, Xiaoning Du, Neelofar Neelofar, Valerio Terragni, Stefano Ruberto, and Aldeida Aleti (Monash University, Australia; University of Auckland, New Zealand; Royal Melbourne Institute of Technology, Australia; Joint Research Centre at the European Commission, Italy) ![]() |
|
Ng, Vincent |
![]() Amiao Gao, Zenong Zhang, Simin Wang, LiGuo Huang, Shiyi Wei, and Vincent Ng (Southern Methodist University, Dallas, USA; University of Texas at Dallas, USA) ![]() |
|
Nguyen, Tien N. |
![]() Hridya Dhulipala, Aashish Yadavally, Smit Soneshbhai Patel, and Tien N. Nguyen (University of Texas at Dallas, USA) While LLMs excel in understanding source code and descriptive texts for tasks like code generation, code completion, etc., they exhibit weaknesses in predicting dynamic program behavior, such as code coverage and runtime error detection, which typically require program execution. Aiming to advance the capability of LLMs in reasoning and predicting the program behavior at runtime, we present CRISPE (short for Coverage Rationalization and Intelligent Selection ProcedurE), a novel approach for code coverage prediction. CRISPE guides an LLM in simulating program execution via an execution plan based on two key factors: (1) program semantics of each statement type, and (2) the observation of the set of covered statements at the current “execution” step relative to all feasible code coverage options. We formulate code coverage prediction as a process of semantic-guided execution-based planning, where feasible coverage options are utilized to assess whether the LLM is heading in the correct reasoning. We enhance the traditional generative task with the retrieval-based framework on feasible options of code coverage. Our experimental results show that CRISPE achieves high accuracy in coverage prediction in terms of both exact-match and statement-match coverage metrics, improving over the baselines. We also show that with semantic-guiding and dynamic reasoning from CRISPE, the LLM generates more correct planning steps. To demonstrate CRISPE’s usefulness, we used it in the downstream task of statically detecting runtime error(s) in incomplete code snippets with the given inputs. ![]() ![]() Yi Li, Hridya Dhulipala, Aashish Yadavally, Xiaokai Rong, Shaohua Wang, and Tien N. Nguyen (University of Texas at Dallas, USA; Central University of Finance and Economics, China) Although Large Language Models (LLMs) are highly proficient in understanding source code and descriptive texts, they have limitations in reasoning on dynamic program behaviors, such as execution trace and code coverage prediction, and runtime error prediction, which usually require actual program execution. To advance the ability of LLMs in predicting dynamic behaviors, we leverage the strengths of both approaches, Program Analysis (PA) and LLM, in building PredEx, a predictive executor for Python. Our principle is a blended analysis between PA and LLM to use PA to guide the LLM in predicting execution traces. We break down the task of predictive execution into smaller sub-tasks and leverage the deterministic nature when an execution order can be deterministically decided. When it is not certain, we use predictive backward slicing per variable, i.e., slicing the prior trace to only the parts that affect each variable separately breaks up the valuation prediction into significantly simpler problems. Our empirical evaluation on real-world datasets shows that PredEx achieves 31.5–47.1% relatively higher accuracy in predicting full execution traces than the state-of-the-art models. It also produces 8.6–53.7% more correct execution trace prefixes than those baselines. In predicting next executed statements, its relative improvement over the baselines is 15.7–102.3%. Finally, we show PredEx’s usefulness in two tasks: static code coverage analysis and static prediction of run-time errors for (in)complete code. ![]() |
|
Nguyen Tung, Lam |
![]() Lam Nguyen Tung, Steven Cho, Xiaoning Du, Neelofar Neelofar, Valerio Terragni, Stefano Ruberto, and Aldeida Aleti (Monash University, Australia; University of Auckland, New Zealand; Royal Melbourne Institute of Technology, Australia; Joint Research Centre at the European Commission, Italy) ![]() |
|
Nie, Yinan |
![]() Zhengmin Yu, Yuan Zhang, Ming Wen, Yinan Nie, Wenhui Zhang, and Min Yang (Fudan University, China; Huazhong University of Science and Technology, China) ![]() |
|
Niu, Ben |
![]() Xing Su, Hanzhong Liang, Hao Wu, Ben Niu, Fengyuan Xu, and Sheng Zhong (Nanjing University, China; Institute of Information Engineering at Chinese Academy of Sciences, China) Understanding the Ethereum smart contract bytecode is essential for ensuring cryptoeconomics security. However, existing decompilers primarily convert bytecode into pseudocode, which is not easily comprehensible for general users, potentially leading to misunderstanding of contract behavior and increased vulnerability to scams or exploits. In this paper, we propose DiSCo, the first LLMs-based EVM decompilation pipeline, which aims to enable LLMs to understand the opaque bytecode and lift it into smart contract code. DiSCo introduces three core technologies. First, a logic-invariant intermediate representation is proposed to reproject the low-level bytecode into high-level abstracted units. The second technique involves semantic enhancement based on a novel type-aware graph model to infer stripped variables during compilation, enhancing the lifting effect. The third technology is a flexible method incorporating code specifications to construct LLM-comprehensible prompts for source code generation. Extensive experiments illustrate that our generated code guarantees a high compilability rate at 75%, with differential fuzzing pass rate averaging at 50%. Manual validation results further indicate that the generated solidity contracts significantly outperforms baseline methods in tasks such as code comprehension and attack reproduction. ![]() |
|
Niu, Liangkun |
![]() Zhi Tu, Liangkun Niu, Wei Fan, and Tianyi Zhang (Purdue University, USA) ![]() |
|
Noei, Shayan |
![]() Shayan Noei, Heng Li, and Ying Zou (Queen's University, Canada; Polytechnique Montréal, Canada) Refactoring is a technical approach to increase the internal quality of software without altering its external functionalities. Developers often invest significant effort in refactoring. With the increased adoption of continuous integration and deployment (CI/CD), refactoring activities may vary within and across different releases and be influenced by various release goals. For example, developers may consistently allocate refactoring activities throughout a release, or prioritize new features early on in a release and only pick up refactoring late in a release. Different approaches to allocating refactoring tasks may have different implications for code quality. However, there is a lack of existing research on how practitioners allocate their refactoring activities within a release and their impact on code quality. Therefore, we first empirically study the frequent release-wise refactoring patterns in 207 open-source Java projects and their characteristics. Then, we analyze how these patterns and their transitions affect code quality. We identify four major release-wise refactoring patterns: early active, late active, steady active, and steady inactive. We find that adopting the late active pattern—characterized by gradually increasing refactoring activities as the release approaches—leads to the best code quality. We observe that as projects mature, refactoring becomes more active, reflected in the increasing use of the steady active release-wise refactoring pattern and the decreasing utilization of the steady inactive release-wise refactoring pattern. While the steady active pattern shows improvement in quality-related code metrics (e.g., cohesion), it can also lead to more architectural problems. Additionally, we observe that developers tend to adhere to a single refactoring pattern rather than switching between different patterns. The late active pattern, in particular, can be a safe release-wise refactoring pattern that is used repeatedly. Our results can help practitioners understand existing release-wise refactoring patterns and their effects on code quality, enabling them to utilize the most effective pattern to enhance release quality. ![]() |
|
Nyyssölä, Jesse |
![]() Yuqing Wang, Mika V. Mäntylä, Serge Demeyer, Mutlu Beyazıt, Joanna Kisaakye, and Jesse Nyyssölä (University of Helsinki, Finland; University of Oulu, Finland; University of Antwerp, Belgium) Microservice-based systems (MSS) may fail with various fault types, due to their complex and dynamic nature. While existing AIOps methods excel at detecting abnormal traces and locating the responsible service(s), human efforts from practitioners are still required for further root cause analysis to diagnose specific fault types and analyze failure reasons for detected abnormal traces, particularly when abnormal traces do not stem directly from specific services. In this paper, we propose a novel AIOps framework, TraFaultDia, to automatically classify abnormal traces into fault categories for MSS. We treat the classification process as a series of multi-class classification tasks, where each task represents an attempt to classify abnormal traces into specific fault categories for a MSS. TraFaultDia is trained on several abnormal trace classification tasks with a few labeled instances from a MSS using a meta-learning approach. After training, TraFaultDia can quickly adapt to new, unseen abnormal trace classification tasks with a few labeled instances across MSS. TraFaultDia’s use cases are scalable depending on how fault categories are built from anomalies within MSS. We evaluated TraFaultDia on two representative MSS, TrainTicket and OnlineBoutique, with open datasets. In these datasets, each fault category is tied to the faulty system component(s) (service/pod) with a root cause. Our TraFaultDia automatically classifies abnormal traces into these fault categories, thus enabling the automatic identification of faulty system components and root causes without manual analysis. Our results show that, within the MSS it is trained on, TraFaultDia achieves an average accuracy of 93.26% and 85.20% across 50 new, unseen abnormal trace classification tasks for TrainTicket and OnlineBoutique respectively, when provided with 10 labeled instances for each fault category per task in each system. In the cross-system context, when TraFaultDia is applied to a MSS different from the one it is trained on, TraFaultDia gets an average accuracy of 92.19% and 84.77% for the same set of 50 new, unseen abnormal trace classification tasks of the respective systems, also with 10 labeled instances provided for each fault category per task in each system. ![]() |
|
Oakes, Bentley |
![]() Mouna Dhaouadi, Bentley Oakes, and Michalis Famelis (Université de Montréal, Canada; Polytechnique Montréal, Canada) ![]() |
|
Orso, Alessandro |
![]() Myeongsoo Kim, Saurabh Sinha, and Alessandro Orso (Georgia Institute of Technology, USA; IBM Research, USA) Modern web services rely heavily on REST APIs, typically documented using the OpenAPI specification. The widespread adoption of this standard has resulted in the development of many black-box testing tools that generate tests based on OpenAPI specifications. Although Large Language Models (LLMs) have shown promising test-generation abilities, their application to REST API testing remains mostly unexplored. We present LlamaRestTest, a novel approach that employs two custom LLMs-created by fine-tuning and quantizing the Llama3-8B model using mined datasets of REST API example values and inter-parameter dependencies-to generate realistic test inputs and uncover inter-parameter dependencies during the testing process by analyzing server responses. We evaluated LlamaRestTest on 12 real-world services (including popular services such as Spotify), comparing it against RESTGPT, a GPT-powered specification-enhancement tool, as well as several state-of-the-art REST API testing tools, including RESTler, MoRest, EvoMaster, and ARAT-RL. Our results demonstrate that fine-tuning enables smaller models to outperform much larger models in detecting actionable parameter-dependency rules and generating valid inputs for REST API testing. We also evaluated different tool configurations, ranging from the base Llama3-8B model to fine-tuned versions, and explored multiple quantization techniques, including 2-bit, 4-bit, and 8-bit integer formats. Our study shows that small language models can perform as well as, or better than, large language models in REST API testing, balancing effectiveness and efficiency. Furthermore, LlamaRestTest outperforms state-of-the-art REST API testing tools in code coverage achieved and internal server errors identified, even when those tools use RESTGPT-enhanced specifications. Finally, through an ablation study, we show that each component of LlamaRestTest contributes to its overall performance. ![]() |
|
Ou, Rongyi |
![]() Weibin Wu, Yuhang Cao, Ning Yi, Rongyi Ou, and Zibin Zheng (Sun Yat-sen University, China) Question answering (QA) is a fundamental task of large language models (LLMs), which requires LLMs to automatically answer human-posed questions in natural language. However, LLMs are known to distort facts and make non-factual statements (i.e., hallucinations) when dealing with QA tasks, which may affect the deployment of LLMs in real-life situations. In this work, we propose DrHall, a framework for detecting and reducing the factual hallucinations of black-box LLMs with metamorphic testing (MT). We believe that hallucinated answers are unstable. Therefore, when LLMs hallucinate, they are more likely to produce different answers if we use metamorphic testing to make LLMs re-execute the same task with different execution paths, which motivates the design of DrHall. The effectiveness of DrHall is evaluated empirically on three datasets, including a self-built dataset of natural language questions: FactHalluQA, as well as two datasets of programming questions: Refactory and LeetCode. The evaluation results confirm that DrHall can consistently outperform the state-of-the-art baselines, obtaining an average F1 score of over 0.856 for hallucination detection. For hallucination correction, DrHall can also outperform the state-of-the-art baselines, with an average hallucination correction rate of over 53%. We hope that our work can enhance the reliability of LLMs and provide new insights for the research of LLM hallucination mitigation. ![]() |
|
Pacheco, Michael |
![]() Xu Yang, Wenhan Zhu, Michael Pacheco, Jiayuan Zhou, Shaowei Wang, Xing Hu, and Kui Liu (University of Manitoba, Canada; Huawei, Canada; Zhejiang University, China; Huawei Software Engineering Application Technology Lab, China) Detecting vulnerability fix commits in open-source software is crucial for maintaining software security. To help OSS identify vulnerability fix commits, several automated approaches are developed. However, existing approaches like VulFixMiner and CoLeFunDa, focus solely on code changes, neglecting essential context from development artifacts. Tools like Vulcurator, which integrates issue reports, fail to leverage semantic associations between different development artifacts (e.g., pull requests and history vulnerability fixes). Moreover, they miss vulnerability fixes in tangled commits and lack explanations, limiting practical use. Hence to address those limitations, we propose LLM4VFD, a novel framework that leverages Large Language Models (LLMs) enhanced with Chain-of-Thought reasoning and In-Context Learning to improve the accuracy of vulnerability fix detection. LLM4VFD comprises three components: (1) Code Change Intention, which analyzes commit summaries, purposes, and implications using Chain-of-Thought reasoning; (2) Development Artifact, which incorporates context from related issue reports and pull requests; (3) Historical Vulnerability, which retrieves similar past vulnerability fixes to enrich context. More importantly, on top of the prediction, LLM4VFD also provides a detailed analysis and explanation to help security experts understand the rationale behind the decision. We evaluated LLM4VFD against state-of-the-art techniques, including Pre-trained Language Model-based approaches and vanilla LLMs, using a newly collected dataset, BigVulFixes. Experimental results demonstrate that LLM4VFD significantly outperforms the best-performed existing approach by 68.1%--145.4%. Furthermore, We conducted a user study with security experts, showing that the analysis generated by LLM4VFD improves the efficiency of vulnerability fix identification. ![]() |
|
Paganini, Lavínia |
![]() Addi Malviya Thakur, Reed Milewicz, Mahmoud Jahanshahi, Lavínia Paganini, Bogdan Vasilescu, and Audris Mockus (University of Tennessee at Knoxville, USA; Oak Ridge National Laboratory, USA; Sandia National Laboratories, USA; Eindhoven University of Technology, Netherlands; Carnegie Mellon University, USA) ![]() |
|
Pan, Minxue |
![]() Chun Li, Hui Li, Zhong Li, Minxue Pan, and Xuandong Li (Nanjing University, China; Samsung Electronics, China) Due to advancements in graph neural networks, graph learning-based fault localization (GBFL) methods have achieved promising results. However, as these methods are supervised learning paradigms and deep learning is typically data-hungry, they can only be trained on fully labeled large-scale datasets. This is impractical because labeling failed tests is similar to manual fault localization, which is time-consuming and labor-intensive, leading to only a small portion of failed tests that can be labeled within limited budgets. These data labeling limitations would lead to the sub-optimal effectiveness of supervised GBFL techniques. Semi-supervised learning (SSL) provides an effective means of leveraging unlabeled data to improve a model's performance and address data labeling limitations. However, as these methods are not specifically designed for fault localization, directly utilizing them might lead to sub-optimal effectiveness. In response, we propose a novel semi-supervised GBFL framework, LEGATO. LEGATO first leverages the attention mechanism to identify and augment likely fault-unrelated sub-graphs in unlabeled graphs and then quantifies the suspiciousness distribution of unlabeled graphs to estimate pseudo-labels. Through training the model on augmented unlabeled graphs and pseudo-labels, LEGATO can utilize the unlabeled data to improve the effectiveness of fault localization and address the restrictions in data labeling. By extensive evaluations against 3 baselines SSL methods, LEGATO demonstrates superior performance by outperforming all the methods in comparison. ![]() ![]() Shaoheng Cao, Renyi Chen, Wenhua Yang, Minxue Pan, and Xuandong Li (Nanjing University, China; Samsung Electronics, China; Nanjing University of Aeronautics and Astronautics, China) Model-based testing (MBT) has been an important methodology in software engineering, attracting extensive research attention for over four decades. However, despite its academic acclaim, studies examining the impact of MBT in industrial environments—particularly regarding its extended effects—remain limited and yield unclear results. This gap may contribute to the challenges in establishing a study environment for implementing and applying MBT in production settings to evaluate its impact over time. To bridge this gap, we collaborated with an industrial partner to undertake a comprehensive, longitudinal empirical study employing mixed methods. Over two months, we implemented our MBT tool within the corporation, assessing the immediate and extended effectiveness and efficiency of MBT compared to script-writing-based testing. Through a mix of quantitative and qualitative methods—spanning controlled experiments, questionnaire surveys, and interviews—our study uncovers several insightful findings. These include differences in effectiveness and maintainability between immediate and extended MBT application, the evolving perceptions and expectations of engineers regarding MBT, and more. Leveraging these insights, we propose actionable implications for both the academic and industrial communities, aimed at bolstering confidence in MBT adoption and investment for software testing purposes. ![]() |
|
Pan, Rangeet |
![]() Ali Reza Ibrahimzada, Kaiyao Ke, Mrigank Pawagi, Muhammad Salman Abid, Rangeet Pan, Saurabh Sinha, and Reyhaneh Jabbarvand (University of Illinois at Urbana-Champaign, USA; Indian Institute of Science, India; Cornell University, USA; IBM Research, USA) ![]() |
|
Paramitha, Ranindya |
![]() Ranindya Paramitha, Yuan Feng, and Fabio Massacci (University of Trento, Italy; Vrije Universiteit Amsterdam, Netherlands) Vulnerability datasets used for ML testing implicitly contain retrospective information. When tested on the field, one can only use the labels available at the time of training and testing (e.g. seen and assumed negatives). As vulnerabilities are discovered across calendar time, labels change and past performance is not necessarily aligned with future performance. Past works only considered the slices of the whole history (e.g. DiverseVUl) or individual differences between releases (e.g. Jimenez et al. ESEC/FSE 2019). Such approaches are either too optimistic in training (e.g. the whole history) or too conservative (e.g. consecutive releases). We propose a method to restructure a dataset into a series of datasets in which both training and testing labels change to account for the knowledge available at the time. If the model is actually learning, it should improve its performance over time as more data becomes available and data becomes more stable, an effect that can be checked with the Mann-Kendall test. We validate our methodology for vulnerability detection with 4 time-based datasets (3 projects from BigVul dataset + Vuldeepecker’s NVD) and 5 ML models (Code2Vec, CodeBERT, LineVul, ReGVD, and Vuldeepecker). In contrast to the intuitive expectation (more retrospective information, better performance), the trend results show that performance changes inconsistently across the years, showing that most models are not learning. ![]() |
|
Patel, Smit Soneshbhai |
![]() Hridya Dhulipala, Aashish Yadavally, Smit Soneshbhai Patel, and Tien N. Nguyen (University of Texas at Dallas, USA) While LLMs excel in understanding source code and descriptive texts for tasks like code generation, code completion, etc., they exhibit weaknesses in predicting dynamic program behavior, such as code coverage and runtime error detection, which typically require program execution. Aiming to advance the capability of LLMs in reasoning and predicting the program behavior at runtime, we present CRISPE (short for Coverage Rationalization and Intelligent Selection ProcedurE), a novel approach for code coverage prediction. CRISPE guides an LLM in simulating program execution via an execution plan based on two key factors: (1) program semantics of each statement type, and (2) the observation of the set of covered statements at the current “execution” step relative to all feasible code coverage options. We formulate code coverage prediction as a process of semantic-guided execution-based planning, where feasible coverage options are utilized to assess whether the LLM is heading in the correct reasoning. We enhance the traditional generative task with the retrieval-based framework on feasible options of code coverage. Our experimental results show that CRISPE achieves high accuracy in coverage prediction in terms of both exact-match and statement-match coverage metrics, improving over the baselines. We also show that with semantic-guiding and dynamic reasoning from CRISPE, the LLM generates more correct planning steps. To demonstrate CRISPE’s usefulness, we used it in the downstream task of statically detecting runtime error(s) in incomplete code snippets with the given inputs. ![]() |
|
Pawagi, Mrigank |
![]() Ali Reza Ibrahimzada, Kaiyao Ke, Mrigank Pawagi, Muhammad Salman Abid, Rangeet Pan, Saurabh Sinha, and Reyhaneh Jabbarvand (University of Illinois at Urbana-Champaign, USA; Indian Institute of Science, India; Cornell University, USA; IBM Research, USA) ![]() |
|
Payer, Mathias |
![]() Flavio Toffalini, Nicolas Badoux, Zurab Tsinadze, and Mathias Payer (EPFL, Switzerland; Ruhr University Bochum, Germany) ![]() ![]() Han Zheng, Flavio Toffalini, Marcel Böhme, and Mathias Payer (EPFL, Switzerland; Ruhr University Bochum, Germany; MPI-SP, Germany) Can a fuzzer cover more code with minimal corruption of the initial seed? Before a seed is fuzzed, the early greybox fuzzers first systematically enumerated slightly corrupted inputs by applying every mutation operator to every part of the seed, once per generated input. The hope of this so-called “deterministic” stage was that simple changes to the seed would be less likely to break the complex file format; the resulting inputs would find bugs in the program logic well beyond the program’s parser. However, when experiments showed that disabling the deterministic stage achieves more coverage, applying multiple mutation operators at the same time to a single input, most fuzzers disabled the deterministic stage by default. Instead of ignoring the deterministic stage, we analyze its potential and substantially improve deterministic exploration. Our deterministic stage is now the default in AFL++, reverting the earlier decision of dropping deterministic exploration. We start by investigating the overhead and the contribution of the deterministic stage to the discovery of coverage-increasing inputs. While the sheer number of generated inputs explains the overhead, we find that only a few critical seeds (20%), and only a few critical bytes in a seed (0.5%) are responsible for the vast majority of the coverage-increasing inputs (83% and 84%, respectively). Hence, we develop an efficient technique, called , to identify these critical seeds / bytes so as to prune a large number of unnecessary inputs. retains the benefits of the classic deterministic stage by only enumerating a tiny part of the total deterministic state space. We evaluate implementation on two benchmarking frameworks, FuzzBench and Magma. Our evaluation shows that outperforms state-of-the-art fuzzers with and without the (old) deterministic stage enabled, both in terms of coverage and bug finding. also discovered 8 new CVEs on exhaustively fuzzed security-critical applications. Finally, has been independently evaluated and integrated into AFL++ as default option. ![]() |
|
Peng, Ting |
![]() Chaozheng Wang, Jia Feng, Shuzheng Gao, Cuiyun Gao, Zongjie Li, Ting Peng, Hailiang Huang, Yuetang Deng, and Michael Lyu (Chinese University of Hong Kong, China; University of Electronic Science and Technology of China, China; Hong Kong University of Science and Technology, China; Tencent, China) Large Code Models (LCMs) have demonstrated remarkable effectiveness across various code intelligence tasks. Supervised fine-tuning is essential to optimize their performance for specific downstream tasks. Compared with the traditional full-parameter fine-tuning (FFT) method, Parameter-Efficient Fine-Tuning (PEFT) methods can train LCMs with substantially reduced resource consumption and have gained widespread attention among researchers and practitioners. While existing studies have explored PEFT methods for code intelligence tasks, they have predominantly focused on a limited subset of scenarios, such as code generation with publicly available datasets, leading to constrained generalizability of the findings. To mitigate the limitation, we conduct a comprehensive study on exploring the effectiveness of the PEFT methods, which involves five code intelligence tasks containing both public and private data. Our extensive experiments reveal a considerable performance gap between PEFT methods and FFT, which is contrary to the findings of existing studies. We also find that this disparity is particularly pronounced in tasks involving private data. To improve the tuning performance for LCMs while reducing resource utilization during training, we propose a Layer-Wise Optimization (LWO) strategy in the paper. LWO incrementally updates the parameters of each layer in the whole model architecture, without introducing any additional component and inference overhead. Experiments across five LCMs and five code intelligence tasks demonstrate that LWO trains LCMs more effectively and efficiently compared to previous PEFT methods, with significant improvements in tasks using private data. For instance, in the line-level code completion task using our private code repositories, LWO outperforms the state-of-the-art LoRA method by 22% and 12% in terms of accuracy and BLEU scores, respectively. Furthermore, LWO can enable more efficient LCM tuning, reducing the training time by an average of 42.7% compared to LoRA. ![]() |
|
Peng, Xin |
![]() Junming Cao, Xuwen Xiang, Mingfei Cheng, Bihuan Chen, Xinyan Wang, You Lu, Chaofeng Sha, Xiaofei Xie, and Xin Peng (Fudan University, China; Singapore Management University, Singapore) ![]() ![]() Susheng Wu, Ruisi Wang, Yiheng Cao, Bihuan Chen, Zhuotong Zhou, Yiheng Huang, JunPeng Zhao, and Xin Peng (Fudan University, China) Branching repositories facilitates efficient software development but can also inadvertently propagate vulnerabilities. When an original branch is patched, other unfixed branches remain vulnerable unless the patch is successfully ported. However, due to inherent discrepancies between branches, many patches cannot be directly applied and require manual intervention, which is time-consuming and leads to delays in patch porting, increasing vulnerability risks. Existing automated patch porting approaches are prone to errors, as they often overlook essential semantic and syntactic context of vulnerability and fail to detect or refine faulty patches. We propose Mystique, a novel LLM-based approach to address these limitations. Mystique first slices the semantic-related statements linked to the vulnerability while ensuring syntactic correctness, allowing it to extract the signatures for both the original patched function and the target vulnerable function. Mystique then utilizes a fine-tuned LLM to generate a fixed function, which is further iteratively checked and refined to ensure successful porting. Our evaluation shows that Mystique achieved a success rate of 0.954 at function level and of 0.924 at CVE level, outperforming state-of-the-art approaches by at least 13.2% at function level and 12.3% at CVE level. Our evaluation also demonstrates Mystique’s superior generality across various projects, bugs, and programming languages. Mystique successfully ported patches for 34 real-world vulnerable branches. ![]() ![]() ![]() Bingkun Sun, Shiqi Sun, Jialin Ren, Mingming Hu, Kun Hu, Liwei Shen, and Xin Peng (Fudan University, China; Northwestern Polytechnical University, China) The Web of Things (WoT) system standardizes the integration of ubiquitous IoT devices in physical environments, enabling various software applications to automatically sense and regulate the physical environment. While providing convenience, the complex interactions among software applications and physical environment make the WoT system vulnerable to violations caused by improper actuator operations, which may cause undesirable or even harmful results, posing serious risks to user safety and security. In response to this critical concern, many previous efforts have be made. However, existing works primarily focus on analyzing software application behaviors, with insufficient consideration of the physical interactions, multi-source violations, and environmental dynamics in such ubiquitous software systems. As a result, they fail to comprehensively detect the impact of actuator operations on the dynamic environment, thus limiting their effectiveness. To address these limitations, we propose SysGuard, a violation detecting and handling approach. SysGuard employs the dynamic probabilistic graphical model (DPGM) to model the physical interactions as the physical interaction graph (PIG). In the offline phase, SysGuard takes device description models and history device logs as input to capture physical interactions by learning the PIG. In this process, a large language model (LLM) based causal analysis method is further introduced to filter out the device dependencies unrelated to physical interaction by analyzing the device interaction scenarios recorded in device logs. In the online phase, SysGuard processes user-customized violation rules, and monitors runtime device logs to predict violation states and generates handling policies by inferring the PIG. Evaluation on two real-world WoT systems shows that SysGuard significantly outperforms existing state-of-the-art works, achieving high performance in both violation detection and handling. It also confirms the runtime efficiency and scalability of SysGuard. Ablation experiment on our constructed dataset demonstrates that the LLM-based causal analysis significantly improves the performance of SysGuard, with the accuracy increasing in both violation detecting and handling. ![]() ![]() |
|
Peng, Yun |
![]() Yun Peng, Jun Wan, Yichen Li, and Xiaoxue Ren (Chinese University of Hong Kong, China; Zhejiang University, China) Code generation has largely improved development efficiency in the era of large language models (LLMs). With the ability to follow instructions, current LLMs can be prompted to generate code solutions given detailed descriptions in natural language. Many research efforts are being devoted to improving the correctness of LLM-generated code, and many benchmarks are proposed to evaluate the correctness comprehensively. Despite the focus on correctness, the time efficiency of LLM-generated code solutions is under-explored. Current correctness benchmarks are not suitable for time efficiency evaluation since their test cases cannot well distinguish the time efficiency of different code solutions. Besides, the current execution time measurement is not stable and comprehensive, threatening the validity of the time efficiency evaluation. To address the challenges in the time efficiency evaluation of code generation, we propose COFFE, a code generation benchmark for evaluating the time efficiency of LLM-generated code solutions. COFFE contains 398 and 358 problems for function-level and file-level code generation, respectively. To improve the distinguishability, we design a novel stressful test case generation approach with contracts and two new formats of test cases to improve the accuracy of generation. For the time evaluation metric, we propose efficienct@k based on CPU instruction count to ensure a stable and solid comparison between different solutions. We evaluate 14 popular LLMs on COFFE and identify four findings. Based on the findings, we draw some implications for LLM researchers and software practitioners to facilitate future research and usage of LLMs in code generation. ![]() |
|
Pradel, Michael |
![]() Aryaz Eghbali, Felix Burk, and Michael Pradel (University of Stuttgart, Germany) Python is a dynamic language with applications in many domains, and one of the most popular languages in recent years. To increase code quality, developers have turned to “linters” that statically analyze the source code and warn about potential programming problems. However, the inherent limitations of static analysis and the dynamic nature of Python make it difficult or even impossible for static linters to detect some problems. This paper presents DyLin, the first dynamic linter for Python. Similar to a traditional linter, the approach has an extensible set of checkers, which, unlike in traditional linters, search for specific programming anti-patterns by analyzing the program as it executes. A key contribution of this paper is a set of 15 Python-specific anti-patterns that are hard to find statically but amenable to dynamic analysis, along with corresponding checkers to detect them. Our evaluation applies DyLin to 37 popular open-source Python projects on GitHub and to a dataset of code submitted to Kaggle machine learning competitions, totaling over 683k lines of Python code. The approach reports a total of 68 problems, 48 of which are previously unknown true positives, i.e., a precision of 70.6%. The detected problems include bugs that cause incorrect values, such as inf, incorrect behavior, e.g., missing out on relevant events, unnecessary computations that slow down the program, and unintended data leakage from test data to the training phase of machine learning pipelines. These issues remained unnoticed in public repositories for more than 3.5 years, on average, despite the fact that the corresponding code has been exercised by the developer-written tests. A comparison with popular static linters and a type checker shows that DyLin complements these tools by detecting problems that are missed statically. Based on our reporting of 42 issues to the developers, 31 issues have so far been fixed. ![]() ![]() Huimin Hu, Yingying Wang, Julia Rubin, and Michael Pradel (University of Stuttgart, Germany; University of British Columbia, Canada) Scalable static analyzers are popular tools for finding incorrect, inefficient, insecure, and hard-to-maintain code early during the development process. Because not all warnings reported by a static analyzer are immediately useful to developers, many static analyzers provide a way to suppress warnings, e.g., in the form of special comments added into the code. Such suppressions are an important mechanism at the interface between static analyzers and software developers, but little is currently known about them. This paper presents the first in-depth empirical study of suppressions of static analysis warnings, addressing questions about the prevalence of suppressions, their evolution over time, the relationship between suppressions and warnings, and the reasons for using suppressions. We answer these questions by studying projects written in three popular languages and suppressions for warnings by four popular static analyzers. Our findings show that (i) suppressions are relatively common, e.g., with a total of 7,357 suppressions in 46 Python projects, (ii) the number of suppressions in a project tends to continuously increase over time, (iii) surprisingly, 50.8% of all suppressions do not affect any warning and hence are practically useless, (iv) some suppressions, including useless ones, may unintentionally hide future warnings, and (v) common reasons for introducing suppressions include false positives, suboptimal configurations of the static analyzer, and misleading warning messages. These results have actionable implications, e.g., that developers should be made aware of useless suppressions and the potential risk of unintentional suppressing, that static analyzers should provide better warning messages, and that static analyzers should separately categorize warnings from third-party libraries. ![]() ![]() Lars Gröninger, Beatriz Souza, and Michael Pradel (University of Stuttgart, Germany) Code changes are an integral part of the software development process. Many code changes are meant to improve the code without changing its functional behavior, e.g., refactorings and performance improvements. Unfortunately, validating whether a code change preserves the behavior is non-trivial, particularly when the code change is performed deep inside a complex project. This paper presents ChangeGuard, an approach that uses learning-guided execution to compare the runtime behavior of a modified function. The approach is enabled by the novel concept of pairwise learning-guided execution and by a set of techniques that improve the robustness and coverage of the state-of-the-art learning-guided execution technique. Our evaluation applies ChangeGuard to a dataset of 224 manually annotated code changes from popular Python open-source projects and to three datasets of code changes obtained by applying automated code transformations. Our results show that the approach identifies semantics-changing code changes with a precision of 77.1% and a recall of 69.5%, and that it detects unexpected behavioral changes introduced by automatic code refactoring tools. In contrast, the existing regression tests of the analyzed projects miss the vast majority of semantics-changing code changes, with a recall of only 7.6%. We envision our approach being useful for detecting unintended behavioral changes early in the development process and for improving the quality of automated code transformations. ![]() |
|
Qiang, Weizhong |
![]() Junyao Ye, Zhen Li, Xi Tang, Deqing Zou, Shouhuai Xu, Weizhong Qiang, and Hai Jin (Huazhong University of Science and Technology, China; University of Colorado at Colorado Springs, USA) ![]() |
|
Qiao, Ao |
![]() Shuwei Song, Ting Chen, Ao Qiao, Xiapu Luo, Leqing Wang, Zheyuan He, Ting Wang, Xiaodong Lin, Peng He, Wensheng Zhang, and Xiaosong Zhang (University of Electronic Science and Technology of China, China; Hong Kong Polytechnic University, China; Stony Brook University, USA; University of Guelph, Canada) Cryptocurrency tokens, implemented by smart contracts, are prime targets for attackers due to their substantial monetary value. To illicitly gain profit, attackers often embed malicious code or exploit vulnerabilities within token contracts. Token transfer identification is crucial for detecting malicious and vulnerable token contracts. However, existing methods suffer from high false positives or false negatives due to invalid assumptions or reliance on limited patterns. This paper introduces a novel approach that captures the essential principles of token contracts, which are independent of programming languages and token standards, and presents a new tool, CRYPTO-SCOUT. CRYPTO-SCOUT automatically and accurately identifies token transfers, enabling the detection of various malicious and vulnerable token contracts. CRYPTO-SCOUT's core innovation is its capability to automatically identify complex container-type variables used by token contracts for storing holder information. It processes the bytecode of smart contracts written in the two most widely-used languages, Solidity and Vyper, and supports the three most popular token standards, ERC20, ERC721, and ERC1155. Furthermore, CRYPTO-SCOUT detects four types of malicious and vulnerable token contracts and is designed to be extensible. Extensive experiments show that CRYPTO-SCOUT outperforms existing approaches and uncovers over 21,000 malicious/vulnerable token contracts and more than 12,000 transactions triggering them. ![]() |
|
Qiao, Yitong |
![]() Weibin Wu, Haoxuan Hu, Zhaoji Fan, Yitong Qiao, Yizhan Huang, Yichen Li, Zibin Zheng, and Michael Lyu (Sun Yat-sen University, China; Chinese University of Hong Kong, China) Deep learning (DL) has revolutionized various software engineering tasks. Particularly, the emergence of AI code generators has pushed the boundaries of automatic programming to synthesize entire programs based on user-defined specifications in natural language. However, it remains a mystery if these AI code generators rely on the copy-and-paste programming practice, resulting in code clone concerns. In this work, to comprehensively study the code cloning behavior of AI code generators, we conduct an empirical study on three state-of-the-art commercial AI code generators to investigate the existence of all types of clones, which remains underexplored. Our experimental results show that the total Type-1 and Type-2 clone rates of the state-of-the-art commercial AI code generators can reach up to 7.50%, indicating marked code clone issues. Furthermore, it is observed that AI code generators risk infringing copyrights and propagating buggy and vulnerable code resulting from cloning code and show a certain degree of stability in generating code clones. ![]() |
|
Qin, Mian |
![]() Mian Qin, Yuxia Zhang, Klaas-Jan Stol, and Hui Liu (Beijing Institute of Technology, China; University College Cork, Ireland; Lero, Ireland; SINTEF, Norway) Open Source Software (OSS) projects are no longer only developed by volunteers. Instead, many organizations, from early-stage startups to large global enterprises, actively participate in many well-known projects. The survival and success of OSS projects rely on long-term contributors, who have extensive experience and knowledge. While prior literature has explored volunteer turnover in OSS, there is a paucity of research on company turnover in OSS ecosystems. Given the intensive involvement of companies in OSS and the different nature of corporate contributors vis-a-vis volunteers, it is important to investigate company turnover in OSS projects. This study first explores the prevalence and characteristics of companies that discontinue contributing to OSS projects, and then develops models to predict companies’ turnover. Based on a study of the Linux kernel, we analyze the early-stage behavior of 1,322 companies that have contributed to the project. We find that approximately 12% of companies discontinue contributing each year; one-sixth of those used to be core contributing companies (those that ranked in the top 20% by commit volume). Furthermore, withdrawing companies tend to have a lower intensity and scope of contributions, make primarily perfective changes, collaborate less, and operate on a smaller scale. We propose a Temporal Convolutional Network (TCN) deep learning model based on these indicators to predict whether companies will discontinue. The evaluation results show that the model achieves an AUC metric of .76 and an accuracy of .71. We evaluated the model in two other OSS projects, Rust and OpenStack, and the performance remains stable. ![]() |
|
Qiu, Yuxin |
![]() Jiyuan Wang, Yuxin Qiu, Ben Limpanukorn, Hong Jin Kang, Qian Zhang, and Miryung Kim (University of California at Los Angeles, USA; University of California at Riverside, USA) In recent years, the MLIR framework has had explosive growth due to the need for extensible deep learning compilers for hardware accelerators. Such examples include Triton, CIRCT, and ONNX-MLIR. MLIR compilers introduce significant complexities in localizing bugs or inefficiencies because of their layered optimization and transformation process with compilation passes. While existing delta debugging techniques can be used to identify a minimum subset of IR code that reproduces a given bug symptom, their naive application to MLIR is time-consuming because real-world MLIR compilers usually involve a large number of compilation passes. Compiler developers must identify a minimized set of relevant compilation passes to reduce the footprint of MLIR compiler code to be inspected for a bug fix. We propose DuoReduce, a dual-dimensional reduction approach for MLIR bug localization. DuoReduce leverages three key ideas in tandem to design an efficient MLIR delta debugger. First, DuoReduce reduces compiler passes that are irrelevant to the bug by identifying ordering dependencies among the different compilation passes. Second, DuoReduce uses MLIR-semantics-aware transformations to expedite IR code reduction. Finally, DuoReduce leverages cross-dependence between the IR code dimension and the compilation pass dimension by accounting for which IR code segments are related to which compilation passes to reduce unused passes. Experiments with three large-scale MLIR compiler projects find that DuoReduce outperforms syntax-aware reducers such as Perses and Vulcan in terms of IR code reduction by 31.6% and 21.5% respectively. If one uses these reducers by enumerating all possible compilation passes (on average 18 passes), it could take up to 145 hours. By identifying ordering dependencies among compilation passes, DuoReduce reduces this time to 9.5 minutes. By identifying which compilation passes are unused for compiling reduced IR code, DuoReduce reduces the number of passes by 14.6%. This translates to not needing to examine 281 lines of MLIR compiler code on average to fix the bugs. DuoReduce has the potential to significantly reduce debugging effort in MLIR compilers, which serves as the foundation for the current landscape of machine learning and hardware accelerators. ![]() |
|
Rahaman, Sazzadur |
![]() Tanner Finken, Jesse Chen, and Sazzadur Rahaman (University of Arizona, USA) ![]() |
|
Rahe, Christian |
![]() Christian Rahe and Walid Maalej (University of Hamburg, Germany) Programming students have a widespread access to powerful Generative AI tools like ChatGPT. While this can help understand the learning material and assist with exercises, educators are voicing more and more concerns about an overreliance on generated outputs and lack of critical thinking skills. It is thus important to understand how students actually use generative AI and what impact this could have on their learning behavior. To this end, we conducted a study including an exploratory experiment with 37 programming students, giving them monitored access to ChatGPT while solving a code authoring exercise. The task was not directly solvable by ChatGPT and required code comprehension and reasoning. While only 23 of the students actually opted to use the chatbot, the majority of those eventually prompted it to simply generate a full solution. We observed two prevalent usage strategies: to seek knowledge about general concepts and to directly generate solutions. Instead of using the bot to comprehend the code and their own mistakes, students often got trapped in a vicious cycle of submitting wrong generated code and then asking the bot for a fix. Those who self-reported using generative AI regularly were more likely to prompt the bot to generate a solution. Our findings indicate that concerns about potential decrease in programmers' agency and productivity with Generative AI are justified. We discuss how researchers and educators can respond to the potential risk of students uncritically over-relying on Generative AI. We also discuss potential modifications to our study design for large-scale replications. ![]() |
|
Rajan, Hridesh |
![]() Sayem Mohammad Imtiaz, Astha Singh, Fraol Batole, and Hridesh Rajan (Iowa State University, USA; Tulane University, USA) Not a day goes by without hearing about the impressive feats of large language models (LLMs), and equally, not a day passes without hearing about their challenges. LLMs are notoriously vulnerable to biases in their dataset, leading to issues such as toxicity, harmful responses, and factual inaccuracies. While domain-adaptive training has been employed to mitigate these issues, these techniques often address all model parameters indiscriminately during the repair process, resulting in poor repair quality and reduced model versatility. In this paper, drawing inspiration from fault localization via program slicing, we introduce a novel dynamic slicing-based intent-aware LLM repair strategy, IRepair. This approach selectively targets the most error-prone sections of the model for repair. Specifically, we propose dynamically slicing the model’s most sensitive layers that require immediate attention, concentrating repair efforts on those areas. This method enables more effective repairs with potentially less impact on the model’s overall versatility by altering a smaller portion of the model. Furthermore, dynamic selection allows for a more nuanced and precise model repair compared to a fixed selection strategy. We evaluated our technique on three models from the GPT2 and GPT-Neo families, with parameters ranging from 800M to 1.6B, in a toxicity mitigation setup. Our results show that IRepair repairs errors 43.6% more effectively while causing 46% less disruption to general performance compared to the closest baseline, direct preference optimization. Our empirical analysis also reveals that errors are more concentrated in a smaller section of the model, with the top 20% of layers exhibiting 773% more error density than the remaining 80%. This highlights the need for selective repair. Additionally, we demonstrate that a dynamic selection approach is essential for addressing errors dispersed throughout the model, ensuring a robust and efficient repair. ![]() |
|
Ran, Dezhi |
![]() Dezhi Ran, Yuan Cao, Yuzhe Guo, Yuetong Li, Mengzhou Wu, Simin Chen, Wei Yang, and Tao Xie (Peking University, China; Beijing Jiaotong University, China; University of Chicago, USA; University of Texas at Dallas, USA) ![]() |
|
Rani, Pooja |
![]() Valentin Bourcier, Pooja Rani, Maximilian Ignacio Willembrinck Santander, Alberto Bacchelli, and Steven Costiou (University of Lille - Inria - CNRS - Centrale Lille - UMR 9189 CRIStAL, France; University of Zurich, Switzerland) Debugging consists in understanding the behavior of a program to identify and correct its defects. Breakpoints are the most commonly used debugging tool and aim to facilitate the debugging process by allowing developers to interrupt a program’s execution at a source code location of their choice and inspect the state of the program. Researchers suggest that in systems developed using object-oriented programming (OOP), traditional breakpoints may be a not effective method for debugging. In OOP, developers create code in classes, which at runtime are instantiated as object—entities with their own state and behavior that can interact with one another. Traditional breakpoints are set within the class code, halting execution for every object that shares that class’s code. This leads to unnecessary interruptions for developers who are focused on monitoring the behavior of a specific object. As an answer to this challenge, researchers proposed object-centric debugging, an approach based on debugging tools that focus on objects rather than classes. In particular, using object-centric breakpoints, developers can select specific objects (rather than classes) for which the execution must be interrupted. Even though it seems reasonable that this approach may ease the debugging process by reducing the time and actions needed for debugging objects, no research has yet verified its actual impact. To investigate the impact of object-centric breakpoints on the debugging process, we devised and conducted a controlled experiment with 81 developers who spent an average of 1 hour and 30 minutes each on the study. The experiment required participants to complete two debugging tasks using debugging tools with vs. without object-centric breakpoints. We found no significant effect from the use of object-centric breakpoints on the number of actions required to debug or the effectiveness in understanding or fixing the bug. However, for one of the two tasks, we measured a statistically significant reduction in debugging time for participants who used object-centric breakpoints, while for the other task, there was a statistically significant increase. Our analysis suggests that the impact of object-centric breakpoints varies depending on the context and the specific nature of the bug being addressed. In particular, our analysis indicates that object-centric breakpoints can speed up the process of locating the root cause of a bug when the bug can be replicated without needing to restart the program. We discuss the implications of these findings for debugging practices and future research. ![]() ![]() |
|
Rao, Nikitha |
![]() Jenny T. Liang, Melissa Lin, Nikitha Rao, and Brad Myers (Carnegie Mellon University, USA) ![]() |
|
Ren, Jialin |
![]() Bingkun Sun, Shiqi Sun, Jialin Ren, Mingming Hu, Kun Hu, Liwei Shen, and Xin Peng (Fudan University, China; Northwestern Polytechnical University, China) The Web of Things (WoT) system standardizes the integration of ubiquitous IoT devices in physical environments, enabling various software applications to automatically sense and regulate the physical environment. While providing convenience, the complex interactions among software applications and physical environment make the WoT system vulnerable to violations caused by improper actuator operations, which may cause undesirable or even harmful results, posing serious risks to user safety and security. In response to this critical concern, many previous efforts have be made. However, existing works primarily focus on analyzing software application behaviors, with insufficient consideration of the physical interactions, multi-source violations, and environmental dynamics in such ubiquitous software systems. As a result, they fail to comprehensively detect the impact of actuator operations on the dynamic environment, thus limiting their effectiveness. To address these limitations, we propose SysGuard, a violation detecting and handling approach. SysGuard employs the dynamic probabilistic graphical model (DPGM) to model the physical interactions as the physical interaction graph (PIG). In the offline phase, SysGuard takes device description models and history device logs as input to capture physical interactions by learning the PIG. In this process, a large language model (LLM) based causal analysis method is further introduced to filter out the device dependencies unrelated to physical interaction by analyzing the device interaction scenarios recorded in device logs. In the online phase, SysGuard processes user-customized violation rules, and monitors runtime device logs to predict violation states and generates handling policies by inferring the PIG. Evaluation on two real-world WoT systems shows that SysGuard significantly outperforms existing state-of-the-art works, achieving high performance in both violation detection and handling. It also confirms the runtime efficiency and scalability of SysGuard. Ablation experiment on our constructed dataset demonstrates that the LLM-based causal analysis significantly improves the performance of SysGuard, with the accuracy increasing in both violation detecting and handling. ![]() ![]() |
|
Ren, Kui |
![]() Shoupeng Ren, Lipeng He, Tianyu Tu, Di Wu, Jian Liu, Kui Ren, and Chun Chen (Zhejiang University, China; University of Waterloo, Canada) ![]() ![]() Yuan Li, Peisen Yao, Kan Yu, Chengpeng Wang, Yaoyang Ye, Song Li, Meng Luo, Yepang Liu, and Kui Ren (Zhejiang University, China; Ant Group, China; Hong Kong University of Science and Technology, China; Southern University of Science and Technology, China) ![]() |
|
Ren, Mengfei |
![]() Xiaolei Ren, Mengfei Ren, Yu Lei, and Jiang Ming (Macau University of Science and Technology, China; University of Alabama in Huntsville, USA; University of Texas at Arlington, USA; Tulane University, USA) ![]() |
|
Ren, Shoupeng |
![]() Shoupeng Ren, Lipeng He, Tianyu Tu, Di Wu, Jian Liu, Kui Ren, and Chun Chen (Zhejiang University, China; University of Waterloo, Canada) ![]() |
|
Ren, Xiaolei |
![]() Xiaolei Ren, Mengfei Ren, Yu Lei, and Jiang Ming (Macau University of Science and Technology, China; University of Alabama in Huntsville, USA; University of Texas at Arlington, USA; Tulane University, USA) ![]() |
|
Ren, Xiaoxue |
![]() Yun Peng, Jun Wan, Yichen Li, and Xiaoxue Ren (Chinese University of Hong Kong, China; Zhejiang University, China) Code generation has largely improved development efficiency in the era of large language models (LLMs). With the ability to follow instructions, current LLMs can be prompted to generate code solutions given detailed descriptions in natural language. Many research efforts are being devoted to improving the correctness of LLM-generated code, and many benchmarks are proposed to evaluate the correctness comprehensively. Despite the focus on correctness, the time efficiency of LLM-generated code solutions is under-explored. Current correctness benchmarks are not suitable for time efficiency evaluation since their test cases cannot well distinguish the time efficiency of different code solutions. Besides, the current execution time measurement is not stable and comprehensive, threatening the validity of the time efficiency evaluation. To address the challenges in the time efficiency evaluation of code generation, we propose COFFE, a code generation benchmark for evaluating the time efficiency of LLM-generated code solutions. COFFE contains 398 and 358 problems for function-level and file-level code generation, respectively. To improve the distinguishability, we design a novel stressful test case generation approach with contracts and two new formats of test cases to improve the accuracy of generation. For the time evaluation metric, we propose efficienct@k based on CPU instruction count to ensure a stable and solid comparison between different solutions. We evaluate 14 popular LLMs on COFFE and identify four findings. Based on the findings, we draw some implications for LLM researchers and software practitioners to facilitate future research and usage of LLMs in code generation. ![]() |
|
Rong, Xiaokai |
![]() Yi Li, Hridya Dhulipala, Aashish Yadavally, Xiaokai Rong, Shaohua Wang, and Tien N. Nguyen (University of Texas at Dallas, USA; Central University of Finance and Economics, China) Although Large Language Models (LLMs) are highly proficient in understanding source code and descriptive texts, they have limitations in reasoning on dynamic program behaviors, such as execution trace and code coverage prediction, and runtime error prediction, which usually require actual program execution. To advance the ability of LLMs in predicting dynamic behaviors, we leverage the strengths of both approaches, Program Analysis (PA) and LLM, in building PredEx, a predictive executor for Python. Our principle is a blended analysis between PA and LLM to use PA to guide the LLM in predicting execution traces. We break down the task of predictive execution into smaller sub-tasks and leverage the deterministic nature when an execution order can be deterministically decided. When it is not certain, we use predictive backward slicing per variable, i.e., slicing the prior trace to only the parts that affect each variable separately breaks up the valuation prediction into significantly simpler problems. Our empirical evaluation on real-world datasets shows that PredEx achieves 31.5–47.1% relatively higher accuracy in predicting full execution traces than the state-of-the-art models. It also produces 8.6–53.7% more correct execution trace prefixes than those baselines. In predicting next executed statements, its relative improvement over the baselines is 15.7–102.3%. Finally, we show PredEx’s usefulness in two tasks: static code coverage analysis and static prediction of run-time errors for (in)complete code. ![]() |
|
Ruberto, Stefano |
![]() Lam Nguyen Tung, Steven Cho, Xiaoning Du, Neelofar Neelofar, Valerio Terragni, Stefano Ruberto, and Aldeida Aleti (Monash University, Australia; University of Auckland, New Zealand; Royal Melbourne Institute of Technology, Australia; Joint Research Centre at the European Commission, Italy) ![]() |
|
Rubin, Julia |
![]() Huimin Hu, Yingying Wang, Julia Rubin, and Michael Pradel (University of Stuttgart, Germany; University of British Columbia, Canada) Scalable static analyzers are popular tools for finding incorrect, inefficient, insecure, and hard-to-maintain code early during the development process. Because not all warnings reported by a static analyzer are immediately useful to developers, many static analyzers provide a way to suppress warnings, e.g., in the form of special comments added into the code. Such suppressions are an important mechanism at the interface between static analyzers and software developers, but little is currently known about them. This paper presents the first in-depth empirical study of suppressions of static analysis warnings, addressing questions about the prevalence of suppressions, their evolution over time, the relationship between suppressions and warnings, and the reasons for using suppressions. We answer these questions by studying projects written in three popular languages and suppressions for warnings by four popular static analyzers. Our findings show that (i) suppressions are relatively common, e.g., with a total of 7,357 suppressions in 46 Python projects, (ii) the number of suppressions in a project tends to continuously increase over time, (iii) surprisingly, 50.8% of all suppressions do not affect any warning and hence are practically useless, (iv) some suppressions, including useless ones, may unintentionally hide future warnings, and (v) common reasons for introducing suppressions include false positives, suboptimal configurations of the static analyzer, and misleading warning messages. These results have actionable implications, e.g., that developers should be made aware of useless suppressions and the potential risk of unintentional suppressing, that static analyzers should provide better warning messages, and that static analyzers should separately categorize warnings from third-party libraries. ![]() |
|
Ryou, Yeonhee |
![]() Sujin Jang, Yeonhee Ryou, Heewon Lee, and Kihong Heo (KAIST, Republic of Korea) ![]() |
|
Ryu, Komei |
![]() Zhaoxu Zhang, Komei Ryu, Tingting Yu, and William G.J. Halfond (University of Southern California, USA; University of Connecticut, USA) Bug report reproduction is a crucial but time-consuming task to be carried out during mobile app maintenance. To accelerate this process, researchers have developed automated techniques for reproducing mobile app bug reports. However, due to the lack of an effective mechanism to recognize different buggy behaviors described in the report, existing work is limited to reproducing crash bug reports, or requires developers to manually analyze execution traces to determine if a bug was successfully reproduced. To address this limitation, we introduce a novel technique to automatically identify and extract the buggy behavior from the bug report and detect it during the automated reproduction process. To accommodate various buggy behaviors of mobile app bugs, we conducted an empirical study and created a standardized representation for expressing the bug behavior identified from our study. Given a report, our approach first transforms the documented buggy behavior into this standardized representation, then matches it against real-time device and UI information during the reproduction to recognize the bug. Our empirical evaluation demonstrated that our approach achieved over 90% precision and recall in generating the standardized representation of buggy behaviors. It correctly identified bugs in 83% of the bug reports and enhanced existing reproduction techniques, allowing them to reproduce four times more bug reports. ![]() |
|
Saad, Mootez |
![]() Mootez Saad, José Antonio Hernández López, Boqi Chen, Dániel Varró, and Tushar Sharma (Dalhousie University, Canada; University of Murcia, Spain; McGill University, Canada; Linköping University, Sweden) Language models of code have demonstrated remarkable performance across various software engineering and source code analysis tasks. However, their demanding computational resource requirements and consequential environmental footprint remain as significant challenges. This work introduces ALPINE, an adaptive programming language-agnostic pruning technique designed to substantially reduce the computational overhead of these models. The proposed method offers a pluggable layer that can be integrated with all Transformer-based models. With ALPINE, input sequences undergo adaptive compression throughout the pipeline, reaching a size that is up to x3 less their initial size, resulting in significantly reduced computational load. Our experiments on two software engineering tasks, defect prediction and code clone detection across three language models CodeBERT, GraphCodeBERT and UniXCoder show that ALPINE achieves up to a 50% reduction in FLOPs, a 58.1% decrease in memory footprint, and a 28.1% improvement in throughput on average. This led to a reduction in CO2 emissions by up to 44.85%. Importantly, it achieves a reduction in computation resources while maintaining up to 98.1% of the original predictive performance. These findings highlight the potential of ALPINE in making language models of code more resource-efficient and accessible while preserving their performance, contributing to the overall sustainability of their adoption in software development. Also, it sheds light on redundant and noisy information in source code analysis corpora, as shown by the substantial sequence compression achieved by ALPINE. ![]() |
|
Sadowski, Caitlin |
![]() Kathryn T. Stolee, Tobias Welp, Caitlin Sadowski, and Sebastian Elbaum (North Carolina State University, USA; Google, Germany; Unaffiliated, USA; University of Virginia, USA) Code search is an integral part of a developer’s workflow. In 2015, researchers published a paper reflecting on the code search practices at Google of 27 developers who used the internal Code Search tool. That paper had first-hand accounts for why those developers were using code search and highlighted how often and in what situations developers were searching for code. In the past decade, much has changed in the landscape of developer support. New languages have emerged, artificial intelligence (AI) for code generation has gained traction, auto-complete in the IDE has gotten better, Q&A forums have increased in popularity, and code repositories are larger than ever. It is worth considering whether those observations from almost a decade ago have stood the test of time. In this work, inspired by the prior survey about the Code Search tool, we run a series of three surveys with 1,945 total responses and report overall Code Search usage statistics for over 100,000 users. Unlike the prior work, in our surveys, we include explicit success criteria to understand when code search is meeting their needs, and when it is not. We dive further into two common sub-categories of code search effort: when its users are looking for examples and when they are using code search alongside code review. We find that Code Search users continue to use the tool frequently and the frequency has not changed despite the introduction of AI-enhanced development support. Users continue to turn to Code Search to find examples, but the frequency of example-seeking behavior has decreased. More often than before, users access the tool to learn about and explore code. This has implications for future Code Search support in software development. ![]() |
|
Sanan, David |
![]() Xiaokun Luan, David Sanan, Zhe Hou, Qiyuan Xu, Chengwei Liu, Yufan Cai, Yang Liu, and Meng Sun (Peking University, China; Singapore Institute of Technology, Singapore; Griffith University, Australia; Nanyang Technological University, Singapore; China-Singapore International Joint Research Institute, China; National University of Singapore, Singapore) Proof assistants are software tools for formal modeling and verification of software, hardware, design, and mathematical proofs. Due to the growing complexity and scale of formal proofs, compatibility issues frequently arise when using different versions of proof assistants. These issues result in broken proofs, disrupting the maintenance of formalized theories and hindering the broader dissemination of results within the community. Although existing works have proposed techniques to address specific types of compatibility issues, the overall characteristics of these issues remain largely unexplored. To address this gap, we conduct the first extensive empirical study to characterize compatibility issues, using Isabelle as a case study. We develop a regression testing framework to automatically collect compatibility issues from the Archive of Formal Proofs, the largest repository of formal proofs in Isabelle. By analyzing 12,079 collected issues, we identify their types and symptoms and further investigate their root causes. We also extract updated proofs that address these issues to understand the applied resolution strategies. Our study provides an in-depth understanding of compatibility issues in proof assistants, offering insights that support the development of effective techniques to mitigate these issues. ![]() ![]() |
|
Santander, Maximilian Ignacio Willembrinck |
![]() Valentin Bourcier, Pooja Rani, Maximilian Ignacio Willembrinck Santander, Alberto Bacchelli, and Steven Costiou (University of Lille - Inria - CNRS - Centrale Lille - UMR 9189 CRIStAL, France; University of Zurich, Switzerland) Debugging consists in understanding the behavior of a program to identify and correct its defects. Breakpoints are the most commonly used debugging tool and aim to facilitate the debugging process by allowing developers to interrupt a program’s execution at a source code location of their choice and inspect the state of the program. Researchers suggest that in systems developed using object-oriented programming (OOP), traditional breakpoints may be a not effective method for debugging. In OOP, developers create code in classes, which at runtime are instantiated as object—entities with their own state and behavior that can interact with one another. Traditional breakpoints are set within the class code, halting execution for every object that shares that class’s code. This leads to unnecessary interruptions for developers who are focused on monitoring the behavior of a specific object. As an answer to this challenge, researchers proposed object-centric debugging, an approach based on debugging tools that focus on objects rather than classes. In particular, using object-centric breakpoints, developers can select specific objects (rather than classes) for which the execution must be interrupted. Even though it seems reasonable that this approach may ease the debugging process by reducing the time and actions needed for debugging objects, no research has yet verified its actual impact. To investigate the impact of object-centric breakpoints on the debugging process, we devised and conducted a controlled experiment with 81 developers who spent an average of 1 hour and 30 minutes each on the study. The experiment required participants to complete two debugging tasks using debugging tools with vs. without object-centric breakpoints. We found no significant effect from the use of object-centric breakpoints on the number of actions required to debug or the effectiveness in understanding or fixing the bug. However, for one of the two tasks, we measured a statistically significant reduction in debugging time for participants who used object-centric breakpoints, while for the other task, there was a statistically significant increase. Our analysis suggests that the impact of object-centric breakpoints varies depending on the context and the specific nature of the bug being addressed. In particular, our analysis indicates that object-centric breakpoints can speed up the process of locating the root cause of a bug when the bug can be replicated without needing to restart the program. We discuss the implications of these findings for debugging practices and future research. ![]() ![]() |
|
Sarker, Jaydeb |
![]() Jaydeb Sarker, Asif Kamal Turzo, and Amiangshu Bosu (University of Nebraska at Omaha, USA; Wayne State University, USA) Toxicity on GitHub can severely impact Open Source Software (OSS) development communities. To mitigate such behavior, a better understanding of its nature and how various measurable characteristics of project contexts and participants are associated with its prevalence is necessary. To achieve this goal, we conducted a large-scale mixed-method empirical study of 2,828 GitHub-based OSS projects randomly selected based on a stratified sampling strategy. Using ToxiCR, an SE domain-specific toxicity detector, we automatically classified each comment as toxic or non-toxic. Additionally, we manually analyzed a random sample of 600 comments to validate ToxiCR's performance and gain insights into the nature of toxicity within our dataset. The results of our study suggest that profanity is the most frequent toxicity on GitHub, followed by trolling and insults. While a project's popularity is positively associated with the prevalence of toxicity, its issue resolution rate has the opposite association. Corporate-sponsored projects are less toxic, but gaming projects are seven times more toxic than non-gaming ones. OSS contributors who have authored toxic comments in the past are significantly more likely to repeat such behavior. Moreover, such individuals are more likely to become targets of toxic texts. ![]() ![]() |
|
Schweikl, Sebastian |
![]() Sebastian Schweikl and Gordon Fraser (University of Passau, Germany) Programming is increasingly taught using dedicated block-based programming environments such as Scratch. While the use of blocks instead of text prevents syntax errors, learners can still make semantic mistakes implying a need for feedback and help. Since teachers may be overwhelmed by help requests in a classroom, may not have the required programming education themselves, and may simply not be available in independent learning scenarios, automated hint generation is desirable. Automated program repair can provide the foundation for automated hints, but relies on multiple assumptions: (1) Program repair usually aims to produce localized patches for fixing single bugs, but learners may fundamentally misunderstand programming concepts and tasks or request help for substantially incomplete programs. (2) Software tests are required to guide the search and to localize broken statements, but test suites for block-based programs are different to those considered in past research on fault localization and repair: They consist of system tests, where very few tests are sufficient to fully cover the code. At the same time, these tests have vastly longer execution times caused by the use of animations and interactions on Scratch programs, thus inhibiting the applicability of metaheuristic search. (3) The plastic surgery hypothesis assumes that the code necessary for repairs already exists in the codebase. Block-based programs tend to be small and may lack this necessary redundancy. In order to study whether automated program repair of block-based programs is nevertheless feasible, in this paper we introduce, to the best of our knowledge, the first automated program repair approach for Scratch programs based on evolutionary search. Our RePurr prototype includes novel refinements of fault localization to improve the lack of guidance of the test suites, recovers the plastic surgery hypothesis by exploiting that a learning scenario provides model and student solutions as alternatives, and uses parallelization and accelerated executions to reduce the costs of fitness evaluations. Empirical evaluation of RePurr on a set of real learners' programs confirms the anticipated challenges, but also demonstrates that the repair can nonetheless effectively improve and fix learners' programs, thus enabling automated generation of hints and feedback for learners. ![]() |
|
Sha, Chaofeng |
![]() Junming Cao, Xuwen Xiang, Mingfei Cheng, Bihuan Chen, Xinyan Wang, You Lu, Chaofeng Sha, Xiaofei Xie, and Xin Peng (Fudan University, China; Singapore Management University, Singapore) ![]() |
|
Sharma, Tushar |
![]() Mootez Saad, José Antonio Hernández López, Boqi Chen, Dániel Varró, and Tushar Sharma (Dalhousie University, Canada; University of Murcia, Spain; McGill University, Canada; Linköping University, Sweden) Language models of code have demonstrated remarkable performance across various software engineering and source code analysis tasks. However, their demanding computational resource requirements and consequential environmental footprint remain as significant challenges. This work introduces ALPINE, an adaptive programming language-agnostic pruning technique designed to substantially reduce the computational overhead of these models. The proposed method offers a pluggable layer that can be integrated with all Transformer-based models. With ALPINE, input sequences undergo adaptive compression throughout the pipeline, reaching a size that is up to x3 less their initial size, resulting in significantly reduced computational load. Our experiments on two software engineering tasks, defect prediction and code clone detection across three language models CodeBERT, GraphCodeBERT and UniXCoder show that ALPINE achieves up to a 50% reduction in FLOPs, a 58.1% decrease in memory footprint, and a 28.1% improvement in throughput on average. This led to a reduction in CO2 emissions by up to 44.85%. Importantly, it achieves a reduction in computation resources while maintaining up to 98.1% of the original predictive performance. These findings highlight the potential of ALPINE in making language models of code more resource-efficient and accessible while preserving their performance, contributing to the overall sustainability of their adoption in software development. Also, it sheds light on redundant and noisy information in source code analysis corpora, as shown by the substantial sequence compression achieved by ALPINE. ![]() |
|
Sharmin, Shaila |
![]() Shaila Sharmin, Anwar Hossain Zahid, Subhankar Bhattacharjee, Chiamaka Igwilo, Miryung Kim, and Wei Le (Iowa State University, USA; University of California at Los Angeles, USA) ![]() |
|
Shen, Liwei |
![]() Bingkun Sun, Shiqi Sun, Jialin Ren, Mingming Hu, Kun Hu, Liwei Shen, and Xin Peng (Fudan University, China; Northwestern Polytechnical University, China) The Web of Things (WoT) system standardizes the integration of ubiquitous IoT devices in physical environments, enabling various software applications to automatically sense and regulate the physical environment. While providing convenience, the complex interactions among software applications and physical environment make the WoT system vulnerable to violations caused by improper actuator operations, which may cause undesirable or even harmful results, posing serious risks to user safety and security. In response to this critical concern, many previous efforts have be made. However, existing works primarily focus on analyzing software application behaviors, with insufficient consideration of the physical interactions, multi-source violations, and environmental dynamics in such ubiquitous software systems. As a result, they fail to comprehensively detect the impact of actuator operations on the dynamic environment, thus limiting their effectiveness. To address these limitations, we propose SysGuard, a violation detecting and handling approach. SysGuard employs the dynamic probabilistic graphical model (DPGM) to model the physical interactions as the physical interaction graph (PIG). In the offline phase, SysGuard takes device description models and history device logs as input to capture physical interactions by learning the PIG. In this process, a large language model (LLM) based causal analysis method is further introduced to filter out the device dependencies unrelated to physical interaction by analyzing the device interaction scenarios recorded in device logs. In the online phase, SysGuard processes user-customized violation rules, and monitors runtime device logs to predict violation states and generates handling policies by inferring the PIG. Evaluation on two real-world WoT systems shows that SysGuard significantly outperforms existing state-of-the-art works, achieving high performance in both violation detection and handling. It also confirms the runtime efficiency and scalability of SysGuard. Ablation experiment on our constructed dataset demonstrates that the LLM-based causal analysis significantly improves the performance of SysGuard, with the accuracy increasing in both violation detecting and handling. ![]() ![]() |
|
Shi, Chenghang |
![]() Liqing Cao, Haofeng Li, Chenghang Shi, Jie Lu, Haining Meng, Lian Li, and Jingling Xue (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia) ![]() |
|
Shi, Jie |
![]() Jia Chen, Yuang He, Peng Wang, Xiaolei Chen, Jie Shi, and Wei Wang (Fudan University, China) ![]() |
|
Shi, Lin |
![]() Xin Tan, Xiao Long, Yinghao Zhu, Lin Shi, Xiaoli Lian, and Li Zhang (Beihang University, China) Onboarding newcomers is vital for the sustainability of open-source software (OSS) projects. To lower barriers and increase engagement, OSS projects have dedicated experts who provide guidance for newcomers. However, timely responses are often hindered by experts' busy schedules. The recent rapid advancements of AI in software engineering have brought opportunities to leverage AI as a substitute for expert mentoring. However, the potential role of AI as a comprehensive mentor throughout the entire onboarding process remains unexplored. To identify design strategies of this "AI mentor", we applied Design Fiction as a participatory method with 19 OSS newcomers. We investigated their current onboarding experience and elicited 32 design strategies for future AI mentor. Participants envisioned AI mentor being integrated into OSS platforms like GitHub, where it could offer assistance to newcomers, such as "recommending projects based on personalized requirements" and "assessing and categorizing project issues by difficulty". We also collected participants' perceptions of a prototype, named "OSSerCopilot", that implemented the envisioned strategies. They found the interface useful and user-friendly, showing a willingness to use it in the future, which suggests the design strategies are effective. Finally, in order to identify the gaps between our design strategies and current research, we conducted a comprehensive literature review, evaluating the extent of existing research support for this concept. We find that research is relatively scarce in certain areas where newcomers highly anticipate AI mentor assistance, such as "discovering an interested project". Our study has the potential to revolutionize the current newcomer-expert mentorship and provides valuable insights for researchers and tool designers aiming to develop and enhance AI mentor systems. ![]() |
|
Shu, Zhan |
![]() Zhijie Wang, Zhehua Zhou, Jiayang Song, Yuheng Huang, Zhan Shu, and Lei Ma (University of Alberta, Canada; University of Tokyo, Japan) ![]() |
|
Siegmund, Janet |
![]() Max Weber, Alina Mailach, Sven Apel, Janet Siegmund, Raimund Dachselt, and Norbert Siegmund (Leipzig University, Germany; ScaDS.AI Dresden-Leipzig, Germany; Saarland Informatics Campus, Germany; Saarland University, Germany; Chemnitz University of Technology, Germany; Dresden University of Technology, Germany) Debugging performance bugs in configurable software systems is a complex and time-consuming task that requires not only fixing a bug, but also understanding its root cause. While there is a vast body of literature on debugging strategies, there is no consensus on general debugging. This makes it difficult to provide concrete guidance for developers, especially for configuration-dependent performance bugs. The goal of our work is to alleviate this situation by providing an framework to describe debugging strategies in a more general, unifying way. We conducted a user study with 12 professional developers who debugged a performance bug in a real-world configurable system. To observe developers in an unobstructive way, we provided an immersive virtual reality tool, SoftVR, giving them a large degree of freedom to choose the preferred debugging strategy. The results show that the existing documentation of strategies is too coarse-grained and intermixed to identify successful approaches. In a subsequent qualitative analysis, we devised a coding framework to reason about debugging approaches. With this framework, we identified five goal-oriented episodes that developers employ, which they also confirmed in subsequent interviews. Our work provides a unified description of debugging strategies, allowing researchers a common foundation to study debugging and practitioners and teachers guidance on successful debugging strategies. ![]() |
|
Siegmund, Norbert |
![]() Max Weber, Alina Mailach, Sven Apel, Janet Siegmund, Raimund Dachselt, and Norbert Siegmund (Leipzig University, Germany; ScaDS.AI Dresden-Leipzig, Germany; Saarland Informatics Campus, Germany; Saarland University, Germany; Chemnitz University of Technology, Germany; Dresden University of Technology, Germany) Debugging performance bugs in configurable software systems is a complex and time-consuming task that requires not only fixing a bug, but also understanding its root cause. While there is a vast body of literature on debugging strategies, there is no consensus on general debugging. This makes it difficult to provide concrete guidance for developers, especially for configuration-dependent performance bugs. The goal of our work is to alleviate this situation by providing an framework to describe debugging strategies in a more general, unifying way. We conducted a user study with 12 professional developers who debugged a performance bug in a real-world configurable system. To observe developers in an unobstructive way, we provided an immersive virtual reality tool, SoftVR, giving them a large degree of freedom to choose the preferred debugging strategy. The results show that the existing documentation of strategies is too coarse-grained and intermixed to identify successful approaches. In a subsequent qualitative analysis, we devised a coding framework to reason about debugging approaches. With this framework, we identified five goal-oriented episodes that developers employ, which they also confirmed in subsequent interviews. Our work provides a unified description of debugging strategies, allowing researchers a common foundation to study debugging and practitioners and teachers guidance on successful debugging strategies. ![]() |
|
Singh, Astha |
![]() Sayem Mohammad Imtiaz, Astha Singh, Fraol Batole, and Hridesh Rajan (Iowa State University, USA; Tulane University, USA) Not a day goes by without hearing about the impressive feats of large language models (LLMs), and equally, not a day passes without hearing about their challenges. LLMs are notoriously vulnerable to biases in their dataset, leading to issues such as toxicity, harmful responses, and factual inaccuracies. While domain-adaptive training has been employed to mitigate these issues, these techniques often address all model parameters indiscriminately during the repair process, resulting in poor repair quality and reduced model versatility. In this paper, drawing inspiration from fault localization via program slicing, we introduce a novel dynamic slicing-based intent-aware LLM repair strategy, IRepair. This approach selectively targets the most error-prone sections of the model for repair. Specifically, we propose dynamically slicing the model’s most sensitive layers that require immediate attention, concentrating repair efforts on those areas. This method enables more effective repairs with potentially less impact on the model’s overall versatility by altering a smaller portion of the model. Furthermore, dynamic selection allows for a more nuanced and precise model repair compared to a fixed selection strategy. We evaluated our technique on three models from the GPT2 and GPT-Neo families, with parameters ranging from 800M to 1.6B, in a toxicity mitigation setup. Our results show that IRepair repairs errors 43.6% more effectively while causing 46% less disruption to general performance compared to the closest baseline, direct preference optimization. Our empirical analysis also reveals that errors are more concentrated in a smaller section of the model, with the top 20% of layers exhibiting 773% more error density than the remaining 80%. This highlights the need for selective repair. Additionally, we demonstrate that a dynamic selection approach is essential for addressing errors dispersed throughout the model, ensuring a robust and efficient repair. ![]() |
|
Sinha, Saurabh |
![]() Ali Reza Ibrahimzada, Kaiyao Ke, Mrigank Pawagi, Muhammad Salman Abid, Rangeet Pan, Saurabh Sinha, and Reyhaneh Jabbarvand (University of Illinois at Urbana-Champaign, USA; Indian Institute of Science, India; Cornell University, USA; IBM Research, USA) ![]() ![]() Myeongsoo Kim, Saurabh Sinha, and Alessandro Orso (Georgia Institute of Technology, USA; IBM Research, USA) Modern web services rely heavily on REST APIs, typically documented using the OpenAPI specification. The widespread adoption of this standard has resulted in the development of many black-box testing tools that generate tests based on OpenAPI specifications. Although Large Language Models (LLMs) have shown promising test-generation abilities, their application to REST API testing remains mostly unexplored. We present LlamaRestTest, a novel approach that employs two custom LLMs-created by fine-tuning and quantizing the Llama3-8B model using mined datasets of REST API example values and inter-parameter dependencies-to generate realistic test inputs and uncover inter-parameter dependencies during the testing process by analyzing server responses. We evaluated LlamaRestTest on 12 real-world services (including popular services such as Spotify), comparing it against RESTGPT, a GPT-powered specification-enhancement tool, as well as several state-of-the-art REST API testing tools, including RESTler, MoRest, EvoMaster, and ARAT-RL. Our results demonstrate that fine-tuning enables smaller models to outperform much larger models in detecting actionable parameter-dependency rules and generating valid inputs for REST API testing. We also evaluated different tool configurations, ranging from the base Llama3-8B model to fine-tuned versions, and explored multiple quantization techniques, including 2-bit, 4-bit, and 8-bit integer formats. Our study shows that small language models can perform as well as, or better than, large language models in REST API testing, balancing effectiveness and efficiency. Furthermore, LlamaRestTest outperforms state-of-the-art REST API testing tools in code coverage achieved and internal server errors identified, even when those tools use RESTGPT-enhanced specifications. Finally, through an ablation study, we show that each component of LlamaRestTest contributes to its overall performance. ![]() |
|
Song, Jiayang |
![]() Zhijie Wang, Zhehua Zhou, Jiayang Song, Yuheng Huang, Zhan Shu, and Lei Ma (University of Alberta, Canada; University of Tokyo, Japan) ![]() |
|
Song, Shuwei |
![]() Shuwei Song, Ting Chen, Ao Qiao, Xiapu Luo, Leqing Wang, Zheyuan He, Ting Wang, Xiaodong Lin, Peng He, Wensheng Zhang, and Xiaosong Zhang (University of Electronic Science and Technology of China, China; Hong Kong Polytechnic University, China; Stony Brook University, USA; University of Guelph, Canada) Cryptocurrency tokens, implemented by smart contracts, are prime targets for attackers due to their substantial monetary value. To illicitly gain profit, attackers often embed malicious code or exploit vulnerabilities within token contracts. Token transfer identification is crucial for detecting malicious and vulnerable token contracts. However, existing methods suffer from high false positives or false negatives due to invalid assumptions or reliance on limited patterns. This paper introduces a novel approach that captures the essential principles of token contracts, which are independent of programming languages and token standards, and presents a new tool, CRYPTO-SCOUT. CRYPTO-SCOUT automatically and accurately identifies token transfers, enabling the detection of various malicious and vulnerable token contracts. CRYPTO-SCOUT's core innovation is its capability to automatically identify complex container-type variables used by token contracts for storing holder information. It processes the bytecode of smart contracts written in the two most widely-used languages, Solidity and Vyper, and supports the three most popular token standards, ERC20, ERC721, and ERC1155. Furthermore, CRYPTO-SCOUT detects four types of malicious and vulnerable token contracts and is designed to be extensible. Extensive experiments show that CRYPTO-SCOUT outperforms existing approaches and uncovers over 21,000 malicious/vulnerable token contracts and more than 12,000 transactions triggering them. ![]() |
|
Souza, Beatriz |
![]() Lars Gröninger, Beatriz Souza, and Michael Pradel (University of Stuttgart, Germany) Code changes are an integral part of the software development process. Many code changes are meant to improve the code without changing its functional behavior, e.g., refactorings and performance improvements. Unfortunately, validating whether a code change preserves the behavior is non-trivial, particularly when the code change is performed deep inside a complex project. This paper presents ChangeGuard, an approach that uses learning-guided execution to compare the runtime behavior of a modified function. The approach is enabled by the novel concept of pairwise learning-guided execution and by a set of techniques that improve the robustness and coverage of the state-of-the-art learning-guided execution technique. Our evaluation applies ChangeGuard to a dataset of 224 manually annotated code changes from popular Python open-source projects and to three datasets of code changes obtained by applying automated code transformations. Our results show that the approach identifies semantics-changing code changes with a precision of 77.1% and a recall of 69.5%, and that it detects unexpected behavioral changes introduced by automatic code refactoring tools. In contrast, the existing regression tests of the analyzed projects miss the vast majority of semantics-changing code changes, with a recall of only 7.6%. We envision our approach being useful for detecting unintended behavioral changes early in the development process and for improving the quality of automated code transformations. ![]() |
|
Sovrano, Francesco |
![]() Francesco Sovrano, Adam Bauer, and Alberto Bacchelli (ETH Zurich, Switzerland; University of Zurich, Switzerland) Traditionally, software vulnerability detection research has focused on individual small functions due to earlier language processing technologies’ limitations in handling larger inputs. However, this function-level approach may miss bugs that span multiple functions and code blocks. Recent advancements in artificial intelligence have enabled processing of larger inputs, leading everyday software developers to increasingly rely on chat-based large language models (LLMs) like GPT-3.5 and GPT-4 to detect vulnerabilities across entire files, not just within functions. This new development practice requires researchers to urgently investigate whether commonly used LLMs can effectively analyze large file-sized inputs, in order to provide timely insights for software developers and engineers about the pros and cons of this emerging technological trend. Hence, the goal of this paper is to evaluate the effectiveness of several state-of-the-art chat-based LLMs, including the GPT models, in detecting in-file vulnerabilities. We conducted a costly investigation into how the performance of LLMs varies based on vulnerability type, input size, and vulnerability location within the file. To give enough statistical power (β ≥ .8) to our study, we could only focus on the three most common (as well as dangerous) vulnerabilities: XSS, SQL injection, and path traversal. Our findings indicate that the effectiveness of LLMs in detecting these vulnerabilities is strongly influenced by both the location of the vulnerability and the overall size of the input. Specifically, regardless of the vulnerability type, LLMs tend to significantly (p < .05) underperform when detecting vulnerabilities located toward the end of larger files—a pattern we call the lost-in-the-end effect. Finally, to further support software developers and practitioners, we also explored the optimal input size for these LLMs and presented a simple strategy for identifying it, which can be applied to other models and vulnerability types. Eventually, we show how adjusting the input size can lead to significant improvements in LLM-based vulnerability detection, with an average recall increase of over 37% across all models. ![]() ![]() |
|
Sridharan, Manu |
![]() Nima Karimipour, Erfan Arvan, Martin Kellogg, and Manu Sridharan (University of California at Riverside, USA; New Jersey Institute of Technology, USA) Null-pointer exceptions are serious problem for Java, and researchers have developed type-based nullness checking tools to prevent them. These tools, however, have a downside: they require developers to write nullability annotations, which is time-consuming and hinders adoption. Researchers have therefore proposed nullability annotation inference tools, whose goal is to (partially) automate the task of annotating a program for nullability. However, prior works rely on differing theories of what makes a set of nullability annotations good, making comparing their effectiveness challenging. In this work, we identify a systematic bias in some prior experimental evaluation of these tools: the use of “type reconstruction” experiments to see if a tool can recover erased developer-written annotations. We show that developers make semantic code changes while adding annotations to facilitate typechecking, leading such experiments to overestimate the effectiveness of inference tools on never-annotated code. We propose a new definition of the “best” inferred annotations for a program that avoids this bias, based on a systematic exploration of the design space. With this new definition, we perform the first head-to-head comparison of three extant nullability inference tools. Our evaluation showed the complementary strengths of the tools and remaining weaknesses that could be addressed in future work. ![]() |
|
Stol, Klaas-Jan |
![]() Meng Fan, Yuxia Zhang, Klaas-Jan Stol, and Hui Liu (Beijing Institute of Technology, China; University College Cork, Ireland; Lero, Ireland; SINTEF, Norway) Continued contributions of core developers in open source software (OSS) projects are key for sustaining and maintaining successful OSS projects. A major risk to the sustainability of OSS projects is developer turnover. Prior studies have explored developer turnover at the level of individual projects. A shortcoming of such studies is that they ignore the impact of developer turnover on downstream projects. Yet, an awareness of the turnover of core developers offers useful insights to the rest of an open source ecosystem. This study performs a large-scale empirical analysis of code developer turnover in the Rust package ecosystem. We find that the turnover of core developers is quite common in the whole Rust ecosystem with 36,991 packages. This is particularly worrying as a vast majority of Rust packages only have a single core developer. We found that core developer turnover can significantly decrease the quality and efficiency of software development and maintenance, even leading to deprecation. This is a major source of concern for those Rust packages that are widely used. We surveyed developers' perspectives on the turnover of core developers in upstream packages. We found that developers widely agreed that core developer turnover can affect project stability and sustainability. They also emphasized the importance of transparency and timely notifications regarding the health status of upstream dependencies. This study provides unique insights to help communities focus on building reliable software dependency networks. ![]() ![]() Mian Qin, Yuxia Zhang, Klaas-Jan Stol, and Hui Liu (Beijing Institute of Technology, China; University College Cork, Ireland; Lero, Ireland; SINTEF, Norway) Open Source Software (OSS) projects are no longer only developed by volunteers. Instead, many organizations, from early-stage startups to large global enterprises, actively participate in many well-known projects. The survival and success of OSS projects rely on long-term contributors, who have extensive experience and knowledge. While prior literature has explored volunteer turnover in OSS, there is a paucity of research on company turnover in OSS ecosystems. Given the intensive involvement of companies in OSS and the different nature of corporate contributors vis-a-vis volunteers, it is important to investigate company turnover in OSS projects. This study first explores the prevalence and characteristics of companies that discontinue contributing to OSS projects, and then develops models to predict companies’ turnover. Based on a study of the Linux kernel, we analyze the early-stage behavior of 1,322 companies that have contributed to the project. We find that approximately 12% of companies discontinue contributing each year; one-sixth of those used to be core contributing companies (those that ranked in the top 20% by commit volume). Furthermore, withdrawing companies tend to have a lower intensity and scope of contributions, make primarily perfective changes, collaborate less, and operate on a smaller scale. We propose a Temporal Convolutional Network (TCN) deep learning model based on these indicators to predict whether companies will discontinue. The evaluation results show that the model achieves an AUC metric of .76 and an accuracy of .71. We evaluated the model in two other OSS projects, Rust and OpenStack, and the performance remains stable. ![]() |
|
Stolee, Kathryn T. |
![]() Kathryn T. Stolee, Tobias Welp, Caitlin Sadowski, and Sebastian Elbaum (North Carolina State University, USA; Google, Germany; Unaffiliated, USA; University of Virginia, USA) Code search is an integral part of a developer’s workflow. In 2015, researchers published a paper reflecting on the code search practices at Google of 27 developers who used the internal Code Search tool. That paper had first-hand accounts for why those developers were using code search and highlighted how often and in what situations developers were searching for code. In the past decade, much has changed in the landscape of developer support. New languages have emerged, artificial intelligence (AI) for code generation has gained traction, auto-complete in the IDE has gotten better, Q&A forums have increased in popularity, and code repositories are larger than ever. It is worth considering whether those observations from almost a decade ago have stood the test of time. In this work, inspired by the prior survey about the Code Search tool, we run a series of three surveys with 1,945 total responses and report overall Code Search usage statistics for over 100,000 users. Unlike the prior work, in our surveys, we include explicit success criteria to understand when code search is meeting their needs, and when it is not. We dive further into two common sub-categories of code search effort: when its users are looking for examples and when they are using code search alongside code review. We find that Code Search users continue to use the tool frequently and the frequency has not changed despite the introduction of AI-enhanced development support. Users continue to turn to Code Search to find examples, but the frequency of example-seeking behavior has decreased. More often than before, users access the tool to learn about and explore code. This has implications for future Code Search support in software development. ![]() |
|
Su, Xing |
![]() Xing Su, Hanzhong Liang, Hao Wu, Ben Niu, Fengyuan Xu, and Sheng Zhong (Nanjing University, China; Institute of Information Engineering at Chinese Academy of Sciences, China) Understanding the Ethereum smart contract bytecode is essential for ensuring cryptoeconomics security. However, existing decompilers primarily convert bytecode into pseudocode, which is not easily comprehensible for general users, potentially leading to misunderstanding of contract behavior and increased vulnerability to scams or exploits. In this paper, we propose DiSCo, the first LLMs-based EVM decompilation pipeline, which aims to enable LLMs to understand the opaque bytecode and lift it into smart contract code. DiSCo introduces three core technologies. First, a logic-invariant intermediate representation is proposed to reproject the low-level bytecode into high-level abstracted units. The second technique involves semantic enhancement based on a novel type-aware graph model to infer stripped variables during compilation, enhancing the lifting effect. The third technology is a flexible method incorporating code specifications to construct LLM-comprehensible prompts for source code generation. Extensive experiments illustrate that our generated code guarantees a high compilability rate at 75%, with differential fuzzing pass rate averaging at 50%. Manual validation results further indicate that the generated solidity contracts significantly outperforms baseline methods in tasks such as code comprehension and attack reproduction. ![]() |
|
Su, Yanqi |
![]() Yanqi Su, Zhenchang Xing, Chong Wang, Chunyang Chen, Sherry (Xiwei) Xu, Qinghua Lu, and Liming Zhu (Australian National University, Australia; CSIRO's Data61, Australia; Nanyang Technological University, Singapore; TU Munich, Germany; UNSW, Australia) Exploratory testing (ET) harnesses tester's knowledge, creativity, and experience to create varying tests that uncover unexpected bugs from the end-user's perspective. Although ET has proven effective in system-level testing of interactive systems, the need for manual execution has hindered large-scale adoption. In this work, we explore the feasibility, challenges and road ahead of automated scenario-based ET (a.k.a soap opera testing). We conduct a formative study, identifying key insights for effective manual soap opera testing and challenges in automating the process. We then develop a multi-agent system leveraging LLMs and a Scenario Knowledge Graph (SKG) to automate soap opera testing. The system consists of three multi-modal agents, Planner, Player, and Detector that collaborate to execute tests and identify potential bugs. Experimental results demonstrate the potential of automated soap opera testing, but there remains a significant gap compared to manual execution, especially under-explored scenario boundaries and incorrectly identified bugs. Based on the observation, we envision road ahead for the future of automated soap opera testing, focusing on three key aspects: the synergy of neural and symbolic approaches, human-AI co-learning, and the integration of soap opera testing with broader software engineering practices. These insights aim to guide and inspire the future research. ![]() |
|
Sun, Bingkun |
![]() Bingkun Sun, Shiqi Sun, Jialin Ren, Mingming Hu, Kun Hu, Liwei Shen, and Xin Peng (Fudan University, China; Northwestern Polytechnical University, China) The Web of Things (WoT) system standardizes the integration of ubiquitous IoT devices in physical environments, enabling various software applications to automatically sense and regulate the physical environment. While providing convenience, the complex interactions among software applications and physical environment make the WoT system vulnerable to violations caused by improper actuator operations, which may cause undesirable or even harmful results, posing serious risks to user safety and security. In response to this critical concern, many previous efforts have be made. However, existing works primarily focus on analyzing software application behaviors, with insufficient consideration of the physical interactions, multi-source violations, and environmental dynamics in such ubiquitous software systems. As a result, they fail to comprehensively detect the impact of actuator operations on the dynamic environment, thus limiting their effectiveness. To address these limitations, we propose SysGuard, a violation detecting and handling approach. SysGuard employs the dynamic probabilistic graphical model (DPGM) to model the physical interactions as the physical interaction graph (PIG). In the offline phase, SysGuard takes device description models and history device logs as input to capture physical interactions by learning the PIG. In this process, a large language model (LLM) based causal analysis method is further introduced to filter out the device dependencies unrelated to physical interaction by analyzing the device interaction scenarios recorded in device logs. In the online phase, SysGuard processes user-customized violation rules, and monitors runtime device logs to predict violation states and generates handling policies by inferring the PIG. Evaluation on two real-world WoT systems shows that SysGuard significantly outperforms existing state-of-the-art works, achieving high performance in both violation detection and handling. It also confirms the runtime efficiency and scalability of SysGuard. Ablation experiment on our constructed dataset demonstrates that the LLM-based causal analysis significantly improves the performance of SysGuard, with the accuracy increasing in both violation detecting and handling. ![]() ![]() |
|
Sun, Chengnian |
![]() Weiqi Lu, Yongqiang Tian, Xiaohan Zhong, Haoyang Ma, Zhenyang Xu, Shing-Chi Cheung, and Chengnian Sun (Hong Kong University of Science and Technology, China; University of Waterloo, Canada) Data visualization (DataViz) libraries play a crucial role in presentation, data analysis, and application development, underscoring the importance of their accuracy in transforming data into visual representations. Incorrect visualizations can adversely impact user experience, distort information conveyance, and influence user perception and decision-making processes. Visual bugs in these libraries can be particularly insidious as they may not cause obvious errors like crashes, but instead mislead users of the underlying data graphically, resulting in wrong decision making. Consequently, a good understanding of the unique characteristics of bugs in DataViz libraries is essential for researchers and developers to detect and fix bugs in DataViz libraries. This study presents the first comprehensive analysis of bugs in DataViz libraries, examining 564 bugs collected from five widely-used libraries. Our study systematically analyzes their symptoms and root causes, and provides a detailed taxonomy. We found that incorrect/inaccurate plots are pervasive in DataViz libraries and incorrect graphic computation is the major root cause, which necessitates further automated testing methods for DataViz libraries. Moreover, we identified eight key steps to trigger such bugs and two test oracles specific to DataViz libraries, which may inspire future research in designing effective automated testing techniques. Furthermore, with the recent advancements in Vision Language Models (VLMs), we explored the feasibility of applying these models to detect incorrect/inaccurate plots. The results show that the effectiveness of VLMs in bug detection varies from 29% to 57%, depending on the prompts, and adding more information in prompts does not necessarily increase the effectiveness. Our findings offer valuable insights into the nature and patterns of bugs in DataViz libraries, providing a foundation for developers and researchers to improve library reliability, and ultimately benefit more accurate and reliable data visualizations across various domains. ![]() ![]() Haibo Wang, Zezhong Xing, Chengnian Sun, Zheng Wang, and Shin Hwei Tan (Concordia University, Canada; Southern University of Science and Technology, China; University of Waterloo, Canada; University of Leeds, UK) By reducing the number of lines of code, program simplification reduces code complexity, improving software maintainability and code comprehension. While several existing techniques can be used for automatic program simplification, there is no consensus on the effectiveness of these approaches. We present the first study on how real-world developers simplify programs in open-source software projects. By analyzing 382 pull requests from 296 projects, we summarize the types of program transformations used, the motivations behind simplifications, and the set of program transformations that have not been covered by existing refactoring types. As a result of our study, we submitted eight bug reports to a widely used refactoring detection tool, RefactoringMiner, where seven were fixed. Our study also identifies gaps in applying existing approaches for automating program simplification and outlines the criteria for designing automatic program simplification techniques. In light of these observations, we propose SimpT5, a tool to automatically produce simplified programs that are semantically equivalent programs with reduced lines of code. SimpT5 is trained on our collected dataset of 92,485 simplified programs with two heuristics: (1) modified line localization that encodes lines changed in simplified programs, and (2) checkers that measure the quality of generated programs. Experimental results show that SimpT5 outperforms prior approaches in automating developer-induced program simplification. ![]() ![]() |
|
Sun, Jun |
![]() Junjie Chen, Xingyu Fan, Chen Yang, Shuang Liu, and Jun Sun (Tianjin University, China; Renmin University of China, China; Singapore Management University, Singapore) ![]() ![]() Songyang Yan, Xiaodong Zhang, Kunkun Hao, Haojie Xin, Yonggang Luo, Jucheng Yang, Ming Fan, Chao Yang, Jun Sun, and Zijiang Yang (Xi'an Jiaotong University, China; Xidian University, China; Synkrotron, China; Chongqing Changan Automobile, China; Singapore Management University, Singapore; University of Science and Technology of China, China) The safety and reliability of Automated Driving Systems (ADS) are paramount, necessitating rigorous testing methodologies to uncover potential failures before deployment. Traditional testing approaches often prioritize either natural scenario sampling or safety-critical scenario generation, resulting in overly simplistic or unrealistic hazardous tests. In practice, the demand for natural scenarios (e.g., when evaluating the ADS's reliability in real-world conditions), critical scenarios (e.g., when evaluating safety in critical situations), or somewhere in between (e.g., when testing the ADS in regions with less civilized drivers) varies depending on the testing objectives. To address this issue, we propose the On-demand Scenario Generation (OSG) Framework, which generates diverse scenarios with varying risk levels. Achieving the goal of OSG is challenging due to the complexity of quantifying the criticalness and naturalness stemming from intricate vehicle-environment interactions, as well as the need to maintain scenario diversity across various risk levels. OSG learns from real-world traffic datasets and employs a Risk Intensity Regulator to quantitatively control the risk level. It also leverages an improved heuristic search method to ensure scenario diversity. We evaluate OSG on the Carla simulators using various ADSs. We verify OSG's ability to generate scenarios with different risk levels and demonstrate its necessity by comparing accident types across risk levels. With the help of OSG, we are now able to systematically and objectively compare the performance of different ADSs based on different risk levels. ![]() ![]() |
|
Sun, Meng |
![]() Xiaokun Luan, David Sanan, Zhe Hou, Qiyuan Xu, Chengwei Liu, Yufan Cai, Yang Liu, and Meng Sun (Peking University, China; Singapore Institute of Technology, Singapore; Griffith University, Australia; Nanyang Technological University, Singapore; China-Singapore International Joint Research Institute, China; National University of Singapore, Singapore) Proof assistants are software tools for formal modeling and verification of software, hardware, design, and mathematical proofs. Due to the growing complexity and scale of formal proofs, compatibility issues frequently arise when using different versions of proof assistants. These issues result in broken proofs, disrupting the maintenance of formalized theories and hindering the broader dissemination of results within the community. Although existing works have proposed techniques to address specific types of compatibility issues, the overall characteristics of these issues remain largely unexplored. To address this gap, we conduct the first extensive empirical study to characterize compatibility issues, using Isabelle as a case study. We develop a regression testing framework to automatically collect compatibility issues from the Archive of Formal Proofs, the largest repository of formal proofs in Isabelle. By analyzing 12,079 collected issues, we identify their types and symptoms and further investigate their root causes. We also extract updated proofs that address these issues to understand the applied resolution strategies. Our study provides an in-depth understanding of compatibility issues in proof assistants, offering insights that support the development of effective techniques to mitigate these issues. ![]() ![]() |
|
Sun, Shiqi |
![]() Bingkun Sun, Shiqi Sun, Jialin Ren, Mingming Hu, Kun Hu, Liwei Shen, and Xin Peng (Fudan University, China; Northwestern Polytechnical University, China) The Web of Things (WoT) system standardizes the integration of ubiquitous IoT devices in physical environments, enabling various software applications to automatically sense and regulate the physical environment. While providing convenience, the complex interactions among software applications and physical environment make the WoT system vulnerable to violations caused by improper actuator operations, which may cause undesirable or even harmful results, posing serious risks to user safety and security. In response to this critical concern, many previous efforts have be made. However, existing works primarily focus on analyzing software application behaviors, with insufficient consideration of the physical interactions, multi-source violations, and environmental dynamics in such ubiquitous software systems. As a result, they fail to comprehensively detect the impact of actuator operations on the dynamic environment, thus limiting their effectiveness. To address these limitations, we propose SysGuard, a violation detecting and handling approach. SysGuard employs the dynamic probabilistic graphical model (DPGM) to model the physical interactions as the physical interaction graph (PIG). In the offline phase, SysGuard takes device description models and history device logs as input to capture physical interactions by learning the PIG. In this process, a large language model (LLM) based causal analysis method is further introduced to filter out the device dependencies unrelated to physical interaction by analyzing the device interaction scenarios recorded in device logs. In the online phase, SysGuard processes user-customized violation rules, and monitors runtime device logs to predict violation states and generates handling policies by inferring the PIG. Evaluation on two real-world WoT systems shows that SysGuard significantly outperforms existing state-of-the-art works, achieving high performance in both violation detection and handling. It also confirms the runtime efficiency and scalability of SysGuard. Ablation experiment on our constructed dataset demonstrates that the LLM-based causal analysis significantly improves the performance of SysGuard, with the accuracy increasing in both violation detecting and handling. ![]() ![]() |
|
Sun, Weisong |
![]() Zhenpeng Chen, Xinyue Li, Jie M. Zhang, Weisong Sun, Ying Xiao, Tianlin Li, Yiling Lou, and Yang Liu (Nanyang Technological University, Singapore; Peking University, China; King's College London, UK; Fudan University, China) Fairness is a critical requirement for Machine Learning (ML) software, driving the development of numerous bias mitigation methods. Previous research has identified a leveling-down effect in bias mitigation for computer vision and natural language processing tasks, where fairness is achieved by lowering performance for all groups without benefiting the unprivileged group. However, it remains unclear whether this effect applies to bias mitigation for tabular data tasks, a key area in fairness research with significant real-world applications. This study evaluates eight bias mitigation methods for tabular data, including both widely used and cutting-edge approaches, across 44 tasks using five real-world datasets and four common ML models. Contrary to earlier findings, our results show that these methods operate in a zero-sum fashion, where improvements for unprivileged groups are related to reduced benefits for traditionally privileged groups. However, previous research indicates that the perception of a zero-sum trade-off might complicate the broader adoption of fairness policies. To explore alternatives, we investigate an approach that applies the state-of-the-art bias mitigation method solely to unprivileged groups, showing potential to enhance benefits of unprivileged groups without negatively affecting privileged groups or overall ML performance. Our study highlights potential pathways for achieving fairness improvements without zero-sum trade-offs, which could help advance the adoption of bias mitigation methods. ![]() ![]() Weisong Sun, Yuchen Chen, Chunrong Fang, Yebo Feng, Yuan Xiao, An Guo, Quanjun Zhang, Zhenyu Chen, Baowen Xu, and Yang Liu (Nanyang Technological University, Singapore; Nanjing University, China) Neural code models (NCMs) have been widely used to address various code understanding tasks, such as defect detection. However, numerous recent studies reveal that such models are vulnerable to backdoor attacks. Backdoored NCMs function normally on normal/clean code snippets, but exhibit adversary-expected behavior on poisoned code snippets injected with the adversary-crafted trigger. It poses a significant security threat. For example, a backdoored defect detection model may misclassify user-submitted defective code as non-defective. If this insecure code is then integrated into critical systems, like autonomous driving systems, it could jeopardize life safety. Therefore, there is an urgent need for effective techniques to detect and eliminate backdoors stealthily implanted in NCMs. To address this issue, in this paper, we innovatively propose a backdoor elimination technique for secure code understanding, called EliBadCode. EliBadCode eliminates backdoors in NCMs by inverting/reverse-engineering and unlearning backdoor triggers. Specifically, EliBadCode first filters the model vocabulary for trigger tokens based on the naming conventions of specific programming languages to reduce the trigger search space and cost. Then, EliBadCode introduces a sample-specific trigger position identification method, which can reduce the interference of non-backdoor (adversarial) perturbations for subsequent trigger inversion, thereby producing effective inverted backdoor triggers efficiently. Backdoor triggers can be viewed as backdoor (adversarial) perturbations. Subsequently, EliBadCode employs a Greedy Coordinate Gradient algorithm to optimize the inverted trigger and designs a trigger anchoring method to purify the inverted trigger. Finally, EliBadCode eliminates backdoors through model unlearning. We evaluate the effectiveness of in eliminating backdoors implanted in multiple NCMs used for three safety-critical code understanding tasks. The results demonstrate that EliBadCode can effectively eliminate backdoors while having minimal adverse effects on the normal functionality of the model. For instance, on defect detection tasks, EliBadCode substantially decreases the average Attack Success Rate (ASR) of the advanced backdoor attack from 99.76% to 2.64%, significantly outperforming the three baselines. The clean model produced by EliBadCode exhibits an average decrease in defect prediction accuracy of only 0.01% (the same as the baseline). ![]() |
|
Sun, Xiaoyu |
![]() Ting Zhou, Yanjie Zhao, Xinyi Hou, Xiaoyu Sun, Kai Chen, and Haoyu Wang (Huazhong University of Science and Technology, China; Australian National University, Australia) Declarative UI frameworks have gained widespread adoption in mobile app development, offering benefits such as improved code readability and easier maintenance. Despite these advantages, the process of translating UI designs into functional code remains challenging and time-consuming. Recent advancements in multimodal large language models (MLLMs) have shown promise in directly generating mobile app code from user interface (UI) designs. However, the direct application of MLLMs to this task is limited by challenges in accurately recognizing UI components and comprehensively capturing interaction logic. To address these challenges, we propose DeclarUI, an automated approach that synergizes computer vision (CV), MLLMs, and iterative compiler-driven optimization to generate and refine declarative UI code from designs. DeclarUI enhances visual fidelity, functional completeness, and code quality through precise component segmentation, Page Transition Graphs (PTGs) for modeling complex inter-page relationships, and iterative optimization. In our evaluation, DeclarUI outperforms baselines on React Native, a widely adopted declarative UI framework, achieving a 96.8% PTG coverage rate and a 98% compilation success rate. Notably, DeclarUI demonstrates significant improvements over state-of-the-art MLLMs, with a 123% increase in PTG coverage rate, up to 55% enhancement in visual similarity scores, and a 29% boost in compilation success rate. We further demonstrate DeclarUI’s generalizability through successful applications to Flutter and ArkUI frameworks. User studies with professional developers confirm that DeclarUI’s generated code meets industrial-grade standards in code availability, modification time, readability, and maintainability. By streamlining app development, improving efficiency, and fostering designer-developer collaboration, DeclarUI offers a practical solution to the persistent challenges in mobile UI development. ![]() |
|
Sun, Yuteng |
![]() Yuteng Sun, Joyanta Debnath, Wenzheng Hong, Omar Chowdhury, and Sze Yiu Chau (Chinese University of Hong Kong, Hong Kong; Stony Brook University, USA; Independent, China) TheX.509 Public Key Infrastructure (PKI) provides a cryptographically verifiable mechanism for authenticating a binding of an entity’s public-key with its identity, presented in a tamper-proof digital certificate. This often serves as a foundational building block for achieving different security guarantees in many critical applications and protocols (e.g., SSL/TLS). Identities in the context of X.509 PKI are often captured as names, which are encoded in certificates as composite records with different optional fields that can store various types of string values (e.g., ASCII, UTF8). Although such flexibility allows for the support of diverse name types (e.g., IP addresses, DNS names) and application scenarios, it imposes on library developers obligations to enforce unnecessarily convoluted requirements. Bugs in enforcing these requirements can lead to potential interoperability and performance issues, and might open doors to impersonation attacks. This paper focuses on analyzing how open-source libraries enforce the constraints regarding the formatting, encoding, and processing of complex name structures on X.509 certificate chains, for the purpose of establishing identities. Our analysis reveals that a portfolio of simplistic testing approaches can expose blatant violations of the requirements in widely used open-source libraries. Although there is a substantial amount of prior work that focused on testing the overall certificate chain validation process of X.509 libraries, the identified violations have escaped their scrutiny. To make matters worse, we observe that almost all the analyzed libraries completely ignore certain pre-processing steps prescribed by the standard. This begs the question of whether it is beneficial to have a standard so flexible but complex that none of the implementations can faithfully adhere to it. With our test results, we argue in the negative, and explain how simpler alternatives (e.g., other forms of identifiers such as Authority and Subject Key Identifiers) can be used to enforce similar constraints with no loss of security. ![]() |
|
Sun, Zeyu |
![]() Mingxuan Zhu, Zeyu Sun, and Dan Hao (Peking University, China; Institute of Software at Chinese Academy of Sciences, China) Compilers are crucial software tools that usually convert programs in high-level languages into machine code. A compiler provides hundreds of optimizations to improve the performance of the compiled code, which are controlled by enabled or disabled optimization flags. However, the vast number of combinations of these flags makes it extremely challenging to select the desired settings for compiler optimization flags (i.e., an optimization sequence) for a given target program. In the literature, many auto-tuning techniques have been proposed to select a desired optimization sequence via different strategies across the entire optimization space. However, due to the huge optimization space, these techniques commonly suffer from the widely recognized efficiency problem. To address this problem, in this paper, we propose a preference-driven selection approach PDCAT, which reduces the search space of optimization sequences through three components. In particular, PDCAT first identifies combined optimizations based on compiler documentation to exclude optimization sequences violating the combined constraints, and then categorizes the optimizations into a common optimization set (whose optimization flags are fixed) and an exploration set containing the remaining optimizations. Finally, within the search process, PDCAT assigns distinct enable probabilities to the explored optimization flags and finally selects a desired optimization sequence. The former two components reduce the search space by removing invalid optimization sequences and fixing some optimization flags, whereas the latter performs a biased search in the search space. To evaluate the performance of the proposed approach PDCAT, we conducted an extensive experimental study on the latest version of the GCC compiler with two widely used benchmarks, cBench and PolyBench. The results show that PDCAT significantly outperforms the four compared techniques, including the state-of-art technique SRTuner. Moreover, each component of PDCAT not only contributes to its performance, but also improves the acceleration performance of the compared techniques. ![]() |
|
Sunshine, Joshua |
![]() Matthew Davis, Amy Wei, Brad Myers, and Joshua Sunshine (Carnegie Mellon University, USA; University of Michigan, USA) Software testing is difficult, tedious, and may consume 28%–50% of software engineering labor. Automatic test generators aim to ease this burden but have important trade-offs. Fuzzers use an implicit oracle that can detect obviously invalid results, but the oracle problem has no general solution, and an implicit oracle cannot automatically evaluate correctness. Test suite generators like EvoSuite use the program under test as the oracle and therefore cannot evaluate correctness. Property-based testing tools evaluate correctness, but users have difficulty coming up with properties to test and understanding whether their properties are correct. Consequently, practitioners create many test suites manually and often use an example-based oracle to tediously specify correct input and output examples. To help bridge the gaps among various oracle and tool types, we present the Composite Oracle, which organizes various oracle types into a hierarchy and renders a single test result per example execution. To understand the Composite Oracle’s practical properties, we built TerzoN, a test suite generator that includes a particular instantiation of the Composite Oracle. TerzoN displays all the test results in an integrated view composed from the results of three types of oracles and finds some types of test assertion inconsistencies that might otherwise lead to misleading test results. We evaluated TerzoN in a randomized controlled trial with 14 professional software engineers with a popular industry tool, fast-check, as the control. Participants using TerzoN elicited 72% more bugs (p < 0.01), accurately described more than twice the number of bugs (p < 0.01) and tested 16% more quickly (p < 0.05) relative to fast-check. ![]() ![]() |
|
Tan, Shin Hwei |
![]() Haibo Wang, Zezhong Xing, Chengnian Sun, Zheng Wang, and Shin Hwei Tan (Concordia University, Canada; Southern University of Science and Technology, China; University of Waterloo, Canada; University of Leeds, UK) By reducing the number of lines of code, program simplification reduces code complexity, improving software maintainability and code comprehension. While several existing techniques can be used for automatic program simplification, there is no consensus on the effectiveness of these approaches. We present the first study on how real-world developers simplify programs in open-source software projects. By analyzing 382 pull requests from 296 projects, we summarize the types of program transformations used, the motivations behind simplifications, and the set of program transformations that have not been covered by existing refactoring types. As a result of our study, we submitted eight bug reports to a widely used refactoring detection tool, RefactoringMiner, where seven were fixed. Our study also identifies gaps in applying existing approaches for automating program simplification and outlines the criteria for designing automatic program simplification techniques. In light of these observations, we propose SimpT5, a tool to automatically produce simplified programs that are semantically equivalent programs with reduced lines of code. SimpT5 is trained on our collected dataset of 92,485 simplified programs with two heuristics: (1) modified line localization that encodes lines changed in simplified programs, and (2) checkers that measure the quality of generated programs. Experimental results show that SimpT5 outperforms prior approaches in automating developer-induced program simplification. ![]() ![]() |
|
Tan, Siwei |
![]() Siwei Tan, Liqiang Lu, Debin Xiang, Tianyao Chu, Congliang Lang, Jintao Chen, Xing Hu, and Jianwei Yin (Zhejiang University, China) Quantum programs provide exponential speedups compared to classical programs in certain areas, but they also inevitably encounter logical faults. Automatically repairing quantum programs is much more challenging than repairing classical programs due to the non-replicability of data, the vast search space of program inputs, and the new programming paradigm. Existing works based on semantic-based or learning-based program repair techniques are fundamentally limited in repairing efficiency and effectiveness. In this work, we propose HornBro, an efficient framework for automated quantum program repair. The key insight of HornBro lies in the homotopy-like method, which iteratively switches between the classical and quantum parts. This approach allows the repair tasks to be efficiently offloaded to the most suitable platforms, enabling a progressive convergence toward the correct program. We start by designing an implication assertion pragma to enable rigorous specifications of quantum program behavior, which helps to generate a quantum test suite automatically. This suite leverages the orthonormal bases of quantum programs to accommodate different encoding schemes. Given a fixed number of test cases, it allows the maximum input coverage of potential counter-example candidates. Then, we develop a Clifford approximation method with an SMT-based search to transform the patch localization program into a symbolic reasoning problem. Finally, we offload the computationally intensive repair of gate parameters to quantum hardware by leveraging the differentiability of quantum gates. Experiments suggest that HornBro increases the repair success rate by more than 62.5% compared to the existing repair techniques, supporting more types of quantum bugs. It also achieves 35.7× speedup in the repair and 99.9% gate reduction of the patch. ![]() ![]() |
|
Tan, Xin |
![]() Xin Tan, Xiao Long, Yinghao Zhu, Lin Shi, Xiaoli Lian, and Li Zhang (Beihang University, China) Onboarding newcomers is vital for the sustainability of open-source software (OSS) projects. To lower barriers and increase engagement, OSS projects have dedicated experts who provide guidance for newcomers. However, timely responses are often hindered by experts' busy schedules. The recent rapid advancements of AI in software engineering have brought opportunities to leverage AI as a substitute for expert mentoring. However, the potential role of AI as a comprehensive mentor throughout the entire onboarding process remains unexplored. To identify design strategies of this "AI mentor", we applied Design Fiction as a participatory method with 19 OSS newcomers. We investigated their current onboarding experience and elicited 32 design strategies for future AI mentor. Participants envisioned AI mentor being integrated into OSS platforms like GitHub, where it could offer assistance to newcomers, such as "recommending projects based on personalized requirements" and "assessing and categorizing project issues by difficulty". We also collected participants' perceptions of a prototype, named "OSSerCopilot", that implemented the envisioned strategies. They found the interface useful and user-friendly, showing a willingness to use it in the future, which suggests the design strategies are effective. Finally, in order to identify the gaps between our design strategies and current research, we conducted a comprehensive literature review, evaluating the extent of existing research support for this concept. We find that research is relatively scarce in certain areas where newcomers highly anticipate AI mentor assistance, such as "discovering an interested project". Our study has the potential to revolutionize the current newcomer-expert mentorship and provides valuable insights for researchers and tool designers aiming to develop and enhance AI mentor systems. ![]() |
|
Tang, Xi |
![]() Junyao Ye, Zhen Li, Xi Tang, Deqing Zou, Shouhuai Xu, Weizhong Qiang, and Hai Jin (Huazhong University of Science and Technology, China; University of Colorado at Colorado Springs, USA) ![]() |
|
Tang, Xiaoying |
![]() Haowen Yang, Zhengda Li, Zhiqing Zhong, Xiaoying Tang, and Pinjia He (Chinese University of Hong Kong, Shenzhen, China) With the increasing demand for handling large-scale and complex data, data science (DS) applications often suffer from long execution time and rapid RAM exhaustion, which leads to many serious issues like unbearable delays and crashes in financial transactions. As popular DS libraries are frequently used in these applications, their performance bugs (PBs) are a major contributing factor to these issues, making it crucial to address them for improving overall application performance. However, PBs in popular DS libraries remain largely unexplored. To address this gap, we conducted a study of 202 PBs collected from seven popular DS libraries. Our study examined the impact of PBs and proposed a taxonomy for common root causes. We found over half of the PBs arise from inefficient data processing operations, especially within data structure. We also explored the effort required to locate their root causes and fix these bugs, along with the challenges involved. Notably, around 20% of the PBs could be fixed using simple strategies (e.g. Conditions Optimizing), suggesting the potential for automated repair approaches. Our findings highlight the severity of PBs in core DS libraries and offer insights for developing high-performance libraries and detecting PBs. Furthermore, we derived test rules from our identified root causes, identifying eight PBs, of which four were confirmed, demonstrating the practical utility of our findings. ![]() |
|
Tang, Zhanyong |
![]() Weiyuan Tong, Zixu Wang, Zhanyong Tang, Jianbin Fang, Yuqun Zhang, and Guixin Ye (NorthWest University, China; National University of Defense Technology, China; Southern University of Science and Technology, China) ![]() |
|
Taylor, Raygan |
![]() Soneya Binta Hossain, Raygan Taylor, and Matthew Dwyer (University of Virginia, USA; Dillard University, USA) Code documentation is a critical artifact of software development, bridging human understanding and machine- readable code. Beyond aiding developers in code comprehension and maintenance, documentation also plays a critical role in automating various software engineering tasks, such as test oracle generation (TOG). In Java, Javadoc comments offer structured, natural language documentation embedded directly within the source code, typically describing functionality, usage, parameters, return values, and exceptional behavior. While prior research has explored the use of Javadoc comments in TOG alongside other information, such as the method under test (MUT), their potential as a stand-alone input source, the most relevant Javadoc components, and guidelines for writing effective Javadoc comments for automating TOG remain less explored. In this study, we investigate the impact of Javadoc comments on TOG through a comprehensive analysis. We begin by fine-tuning 10 large language models using three different prompt pairs to assess the role of Javadoc comments alongside other contextual information. Next, we systematically analyze the impact of different Javadoc comment’s components on TOG. To evaluate the generalizability of Javadoc comments from various sources, we also generate them using the GPT-3.5 model. We perform a thorough bug detection study using Defects4J dataset to understand their role in real-world bug detection. Our results show that incorporating Javadoc comments improves the accuracy of test oracles in most cases, aligning closely with ground truth. We find that Javadoc comments alone can match or even outperform approaches that utilize the MUT implementation. Additionally, we identify that the description and the return tag are the most valuable components for TOG. Finally, our approach, when using only Javadoc comments, detects between 19% and 94% more real-world bugs in Defects4J than prior methods, establishing a new state-of-the-art. To further guide developers in writing effective documentation, we conduct a detailed qualitative study on when Javadoc comments are helpful or harmful for TOG. ![]() |
|
Terragni, Valerio |
![]() Lam Nguyen Tung, Steven Cho, Xiaoning Du, Neelofar Neelofar, Valerio Terragni, Stefano Ruberto, and Aldeida Aleti (Monash University, Australia; University of Auckland, New Zealand; Royal Melbourne Institute of Technology, Australia; Joint Research Centre at the European Commission, Italy) ![]() ![]() Hengcheng Zhu, Valerio Terragni, Lili Wei, Shing-Chi Cheung, Jiarong Wu, and Yepang Liu (Hong Kong University of Science and Technology, China; University of Auckland, New Zealand; McGill University, Canada; Southern University of Science and Technology, China) Mock assertions provide developers with a powerful means to validate program behaviors that are unobservable to test assertions. Despite their significance, they are rarely considered by automated test generation techniques. Effective generation of mock assertions requires understanding how they are used in practice. Although previous studies highlighted the importance of mock assertions, none provide insight into their usages. To bridge this gap, we conducted the first empirical study on mock assertions, examining their adoption, the characteristics of the verified method invocations, and their effectiveness in fault detection. Our analysis of 4,652 test cases from 11 popular Java projects reveals that mock assertions are mostly applied to validating specific kinds of method calls, such as those interacting with external resources and those reflecting whether a certain code path was traversed in systems under test. Additionally, we find that mock assertions complement traditional test assertions by ensuring the desired side effects have been produced, validating control flow logic, and checking internal computation results. Our findings contribute to a better understanding of mock assertion usages and provide a foundation for future related research such as automated test generation that support mock assertions. ![]() ![]() |
|
Thakur, Addi Malviya |
![]() Addi Malviya Thakur, Reed Milewicz, Mahmoud Jahanshahi, Lavínia Paganini, Bogdan Vasilescu, and Audris Mockus (University of Tennessee at Knoxville, USA; Oak Ridge National Laboratory, USA; Sandia National Laboratories, USA; Eindhoven University of Technology, Netherlands; Carnegie Mellon University, USA) ![]() |
|
Tian, Yongqiang |
![]() Weiqi Lu, Yongqiang Tian, Xiaohan Zhong, Haoyang Ma, Zhenyang Xu, Shing-Chi Cheung, and Chengnian Sun (Hong Kong University of Science and Technology, China; University of Waterloo, Canada) Data visualization (DataViz) libraries play a crucial role in presentation, data analysis, and application development, underscoring the importance of their accuracy in transforming data into visual representations. Incorrect visualizations can adversely impact user experience, distort information conveyance, and influence user perception and decision-making processes. Visual bugs in these libraries can be particularly insidious as they may not cause obvious errors like crashes, but instead mislead users of the underlying data graphically, resulting in wrong decision making. Consequently, a good understanding of the unique characteristics of bugs in DataViz libraries is essential for researchers and developers to detect and fix bugs in DataViz libraries. This study presents the first comprehensive analysis of bugs in DataViz libraries, examining 564 bugs collected from five widely-used libraries. Our study systematically analyzes their symptoms and root causes, and provides a detailed taxonomy. We found that incorrect/inaccurate plots are pervasive in DataViz libraries and incorrect graphic computation is the major root cause, which necessitates further automated testing methods for DataViz libraries. Moreover, we identified eight key steps to trigger such bugs and two test oracles specific to DataViz libraries, which may inspire future research in designing effective automated testing techniques. Furthermore, with the recent advancements in Vision Language Models (VLMs), we explored the feasibility of applying these models to detect incorrect/inaccurate plots. The results show that the effectiveness of VLMs in bug detection varies from 29% to 57%, depending on the prompts, and adding more information in prompts does not necessarily increase the effectiveness. Our findings offer valuable insights into the nature and patterns of bugs in DataViz libraries, providing a foundation for developers and researchers to improve library reliability, and ultimately benefit more accurate and reliable data visualizations across various domains. ![]() |
|
Tian, Zhiliang |
![]() Kangchen Zhu, Zhiliang Tian, Shangwen Wang, Weiguo Chen, Zixuan Dong, Mingyue Leng, and Xiaoguang Mao (National University of Defense Technology, China) The current landscape of binary code summarization predominantly focuses on generating a single summary, which limits its utility and understanding for reverse engineers. Existing approaches often fail to meet the diverse needs of users, such as providing detailed insights into usage patterns, implementation nuances, and design rationale, as observed in the field of source code summarization. This highlights the need for multi-intent binary code summarization to enhance the effectiveness of reverse engineering processes. To address this gap, we propose MiSum, a novel method that leverages multi-modality heterogeneous code graph alignment and learning to integrate both assembly code and pseudo-code. MiSum introduces a unified multi-modality heterogeneous code graph (MM-HCG) that aligns assembly code graphs with pseudo-code graphs, capturing both low-level execution details and high-level structural information. We further propose multi-modality heterogeneous graph learning with heterogeneous mutual attention and message passing, which highlights important code blocks and discovers inter-dependencies across different code forms. Additionally, an intent-aware summary generator with an intent-aware attention mechanism is introduced to produce customized summaries tailored to multiple intents. Extensive experiments, including evaluations across various architectures and optimization levels, demonstrate that MiSum outperforms state-of-the-art baselines in BLEU, METEOR, and ROUGE-L metrics. Human evaluations validate its capability to effectively support reverse engineers in understanding diverse binary code intents, marking a significant advancement in binary code analysis. ![]() ![]() ![]() |
|
Toffalini, Flavio |
![]() Flavio Toffalini, Nicolas Badoux, Zurab Tsinadze, and Mathias Payer (EPFL, Switzerland; Ruhr University Bochum, Germany) ![]() ![]() Han Zheng, Flavio Toffalini, Marcel Böhme, and Mathias Payer (EPFL, Switzerland; Ruhr University Bochum, Germany; MPI-SP, Germany) Can a fuzzer cover more code with minimal corruption of the initial seed? Before a seed is fuzzed, the early greybox fuzzers first systematically enumerated slightly corrupted inputs by applying every mutation operator to every part of the seed, once per generated input. The hope of this so-called “deterministic” stage was that simple changes to the seed would be less likely to break the complex file format; the resulting inputs would find bugs in the program logic well beyond the program’s parser. However, when experiments showed that disabling the deterministic stage achieves more coverage, applying multiple mutation operators at the same time to a single input, most fuzzers disabled the deterministic stage by default. Instead of ignoring the deterministic stage, we analyze its potential and substantially improve deterministic exploration. Our deterministic stage is now the default in AFL++, reverting the earlier decision of dropping deterministic exploration. We start by investigating the overhead and the contribution of the deterministic stage to the discovery of coverage-increasing inputs. While the sheer number of generated inputs explains the overhead, we find that only a few critical seeds (20%), and only a few critical bytes in a seed (0.5%) are responsible for the vast majority of the coverage-increasing inputs (83% and 84%, respectively). Hence, we develop an efficient technique, called , to identify these critical seeds / bytes so as to prune a large number of unnecessary inputs. retains the benefits of the classic deterministic stage by only enumerating a tiny part of the total deterministic state space. We evaluate implementation on two benchmarking frameworks, FuzzBench and Magma. Our evaluation shows that outperforms state-of-the-art fuzzers with and without the (old) deterministic stage enabled, both in terms of coverage and bug finding. also discovered 8 new CVEs on exhaustively fuzzed security-critical applications. Finally, has been independently evaluated and integrated into AFL++ as default option. ![]() |
|
Toledo, Felipe |
![]() Trey Woodlief, Felipe Toledo, Matthew Dwyer, and Sebastian Elbaum (University of Virginia, USA) ![]() |
|
Tonella, Paolo |
![]() Matteo Biagiola, Robert Feldt, and Paolo Tonella (USI Lugano, Switzerland; Chalmers University of Technology, Sweden) Adaptive Random Testing (ART) has faced criticism, particularly for its computational inefficiency, as highlighted by Arcuri and Briand. Their analysis clarified how ART requires a quadratic number of distance computations as the number of test executions increases, which limits its scalability in scenarios requiring extensive testing to uncover faults. Simulation results support this, showing that the computational overhead of these distance calculations often outweighs ART’s benefits. While various ART variants have attempted to reduce these costs, they frequently do so at the expense of fault detection, lack complexity guarantees, or are restricted to specific input types, such as numerical or discrete data. In this paper, we introduce a novel framework for adaptive random testing that replaces pairwise distance computations with a compact aggregation of past executions, such as counting the q-grams observed in previous runs. Test case selection then leverages this aggregated data to measure diversity (e.g., entropy of q-grams), allowing us to reduce the computational complexity from quadratic to linear. Experiments with a benchmark of six web applications, show that ART with q-grams covers, on average, 4× more unique targets than random testing, and 3.5×more than ART using traditional distance-based methods. ![]() |
|
Tong, Weiyuan |
![]() Weiyuan Tong, Zixu Wang, Zhanyong Tang, Jianbin Fang, Yuqun Zhang, and Guixin Ye (NorthWest University, China; National University of Defense Technology, China; Southern University of Science and Technology, China) ![]() |
|
Tsinadze, Zurab |
![]() Flavio Toffalini, Nicolas Badoux, Zurab Tsinadze, and Mathias Payer (EPFL, Switzerland; Ruhr University Bochum, Germany) ![]() |
|
Tu, Tianyu |
![]() Shoupeng Ren, Lipeng He, Tianyu Tu, Di Wu, Jian Liu, Kui Ren, and Chun Chen (Zhejiang University, China; University of Waterloo, Canada) ![]() |
|
Tu, Zhi |
![]() Zhi Tu, Liangkun Niu, Wei Fan, and Tianyi Zhang (Purdue University, USA) ![]() |
|
Turcotte, Alexi |
![]() Alexi Turcotte and Zheyuan Wu (CISPA, Germany; Saarland University, Germany) ![]() |
|
Turzo, Asif Kamal |
![]() Jaydeb Sarker, Asif Kamal Turzo, and Amiangshu Bosu (University of Nebraska at Omaha, USA; Wayne State University, USA) Toxicity on GitHub can severely impact Open Source Software (OSS) development communities. To mitigate such behavior, a better understanding of its nature and how various measurable characteristics of project contexts and participants are associated with its prevalence is necessary. To achieve this goal, we conducted a large-scale mixed-method empirical study of 2,828 GitHub-based OSS projects randomly selected based on a stratified sampling strategy. Using ToxiCR, an SE domain-specific toxicity detector, we automatically classified each comment as toxic or non-toxic. Additionally, we manually analyzed a random sample of 600 comments to validate ToxiCR's performance and gain insights into the nature of toxicity within our dataset. The results of our study suggest that profanity is the most frequent toxicity on GitHub, followed by trolling and insults. While a project's popularity is positively associated with the prevalence of toxicity, its issue resolution rate has the opposite association. Corporate-sponsored projects are less toxic, but gaming projects are seven times more toxic than non-gaming ones. OSS contributors who have authored toxic comments in the past are significantly more likely to repeat such behavior. Moreover, such individuals are more likely to become targets of toxic texts. ![]() ![]() |
|
Uddin, Gias |
![]() Borui Yang, Md Afif Al Mamun, Jie M. Zhang, and Gias Uddin (King's College London, UK; University of Calgary, Canada; York University, Canada) Large Language Models (LLMs) are prone to hallucinations, e.g., factually incorrect information, in their responses. These hallucinations present challenges for LLM-based applications that demand high factual accuracy. Existing hallucination detection methods primarily depend on external resources, which can suffer from issues such as low availability, incomplete coverage, privacy concerns, high latency, low reliability, and poor scalability. There are also methods depending on output probabilities, which are often inaccessible for closed-source LLMs like GPT models. This paper presents MetaQA, a self-contained hallucination detection approach that leverages metamorphic relation and prompt mutation. Unlike existing methods, MetaQA operates without any external resources and is compatible with both open-source and closed-source LLMs. MetaQA is based on the hypothesis that if an LLM’s response is a hallucination, the designed metamorphic relations will be violated. We compare MetaQA with the state-of-the-art zero-resource hallucination detection method, SelfCheckGPT, across multiple datasets, and on two open-source and two closed-source LLMs. Our results reveal that MetaQA outperforms SelfCheckGPT in terms of precision, recall, and f1 score. For the four LLMs we study, MetaQA outperforms SelfCheckGPT with a superiority margin ranging from 0.041 - 0.113 (for precision), 0.143 - 0.430 (for recall), and 0.154 - 0.368 (for F1-score). For instance, with Mistral-7B, MetaQA achieves an average F1-score of 0.435, compared to SelfCheckGPT’s F1-score of 0.205, representing an improvement rate of 112.2%. MetaQA also demonstrates superiority across all different categories of questions. ![]() |
|
Umer, Qasim |
![]() Haris Ali Khan, Yanjie Jiang, Qasim Umer, Yuxia Zhang, Waseem Akram, and Hui Liu (Beijing Institute of Technology, China; Peking University, China; King Fahd University of Petroleum and Minerals, Saudi Arabia) It is often valuable to know whether a given piece of source code has or hasn’t been used to train a given deep learning model. On one side, it helps avoid data contamination problems that may exaggerate the performance of evaluated models. Conversely, it facilitates copyright protection by identifying private or protected code leveraged for model training without permission. To this end, automated approaches have been proposed for the detection, known as data contamination detection. Such approaches often heavily rely on the confidence of the involved models, assuming that the models should be more confident in handling contaminated data than cleaned data. However, such approaches do not consider the nature of the given data item, i.e., how difficult it is to predict the given item. Consequently, difficult-to-predict contaminated data and easy-to-predict cleaned data are often misclassified. As an initial attempt to solve this problem, this paper presents a naturalness-based approach, called Natural-DaCoDe, for code-completion models to distinguish contaminated source code from cleaned ones. Natural-DaCoDe leverages code naturalness to quantitatively measure the difficulty of a given source code for code-completion models. It then trains a classifier to distinguish contaminated source code according to both code naturalness and the performance of the code-completion models on the given source code. We evaluate Natural-DaCoDe with two pre-trained large language models (e.g., ChatGPT and Claude) and two code-completion models that we trained from scratch for detecting contamination data. Our evaluation results suggest that Natural-DaCoDe substantially outperformed the state-of-the-art approaches in detecting contaminated data, improving the average accuracy by 61.78%. We also evaluate Natural-DaCoDe with method name suggestion task, and it remains more accurate than the state-of-the-art approaches, improving the accuracy by 54.39%. Furthermore, Natural-DaCoDe was tested on a natural language text benchmark, significantly outperforming the state-of-the-art approaches by 22% . It may suggest that Natural-DaCoDe could be applied to various source code related tasks besides code complete. ![]() |
|
Van Deursen, Arie |
![]() Ali Al-Kaswan, Sebastian Deatc, Begüm Koç, Arie van Deursen, and Maliheh Izadi (Delft University of Technology, Netherlands) ![]() |
|
Van Kempen, Nicolas |
![]() Kyla H. Levin, Nicolas van Kempen, Emery D. Berger, and Stephen N. Freund (University of Massachusetts at Amherst, USA; Amazon Web Services, USA; Williams College, USA) ![]() |
|
Varró, Dániel |
![]() Mootez Saad, José Antonio Hernández López, Boqi Chen, Dániel Varró, and Tushar Sharma (Dalhousie University, Canada; University of Murcia, Spain; McGill University, Canada; Linköping University, Sweden) Language models of code have demonstrated remarkable performance across various software engineering and source code analysis tasks. However, their demanding computational resource requirements and consequential environmental footprint remain as significant challenges. This work introduces ALPINE, an adaptive programming language-agnostic pruning technique designed to substantially reduce the computational overhead of these models. The proposed method offers a pluggable layer that can be integrated with all Transformer-based models. With ALPINE, input sequences undergo adaptive compression throughout the pipeline, reaching a size that is up to x3 less their initial size, resulting in significantly reduced computational load. Our experiments on two software engineering tasks, defect prediction and code clone detection across three language models CodeBERT, GraphCodeBERT and UniXCoder show that ALPINE achieves up to a 50% reduction in FLOPs, a 58.1% decrease in memory footprint, and a 28.1% improvement in throughput on average. This led to a reduction in CO2 emissions by up to 44.85%. Importantly, it achieves a reduction in computation resources while maintaining up to 98.1% of the original predictive performance. These findings highlight the potential of ALPINE in making language models of code more resource-efficient and accessible while preserving their performance, contributing to the overall sustainability of their adoption in software development. Also, it sheds light on redundant and noisy information in source code analysis corpora, as shown by the substantial sequence compression achieved by ALPINE. ![]() |
|
Vasilescu, Bogdan |
![]() Addi Malviya Thakur, Reed Milewicz, Mahmoud Jahanshahi, Lavínia Paganini, Bogdan Vasilescu, and Audris Mockus (University of Tennessee at Knoxville, USA; Oak Ridge National Laboratory, USA; Sandia National Laboratories, USA; Eindhoven University of Technology, Netherlands; Carnegie Mellon University, USA) ![]() ![]() Hao He, Bogdan Vasilescu, and Christian Kästner (Carnegie Mellon University, USA) Recent high-profile incidents in open-source software have greatly raised practitioner attention on software supply chain attacks. To guard against potential malicious package updates, security practitioners advocate pinning dependency to specific versions rather than floating in version ranges. However, it remains controversial whether pinning carries a meaningful security benefit that outweighs the cost of maintaining outdated and possibly vulnerable dependencies. In this paper, we quantify, through counterfactual analysis and simulations, the security and maintenance impact of version constraints in the npm ecosystem. By simulating dependency resolutions over historical time points, we find that pinning direct dependencies not only (as expected) increases the cost of maintaining vulnerable and outdated dependencies, but also (surprisingly) even increases the risk of exposure to malicious package updates in larger dependency graphs due to the specifics of npm’s dependency resolution mechanism. Finally, we explore collective pinning strategies to secure the ecosystem against supply chain attacks, suggesting specific changes to npm to enable such interventions. Our study provides guidance for practitioners and tool designers to manage their supply chains more securely. ![]() |
|
Virk, Yuvraj Singh |
![]() Yuvraj Singh Virk, Premkumar Devanbu, and Toufique Ahmed (University of California at Davis, USA; IBM Research, USA) ![]() |
|
Visaggio, Corrado Aaron |
![]() Gregorio Dalia, Andrea Di Sorbo, Corrado Aaron Visaggio, and Gerardo Canfora (University of Sannio, Italy) ![]() |
|
Wan, Jun |
![]() Yun Peng, Jun Wan, Yichen Li, and Xiaoxue Ren (Chinese University of Hong Kong, China; Zhejiang University, China) Code generation has largely improved development efficiency in the era of large language models (LLMs). With the ability to follow instructions, current LLMs can be prompted to generate code solutions given detailed descriptions in natural language. Many research efforts are being devoted to improving the correctness of LLM-generated code, and many benchmarks are proposed to evaluate the correctness comprehensively. Despite the focus on correctness, the time efficiency of LLM-generated code solutions is under-explored. Current correctness benchmarks are not suitable for time efficiency evaluation since their test cases cannot well distinguish the time efficiency of different code solutions. Besides, the current execution time measurement is not stable and comprehensive, threatening the validity of the time efficiency evaluation. To address the challenges in the time efficiency evaluation of code generation, we propose COFFE, a code generation benchmark for evaluating the time efficiency of LLM-generated code solutions. COFFE contains 398 and 358 problems for function-level and file-level code generation, respectively. To improve the distinguishability, we design a novel stressful test case generation approach with contracts and two new formats of test cases to improve the accuracy of generation. For the time evaluation metric, we propose efficienct@k based on CPU instruction count to ensure a stable and solid comparison between different solutions. We evaluate 14 popular LLMs on COFFE and identify four findings. Based on the findings, we draw some implications for LLM researchers and software practitioners to facilitate future research and usage of LLMs in code generation. ![]() |
|
Wan, Yuxuan |
![]() Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, and Michael Lyu (Chinese University of Hong Kong, China; Singapore Management University, Singapore) ![]() |
|
Wan, Zhiyuan |
![]() Keyu Liang, Zhongxin Liu, Chao Liu, Zhiyuan Wan, David Lo, and Xiaohu Yang (Zhejiang University, China; Chongqing University, China; Singapore Management University, Singapore) Code search is a crucial task in software engineering, aiming to retrieve code snippets that are semantically relevant to a natural language query. Recently, Pre-trained Language Models (PLMs) have shown remarkable success and are widely adopted for code search tasks. However, PLM-based methods often struggle in cross-domain scenarios. When applied to a new domain, they typically require extensive fine-tuning with substantial data. Even worse, the data scarcity problem in new domains often forces these methods to operate in a zero-shot setting, resulting in a significant decline in performance. RAPID, which generates synthetic data for model fine-tuning, is currently the only effective method for zero-shot cross-domain code search. Despite its effectiveness, RAPID demands substantial computational resources for fine-tuning and needs to maintain specialized models for each domain, underscoring the need for a zero-shot, fine-tuning-free approach for cross-domain code search. The key to tackling zero-shot cross-domain code search lies in bridging the gaps among domains. In this work, we propose to break the query-code matching process of code search into two simpler tasks: query-comment matching and code-code matching. We first conduct an empirical study to investigate the effectiveness of these two matching schemas in zero-shot cross-domain code search. Our findings highlight the strong complementarity among the three matching schemas, i.e., query-code, query-comment, and code-code matching. Based on the findings, we propose CodeBridge, a zero-shot, fine-tuning-free approach for cross-domain code search. Specifically, CodeBridge first employs zero-shot prompting to guide Large Language Models (LLMs) to generate a comment for each code snippet in the codebase and produce a code for each query. Subsequently, it encodes queries, code snippets, comments, and the generated code using PLMs and assesses similarities through three matching schemas: query-code, query-comment, and generated code-code. Lastly, CodeBridge leverages a sampling-based fusion approach that combines these three similarity scores to rank the final search outcomes. Experimental results show that our approach outperforms the state-of-the-art PLM-based code search approaches, i.e., CoCoSoDa and UniXcoder, by an average of 21.4% and 24.9% in MRR, respectively, across three datasets. Our approach also yields results that are better than or comparable to those of the zero-shot cross-domain code search approach RAPID, which requires costly fine-tuning. ![]() ![]() ![]() Zuobin Wang, Zhiyuan Wan, Yujing Chen, Yun Zhang, David Lo, Difan Xie, and Xiaohu Yang (Zhejiang University, China; Hangzhou High-Tech Zone (Binjiang), China; Hangzhou City University, China; Singapore Management University, Singapore) In smart contract development, practitioners frequently reuse code to reduce development effort and avoid reinventing the wheel. This reused code, whether identical or similar to its original source, is referred to as a code clone. Unintentional code cloning can propagate flaws and vulnerabilities, potentially undermining the reliability and maintainability of software systems. Previous studies have identified a significant prevalence of code clones in Solidity smart contracts on the Ethereum blockchain. To mitigate the risks posed by code clones, clone detection has emerged as an active field of research and practice in software engineering. Recent studies have extended existing techniques or proposed novel techniques tailored to the unique syntactic and semantic features of Solidity. Nonetheless, the evaluations of existing techniques, whether conducted by their original authors or independent researchers, involve codebases in various programming languages and utilize different versions of the corresponding tools. The resulting inconsistency makes direct comparisons of the evaluation results impractical, and hinders the ability to derive meaningful conclusions across the evaluations. There remains a lack of clarity regarding the effectiveness of these techniques in detecting smart contract clones, and whether it is feasible to combine different techniques to achieve scalable yet accurate detection of code clones in smart contracts. To address this gap, we conduct a comprehensive empirical study that evaluates the effectiveness and scalability of five representative clone detection techniques on 33,073 verified Solidity smart contracts, along with a benchmark we curate, in which we manually label 72,010 pairs of Solidity smart contracts with clone tags. Moreover, we explore the potential of combining different techniques to achieve optimal performance of code clone detection for smart contracts, and propose SourceREClone, a framework designed for the refined integration of different techniques, which achieves a 36.9% improvement in F1 score compared to a straightforward combination of the state of the art. Based on our findings, we discuss implications, provide recommendations for practitioners, and outline directions for future research. ![]() |
|
Wang, Chaozheng |
![]() Chaozheng Wang, Jia Feng, Shuzheng Gao, Cuiyun Gao, Zongjie Li, Ting Peng, Hailiang Huang, Yuetang Deng, and Michael Lyu (Chinese University of Hong Kong, China; University of Electronic Science and Technology of China, China; Hong Kong University of Science and Technology, China; Tencent, China) Large Code Models (LCMs) have demonstrated remarkable effectiveness across various code intelligence tasks. Supervised fine-tuning is essential to optimize their performance for specific downstream tasks. Compared with the traditional full-parameter fine-tuning (FFT) method, Parameter-Efficient Fine-Tuning (PEFT) methods can train LCMs with substantially reduced resource consumption and have gained widespread attention among researchers and practitioners. While existing studies have explored PEFT methods for code intelligence tasks, they have predominantly focused on a limited subset of scenarios, such as code generation with publicly available datasets, leading to constrained generalizability of the findings. To mitigate the limitation, we conduct a comprehensive study on exploring the effectiveness of the PEFT methods, which involves five code intelligence tasks containing both public and private data. Our extensive experiments reveal a considerable performance gap between PEFT methods and FFT, which is contrary to the findings of existing studies. We also find that this disparity is particularly pronounced in tasks involving private data. To improve the tuning performance for LCMs while reducing resource utilization during training, we propose a Layer-Wise Optimization (LWO) strategy in the paper. LWO incrementally updates the parameters of each layer in the whole model architecture, without introducing any additional component and inference overhead. Experiments across five LCMs and five code intelligence tasks demonstrate that LWO trains LCMs more effectively and efficiently compared to previous PEFT methods, with significant improvements in tasks using private data. For instance, in the line-level code completion task using our private code repositories, LWO outperforms the state-of-the-art LoRA method by 22% and 12% in terms of accuracy and BLEU scores, respectively. Furthermore, LWO can enable more efficient LCM tuning, reducing the training time by an average of 42.7% compared to LoRA. ![]() ![]() Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, and Michael Lyu (Chinese University of Hong Kong, China; Singapore Management University, Singapore) ![]() |
|
Wang, Chengpeng |
![]() Yuan Li, Peisen Yao, Kan Yu, Chengpeng Wang, Yaoyang Ye, Song Li, Meng Luo, Yepang Liu, and Kui Ren (Zhejiang University, China; Ant Group, China; Hong Kong University of Science and Technology, China; Southern University of Science and Technology, China) ![]() |
|
Wang, Chenxu |
![]() Chenxu Wang, Tianming Liu, Yanjie Zhao, Minghui Yang, and Haoyu Wang (Huazhong University of Science and Technology, China; Monash University, Australia; OPPO, China) With the rapid development of Large Language Models (LLMs), their integration into automated mobile GUI testing has emerged as a promising research direction. However, existing LLM-based testing approaches face significant challenges, including time inefficiency and high costs due to constant LLM querying. To address these issues, this paper introduces LLMDroid, a novel testing framework designed to enhance existing automated mobile GUI testing tools by leveraging LLMs more efficiently. The workflow of LLMDroid comprises two main stages: Autonomous Exploration and LLM Guidance. During Autonomous Exploration, LLMDroid utilizes existing testing tools while leveraging LLMs to summarize explored pages. When code coverage growth slows, it transitions to LLM Guidance to strategically direct testing towards unexplored functionalities. This approach minimizes LLM interactions while maximizing their impact on test coverage. We applied LLMDroid to three popular open-source Android testing tools and evaluated it on 14 top-listed apps from Google Play. Results demonstrate an average increase of 26.16% in code coverage and 29.31% in activity coverage. Furthermore, our evaluation under different LLMs reveals that LLMDroid outperform existing step-wise approaches with significant cost efficiency, achieving optimal performance at $0.49 per hour using GPT-4o among tested models, with a cost-effective alternative achieving 94% of this performance at just $0.03 per hour. These findings highlight LLMDroid’s effectiveness in enhancing automated mobile app testing and its potential for widespread adoption. ![]() |
|
Wang, Chong |
![]() Yanqi Su, Zhenchang Xing, Chong Wang, Chunyang Chen, Sherry (Xiwei) Xu, Qinghua Lu, and Liming Zhu (Australian National University, Australia; CSIRO's Data61, Australia; Nanyang Technological University, Singapore; TU Munich, Germany; UNSW, Australia) Exploratory testing (ET) harnesses tester's knowledge, creativity, and experience to create varying tests that uncover unexpected bugs from the end-user's perspective. Although ET has proven effective in system-level testing of interactive systems, the need for manual execution has hindered large-scale adoption. In this work, we explore the feasibility, challenges and road ahead of automated scenario-based ET (a.k.a soap opera testing). We conduct a formative study, identifying key insights for effective manual soap opera testing and challenges in automating the process. We then develop a multi-agent system leveraging LLMs and a Scenario Knowledge Graph (SKG) to automate soap opera testing. The system consists of three multi-modal agents, Planner, Player, and Detector that collaborate to execute tests and identify potential bugs. Experimental results demonstrate the potential of automated soap opera testing, but there remains a significant gap compared to manual execution, especially under-explored scenario boundaries and incorrectly identified bugs. Based on the observation, we envision road ahead for the future of automated soap opera testing, focusing on three key aspects: the synergy of neural and symbolic approaches, human-AI co-learning, and the integration of soap opera testing with broader software engineering practices. These insights aim to guide and inspire the future research. ![]() |
|
Wang, Haibo |
![]() Haibo Wang, Zezhong Xing, Chengnian Sun, Zheng Wang, and Shin Hwei Tan (Concordia University, Canada; Southern University of Science and Technology, China; University of Waterloo, Canada; University of Leeds, UK) By reducing the number of lines of code, program simplification reduces code complexity, improving software maintainability and code comprehension. While several existing techniques can be used for automatic program simplification, there is no consensus on the effectiveness of these approaches. We present the first study on how real-world developers simplify programs in open-source software projects. By analyzing 382 pull requests from 296 projects, we summarize the types of program transformations used, the motivations behind simplifications, and the set of program transformations that have not been covered by existing refactoring types. As a result of our study, we submitted eight bug reports to a widely used refactoring detection tool, RefactoringMiner, where seven were fixed. Our study also identifies gaps in applying existing approaches for automating program simplification and outlines the criteria for designing automatic program simplification techniques. In light of these observations, we propose SimpT5, a tool to automatically produce simplified programs that are semantically equivalent programs with reduced lines of code. SimpT5 is trained on our collected dataset of 92,485 simplified programs with two heuristics: (1) modified line localization that encodes lines changed in simplified programs, and (2) checkers that measure the quality of generated programs. Experimental results show that SimpT5 outperforms prior approaches in automating developer-induced program simplification. ![]() ![]() |
|
Wang, Haijun |
![]() Ziqiao Kong, Cen Zhang, Maoyi Xie, Ming Hu, Yue Xue, Ye Liu, Haijun Wang, and Yang Liu (Nanyang Technological University, Singapore; Singapore Management University, Singapore; MetaTrust Labs, Singapore; Xi'an Jiaotong University, China) Billions of dollars are transacted through smart contracts, making vulnerabilities a major financial risk. One focus in the security arms race is on profitable vulnerabilities that attackers can exploit. Fuzzing is a key method for identifying these vulnerabilities. However, current solutions face two main limitations: 1. a lack of profit-centric techniques for expediting detection and, 2. insufficient automation in maximizing the profitability of discovered vulnerabilities, leaving the analysis to human experts. To address these gaps, we have developed VERITE, a profit-centric smart contract fuzzing framework that not only effectively detects those profitable vulnerabilities but also maximizes the exploited profits. VERITE has three key features: 1. DeFi action-based mutators for boosting the exploration of transactions with different fund flows; 2. potentially profitable candidates identification criteria, which checks whether the input has caused abnormal fund flow properties during testing; 3. a gradient descent-based profit maximization strategy for these identified candidates. VERITE is fully developed from scratch and evaluated on a dataset consisting of 61 exploited real-world DeFi projects with an average of over 1.1 million dollars loss. The results show that VERITE can automatically extract more than 18 million dollars in total and is significantly better than state-of-the-art fuzzer ItyFuzz in both detection (29/10) and exploitation (134 times more profits gained on average). Remarkably, in 12 targets, it gains more profits than real-world attacking exploits (1.01 to 11.45 times more). VERITE is also applied by auditors in contract auditing, where 6 (5 high severity) zero-day vulnerabilities are found with over $2,500 bounty rewards. ![]() ![]() ![]() |
|
Wang, Haoyu |
![]() Xiaohu Du, Ming Wen, Haoyu Wang, Zichao Wei, and Hai Jin (Huazhong University of Science and Technology, China) ![]() ![]() Ting Zhou, Yanjie Zhao, Xinyi Hou, Xiaoyu Sun, Kai Chen, and Haoyu Wang (Huazhong University of Science and Technology, China; Australian National University, Australia) Declarative UI frameworks have gained widespread adoption in mobile app development, offering benefits such as improved code readability and easier maintenance. Despite these advantages, the process of translating UI designs into functional code remains challenging and time-consuming. Recent advancements in multimodal large language models (MLLMs) have shown promise in directly generating mobile app code from user interface (UI) designs. However, the direct application of MLLMs to this task is limited by challenges in accurately recognizing UI components and comprehensively capturing interaction logic. To address these challenges, we propose DeclarUI, an automated approach that synergizes computer vision (CV), MLLMs, and iterative compiler-driven optimization to generate and refine declarative UI code from designs. DeclarUI enhances visual fidelity, functional completeness, and code quality through precise component segmentation, Page Transition Graphs (PTGs) for modeling complex inter-page relationships, and iterative optimization. In our evaluation, DeclarUI outperforms baselines on React Native, a widely adopted declarative UI framework, achieving a 96.8% PTG coverage rate and a 98% compilation success rate. Notably, DeclarUI demonstrates significant improvements over state-of-the-art MLLMs, with a 123% increase in PTG coverage rate, up to 55% enhancement in visual similarity scores, and a 29% boost in compilation success rate. We further demonstrate DeclarUI’s generalizability through successful applications to Flutter and ArkUI frameworks. User studies with professional developers confirm that DeclarUI’s generated code meets industrial-grade standards in code availability, modification time, readability, and maintainability. By streamlining app development, improving efficiency, and fostering designer-developer collaboration, DeclarUI offers a practical solution to the persistent challenges in mobile UI development. ![]() ![]() Chenxu Wang, Tianming Liu, Yanjie Zhao, Minghui Yang, and Haoyu Wang (Huazhong University of Science and Technology, China; Monash University, Australia; OPPO, China) With the rapid development of Large Language Models (LLMs), their integration into automated mobile GUI testing has emerged as a promising research direction. However, existing LLM-based testing approaches face significant challenges, including time inefficiency and high costs due to constant LLM querying. To address these issues, this paper introduces LLMDroid, a novel testing framework designed to enhance existing automated mobile GUI testing tools by leveraging LLMs more efficiently. The workflow of LLMDroid comprises two main stages: Autonomous Exploration and LLM Guidance. During Autonomous Exploration, LLMDroid utilizes existing testing tools while leveraging LLMs to summarize explored pages. When code coverage growth slows, it transitions to LLM Guidance to strategically direct testing towards unexplored functionalities. This approach minimizes LLM interactions while maximizing their impact on test coverage. We applied LLMDroid to three popular open-source Android testing tools and evaluated it on 14 top-listed apps from Google Play. Results demonstrate an average increase of 26.16% in code coverage and 29.31% in activity coverage. Furthermore, our evaluation under different LLMs reveals that LLMDroid outperform existing step-wise approaches with significant cost efficiency, achieving optimal performance at $0.49 per hour using GPT-4o among tested models, with a cost-effective alternative achieving 94% of this performance at just $0.03 per hour. These findings highlight LLMDroid’s effectiveness in enhancing automated mobile app testing and its potential for widespread adoption. ![]() ![]() Haodong Li, Xiao Cheng, Guohan Zhang, Guosheng Xu, Guoai Xu, and Haoyu Wang (Beijing University of Posts and Telecommunications, China; UNSW, Australia; Harbin Institute of Technology, Shenzhen, China; Huazhong University of Science and Technology, China) Learning-based Android malware detection has earned significant recognition across industry and academia, yet its effectiveness hinges on the accuracy of labeled training data. Manual labeling, being prohibitively expensive, has prompted the use of automated methods, such as leveraging anti-virus engines like VirusTotal, which unfortunately introduces mislabeling, aka "label noise". The state-of-the-art label noise reduction approach, MalWhiteout, can effectively reduce random label noise but underperforms in mitigating real-world emergent malware (EM) label noise stemming from newly emerging Android malware variants overlooked by VirusTotal. To tackle this, we conceptualize EM label noise detection as an anomaly detection problem and introduce a novel tool, MalCleanse, that surpasses MalWhiteout's ability to address EM label noise. MalCleanse combines uncertainty estimation with unsupervised anomaly detection, identifying samples with high uncertainty as mislabeled, thereby enhancing its capability to remove EM label noise. Our experimental results demonstrate a significant reduction in EM label noise by approximately 25.25%, achieving an F1 Score of 80.32% for label noise detection at a noise ratio of 40.61%. Notably, MalCleanse outperforms MalWhiteout with an increase of 40.9% in overall F1 score for mitigating EM label noise. This paper pioneers the integration of deep neural network model uncertainty to refine label accuracy, thereby enhancing the reliability of malware detection systems. Our approach represents a significant step forward in addressing the challenges posed by emergent malware in automated labeling systems. ![]() |
|
Wang, Huaimin |
![]() Yiwen Wu, Yang Zhang, Tao Wang, Bo Ding, and Huaimin Wang (National University of Defense Technology, China; State Key Laboratory of Complex and Critical Software Environment, China) Docker building is a critical component of containerization in modern software development, automating the process of packaging and converting sources into container images. It is not uncommon to find that Docker build faults (DBFs) occur frequently across Docker-based projects, inducing non-negligible costs in development activities. DBF resolution is a challenging problem and previous studies have demonstrated that developers spend non-trivial time in resolving encountered build faults. However, the characteristics of DBFs is still under-investigated, hindering practical solutions to build management in the Docker community. In this paper, to bridge this gap, we present a comprehensive study for understanding the real-world DBFs in practice. We collect and analyze a DBF dataset of 255 issues and 219 posts from GitHub, Stack Overflow, and Docker Forum. We investigate and construct characteristic taxonomies for the DBFs, including 15 symptoms, 23 root causes, and 35 fix patterns. Moreover, we study the fault distributions of symptoms and root causes, in terms of the different build types, i.e., Dockerfile builds and Docker-compose builds. Based on the results, we provide actionable implications and develop a knowledge-based application, which can potentially facilitate research and assist developers in improving the Docker build management. ![]() |
|
Wang, Ji |
![]() Xu Yang, Zhenbang Chen, Wei Dong, and Ji Wang (National University of Defense Technology, China) Floating-point constraint solving is challenging due to the complex representation and non-linear computations. Search-based constraint solving provides an effective method for solving floating-point constraints. In this paper, we propose QSF to improve the efficiency of search-based solving for floating-point constraints. The key idea of QSF is to model the floating-point constraint solving problem as a multi-objective optimization problem. Specifically, QSF considers both the number of unsatisfied constraints and the sum of the violation degrees of unsatisfied constraints as the objectives for search-based optimization. Besides, we propose a new evolutionary algorithm in which the mutation operators are specially designed for floating-point numbers, aiming to solve the multi-objective problem more efficiently. We have implemented QSF and conducted extensive experiments on both the SMT-COMP benchmark and the benchmark from real-world floating-point programs. The results demonstrate that compared to SOTA floating-point solvers, QSF achieved an average speedup of 15.72X under a 60-second timeout and an impressive 87.48X under a 600-second timeout on the first benchmark. Similarly, on the second benchmark, QSF delivered an average speedup of 22.44X and 29.23X, respectively, under the two timeout configurations. Furthermore, QSF has also enhanced the performance of symbolic execution for floating-point programs. ![]() |
|
Wang, Jian |
![]() Shuaiyu Xie, Jian Wang, Maodong Li, Peiran Chen, Jifeng Xuan, and Bing Li (Wuhan University, China; Zhongguancun Laboratory, China) Distributed tracing is a pivotal technique for software operators to understand and diagnose issues within microservice-based systems, offering a comprehensive view of user requests propagated through various services. However, the unprecedented volume of traces imposes expensive storage and analytical burdens on online systems. Conventional tracing approaches typically rely on random sampling with a fixed probability for each trace, which risks missing valuable traces. Several tail-based sampling methods have thus been proposed to sample traces based on their content. Nevertheless, these methods primarily evaluate traces on an individual basis, neglecting the collective attributes of the sample set in terms of comprehensiveness, balance, and consistency. To address these issues, we propose TracePicker, an optimization-based online sampler designed to enhance the quality of sampled data while mitigating storage burden. TracePicker employs a streaming anomaly detector to capture and retain anomalous traces that are crucial for troubleshooting. For normal traces, the sampling process is segmented into quota allocation and group sampling, both formulated as integer programming problems. By solving these problems using dynamic programming and evolution algorithms, TracePicker selects a high-quality subset of data, minimizing overall information loss. Experimental results demonstrate that TracePicker outperforms existing tail-based sampling methods in terms of both sampling quality and time consumption. ![]() ![]() Anji Li, Neng Zhang, Ying Zou, Zhixiang Chen, Jian Wang, and Zibin Zheng (Sun Yat-sen University, China; Central China Normal University, China; Queen's University, Canada; Wuhan University, China) Code snippets are widely used in technical forums to demonstrate solutions to programming problems. They can be leveraged by developers to accelerate problem-solving. However, code snippets often lack concrete types of the APIs used in them, which impedes their understanding and resue. To enhance the description of a code snippet, a number of approaches are proposed to infer the types of APIs. Although existing approaches can achieve good performance, their performance is limited by ignoring other information outside the input code snippet (e.g., the descriptions of similar code snippets) that could potentially improve the performance. In this paper, we propose a novel type inference approach, named CKTyper, by leveraging crowdsourcing knowledge in technical posts. The key idea is to generate a relevant context for a target code snippet from the posts containing similar code snippets and then employ the context to promote the type inference with large language models (e.g., ChatGPT). More specifically, we build a crowdsourcing knowledge base (CKB) by extracting code snippets from a large set of posts and index the CKB using Lucene. An API type dictionary is also built from a set of API libraries. Given a code snippet to be inferred, we first retrieve a list of similar code snippets from the indexed CKB. Then, we generate a crowdsourcing knowledge context (CKC) by extracting and summarizing useful content (e.g., API-related sentences) in the posts that contain the similar code snippets. The CKC is subsequently used to improve the type inference of ChatGPT on the input code snippet. The hallucination of ChatGPT is eliminated by employing the API type dictionary. Evaluation results on two open-source datasets demonstrate the effectiveness and efficiency of CKTyper. CKTyper achieves the optimal precision/recall of 97.80% and 95.54% on both datasets, respectively, significantly outperforming three state-of-the-art baselines and ChatGPT. ![]() |
|
Wang, Jiyuan |
![]() Jiyuan Wang, Yuxin Qiu, Ben Limpanukorn, Hong Jin Kang, Qian Zhang, and Miryung Kim (University of California at Los Angeles, USA; University of California at Riverside, USA) In recent years, the MLIR framework has had explosive growth due to the need for extensible deep learning compilers for hardware accelerators. Such examples include Triton, CIRCT, and ONNX-MLIR. MLIR compilers introduce significant complexities in localizing bugs or inefficiencies because of their layered optimization and transformation process with compilation passes. While existing delta debugging techniques can be used to identify a minimum subset of IR code that reproduces a given bug symptom, their naive application to MLIR is time-consuming because real-world MLIR compilers usually involve a large number of compilation passes. Compiler developers must identify a minimized set of relevant compilation passes to reduce the footprint of MLIR compiler code to be inspected for a bug fix. We propose DuoReduce, a dual-dimensional reduction approach for MLIR bug localization. DuoReduce leverages three key ideas in tandem to design an efficient MLIR delta debugger. First, DuoReduce reduces compiler passes that are irrelevant to the bug by identifying ordering dependencies among the different compilation passes. Second, DuoReduce uses MLIR-semantics-aware transformations to expedite IR code reduction. Finally, DuoReduce leverages cross-dependence between the IR code dimension and the compilation pass dimension by accounting for which IR code segments are related to which compilation passes to reduce unused passes. Experiments with three large-scale MLIR compiler projects find that DuoReduce outperforms syntax-aware reducers such as Perses and Vulcan in terms of IR code reduction by 31.6% and 21.5% respectively. If one uses these reducers by enumerating all possible compilation passes (on average 18 passes), it could take up to 145 hours. By identifying ordering dependencies among compilation passes, DuoReduce reduces this time to 9.5 minutes. By identifying which compilation passes are unused for compiling reduced IR code, DuoReduce reduces the number of passes by 14.6%. This translates to not needing to examine 281 lines of MLIR compiler code on average to fix the bugs. DuoReduce has the potential to significantly reduce debugging effort in MLIR compilers, which serves as the foundation for the current landscape of machine learning and hardware accelerators. ![]() |
|
Wang, Junjie |
![]() Mengzhuo Chen, Zhe Liu, Chunyang Chen, Junjie Wang, Boyu Wu, Jun Hu, and Qing Wang (Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; TU Munich, Germany) In software development, similar apps often encounter similar bugs due to shared functionalities and implementation methods. However, current automated GUI testing methods mainly focus on generating test scripts to cover more pages by analyzing the internal structure of the app, without targeted exploration of paths that may trigger bugs, resulting in low efficiency in bug discovery. Considering that a large number of bug reports on open source platforms can provide external knowledge for testing, this paper proposes BugHunter, a novel bug-aware automated GUI testing approach that generates exploration paths guided by bug reports from similar apps, utilizing a combination of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). Instead of focusing solely on coverage, BugHunter dynamically adapts the testing process to target bug paths, thereby increasing bug detection efficiency. BugHunter first builds a high-quality bug knowledge base from historical bug reports. Then it retrieves relevant reports from this large bug knowledge base using a two-stage retrieval process, and generates test paths based on similar apps’ bug reports. BugHunter also introduces a local and global path-planning mechanism to handle differences in functionality and UI design across apps, and the ambiguous behavior or missing steps in the online bug reports. We evaluate BugHunter on 121 bugs across 71 apps and compare its performance against 16 state-of-the-art baselines. BugHunter achieves 60% improvement in bug detection over the best baseline, with comparable or higher coverage against the baselines. Furthermore, BugHunter successfully detects 49 new crash bugs in real-world apps from Google Play, with 33 bugs fixed, 9 confirmed, and 7 pending feedback. ![]() |
|
Wang, Leqing |
![]() Shuwei Song, Ting Chen, Ao Qiao, Xiapu Luo, Leqing Wang, Zheyuan He, Ting Wang, Xiaodong Lin, Peng He, Wensheng Zhang, and Xiaosong Zhang (University of Electronic Science and Technology of China, China; Hong Kong Polytechnic University, China; Stony Brook University, USA; University of Guelph, Canada) Cryptocurrency tokens, implemented by smart contracts, are prime targets for attackers due to their substantial monetary value. To illicitly gain profit, attackers often embed malicious code or exploit vulnerabilities within token contracts. Token transfer identification is crucial for detecting malicious and vulnerable token contracts. However, existing methods suffer from high false positives or false negatives due to invalid assumptions or reliance on limited patterns. This paper introduces a novel approach that captures the essential principles of token contracts, which are independent of programming languages and token standards, and presents a new tool, CRYPTO-SCOUT. CRYPTO-SCOUT automatically and accurately identifies token transfers, enabling the detection of various malicious and vulnerable token contracts. CRYPTO-SCOUT's core innovation is its capability to automatically identify complex container-type variables used by token contracts for storing holder information. It processes the bytecode of smart contracts written in the two most widely-used languages, Solidity and Vyper, and supports the three most popular token standards, ERC20, ERC721, and ERC1155. Furthermore, CRYPTO-SCOUT detects four types of malicious and vulnerable token contracts and is designed to be extensible. Extensive experiments show that CRYPTO-SCOUT outperforms existing approaches and uncovers over 21,000 malicious/vulnerable token contracts and more than 12,000 transactions triggering them. ![]() |
|
Wang, Peng |
![]() Jia Chen, Yuang He, Peng Wang, Xiaolei Chen, Jie Shi, and Wei Wang (Fudan University, China) ![]() |
|
Wang, Qing |
![]() Mengzhuo Chen, Zhe Liu, Chunyang Chen, Junjie Wang, Boyu Wu, Jun Hu, and Qing Wang (Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; TU Munich, Germany) In software development, similar apps often encounter similar bugs due to shared functionalities and implementation methods. However, current automated GUI testing methods mainly focus on generating test scripts to cover more pages by analyzing the internal structure of the app, without targeted exploration of paths that may trigger bugs, resulting in low efficiency in bug discovery. Considering that a large number of bug reports on open source platforms can provide external knowledge for testing, this paper proposes BugHunter, a novel bug-aware automated GUI testing approach that generates exploration paths guided by bug reports from similar apps, utilizing a combination of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). Instead of focusing solely on coverage, BugHunter dynamically adapts the testing process to target bug paths, thereby increasing bug detection efficiency. BugHunter first builds a high-quality bug knowledge base from historical bug reports. Then it retrieves relevant reports from this large bug knowledge base using a two-stage retrieval process, and generates test paths based on similar apps’ bug reports. BugHunter also introduces a local and global path-planning mechanism to handle differences in functionality and UI design across apps, and the ambiguous behavior or missing steps in the online bug reports. We evaluate BugHunter on 121 bugs across 71 apps and compare its performance against 16 state-of-the-art baselines. BugHunter achieves 60% improvement in bug detection over the best baseline, with comparable or higher coverage against the baselines. Furthermore, BugHunter successfully detects 49 new crash bugs in real-world apps from Google Play, with 33 bugs fixed, 9 confirmed, and 7 pending feedback. ![]() |
|
Wang, Ruisi |
![]() Susheng Wu, Ruisi Wang, Yiheng Cao, Bihuan Chen, Zhuotong Zhou, Yiheng Huang, JunPeng Zhao, and Xin Peng (Fudan University, China) Branching repositories facilitates efficient software development but can also inadvertently propagate vulnerabilities. When an original branch is patched, other unfixed branches remain vulnerable unless the patch is successfully ported. However, due to inherent discrepancies between branches, many patches cannot be directly applied and require manual intervention, which is time-consuming and leads to delays in patch porting, increasing vulnerability risks. Existing automated patch porting approaches are prone to errors, as they often overlook essential semantic and syntactic context of vulnerability and fail to detect or refine faulty patches. We propose Mystique, a novel LLM-based approach to address these limitations. Mystique first slices the semantic-related statements linked to the vulnerability while ensuring syntactic correctness, allowing it to extract the signatures for both the original patched function and the target vulnerable function. Mystique then utilizes a fine-tuned LLM to generate a fixed function, which is further iteratively checked and refined to ensure successful porting. Our evaluation shows that Mystique achieved a success rate of 0.954 at function level and of 0.924 at CVE level, outperforming state-of-the-art approaches by at least 13.2% at function level and 12.3% at CVE level. Our evaluation also demonstrates Mystique’s superior generality across various projects, bugs, and programming languages. Mystique successfully ported patches for 34 real-world vulnerable branches. ![]() ![]() |
|
Wang, Shangwen |
![]() Kangchen Zhu, Zhiliang Tian, Shangwen Wang, Weiguo Chen, Zixuan Dong, Mingyue Leng, and Xiaoguang Mao (National University of Defense Technology, China) The current landscape of binary code summarization predominantly focuses on generating a single summary, which limits its utility and understanding for reverse engineers. Existing approaches often fail to meet the diverse needs of users, such as providing detailed insights into usage patterns, implementation nuances, and design rationale, as observed in the field of source code summarization. This highlights the need for multi-intent binary code summarization to enhance the effectiveness of reverse engineering processes. To address this gap, we propose MiSum, a novel method that leverages multi-modality heterogeneous code graph alignment and learning to integrate both assembly code and pseudo-code. MiSum introduces a unified multi-modality heterogeneous code graph (MM-HCG) that aligns assembly code graphs with pseudo-code graphs, capturing both low-level execution details and high-level structural information. We further propose multi-modality heterogeneous graph learning with heterogeneous mutual attention and message passing, which highlights important code blocks and discovers inter-dependencies across different code forms. Additionally, an intent-aware summary generator with an intent-aware attention mechanism is introduced to produce customized summaries tailored to multiple intents. Extensive experiments, including evaluations across various architectures and optimization levels, demonstrate that MiSum outperforms state-of-the-art baselines in BLEU, METEOR, and ROUGE-L metrics. Human evaluations validate its capability to effectively support reverse engineers in understanding diverse binary code intents, marking a significant advancement in binary code analysis. ![]() ![]() ![]() |
|
Wang, Shaohua |
![]() Yi Li, Hridya Dhulipala, Aashish Yadavally, Xiaokai Rong, Shaohua Wang, and Tien N. Nguyen (University of Texas at Dallas, USA; Central University of Finance and Economics, China) Although Large Language Models (LLMs) are highly proficient in understanding source code and descriptive texts, they have limitations in reasoning on dynamic program behaviors, such as execution trace and code coverage prediction, and runtime error prediction, which usually require actual program execution. To advance the ability of LLMs in predicting dynamic behaviors, we leverage the strengths of both approaches, Program Analysis (PA) and LLM, in building PredEx, a predictive executor for Python. Our principle is a blended analysis between PA and LLM to use PA to guide the LLM in predicting execution traces. We break down the task of predictive execution into smaller sub-tasks and leverage the deterministic nature when an execution order can be deterministically decided. When it is not certain, we use predictive backward slicing per variable, i.e., slicing the prior trace to only the parts that affect each variable separately breaks up the valuation prediction into significantly simpler problems. Our empirical evaluation on real-world datasets shows that PredEx achieves 31.5–47.1% relatively higher accuracy in predicting full execution traces than the state-of-the-art models. It also produces 8.6–53.7% more correct execution trace prefixes than those baselines. In predicting next executed statements, its relative improvement over the baselines is 15.7–102.3%. Finally, we show PredEx’s usefulness in two tasks: static code coverage analysis and static prediction of run-time errors for (in)complete code. ![]() |
|
Wang, Shaowei |
![]() Xu Yang, Shaowei Wang, Jiayuan Zhou, and Wenhan Zhu (University of Manitoba, Canada; Huawei, Canada) Deep Learning-based Vulnerability Detection (DLVD) techniques have garnered significant interest due to their ability to automatically learn vulnerability patterns from previously compromised code. Despite the notable accuracy demonstrated by pioneering tools, the broader application of DLVD methods in real-world scenarios is hindered by significant challenges. A primary issue is the “one-for-all” design, where a single model is trained to handle all types of vulnerabilities. This approach fails to capture the patterns of different vulnerability types, resulting in suboptimal performance, particularly for less common vulnerabilities that are often underrepresented in training datasets. To address these challenges, we propose MoEVD, which adopts the Mixture-of-Experts (MoE) framework for vulnerability detection. MoEVD decomposes vulnerability detection into two tasks, CWE type classification and CWE-specific vulnerability detection. By splitting the task, in vulnerability detection, MoEVD allows specific experts to handle distinct types of vulnerabilities instead of handling all vulnerabilities within one model. Our results show that MoEVD achieves an F1-score of 0.44, significantly outperforming all studied state-of-the-art (SOTA) baselines by at least 12.8%. MoEVD excels across almost all CWE types, improving recall over the best SOTA baseline by 9% to 77.8%. Notably, MoEVD does not sacrifice performance on long-tailed CWE types; instead, its MoE design enhances performance (F1-score) on these by at least 7.3%, addressing long-tailed issues effectively. ![]() ![]() Xu Yang, Wenhan Zhu, Michael Pacheco, Jiayuan Zhou, Shaowei Wang, Xing Hu, and Kui Liu (University of Manitoba, Canada; Huawei, Canada; Zhejiang University, China; Huawei Software Engineering Application Technology Lab, China) Detecting vulnerability fix commits in open-source software is crucial for maintaining software security. To help OSS identify vulnerability fix commits, several automated approaches are developed. However, existing approaches like VulFixMiner and CoLeFunDa, focus solely on code changes, neglecting essential context from development artifacts. Tools like Vulcurator, which integrates issue reports, fail to leverage semantic associations between different development artifacts (e.g., pull requests and history vulnerability fixes). Moreover, they miss vulnerability fixes in tangled commits and lack explanations, limiting practical use. Hence to address those limitations, we propose LLM4VFD, a novel framework that leverages Large Language Models (LLMs) enhanced with Chain-of-Thought reasoning and In-Context Learning to improve the accuracy of vulnerability fix detection. LLM4VFD comprises three components: (1) Code Change Intention, which analyzes commit summaries, purposes, and implications using Chain-of-Thought reasoning; (2) Development Artifact, which incorporates context from related issue reports and pull requests; (3) Historical Vulnerability, which retrieves similar past vulnerability fixes to enrich context. More importantly, on top of the prediction, LLM4VFD also provides a detailed analysis and explanation to help security experts understand the rationale behind the decision. We evaluated LLM4VFD against state-of-the-art techniques, including Pre-trained Language Model-based approaches and vanilla LLMs, using a newly collected dataset, BigVulFixes. Experimental results demonstrate that LLM4VFD significantly outperforms the best-performed existing approach by 68.1%--145.4%. Furthermore, We conducted a user study with security experts, showing that the analysis generated by LLM4VFD improves the efficiency of vulnerability fix identification. ![]() |
|
Wang, Shuaisong |
![]() Xiaoli Lian, Shuaisong Wang, Hanyu Zou, Fang Liu, Jiajun Wu, and Li Zhang (Beihang University, China) In the current software-driven era, ensuring privacy and security is critical. Despite this, the specification of security requirements for software is still largely a manual and labor-intensive process. Engineers are tasked with analyzing potential security threats based on functional requirements (FRs), a procedure prone to omissions and errors due to the expertise gap between cybersecurity experts and software engineers. To bridge this gap, we introduce F2SRD (Function-to-Security Requirements Derivation), an automated approach that proactively derives security requirements (SRs) from functional specifications under the guidance of relevant security verification requirements (VRs) drawn from the well recognized OWASP Application Security Verification Standard (ASVS). F2SRD operates in two main phases: Initially, we develop a VR retriever trained on a custom database of FR-VR pairs, enabling it to adeptly select applicable VRs from ASVS. This targeted retrieval informs the precise and actionable formulation of SRs. Subsequently, these VRs are used to construct structured prompts that direct GPT-4 in generating SRs. Our comparative analysis against two established models demonstrates F2SRD's enhanced performance in producing SRs that excel in inspiration, diversity, and specificity—essential attributes for effective security requirement generation. By leveraging security verification standards, we believe that the generated SRs are not only more focused but also resonate stronger with the needs of engineers. ![]() |
|
Wang, Simin |
![]() Amiao Gao, Zenong Zhang, Simin Wang, LiGuo Huang, Shiyi Wei, and Vincent Ng (Southern Methodist University, Dallas, USA; University of Texas at Dallas, USA) ![]() |
|
Wang, Tao |
![]() Yiwen Wu, Yang Zhang, Tao Wang, Bo Ding, and Huaimin Wang (National University of Defense Technology, China; State Key Laboratory of Complex and Critical Software Environment, China) Docker building is a critical component of containerization in modern software development, automating the process of packaging and converting sources into container images. It is not uncommon to find that Docker build faults (DBFs) occur frequently across Docker-based projects, inducing non-negligible costs in development activities. DBF resolution is a challenging problem and previous studies have demonstrated that developers spend non-trivial time in resolving encountered build faults. However, the characteristics of DBFs is still under-investigated, hindering practical solutions to build management in the Docker community. In this paper, to bridge this gap, we present a comprehensive study for understanding the real-world DBFs in practice. We collect and analyze a DBF dataset of 255 issues and 219 posts from GitHub, Stack Overflow, and Docker Forum. We investigate and construct characteristic taxonomies for the DBFs, including 15 symptoms, 23 root causes, and 35 fix patterns. Moreover, we study the fault distributions of symptoms and root causes, in terms of the different build types, i.e., Dockerfile builds and Docker-compose builds. Based on the results, we provide actionable implications and develop a knowledge-based application, which can potentially facilitate research and assist developers in improving the Docker build management. ![]() |
|
Wang, Ting |
![]() Shuwei Song, Ting Chen, Ao Qiao, Xiapu Luo, Leqing Wang, Zheyuan He, Ting Wang, Xiaodong Lin, Peng He, Wensheng Zhang, and Xiaosong Zhang (University of Electronic Science and Technology of China, China; Hong Kong Polytechnic University, China; Stony Brook University, USA; University of Guelph, Canada) Cryptocurrency tokens, implemented by smart contracts, are prime targets for attackers due to their substantial monetary value. To illicitly gain profit, attackers often embed malicious code or exploit vulnerabilities within token contracts. Token transfer identification is crucial for detecting malicious and vulnerable token contracts. However, existing methods suffer from high false positives or false negatives due to invalid assumptions or reliance on limited patterns. This paper introduces a novel approach that captures the essential principles of token contracts, which are independent of programming languages and token standards, and presents a new tool, CRYPTO-SCOUT. CRYPTO-SCOUT automatically and accurately identifies token transfers, enabling the detection of various malicious and vulnerable token contracts. CRYPTO-SCOUT's core innovation is its capability to automatically identify complex container-type variables used by token contracts for storing holder information. It processes the bytecode of smart contracts written in the two most widely-used languages, Solidity and Vyper, and supports the three most popular token standards, ERC20, ERC721, and ERC1155. Furthermore, CRYPTO-SCOUT detects four types of malicious and vulnerable token contracts and is designed to be extensible. Extensive experiments show that CRYPTO-SCOUT outperforms existing approaches and uncovers over 21,000 malicious/vulnerable token contracts and more than 12,000 transactions triggering them. ![]() |
|
Wang, Wei |
![]() Jia Chen, Yuang He, Peng Wang, Xiaolei Chen, Jie Shi, and Wei Wang (Fudan University, China) ![]() |
|
Wang, Wenxuan |
![]() Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, and Michael Lyu (Chinese University of Hong Kong, China; Singapore Management University, Singapore) ![]() |
|
Wang, Xiaoyin |
![]() Md Mahir Asef Kabir, Xiaoyin Wang, and Na Meng (Virginia Tech, USA; University of Texas at San Antonio, USA) When building enterprise applications (EAs) on Java frameworks (e.g., Spring), developers often configure application components via metadata (i.e., Java annotations and XML files). It is challenging for developers to correctly use metadata, because the usage rules can be complex and existing tools provide limited assistance. When developers misuse metadata, EAs become misconfigured, which defects can trigger erroneous runtime behaviors or introduce security vulnerabilities. To help developers correctly use metadata, this paper presents (1) RSL---a domain-specific language that domain experts can adopt to prescribe metadata checking rules, and (2) MeCheck---a tool that takes in RSL rules and EAs to check for rule violations. With RSL, domain experts (e.g., developers of a Java framework) can specify metadata checking rules by defining content consistency among XML files, annotations, and Java code. Given such RSL rules and a program to scan, MeCheck interprets rules as cross-file static analyzers, which analyzers scan Java and/or XML files to gather information and look for consistency violations. For evaluation, we studied the Spring and JUnit documentation to manually define 15 rules, and created 2 datasets with 115 open-source EAs. The first dataset includes 45 EAs, and the ground truth of 45 manually injected bugs. The second dataset includes multiple versions of 70 EAs. We observed that MeCheck identified bugs in the first dataset with 100% precision, 96% recall, and 98% F-score. It reported 152 bugs in the second dataset, 49 of which bugs were already fixed by developers. Our evaluation shows that MeCheck helps ensure the correct usage of metadata. ![]() |
|
Wang, Xinyan |
![]() Junming Cao, Xuwen Xiang, Mingfei Cheng, Bihuan Chen, Xinyan Wang, You Lu, Chaofeng Sha, Xiaofei Xie, and Xin Peng (Fudan University, China; Singapore Management University, Singapore) ![]() |
|
Wang, Xu |
![]() Xu Wang, Mingming Zhang, Xiangxin Meng, Jian Zhang, Yang Liu, and Chunming Hu (Beihang University, China; Zhongguancun Laboratory, China; Ministry of Education, China; Nanyang Technological University, Singapore) Deep Neural Networks (DNNs) are prevalent across a wide range of applications. Despite their success, the complexity and opaque nature of DNNs pose significant challenges in debugging and repairing DNN models, limiting their reliability and broader adoption. In this paper, we propose MLM4DNN, an element-based automated DNN repair method. Unlike previous techniques that focus on post-training adjustments or rely heavily on predefined bug patterns, MLM4DNN repairs DNNs by leveraging a fine-tuned Masked Language Model (MLM) to predict correct fixes for nine predefined key elements in DNNs. We construct a large-scale dataset by masking nine key elements from the correct DNN source code and then force the MLM to restore the correct elements to learn the deep semantics that ensure the normal functionalities of DNNs. Afterwards, a light-weight static analysis tool is designed to filter out low-quality patches to enhance the repair efficiency. We introduce a patch validation method specifically for DNN repair tasks, which consists of three evaluation metrics from different aspects to model the effectiveness of generated patches. We construct a benchmark, BenchmarkAPR4DNN, including 51 buggy DNN models and an evaluation tool that outputs the three metrics. We evaluate MLM4DNN against six baselines on BenchmarkAPR4DNN, and results show that MLM4DNN outperforms all state-of-the-art baselines, including two dynamic-based and four zero-shot learning-based methods. After applying the fine-tuned MLM design to several prevalent Large Language Models (LLMs), we consistently observe improved performance in DNN repair tasks compared to the original LLMs, which demonstrates the effectiveness of the method proposed in this paper. ![]() |
|
Wang, Yanlin |
![]() Yanlin Wang, Tianyue Jiang, Mingwei Liu, Jiachi Chen, Mingzhi Mao, Xilin Liu, Yuchi Ma, and Zibin Zheng (Sun Yat-sen University, China; Huawei Cloud Computing Technologies, China) Large language models (LLMs) have brought a paradigm shift to the field of code generation, offering the potential to enhance the software development process. However, previous research mainly focuses on the accuracy of code generation, while coding style differences between LLMs and human developers remain under-explored. In this paper, we empirically analyze the differences in coding style between the code generated by mainstream LLMs and the code written by human developers, and summarize coding style inconsistency taxonomy. Specifically, we first summarize the types of coding style inconsistencies by manually analyzing a large number of generation results. We then compare the code generated by LLMs with the code written by human programmers in terms of readability, conciseness, and robustness. The results reveal that LLMs and developers exhibit differences in coding style. Additionally, we study the possible causes of these inconsistencies and provide some solutions to alleviate the problem. ![]() |
|
Wang, Yingying |
![]() Huimin Hu, Yingying Wang, Julia Rubin, and Michael Pradel (University of Stuttgart, Germany; University of British Columbia, Canada) Scalable static analyzers are popular tools for finding incorrect, inefficient, insecure, and hard-to-maintain code early during the development process. Because not all warnings reported by a static analyzer are immediately useful to developers, many static analyzers provide a way to suppress warnings, e.g., in the form of special comments added into the code. Such suppressions are an important mechanism at the interface between static analyzers and software developers, but little is currently known about them. This paper presents the first in-depth empirical study of suppressions of static analysis warnings, addressing questions about the prevalence of suppressions, their evolution over time, the relationship between suppressions and warnings, and the reasons for using suppressions. We answer these questions by studying projects written in three popular languages and suppressions for warnings by four popular static analyzers. Our findings show that (i) suppressions are relatively common, e.g., with a total of 7,357 suppressions in 46 Python projects, (ii) the number of suppressions in a project tends to continuously increase over time, (iii) surprisingly, 50.8% of all suppressions do not affect any warning and hence are practically useless, (iv) some suppressions, including useless ones, may unintentionally hide future warnings, and (v) common reasons for introducing suppressions include false positives, suboptimal configurations of the static analyzer, and misleading warning messages. These results have actionable implications, e.g., that developers should be made aware of useless suppressions and the potential risk of unintentional suppressing, that static analyzers should provide better warning messages, and that static analyzers should separately categorize warnings from third-party libraries. ![]() |
|
Wang, Yuqing |
![]() Yuqing Wang, Mika V. Mäntylä, Serge Demeyer, Mutlu Beyazıt, Joanna Kisaakye, and Jesse Nyyssölä (University of Helsinki, Finland; University of Oulu, Finland; University of Antwerp, Belgium) Microservice-based systems (MSS) may fail with various fault types, due to their complex and dynamic nature. While existing AIOps methods excel at detecting abnormal traces and locating the responsible service(s), human efforts from practitioners are still required for further root cause analysis to diagnose specific fault types and analyze failure reasons for detected abnormal traces, particularly when abnormal traces do not stem directly from specific services. In this paper, we propose a novel AIOps framework, TraFaultDia, to automatically classify abnormal traces into fault categories for MSS. We treat the classification process as a series of multi-class classification tasks, where each task represents an attempt to classify abnormal traces into specific fault categories for a MSS. TraFaultDia is trained on several abnormal trace classification tasks with a few labeled instances from a MSS using a meta-learning approach. After training, TraFaultDia can quickly adapt to new, unseen abnormal trace classification tasks with a few labeled instances across MSS. TraFaultDia’s use cases are scalable depending on how fault categories are built from anomalies within MSS. We evaluated TraFaultDia on two representative MSS, TrainTicket and OnlineBoutique, with open datasets. In these datasets, each fault category is tied to the faulty system component(s) (service/pod) with a root cause. Our TraFaultDia automatically classifies abnormal traces into these fault categories, thus enabling the automatic identification of faulty system components and root causes without manual analysis. Our results show that, within the MSS it is trained on, TraFaultDia achieves an average accuracy of 93.26% and 85.20% across 50 new, unseen abnormal trace classification tasks for TrainTicket and OnlineBoutique respectively, when provided with 10 labeled instances for each fault category per task in each system. In the cross-system context, when TraFaultDia is applied to a MSS different from the one it is trained on, TraFaultDia gets an average accuracy of 92.19% and 84.77% for the same set of 50 new, unseen abnormal trace classification tasks of the respective systems, also with 10 labeled instances provided for each fault category per task in each system. ![]() |
|
Wang, Zheng |
![]() Haibo Wang, Zezhong Xing, Chengnian Sun, Zheng Wang, and Shin Hwei Tan (Concordia University, Canada; Southern University of Science and Technology, China; University of Waterloo, Canada; University of Leeds, UK) By reducing the number of lines of code, program simplification reduces code complexity, improving software maintainability and code comprehension. While several existing techniques can be used for automatic program simplification, there is no consensus on the effectiveness of these approaches. We present the first study on how real-world developers simplify programs in open-source software projects. By analyzing 382 pull requests from 296 projects, we summarize the types of program transformations used, the motivations behind simplifications, and the set of program transformations that have not been covered by existing refactoring types. As a result of our study, we submitted eight bug reports to a widely used refactoring detection tool, RefactoringMiner, where seven were fixed. Our study also identifies gaps in applying existing approaches for automating program simplification and outlines the criteria for designing automatic program simplification techniques. In light of these observations, we propose SimpT5, a tool to automatically produce simplified programs that are semantically equivalent programs with reduced lines of code. SimpT5 is trained on our collected dataset of 92,485 simplified programs with two heuristics: (1) modified line localization that encodes lines changed in simplified programs, and (2) checkers that measure the quality of generated programs. Experimental results show that SimpT5 outperforms prior approaches in automating developer-induced program simplification. ![]() ![]() |
|
Wang, Zhijie |
![]() Zhijie Wang, Zhehua Zhou, Jiayang Song, Yuheng Huang, Zhan Shu, and Lei Ma (University of Alberta, Canada; University of Tokyo, Japan) ![]() |
|
Wang, Zhiqi |
![]() Yue Li, Bohan Liu, Ting Zhang, Zhiqi Wang, David Lo, Lanxin Yang, Jun Lyu, and He Zhang (Nanjing University, China; Singapore Management University, Singapore) ![]() |
|
Wang, Zixu |
![]() Weiyuan Tong, Zixu Wang, Zhanyong Tang, Jianbin Fang, Yuqun Zhang, and Guixin Ye (NorthWest University, China; National University of Defense Technology, China; Southern University of Science and Technology, China) ![]() |
|
Wang, Zuobin |
![]() Zuobin Wang, Zhiyuan Wan, Yujing Chen, Yun Zhang, David Lo, Difan Xie, and Xiaohu Yang (Zhejiang University, China; Hangzhou High-Tech Zone (Binjiang), China; Hangzhou City University, China; Singapore Management University, Singapore) In smart contract development, practitioners frequently reuse code to reduce development effort and avoid reinventing the wheel. This reused code, whether identical or similar to its original source, is referred to as a code clone. Unintentional code cloning can propagate flaws and vulnerabilities, potentially undermining the reliability and maintainability of software systems. Previous studies have identified a significant prevalence of code clones in Solidity smart contracts on the Ethereum blockchain. To mitigate the risks posed by code clones, clone detection has emerged as an active field of research and practice in software engineering. Recent studies have extended existing techniques or proposed novel techniques tailored to the unique syntactic and semantic features of Solidity. Nonetheless, the evaluations of existing techniques, whether conducted by their original authors or independent researchers, involve codebases in various programming languages and utilize different versions of the corresponding tools. The resulting inconsistency makes direct comparisons of the evaluation results impractical, and hinders the ability to derive meaningful conclusions across the evaluations. There remains a lack of clarity regarding the effectiveness of these techniques in detecting smart contract clones, and whether it is feasible to combine different techniques to achieve scalable yet accurate detection of code clones in smart contracts. To address this gap, we conduct a comprehensive empirical study that evaluates the effectiveness and scalability of five representative clone detection techniques on 33,073 verified Solidity smart contracts, along with a benchmark we curate, in which we manually label 72,010 pairs of Solidity smart contracts with clone tags. Moreover, we explore the potential of combining different techniques to achieve optimal performance of code clone detection for smart contracts, and propose SourceREClone, a framework designed for the refined integration of different techniques, which achieves a 36.9% improvement in F1 score compared to a straightforward combination of the state of the art. Based on our findings, we discuss implications, provide recommendations for practitioners, and outline directions for future research. ![]() |
|
Weber, Max |
![]() Max Weber, Alina Mailach, Sven Apel, Janet Siegmund, Raimund Dachselt, and Norbert Siegmund (Leipzig University, Germany; ScaDS.AI Dresden-Leipzig, Germany; Saarland Informatics Campus, Germany; Saarland University, Germany; Chemnitz University of Technology, Germany; Dresden University of Technology, Germany) Debugging performance bugs in configurable software systems is a complex and time-consuming task that requires not only fixing a bug, but also understanding its root cause. While there is a vast body of literature on debugging strategies, there is no consensus on general debugging. This makes it difficult to provide concrete guidance for developers, especially for configuration-dependent performance bugs. The goal of our work is to alleviate this situation by providing an framework to describe debugging strategies in a more general, unifying way. We conducted a user study with 12 professional developers who debugged a performance bug in a real-world configurable system. To observe developers in an unobstructive way, we provided an immersive virtual reality tool, SoftVR, giving them a large degree of freedom to choose the preferred debugging strategy. The results show that the existing documentation of strategies is too coarse-grained and intermixed to identify successful approaches. In a subsequent qualitative analysis, we devised a coding framework to reason about debugging approaches. With this framework, we identified five goal-oriented episodes that developers employ, which they also confirmed in subsequent interviews. Our work provides a unified description of debugging strategies, allowing researchers a common foundation to study debugging and practitioners and teachers guidance on successful debugging strategies. ![]() |
|
Wei, Amy |
![]() Matthew Davis, Amy Wei, Brad Myers, and Joshua Sunshine (Carnegie Mellon University, USA; University of Michigan, USA) Software testing is difficult, tedious, and may consume 28%–50% of software engineering labor. Automatic test generators aim to ease this burden but have important trade-offs. Fuzzers use an implicit oracle that can detect obviously invalid results, but the oracle problem has no general solution, and an implicit oracle cannot automatically evaluate correctness. Test suite generators like EvoSuite use the program under test as the oracle and therefore cannot evaluate correctness. Property-based testing tools evaluate correctness, but users have difficulty coming up with properties to test and understanding whether their properties are correct. Consequently, practitioners create many test suites manually and often use an example-based oracle to tediously specify correct input and output examples. To help bridge the gaps among various oracle and tool types, we present the Composite Oracle, which organizes various oracle types into a hierarchy and renders a single test result per example execution. To understand the Composite Oracle’s practical properties, we built TerzoN, a test suite generator that includes a particular instantiation of the Composite Oracle. TerzoN displays all the test results in an integrated view composed from the results of three types of oracles and finds some types of test assertion inconsistencies that might otherwise lead to misleading test results. We evaluated TerzoN in a randomized controlled trial with 14 professional software engineers with a popular industry tool, fast-check, as the control. Participants using TerzoN elicited 72% more bugs (p < 0.01), accurately described more than twice the number of bugs (p < 0.01) and tested 16% more quickly (p < 0.05) relative to fast-check. ![]() ![]() |
|
Wei, Lili |
![]() Hengcheng Zhu, Valerio Terragni, Lili Wei, Shing-Chi Cheung, Jiarong Wu, and Yepang Liu (Hong Kong University of Science and Technology, China; University of Auckland, New Zealand; McGill University, Canada; Southern University of Science and Technology, China) Mock assertions provide developers with a powerful means to validate program behaviors that are unobservable to test assertions. Despite their significance, they are rarely considered by automated test generation techniques. Effective generation of mock assertions requires understanding how they are used in practice. Although previous studies highlighted the importance of mock assertions, none provide insight into their usages. To bridge this gap, we conducted the first empirical study on mock assertions, examining their adoption, the characteristics of the verified method invocations, and their effectiveness in fault detection. Our analysis of 4,652 test cases from 11 popular Java projects reveals that mock assertions are mostly applied to validating specific kinds of method calls, such as those interacting with external resources and those reflecting whether a certain code path was traversed in systems under test. Additionally, we find that mock assertions complement traditional test assertions by ensuring the desired side effects have been produced, validating control flow logic, and checking internal computation results. Our findings contribute to a better understanding of mock assertion usages and provide a foundation for future related research such as automated test generation that support mock assertions. ![]() ![]() |
|
Wei, Shiyi |
![]() Amiao Gao, Zenong Zhang, Simin Wang, LiGuo Huang, Shiyi Wei, and Vincent Ng (Southern Methodist University, Dallas, USA; University of Texas at Dallas, USA) ![]() |
|
Wei, Zichao |
![]() Xiaohu Du, Ming Wen, Haoyu Wang, Zichao Wei, and Hai Jin (Huazhong University of Science and Technology, China) ![]() |
|
Welp, Tobias |
![]() Kathryn T. Stolee, Tobias Welp, Caitlin Sadowski, and Sebastian Elbaum (North Carolina State University, USA; Google, Germany; Unaffiliated, USA; University of Virginia, USA) Code search is an integral part of a developer’s workflow. In 2015, researchers published a paper reflecting on the code search practices at Google of 27 developers who used the internal Code Search tool. That paper had first-hand accounts for why those developers were using code search and highlighted how often and in what situations developers were searching for code. In the past decade, much has changed in the landscape of developer support. New languages have emerged, artificial intelligence (AI) for code generation has gained traction, auto-complete in the IDE has gotten better, Q&A forums have increased in popularity, and code repositories are larger than ever. It is worth considering whether those observations from almost a decade ago have stood the test of time. In this work, inspired by the prior survey about the Code Search tool, we run a series of three surveys with 1,945 total responses and report overall Code Search usage statistics for over 100,000 users. Unlike the prior work, in our surveys, we include explicit success criteria to understand when code search is meeting their needs, and when it is not. We dive further into two common sub-categories of code search effort: when its users are looking for examples and when they are using code search alongside code review. We find that Code Search users continue to use the tool frequently and the frequency has not changed despite the introduction of AI-enhanced development support. Users continue to turn to Code Search to find examples, but the frequency of example-seeking behavior has decreased. More often than before, users access the tool to learn about and explore code. This has implications for future Code Search support in software development. ![]() |
|
Wen, Ming |
![]() Zhengmin Yu, Yuan Zhang, Ming Wen, Yinan Nie, Wenhui Zhang, and Min Yang (Fudan University, China; Huazhong University of Science and Technology, China) ![]() ![]() Xiaohu Du, Ming Wen, Haoyu Wang, Zichao Wei, and Hai Jin (Huazhong University of Science and Technology, China) ![]() ![]() Xiao Chen, Hengcheng Zhu, Jialun Cao, Ming Wen, and Shing-Chi Cheung (Hong Kong University of Science and Technology, China; Huazhong University of Science and Technology, China) Debugging can be much facilitated if one can identify the evolution commit that introduced the bug leading to a detected failure (aka. bug-inducing commit, BIC). Although one may, in theory, locate BICs by executing the detected failing test on various historical commit versions, it is impractical when the test cannot be executed on some of those versions. On the other hand, existing static techniques often assume the availability of additional information such as patches and bug reports, or the applicability of predefined heuristics like commit chronology. However, these approaches are ineffective when such assumptions do not hold, which are often the case in practice. To address these limitations, we propose SEMBIC to identify the BIC of a bug by statically tracking the semantic changes in the execution path prescribed by the failing test across successive historical commit versions. Our insight is that the greater the semantic changes a commit introduces concerning the failing execution path of a target bug, the more likely it is to be the BIC. To distill semantic changes relevant to the failure, we focus on three fine-grained semantic properties. We evaluate the performance of SEMBIC on a benchmark containing 199 real-world bugs from 12 open-source projects. We found that SEMBIC can identify BICs with high accuracy – it ranks the BIC as top 1 for 88 out of 199 bugs, and achieves an MRR of 0.520, outperforming the state-of-the-art technique by 29.4% and 13.6%, respectively. ![]() ![]() |
|
Woodlief, Trey |
![]() Trey Woodlief, Felipe Toledo, Matthew Dwyer, and Sebastian Elbaum (University of Virginia, USA) ![]() |
|
Wu, Boyu |
![]() Mengzhuo Chen, Zhe Liu, Chunyang Chen, Junjie Wang, Boyu Wu, Jun Hu, and Qing Wang (Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; TU Munich, Germany) In software development, similar apps often encounter similar bugs due to shared functionalities and implementation methods. However, current automated GUI testing methods mainly focus on generating test scripts to cover more pages by analyzing the internal structure of the app, without targeted exploration of paths that may trigger bugs, resulting in low efficiency in bug discovery. Considering that a large number of bug reports on open source platforms can provide external knowledge for testing, this paper proposes BugHunter, a novel bug-aware automated GUI testing approach that generates exploration paths guided by bug reports from similar apps, utilizing a combination of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). Instead of focusing solely on coverage, BugHunter dynamically adapts the testing process to target bug paths, thereby increasing bug detection efficiency. BugHunter first builds a high-quality bug knowledge base from historical bug reports. Then it retrieves relevant reports from this large bug knowledge base using a two-stage retrieval process, and generates test paths based on similar apps’ bug reports. BugHunter also introduces a local and global path-planning mechanism to handle differences in functionality and UI design across apps, and the ambiguous behavior or missing steps in the online bug reports. We evaluate BugHunter on 121 bugs across 71 apps and compare its performance against 16 state-of-the-art baselines. BugHunter achieves 60% improvement in bug detection over the best baseline, with comparable or higher coverage against the baselines. Furthermore, BugHunter successfully detects 49 new crash bugs in real-world apps from Google Play, with 33 bugs fixed, 9 confirmed, and 7 pending feedback. ![]() |
|
Wu, Cong |
![]() Ruichao Liang, Jing Chen, Ruochen Cao, Kun He, Ruiying Du, Shuhua Li, Zheng Lin, and Cong Wu (Wuhan University, China; University of Hong Kong, Hong Kong) Smart contracts, as Turing-complete programs managing billions of assets in decentralized finance, are prime targets for attackers. While fuzz testing seems effective for detecting vulnerabilities in these programs, we identify several significant challenges when targeting smart contracts: (i) the stateful nature of these contracts requires stateful exploration, but current fuzzers rely on transaction sequences to manipulate contract states, making the process inefficient; (ii) contract execution is influenced by the continuously changing blockchain environment, yet current fuzzers are limited to local deployments, failing to test contracts in real-world scenarios. These challenges hinder current fuzzers from uncovering hidden vulnerabilities, i.e., those concealed in deep contract states and specific blockchain environments. In this paper, we present SmartShot, a mutable snapshot-based fuzzer to hunt hidden vulnerabilities within smart contracts. We innovatively formulate contract states and blockchain environments as directly fuzzable elements and design mutable snapshots to quickly restore and mutate these elements. SmartShot features a symbolic taint analysis-based mutation strategy along with double validation to soundly guide the state mutation. SmartShot mutates blockchain environments using contract’s historical on-chain states, providing real-world execution contexts. We propose a snapshot checkpoint mechanism to integrate mutable snapshots into SmartShot’s fuzzing loops. These innovations enable SmartShot to effectively fuzz contract states, test contracts across varied and realistic blockchain environments, and support on-chain fuzzing. Experimental results show that SmartShot is effective to detect hidden vulnerabilities with the highest code coverage and lowest false positive rate. SmartShot is 4.8× to 20.2× faster than state-of-the-art tools, identifying 2,150 vulnerable contracts out of 42,738 real-world contracts which is 2.1× to 13.7× more than other tools. SmartShot has demonstrated its real-world impact by detecting vulnerabilities that are only discoverable on-chain and uncovering 24 0-day vulnerabilities in the latest 10,000 deployed contracts. ![]() |
|
Wu, Di |
![]() Shoupeng Ren, Lipeng He, Tianyu Tu, Di Wu, Jian Liu, Kui Ren, and Chun Chen (Zhejiang University, China; University of Waterloo, Canada) ![]() |
|
Wu, Hao |
![]() Xing Su, Hanzhong Liang, Hao Wu, Ben Niu, Fengyuan Xu, and Sheng Zhong (Nanjing University, China; Institute of Information Engineering at Chinese Academy of Sciences, China) Understanding the Ethereum smart contract bytecode is essential for ensuring cryptoeconomics security. However, existing decompilers primarily convert bytecode into pseudocode, which is not easily comprehensible for general users, potentially leading to misunderstanding of contract behavior and increased vulnerability to scams or exploits. In this paper, we propose DiSCo, the first LLMs-based EVM decompilation pipeline, which aims to enable LLMs to understand the opaque bytecode and lift it into smart contract code. DiSCo introduces three core technologies. First, a logic-invariant intermediate representation is proposed to reproject the low-level bytecode into high-level abstracted units. The second technique involves semantic enhancement based on a novel type-aware graph model to infer stripped variables during compilation, enhancing the lifting effect. The third technology is a flexible method incorporating code specifications to construct LLM-comprehensible prompts for source code generation. Extensive experiments illustrate that our generated code guarantees a high compilability rate at 75%, with differential fuzzing pass rate averaging at 50%. Manual validation results further indicate that the generated solidity contracts significantly outperforms baseline methods in tasks such as code comprehension and attack reproduction. ![]() |
|
Wu, Jiajun |
![]() Xiaoli Lian, Shuaisong Wang, Hanyu Zou, Fang Liu, Jiajun Wu, and Li Zhang (Beihang University, China) In the current software-driven era, ensuring privacy and security is critical. Despite this, the specification of security requirements for software is still largely a manual and labor-intensive process. Engineers are tasked with analyzing potential security threats based on functional requirements (FRs), a procedure prone to omissions and errors due to the expertise gap between cybersecurity experts and software engineers. To bridge this gap, we introduce F2SRD (Function-to-Security Requirements Derivation), an automated approach that proactively derives security requirements (SRs) from functional specifications under the guidance of relevant security verification requirements (VRs) drawn from the well recognized OWASP Application Security Verification Standard (ASVS). F2SRD operates in two main phases: Initially, we develop a VR retriever trained on a custom database of FR-VR pairs, enabling it to adeptly select applicable VRs from ASVS. This targeted retrieval informs the precise and actionable formulation of SRs. Subsequently, these VRs are used to construct structured prompts that direct GPT-4 in generating SRs. Our comparative analysis against two established models demonstrates F2SRD's enhanced performance in producing SRs that excel in inspiration, diversity, and specificity—essential attributes for effective security requirement generation. By leveraging security verification standards, we believe that the generated SRs are not only more focused but also resonate stronger with the needs of engineers. ![]() |
|
Wu, Jianyu |
![]() Farbod Daneshyan, Runzhi He, Jianyu Wu, and Minghui Zhou (Peking University, China) The release note is a crucial document outlining changes in new software versions. It plays a key role in helping stakeholders recognise important changes and understand the implications behind them. Despite this fact, many developers view the process of writing software release notes as a tedious and dreadful task. Consequently, numerous tools (e.g., DeepRelease and Conventional Changelog) have been developed by researchers and practitioners to automate the generation of software release notes. However, these tools fail to consider project domain and target audience for personalisation, limiting their relevance and conciseness. Additionally, they suffer from limited applicability, often necessitating significant workflow adjustments and adoption efforts, hindering practical use and stressing developers. Despite recent advancements in natural language processing and the proven capabilities of large language models (LLMs) in various code and text-related tasks, there are no existing studies investigating the integration and utilisation of LLMs in automated release note generation. Therefore, we propose SmartNote, a novel and widely applicable release note generation approach that produces high-quality, contextually personalised release notes by leveraging LLM capabilities to aggregate, describe, and summarise changes based on code, commit, and pull request details. It categorises and scores (for significance) commits to generate structured and concise release notes of prioritised changes. We conduct human and automatic evaluations that reveal SmartNote outperforms or achieves comparable performance to DeepRelease (state-of-the-art), Conventional Changelog (off-the-shelf), and the projects' original release note across four quality metrics: completeness, clarity, conciseness, and organisation. In both evaluations, SmartNote ranked first for completeness and organisation, while clarity ranked first in the human evaluation. Furthermore, our controlled study reveals the significance of contextual awareness, while our applicability analysis confirms SmartNote's effectiveness across diverse projects. ![]() ![]() |
|
Wu, Jiarong |
![]() Hengcheng Zhu, Valerio Terragni, Lili Wei, Shing-Chi Cheung, Jiarong Wu, and Yepang Liu (Hong Kong University of Science and Technology, China; University of Auckland, New Zealand; McGill University, Canada; Southern University of Science and Technology, China) Mock assertions provide developers with a powerful means to validate program behaviors that are unobservable to test assertions. Despite their significance, they are rarely considered by automated test generation techniques. Effective generation of mock assertions requires understanding how they are used in practice. Although previous studies highlighted the importance of mock assertions, none provide insight into their usages. To bridge this gap, we conducted the first empirical study on mock assertions, examining their adoption, the characteristics of the verified method invocations, and their effectiveness in fault detection. Our analysis of 4,652 test cases from 11 popular Java projects reveals that mock assertions are mostly applied to validating specific kinds of method calls, such as those interacting with external resources and those reflecting whether a certain code path was traversed in systems under test. Additionally, we find that mock assertions complement traditional test assertions by ensuring the desired side effects have been produced, validating control flow logic, and checking internal computation results. Our findings contribute to a better understanding of mock assertion usages and provide a foundation for future related research such as automated test generation that support mock assertions. ![]() ![]() |
|
Wu, Mengzhou |
![]() Dezhi Ran, Yuan Cao, Yuzhe Guo, Yuetong Li, Mengzhou Wu, Simin Chen, Wei Yang, and Tao Xie (Peking University, China; Beijing Jiaotong University, China; University of Chicago, USA; University of Texas at Dallas, USA) ![]() |
|
Wu, Susheng |
![]() Susheng Wu, Ruisi Wang, Yiheng Cao, Bihuan Chen, Zhuotong Zhou, Yiheng Huang, JunPeng Zhao, and Xin Peng (Fudan University, China) Branching repositories facilitates efficient software development but can also inadvertently propagate vulnerabilities. When an original branch is patched, other unfixed branches remain vulnerable unless the patch is successfully ported. However, due to inherent discrepancies between branches, many patches cannot be directly applied and require manual intervention, which is time-consuming and leads to delays in patch porting, increasing vulnerability risks. Existing automated patch porting approaches are prone to errors, as they often overlook essential semantic and syntactic context of vulnerability and fail to detect or refine faulty patches. We propose Mystique, a novel LLM-based approach to address these limitations. Mystique first slices the semantic-related statements linked to the vulnerability while ensuring syntactic correctness, allowing it to extract the signatures for both the original patched function and the target vulnerable function. Mystique then utilizes a fine-tuned LLM to generate a fixed function, which is further iteratively checked and refined to ensure successful porting. Our evaluation shows that Mystique achieved a success rate of 0.954 at function level and of 0.924 at CVE level, outperforming state-of-the-art approaches by at least 13.2% at function level and 12.3% at CVE level. Our evaluation also demonstrates Mystique’s superior generality across various projects, bugs, and programming languages. Mystique successfully ported patches for 34 real-world vulnerable branches. ![]() ![]() |
|
Wu, Weibin |
![]() Weibin Wu, Haoxuan Hu, Zhaoji Fan, Yitong Qiao, Yizhan Huang, Yichen Li, Zibin Zheng, and Michael Lyu (Sun Yat-sen University, China; Chinese University of Hong Kong, China) Deep learning (DL) has revolutionized various software engineering tasks. Particularly, the emergence of AI code generators has pushed the boundaries of automatic programming to synthesize entire programs based on user-defined specifications in natural language. However, it remains a mystery if these AI code generators rely on the copy-and-paste programming practice, resulting in code clone concerns. In this work, to comprehensively study the code cloning behavior of AI code generators, we conduct an empirical study on three state-of-the-art commercial AI code generators to investigate the existence of all types of clones, which remains underexplored. Our experimental results show that the total Type-1 and Type-2 clone rates of the state-of-the-art commercial AI code generators can reach up to 7.50%, indicating marked code clone issues. Furthermore, it is observed that AI code generators risk infringing copyrights and propagating buggy and vulnerable code resulting from cloning code and show a certain degree of stability in generating code clones. ![]() ![]() Weibin Wu, Yuhang Cao, Ning Yi, Rongyi Ou, and Zibin Zheng (Sun Yat-sen University, China) Question answering (QA) is a fundamental task of large language models (LLMs), which requires LLMs to automatically answer human-posed questions in natural language. However, LLMs are known to distort facts and make non-factual statements (i.e., hallucinations) when dealing with QA tasks, which may affect the deployment of LLMs in real-life situations. In this work, we propose DrHall, a framework for detecting and reducing the factual hallucinations of black-box LLMs with metamorphic testing (MT). We believe that hallucinated answers are unstable. Therefore, when LLMs hallucinate, they are more likely to produce different answers if we use metamorphic testing to make LLMs re-execute the same task with different execution paths, which motivates the design of DrHall. The effectiveness of DrHall is evaluated empirically on three datasets, including a self-built dataset of natural language questions: FactHalluQA, as well as two datasets of programming questions: Refactory and LeetCode. The evaluation results confirm that DrHall can consistently outperform the state-of-the-art baselines, obtaining an average F1 score of over 0.856 for hallucination detection. For hallucination correction, DrHall can also outperform the state-of-the-art baselines, with an average hallucination correction rate of over 53%. We hope that our work can enhance the reliability of LLMs and provide new insights for the research of LLM hallucination mitigation. ![]() |
|
Wu, Yiwen |
![]() Yiwen Wu, Yang Zhang, Tao Wang, Bo Ding, and Huaimin Wang (National University of Defense Technology, China; State Key Laboratory of Complex and Critical Software Environment, China) Docker building is a critical component of containerization in modern software development, automating the process of packaging and converting sources into container images. It is not uncommon to find that Docker build faults (DBFs) occur frequently across Docker-based projects, inducing non-negligible costs in development activities. DBF resolution is a challenging problem and previous studies have demonstrated that developers spend non-trivial time in resolving encountered build faults. However, the characteristics of DBFs is still under-investigated, hindering practical solutions to build management in the Docker community. In this paper, to bridge this gap, we present a comprehensive study for understanding the real-world DBFs in practice. We collect and analyze a DBF dataset of 255 issues and 219 posts from GitHub, Stack Overflow, and Docker Forum. We investigate and construct characteristic taxonomies for the DBFs, including 15 symptoms, 23 root causes, and 35 fix patterns. Moreover, we study the fault distributions of symptoms and root causes, in terms of the different build types, i.e., Dockerfile builds and Docker-compose builds. Based on the results, we provide actionable implications and develop a knowledge-based application, which can potentially facilitate research and assist developers in improving the Docker build management. ![]() |
|
Wu, Zheyuan |
![]() Alexi Turcotte and Zheyuan Wu (CISPA, Germany; Saarland University, Germany) ![]() |
|
Xia, Chunqiu Steven |
![]() Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang (University of Illinois at Urbana-Champaign, USA) Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the following question: Do we really have to employ complex autonomous software agents? To attempt to answer this question, we build Agentless – an agentless approach to automatically resolve software development issues. Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic three-phase process of localization, repair, and patch validation, without letting the LLM decide future actions or operate with complex tools. Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance (32.00%, 96 correct fixes) and low cost ($0.70) compared with all existing open-source software agents at the time of paper submission! Agentless also achieves more than 50% solve rate when using Claude 3.5 Sonnet on the new SWE-bench Verified benchmark. In fact, Agentless has already been adopted by OpenAI as the go-to approach to showcase the real-world coding performance of both GPT-4o and the new o1 models; more recently, Agentless has also been used by DeepSeek to evaluate their newest DeepSeek V3 and R1 models. Furthermore, we manually classified the problems in SWE-bench Lite and found problems with exact ground truth patches or insufficient/misleading issue descriptions. As such, we construct SWE-bench Lite-𝑆 by excluding such problematic issues to perform more rigorous evaluation and comparison. Our work highlights the currently overlooked potential of a simplistic, cost-effective technique in autonomous software development. We hope Agentless will help reset the baseline, starting point, and horizon for autonomous software agents, and inspire future work along this crucial direction. We have open-sourced Agentless at: https://github.com/OpenAutoCoder/Agentless ![]() |
|
Xia, Xin |
![]() Yi Gao, Xing Hu, Xiaohu Yang, and Xin Xia (Zhejiang University, China) Test smells arise from poor design practices and insufficient domain knowledge, which can lower the quality of test code and make it harder to maintain and update. Manually refactoring of test smells is time-consuming and error-prone, highlighting the necessity for automated approaches. Current rule-based refactoring methods often struggle in scenarios not covered by predefined rules and lack the flexibility needed to handle diverse cases effectively. In this paper, we propose a novel approach called UTRefactor, a context-enhanced, LLM-based framework for automatic test refactoring in Java projects. UTRefactor extracts relevant context from test code and leverages an external knowledge base that includes test smell definitions, descriptions, and DSL-based refactoring rules. By simulating the manual refactoring process through a chain-of-thought approach, UTRefactor guides the LLM to eliminate test smells in a step-by-step process, ensuring both accuracy and consistency throughout the refactoring. Additionally, we implement a checkpoint mechanism to facilitate comprehensive refactoring, particularly when multiple smells are present. We evaluate UTRefactor on 879 tests from six open-source Java projects, reducing the number of test smells from 2,375 to 265, achieving an 89% reduction. UTRefactor outperforms direct LLM-based refactoring methods by 61.82% in smell elimination and significantly surpasses the performance of a rule-based test smell refactoring tool. Our results demonstrate the effectiveness of UTRefactor in enhancing test code quality while minimizing manual involvement. ![]() ![]() Junwei Zhang, Xing Hu, Shan Gao, Xin Xia, David Lo, and Shanping Li (Zhejiang University, China; Huawei, China; Singapore Management University, Singapore) Unit testing is crucial for software development and maintenance. Effective unit testing ensures and improves software quality, but writing unit tests is time-consuming and labor-intensive. Recent studies have proposed deep learning (DL) techniques or large language models (LLMs) to automate unit test generation. These models are usually trained or fine-tuned on large-scale datasets. Despite growing awareness of the importance of data quality, there has been limited research on the quality of datasets used for test generation. To bridge this gap, we systematically examine the impact of noise on the performance of learning-based test generation models. We first apply the open card sorting method to analyze the most popular and largest test generation dataset, Methods2Test, to categorize eight distinct types of noise. Further, we conduct detailed interviews with 17 domain experts to validate and assess the importance, reasonableness, and correctness of the noise taxonomy. Then, we propose CleanTest, an automated noise-cleaning framework designed to improve the quality of test generation datasets. CleanTest comprises three filters: a rule-based syntax filter, a rule-based relevance filter, and a model-based coverage filter. To evaluate its effectiveness, we apply CleanTest on two widely-used test generation datasets, i.e., Methods2Test and Atlas. Our findings indicate that 43.52% and 29.65% of datasets contain noise, highlighting its prevalence. Finally, we conduct comparative experiments using four LLMs (i.e., CodeBERT, AthenaTest, StarCoder, and CodeLlama7B) to assess the impact of noise on test generation performance. The results show that filtering noise positively influences the test generation ability of the models. Fine-tuning the four LLMs with the filtered Methods2Test dataset, on average, improves its performance by 67% in branch coverage, using the Defects4J benchmark. For the Atlas dataset, the four LLMs improve branch coverage by 39%. Additionally, filtering noise improves bug detection performance, resulting in a 21.42% increase in bugs detected by the generated tests. ![]() |
|
Xiang, Debin |
![]() Siwei Tan, Liqiang Lu, Debin Xiang, Tianyao Chu, Congliang Lang, Jintao Chen, Xing Hu, and Jianwei Yin (Zhejiang University, China) Quantum programs provide exponential speedups compared to classical programs in certain areas, but they also inevitably encounter logical faults. Automatically repairing quantum programs is much more challenging than repairing classical programs due to the non-replicability of data, the vast search space of program inputs, and the new programming paradigm. Existing works based on semantic-based or learning-based program repair techniques are fundamentally limited in repairing efficiency and effectiveness. In this work, we propose HornBro, an efficient framework for automated quantum program repair. The key insight of HornBro lies in the homotopy-like method, which iteratively switches between the classical and quantum parts. This approach allows the repair tasks to be efficiently offloaded to the most suitable platforms, enabling a progressive convergence toward the correct program. We start by designing an implication assertion pragma to enable rigorous specifications of quantum program behavior, which helps to generate a quantum test suite automatically. This suite leverages the orthonormal bases of quantum programs to accommodate different encoding schemes. Given a fixed number of test cases, it allows the maximum input coverage of potential counter-example candidates. Then, we develop a Clifford approximation method with an SMT-based search to transform the patch localization program into a symbolic reasoning problem. Finally, we offload the computationally intensive repair of gate parameters to quantum hardware by leveraging the differentiability of quantum gates. Experiments suggest that HornBro increases the repair success rate by more than 62.5% compared to the existing repair techniques, supporting more types of quantum bugs. It also achieves 35.7× speedup in the repair and 99.9% gate reduction of the patch. ![]() ![]() |
|
Xiang, Xuwen |
![]() Junming Cao, Xuwen Xiang, Mingfei Cheng, Bihuan Chen, Xinyan Wang, You Lu, Chaofeng Sha, Xiaofei Xie, and Xin Peng (Fudan University, China; Singapore Management University, Singapore) ![]() |
|
Xiao, Ying |
![]() Zhenpeng Chen, Xinyue Li, Jie M. Zhang, Weisong Sun, Ying Xiao, Tianlin Li, Yiling Lou, and Yang Liu (Nanyang Technological University, Singapore; Peking University, China; King's College London, UK; Fudan University, China) Fairness is a critical requirement for Machine Learning (ML) software, driving the development of numerous bias mitigation methods. Previous research has identified a leveling-down effect in bias mitigation for computer vision and natural language processing tasks, where fairness is achieved by lowering performance for all groups without benefiting the unprivileged group. However, it remains unclear whether this effect applies to bias mitigation for tabular data tasks, a key area in fairness research with significant real-world applications. This study evaluates eight bias mitigation methods for tabular data, including both widely used and cutting-edge approaches, across 44 tasks using five real-world datasets and four common ML models. Contrary to earlier findings, our results show that these methods operate in a zero-sum fashion, where improvements for unprivileged groups are related to reduced benefits for traditionally privileged groups. However, previous research indicates that the perception of a zero-sum trade-off might complicate the broader adoption of fairness policies. To explore alternatives, we investigate an approach that applies the state-of-the-art bias mitigation method solely to unprivileged groups, showing potential to enhance benefits of unprivileged groups without negatively affecting privileged groups or overall ML performance. Our study highlights potential pathways for achieving fairness improvements without zero-sum trade-offs, which could help advance the adoption of bias mitigation methods. ![]() |
|
Xiao, Yuan |
![]() Weisong Sun, Yuchen Chen, Chunrong Fang, Yebo Feng, Yuan Xiao, An Guo, Quanjun Zhang, Zhenyu Chen, Baowen Xu, and Yang Liu (Nanyang Technological University, Singapore; Nanjing University, China) Neural code models (NCMs) have been widely used to address various code understanding tasks, such as defect detection. However, numerous recent studies reveal that such models are vulnerable to backdoor attacks. Backdoored NCMs function normally on normal/clean code snippets, but exhibit adversary-expected behavior on poisoned code snippets injected with the adversary-crafted trigger. It poses a significant security threat. For example, a backdoored defect detection model may misclassify user-submitted defective code as non-defective. If this insecure code is then integrated into critical systems, like autonomous driving systems, it could jeopardize life safety. Therefore, there is an urgent need for effective techniques to detect and eliminate backdoors stealthily implanted in NCMs. To address this issue, in this paper, we innovatively propose a backdoor elimination technique for secure code understanding, called EliBadCode. EliBadCode eliminates backdoors in NCMs by inverting/reverse-engineering and unlearning backdoor triggers. Specifically, EliBadCode first filters the model vocabulary for trigger tokens based on the naming conventions of specific programming languages to reduce the trigger search space and cost. Then, EliBadCode introduces a sample-specific trigger position identification method, which can reduce the interference of non-backdoor (adversarial) perturbations for subsequent trigger inversion, thereby producing effective inverted backdoor triggers efficiently. Backdoor triggers can be viewed as backdoor (adversarial) perturbations. Subsequently, EliBadCode employs a Greedy Coordinate Gradient algorithm to optimize the inverted trigger and designs a trigger anchoring method to purify the inverted trigger. Finally, EliBadCode eliminates backdoors through model unlearning. We evaluate the effectiveness of in eliminating backdoors implanted in multiple NCMs used for three safety-critical code understanding tasks. The results demonstrate that EliBadCode can effectively eliminate backdoors while having minimal adverse effects on the normal functionality of the model. For instance, on defect detection tasks, EliBadCode substantially decreases the average Attack Success Rate (ASR) of the advanced backdoor attack from 99.76% to 2.64%, significantly outperforming the three baselines. The clean model produced by EliBadCode exhibits an average decrease in defect prediction accuracy of only 0.01% (the same as the baseline). ![]() |
|
Xiaofei, Xie |
![]() Jiongchi Yu, Xie Xiaofei, Qiang Hu, Bowen Zhang, Ziming Zhao, Yun Lin, Lei Ma, Ruitao Feng, and Frank Liau (Singapore Management University, Singapore; Tianjin University, China; Zhejiang University, China; Shanghai Jiao Tong University, China; University of Tokyo, Japan; University of Alberta, Canada; Southern Cross University, Australia; Government Technology Agency of Singapore, Singapore) ![]() ![]() Jiaolong Kong, Xie Xiaofei, and Shangqing Liu (Singapore Management University, Singapore; Nanyang Technological University, Singapore) ![]() |
|
Xie, Difan |
![]() Zuobin Wang, Zhiyuan Wan, Yujing Chen, Yun Zhang, David Lo, Difan Xie, and Xiaohu Yang (Zhejiang University, China; Hangzhou High-Tech Zone (Binjiang), China; Hangzhou City University, China; Singapore Management University, Singapore) In smart contract development, practitioners frequently reuse code to reduce development effort and avoid reinventing the wheel. This reused code, whether identical or similar to its original source, is referred to as a code clone. Unintentional code cloning can propagate flaws and vulnerabilities, potentially undermining the reliability and maintainability of software systems. Previous studies have identified a significant prevalence of code clones in Solidity smart contracts on the Ethereum blockchain. To mitigate the risks posed by code clones, clone detection has emerged as an active field of research and practice in software engineering. Recent studies have extended existing techniques or proposed novel techniques tailored to the unique syntactic and semantic features of Solidity. Nonetheless, the evaluations of existing techniques, whether conducted by their original authors or independent researchers, involve codebases in various programming languages and utilize different versions of the corresponding tools. The resulting inconsistency makes direct comparisons of the evaluation results impractical, and hinders the ability to derive meaningful conclusions across the evaluations. There remains a lack of clarity regarding the effectiveness of these techniques in detecting smart contract clones, and whether it is feasible to combine different techniques to achieve scalable yet accurate detection of code clones in smart contracts. To address this gap, we conduct a comprehensive empirical study that evaluates the effectiveness and scalability of five representative clone detection techniques on 33,073 verified Solidity smart contracts, along with a benchmark we curate, in which we manually label 72,010 pairs of Solidity smart contracts with clone tags. Moreover, we explore the potential of combining different techniques to achieve optimal performance of code clone detection for smart contracts, and propose SourceREClone, a framework designed for the refined integration of different techniques, which achieves a 36.9% improvement in F1 score compared to a straightforward combination of the state of the art. Based on our findings, we discuss implications, provide recommendations for practitioners, and outline directions for future research. ![]() |
|
Xie, Maoyi |
![]() Ziqiao Kong, Cen Zhang, Maoyi Xie, Ming Hu, Yue Xue, Ye Liu, Haijun Wang, and Yang Liu (Nanyang Technological University, Singapore; Singapore Management University, Singapore; MetaTrust Labs, Singapore; Xi'an Jiaotong University, China) Billions of dollars are transacted through smart contracts, making vulnerabilities a major financial risk. One focus in the security arms race is on profitable vulnerabilities that attackers can exploit. Fuzzing is a key method for identifying these vulnerabilities. However, current solutions face two main limitations: 1. a lack of profit-centric techniques for expediting detection and, 2. insufficient automation in maximizing the profitability of discovered vulnerabilities, leaving the analysis to human experts. To address these gaps, we have developed VERITE, a profit-centric smart contract fuzzing framework that not only effectively detects those profitable vulnerabilities but also maximizes the exploited profits. VERITE has three key features: 1. DeFi action-based mutators for boosting the exploration of transactions with different fund flows; 2. potentially profitable candidates identification criteria, which checks whether the input has caused abnormal fund flow properties during testing; 3. a gradient descent-based profit maximization strategy for these identified candidates. VERITE is fully developed from scratch and evaluated on a dataset consisting of 61 exploited real-world DeFi projects with an average of over 1.1 million dollars loss. The results show that VERITE can automatically extract more than 18 million dollars in total and is significantly better than state-of-the-art fuzzer ItyFuzz in both detection (29/10) and exploitation (134 times more profits gained on average). Remarkably, in 12 targets, it gains more profits than real-world attacking exploits (1.01 to 11.45 times more). VERITE is also applied by auditors in contract auditing, where 6 (5 high severity) zero-day vulnerabilities are found with over $2,500 bounty rewards. ![]() ![]() ![]() |
|
Xie, Shuaiyu |
![]() Shuaiyu Xie, Jian Wang, Maodong Li, Peiran Chen, Jifeng Xuan, and Bing Li (Wuhan University, China; Zhongguancun Laboratory, China) Distributed tracing is a pivotal technique for software operators to understand and diagnose issues within microservice-based systems, offering a comprehensive view of user requests propagated through various services. However, the unprecedented volume of traces imposes expensive storage and analytical burdens on online systems. Conventional tracing approaches typically rely on random sampling with a fixed probability for each trace, which risks missing valuable traces. Several tail-based sampling methods have thus been proposed to sample traces based on their content. Nevertheless, these methods primarily evaluate traces on an individual basis, neglecting the collective attributes of the sample set in terms of comprehensiveness, balance, and consistency. To address these issues, we propose TracePicker, an optimization-based online sampler designed to enhance the quality of sampled data while mitigating storage burden. TracePicker employs a streaming anomaly detector to capture and retain anomalous traces that are crucial for troubleshooting. For normal traces, the sampling process is segmented into quota allocation and group sampling, both formulated as integer programming problems. By solving these problems using dynamic programming and evolution algorithms, TracePicker selects a high-quality subset of data, minimizing overall information loss. Experimental results demonstrate that TracePicker outperforms existing tail-based sampling methods in terms of both sampling quality and time consumption. ![]() |
|
Xie, Tao |
![]() Dezhi Ran, Yuan Cao, Yuzhe Guo, Yuetong Li, Mengzhou Wu, Simin Chen, Wei Yang, and Tao Xie (Peking University, China; Beijing Jiaotong University, China; University of Chicago, USA; University of Texas at Dallas, USA) ![]() |
|
Xie, Xiaofei |
![]() Junming Cao, Xuwen Xiang, Mingfei Cheng, Bihuan Chen, Xinyan Wang, You Lu, Chaofeng Sha, Xiaofei Xie, and Xin Peng (Fudan University, China; Singapore Management University, Singapore) ![]() |
|
Xin, Haojie |
![]() Songyang Yan, Xiaodong Zhang, Kunkun Hao, Haojie Xin, Yonggang Luo, Jucheng Yang, Ming Fan, Chao Yang, Jun Sun, and Zijiang Yang (Xi'an Jiaotong University, China; Xidian University, China; Synkrotron, China; Chongqing Changan Automobile, China; Singapore Management University, Singapore; University of Science and Technology of China, China) The safety and reliability of Automated Driving Systems (ADS) are paramount, necessitating rigorous testing methodologies to uncover potential failures before deployment. Traditional testing approaches often prioritize either natural scenario sampling or safety-critical scenario generation, resulting in overly simplistic or unrealistic hazardous tests. In practice, the demand for natural scenarios (e.g., when evaluating the ADS's reliability in real-world conditions), critical scenarios (e.g., when evaluating safety in critical situations), or somewhere in between (e.g., when testing the ADS in regions with less civilized drivers) varies depending on the testing objectives. To address this issue, we propose the On-demand Scenario Generation (OSG) Framework, which generates diverse scenarios with varying risk levels. Achieving the goal of OSG is challenging due to the complexity of quantifying the criticalness and naturalness stemming from intricate vehicle-environment interactions, as well as the need to maintain scenario diversity across various risk levels. OSG learns from real-world traffic datasets and employs a Risk Intensity Regulator to quantitatively control the risk level. It also leverages an improved heuristic search method to ensure scenario diversity. We evaluate OSG on the Carla simulators using various ADSs. We verify OSG's ability to generate scenarios with different risk levels and demonstrate its necessity by comparing accident types across risk levels. With the help of OSG, we are now able to systematically and objectively compare the performance of different ADSs based on different risk levels. ![]() ![]() |
|
Xing, Zezhong |
![]() Haibo Wang, Zezhong Xing, Chengnian Sun, Zheng Wang, and Shin Hwei Tan (Concordia University, Canada; Southern University of Science and Technology, China; University of Waterloo, Canada; University of Leeds, UK) By reducing the number of lines of code, program simplification reduces code complexity, improving software maintainability and code comprehension. While several existing techniques can be used for automatic program simplification, there is no consensus on the effectiveness of these approaches. We present the first study on how real-world developers simplify programs in open-source software projects. By analyzing 382 pull requests from 296 projects, we summarize the types of program transformations used, the motivations behind simplifications, and the set of program transformations that have not been covered by existing refactoring types. As a result of our study, we submitted eight bug reports to a widely used refactoring detection tool, RefactoringMiner, where seven were fixed. Our study also identifies gaps in applying existing approaches for automating program simplification and outlines the criteria for designing automatic program simplification techniques. In light of these observations, we propose SimpT5, a tool to automatically produce simplified programs that are semantically equivalent programs with reduced lines of code. SimpT5 is trained on our collected dataset of 92,485 simplified programs with two heuristics: (1) modified line localization that encodes lines changed in simplified programs, and (2) checkers that measure the quality of generated programs. Experimental results show that SimpT5 outperforms prior approaches in automating developer-induced program simplification. ![]() ![]() |
|
Xing, Zhenchang |
![]() Yanqi Su, Zhenchang Xing, Chong Wang, Chunyang Chen, Sherry (Xiwei) Xu, Qinghua Lu, and Liming Zhu (Australian National University, Australia; CSIRO's Data61, Australia; Nanyang Technological University, Singapore; TU Munich, Germany; UNSW, Australia) Exploratory testing (ET) harnesses tester's knowledge, creativity, and experience to create varying tests that uncover unexpected bugs from the end-user's perspective. Although ET has proven effective in system-level testing of interactive systems, the need for manual execution has hindered large-scale adoption. In this work, we explore the feasibility, challenges and road ahead of automated scenario-based ET (a.k.a soap opera testing). We conduct a formative study, identifying key insights for effective manual soap opera testing and challenges in automating the process. We then develop a multi-agent system leveraging LLMs and a Scenario Knowledge Graph (SKG) to automate soap opera testing. The system consists of three multi-modal agents, Planner, Player, and Detector that collaborate to execute tests and identify potential bugs. Experimental results demonstrate the potential of automated soap opera testing, but there remains a significant gap compared to manual execution, especially under-explored scenario boundaries and incorrectly identified bugs. Based on the observation, we envision road ahead for the future of automated soap opera testing, focusing on three key aspects: the synergy of neural and symbolic approaches, human-AI co-learning, and the integration of soap opera testing with broader software engineering practices. These insights aim to guide and inspire the future research. ![]() |
|
Xu, Baowen |
![]() Weisong Sun, Yuchen Chen, Chunrong Fang, Yebo Feng, Yuan Xiao, An Guo, Quanjun Zhang, Zhenyu Chen, Baowen Xu, and Yang Liu (Nanyang Technological University, Singapore; Nanjing University, China) Neural code models (NCMs) have been widely used to address various code understanding tasks, such as defect detection. However, numerous recent studies reveal that such models are vulnerable to backdoor attacks. Backdoored NCMs function normally on normal/clean code snippets, but exhibit adversary-expected behavior on poisoned code snippets injected with the adversary-crafted trigger. It poses a significant security threat. For example, a backdoored defect detection model may misclassify user-submitted defective code as non-defective. If this insecure code is then integrated into critical systems, like autonomous driving systems, it could jeopardize life safety. Therefore, there is an urgent need for effective techniques to detect and eliminate backdoors stealthily implanted in NCMs. To address this issue, in this paper, we innovatively propose a backdoor elimination technique for secure code understanding, called EliBadCode. EliBadCode eliminates backdoors in NCMs by inverting/reverse-engineering and unlearning backdoor triggers. Specifically, EliBadCode first filters the model vocabulary for trigger tokens based on the naming conventions of specific programming languages to reduce the trigger search space and cost. Then, EliBadCode introduces a sample-specific trigger position identification method, which can reduce the interference of non-backdoor (adversarial) perturbations for subsequent trigger inversion, thereby producing effective inverted backdoor triggers efficiently. Backdoor triggers can be viewed as backdoor (adversarial) perturbations. Subsequently, EliBadCode employs a Greedy Coordinate Gradient algorithm to optimize the inverted trigger and designs a trigger anchoring method to purify the inverted trigger. Finally, EliBadCode eliminates backdoors through model unlearning. We evaluate the effectiveness of in eliminating backdoors implanted in multiple NCMs used for three safety-critical code understanding tasks. The results demonstrate that EliBadCode can effectively eliminate backdoors while having minimal adverse effects on the normal functionality of the model. For instance, on defect detection tasks, EliBadCode substantially decreases the average Attack Success Rate (ASR) of the advanced backdoor attack from 99.76% to 2.64%, significantly outperforming the three baselines. The clean model produced by EliBadCode exhibits an average decrease in defect prediction accuracy of only 0.01% (the same as the baseline). ![]() |
|
Xu, Fengyuan |
![]() Xing Su, Hanzhong Liang, Hao Wu, Ben Niu, Fengyuan Xu, and Sheng Zhong (Nanjing University, China; Institute of Information Engineering at Chinese Academy of Sciences, China) Understanding the Ethereum smart contract bytecode is essential for ensuring cryptoeconomics security. However, existing decompilers primarily convert bytecode into pseudocode, which is not easily comprehensible for general users, potentially leading to misunderstanding of contract behavior and increased vulnerability to scams or exploits. In this paper, we propose DiSCo, the first LLMs-based EVM decompilation pipeline, which aims to enable LLMs to understand the opaque bytecode and lift it into smart contract code. DiSCo introduces three core technologies. First, a logic-invariant intermediate representation is proposed to reproject the low-level bytecode into high-level abstracted units. The second technique involves semantic enhancement based on a novel type-aware graph model to infer stripped variables during compilation, enhancing the lifting effect. The third technology is a flexible method incorporating code specifications to construct LLM-comprehensible prompts for source code generation. Extensive experiments illustrate that our generated code guarantees a high compilability rate at 75%, with differential fuzzing pass rate averaging at 50%. Manual validation results further indicate that the generated solidity contracts significantly outperforms baseline methods in tasks such as code comprehension and attack reproduction. ![]() |
|
Xu, Guoai |
![]() Haodong Li, Xiao Cheng, Guohan Zhang, Guosheng Xu, Guoai Xu, and Haoyu Wang (Beijing University of Posts and Telecommunications, China; UNSW, Australia; Harbin Institute of Technology, Shenzhen, China; Huazhong University of Science and Technology, China) Learning-based Android malware detection has earned significant recognition across industry and academia, yet its effectiveness hinges on the accuracy of labeled training data. Manual labeling, being prohibitively expensive, has prompted the use of automated methods, such as leveraging anti-virus engines like VirusTotal, which unfortunately introduces mislabeling, aka "label noise". The state-of-the-art label noise reduction approach, MalWhiteout, can effectively reduce random label noise but underperforms in mitigating real-world emergent malware (EM) label noise stemming from newly emerging Android malware variants overlooked by VirusTotal. To tackle this, we conceptualize EM label noise detection as an anomaly detection problem and introduce a novel tool, MalCleanse, that surpasses MalWhiteout's ability to address EM label noise. MalCleanse combines uncertainty estimation with unsupervised anomaly detection, identifying samples with high uncertainty as mislabeled, thereby enhancing its capability to remove EM label noise. Our experimental results demonstrate a significant reduction in EM label noise by approximately 25.25%, achieving an F1 Score of 80.32% for label noise detection at a noise ratio of 40.61%. Notably, MalCleanse outperforms MalWhiteout with an increase of 40.9% in overall F1 score for mitigating EM label noise. This paper pioneers the integration of deep neural network model uncertainty to refine label accuracy, thereby enhancing the reliability of malware detection systems. Our approach represents a significant step forward in addressing the challenges posed by emergent malware in automated labeling systems. ![]() |
|
Xu, Guosheng |
![]() Haodong Li, Xiao Cheng, Guohan Zhang, Guosheng Xu, Guoai Xu, and Haoyu Wang (Beijing University of Posts and Telecommunications, China; UNSW, Australia; Harbin Institute of Technology, Shenzhen, China; Huazhong University of Science and Technology, China) Learning-based Android malware detection has earned significant recognition across industry and academia, yet its effectiveness hinges on the accuracy of labeled training data. Manual labeling, being prohibitively expensive, has prompted the use of automated methods, such as leveraging anti-virus engines like VirusTotal, which unfortunately introduces mislabeling, aka "label noise". The state-of-the-art label noise reduction approach, MalWhiteout, can effectively reduce random label noise but underperforms in mitigating real-world emergent malware (EM) label noise stemming from newly emerging Android malware variants overlooked by VirusTotal. To tackle this, we conceptualize EM label noise detection as an anomaly detection problem and introduce a novel tool, MalCleanse, that surpasses MalWhiteout's ability to address EM label noise. MalCleanse combines uncertainty estimation with unsupervised anomaly detection, identifying samples with high uncertainty as mislabeled, thereby enhancing its capability to remove EM label noise. Our experimental results demonstrate a significant reduction in EM label noise by approximately 25.25%, achieving an F1 Score of 80.32% for label noise detection at a noise ratio of 40.61%. Notably, MalCleanse outperforms MalWhiteout with an increase of 40.9% in overall F1 score for mitigating EM label noise. This paper pioneers the integration of deep neural network model uncertainty to refine label accuracy, thereby enhancing the reliability of malware detection systems. Our approach represents a significant step forward in addressing the challenges posed by emergent malware in automated labeling systems. ![]() |
|
Xu, Qiyuan |
![]() Xiaokun Luan, David Sanan, Zhe Hou, Qiyuan Xu, Chengwei Liu, Yufan Cai, Yang Liu, and Meng Sun (Peking University, China; Singapore Institute of Technology, Singapore; Griffith University, Australia; Nanyang Technological University, Singapore; China-Singapore International Joint Research Institute, China; National University of Singapore, Singapore) Proof assistants are software tools for formal modeling and verification of software, hardware, design, and mathematical proofs. Due to the growing complexity and scale of formal proofs, compatibility issues frequently arise when using different versions of proof assistants. These issues result in broken proofs, disrupting the maintenance of formalized theories and hindering the broader dissemination of results within the community. Although existing works have proposed techniques to address specific types of compatibility issues, the overall characteristics of these issues remain largely unexplored. To address this gap, we conduct the first extensive empirical study to characterize compatibility issues, using Isabelle as a case study. We develop a regression testing framework to automatically collect compatibility issues from the Archive of Formal Proofs, the largest repository of formal proofs in Isabelle. By analyzing 12,079 collected issues, we identify their types and symptoms and further investigate their root causes. We also extract updated proofs that address these issues to understand the applied resolution strategies. Our study provides an in-depth understanding of compatibility issues in proof assistants, offering insights that support the development of effective techniques to mitigate these issues. ![]() ![]() |
|
Xu, Sherry (Xiwei) |
![]() Yanqi Su, Zhenchang Xing, Chong Wang, Chunyang Chen, Sherry (Xiwei) Xu, Qinghua Lu, and Liming Zhu (Australian National University, Australia; CSIRO's Data61, Australia; Nanyang Technological University, Singapore; TU Munich, Germany; UNSW, Australia) Exploratory testing (ET) harnesses tester's knowledge, creativity, and experience to create varying tests that uncover unexpected bugs from the end-user's perspective. Although ET has proven effective in system-level testing of interactive systems, the need for manual execution has hindered large-scale adoption. In this work, we explore the feasibility, challenges and road ahead of automated scenario-based ET (a.k.a soap opera testing). We conduct a formative study, identifying key insights for effective manual soap opera testing and challenges in automating the process. We then develop a multi-agent system leveraging LLMs and a Scenario Knowledge Graph (SKG) to automate soap opera testing. The system consists of three multi-modal agents, Planner, Player, and Detector that collaborate to execute tests and identify potential bugs. Experimental results demonstrate the potential of automated soap opera testing, but there remains a significant gap compared to manual execution, especially under-explored scenario boundaries and incorrectly identified bugs. Based on the observation, we envision road ahead for the future of automated soap opera testing, focusing on three key aspects: the synergy of neural and symbolic approaches, human-AI co-learning, and the integration of soap opera testing with broader software engineering practices. These insights aim to guide and inspire the future research. ![]() |
|
Xu, Shouhuai |
![]() Junyao Ye, Zhen Li, Xi Tang, Deqing Zou, Shouhuai Xu, Weizhong Qiang, and Hai Jin (Huazhong University of Science and Technology, China; University of Colorado at Colorado Springs, USA) ![]() |
|
Xu, Zhenyang |
![]() Weiqi Lu, Yongqiang Tian, Xiaohan Zhong, Haoyang Ma, Zhenyang Xu, Shing-Chi Cheung, and Chengnian Sun (Hong Kong University of Science and Technology, China; University of Waterloo, Canada) Data visualization (DataViz) libraries play a crucial role in presentation, data analysis, and application development, underscoring the importance of their accuracy in transforming data into visual representations. Incorrect visualizations can adversely impact user experience, distort information conveyance, and influence user perception and decision-making processes. Visual bugs in these libraries can be particularly insidious as they may not cause obvious errors like crashes, but instead mislead users of the underlying data graphically, resulting in wrong decision making. Consequently, a good understanding of the unique characteristics of bugs in DataViz libraries is essential for researchers and developers to detect and fix bugs in DataViz libraries. This study presents the first comprehensive analysis of bugs in DataViz libraries, examining 564 bugs collected from five widely-used libraries. Our study systematically analyzes their symptoms and root causes, and provides a detailed taxonomy. We found that incorrect/inaccurate plots are pervasive in DataViz libraries and incorrect graphic computation is the major root cause, which necessitates further automated testing methods for DataViz libraries. Moreover, we identified eight key steps to trigger such bugs and two test oracles specific to DataViz libraries, which may inspire future research in designing effective automated testing techniques. Furthermore, with the recent advancements in Vision Language Models (VLMs), we explored the feasibility of applying these models to detect incorrect/inaccurate plots. The results show that the effectiveness of VLMs in bug detection varies from 29% to 57%, depending on the prompts, and adding more information in prompts does not necessarily increase the effectiveness. Our findings offer valuable insights into the nature and patterns of bugs in DataViz libraries, providing a foundation for developers and researchers to improve library reliability, and ultimately benefit more accurate and reliable data visualizations across various domains. ![]() |
|
Xuan, Jifeng |
![]() Shuaiyu Xie, Jian Wang, Maodong Li, Peiran Chen, Jifeng Xuan, and Bing Li (Wuhan University, China; Zhongguancun Laboratory, China) Distributed tracing is a pivotal technique for software operators to understand and diagnose issues within microservice-based systems, offering a comprehensive view of user requests propagated through various services. However, the unprecedented volume of traces imposes expensive storage and analytical burdens on online systems. Conventional tracing approaches typically rely on random sampling with a fixed probability for each trace, which risks missing valuable traces. Several tail-based sampling methods have thus been proposed to sample traces based on their content. Nevertheless, these methods primarily evaluate traces on an individual basis, neglecting the collective attributes of the sample set in terms of comprehensiveness, balance, and consistency. To address these issues, we propose TracePicker, an optimization-based online sampler designed to enhance the quality of sampled data while mitigating storage burden. TracePicker employs a streaming anomaly detector to capture and retain anomalous traces that are crucial for troubleshooting. For normal traces, the sampling process is segmented into quota allocation and group sampling, both formulated as integer programming problems. By solving these problems using dynamic programming and evolution algorithms, TracePicker selects a high-quality subset of data, minimizing overall information loss. Experimental results demonstrate that TracePicker outperforms existing tail-based sampling methods in terms of both sampling quality and time consumption. ![]() |
|
Xue, Jingling |
![]() Liqing Cao, Haofeng Li, Chenghang Shi, Jie Lu, Haining Meng, Lian Li, and Jingling Xue (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia) ![]() |
|
Xue, Yue |
![]() Ziqiao Kong, Cen Zhang, Maoyi Xie, Ming Hu, Yue Xue, Ye Liu, Haijun Wang, and Yang Liu (Nanyang Technological University, Singapore; Singapore Management University, Singapore; MetaTrust Labs, Singapore; Xi'an Jiaotong University, China) Billions of dollars are transacted through smart contracts, making vulnerabilities a major financial risk. One focus in the security arms race is on profitable vulnerabilities that attackers can exploit. Fuzzing is a key method for identifying these vulnerabilities. However, current solutions face two main limitations: 1. a lack of profit-centric techniques for expediting detection and, 2. insufficient automation in maximizing the profitability of discovered vulnerabilities, leaving the analysis to human experts. To address these gaps, we have developed VERITE, a profit-centric smart contract fuzzing framework that not only effectively detects those profitable vulnerabilities but also maximizes the exploited profits. VERITE has three key features: 1. DeFi action-based mutators for boosting the exploration of transactions with different fund flows; 2. potentially profitable candidates identification criteria, which checks whether the input has caused abnormal fund flow properties during testing; 3. a gradient descent-based profit maximization strategy for these identified candidates. VERITE is fully developed from scratch and evaluated on a dataset consisting of 61 exploited real-world DeFi projects with an average of over 1.1 million dollars loss. The results show that VERITE can automatically extract more than 18 million dollars in total and is significantly better than state-of-the-art fuzzer ItyFuzz in both detection (29/10) and exploitation (134 times more profits gained on average). Remarkably, in 12 targets, it gains more profits than real-world attacking exploits (1.01 to 11.45 times more). VERITE is also applied by auditors in contract auditing, where 6 (5 high severity) zero-day vulnerabilities are found with over $2,500 bounty rewards. ![]() ![]() ![]() |
|
Yadavally, Aashish |
![]() Hridya Dhulipala, Aashish Yadavally, Smit Soneshbhai Patel, and Tien N. Nguyen (University of Texas at Dallas, USA) While LLMs excel in understanding source code and descriptive texts for tasks like code generation, code completion, etc., they exhibit weaknesses in predicting dynamic program behavior, such as code coverage and runtime error detection, which typically require program execution. Aiming to advance the capability of LLMs in reasoning and predicting the program behavior at runtime, we present CRISPE (short for Coverage Rationalization and Intelligent Selection ProcedurE), a novel approach for code coverage prediction. CRISPE guides an LLM in simulating program execution via an execution plan based on two key factors: (1) program semantics of each statement type, and (2) the observation of the set of covered statements at the current “execution” step relative to all feasible code coverage options. We formulate code coverage prediction as a process of semantic-guided execution-based planning, where feasible coverage options are utilized to assess whether the LLM is heading in the correct reasoning. We enhance the traditional generative task with the retrieval-based framework on feasible options of code coverage. Our experimental results show that CRISPE achieves high accuracy in coverage prediction in terms of both exact-match and statement-match coverage metrics, improving over the baselines. We also show that with semantic-guiding and dynamic reasoning from CRISPE, the LLM generates more correct planning steps. To demonstrate CRISPE’s usefulness, we used it in the downstream task of statically detecting runtime error(s) in incomplete code snippets with the given inputs. ![]() ![]() Yi Li, Hridya Dhulipala, Aashish Yadavally, Xiaokai Rong, Shaohua Wang, and Tien N. Nguyen (University of Texas at Dallas, USA; Central University of Finance and Economics, China) Although Large Language Models (LLMs) are highly proficient in understanding source code and descriptive texts, they have limitations in reasoning on dynamic program behaviors, such as execution trace and code coverage prediction, and runtime error prediction, which usually require actual program execution. To advance the ability of LLMs in predicting dynamic behaviors, we leverage the strengths of both approaches, Program Analysis (PA) and LLM, in building PredEx, a predictive executor for Python. Our principle is a blended analysis between PA and LLM to use PA to guide the LLM in predicting execution traces. We break down the task of predictive execution into smaller sub-tasks and leverage the deterministic nature when an execution order can be deterministically decided. When it is not certain, we use predictive backward slicing per variable, i.e., slicing the prior trace to only the parts that affect each variable separately breaks up the valuation prediction into significantly simpler problems. Our empirical evaluation on real-world datasets shows that PredEx achieves 31.5–47.1% relatively higher accuracy in predicting full execution traces than the state-of-the-art models. It also produces 8.6–53.7% more correct execution trace prefixes than those baselines. In predicting next executed statements, its relative improvement over the baselines is 15.7–102.3%. Finally, we show PredEx’s usefulness in two tasks: static code coverage analysis and static prediction of run-time errors for (in)complete code. ![]() |
|
Yan, Songyang |
![]() Songyang Yan, Xiaodong Zhang, Kunkun Hao, Haojie Xin, Yonggang Luo, Jucheng Yang, Ming Fan, Chao Yang, Jun Sun, and Zijiang Yang (Xi'an Jiaotong University, China; Xidian University, China; Synkrotron, China; Chongqing Changan Automobile, China; Singapore Management University, Singapore; University of Science and Technology of China, China) The safety and reliability of Automated Driving Systems (ADS) are paramount, necessitating rigorous testing methodologies to uncover potential failures before deployment. Traditional testing approaches often prioritize either natural scenario sampling or safety-critical scenario generation, resulting in overly simplistic or unrealistic hazardous tests. In practice, the demand for natural scenarios (e.g., when evaluating the ADS's reliability in real-world conditions), critical scenarios (e.g., when evaluating safety in critical situations), or somewhere in between (e.g., when testing the ADS in regions with less civilized drivers) varies depending on the testing objectives. To address this issue, we propose the On-demand Scenario Generation (OSG) Framework, which generates diverse scenarios with varying risk levels. Achieving the goal of OSG is challenging due to the complexity of quantifying the criticalness and naturalness stemming from intricate vehicle-environment interactions, as well as the need to maintain scenario diversity across various risk levels. OSG learns from real-world traffic datasets and employs a Risk Intensity Regulator to quantitatively control the risk level. It also leverages an improved heuristic search method to ensure scenario diversity. We evaluate OSG on the Carla simulators using various ADSs. We verify OSG's ability to generate scenarios with different risk levels and demonstrate its necessity by comparing accident types across risk levels. With the help of OSG, we are now able to systematically and objectively compare the performance of different ADSs based on different risk levels. ![]() ![]() |
|
Yang, Borui |
![]() Borui Yang, Md Afif Al Mamun, Jie M. Zhang, and Gias Uddin (King's College London, UK; University of Calgary, Canada; York University, Canada) Large Language Models (LLMs) are prone to hallucinations, e.g., factually incorrect information, in their responses. These hallucinations present challenges for LLM-based applications that demand high factual accuracy. Existing hallucination detection methods primarily depend on external resources, which can suffer from issues such as low availability, incomplete coverage, privacy concerns, high latency, low reliability, and poor scalability. There are also methods depending on output probabilities, which are often inaccessible for closed-source LLMs like GPT models. This paper presents MetaQA, a self-contained hallucination detection approach that leverages metamorphic relation and prompt mutation. Unlike existing methods, MetaQA operates without any external resources and is compatible with both open-source and closed-source LLMs. MetaQA is based on the hypothesis that if an LLM’s response is a hallucination, the designed metamorphic relations will be violated. We compare MetaQA with the state-of-the-art zero-resource hallucination detection method, SelfCheckGPT, across multiple datasets, and on two open-source and two closed-source LLMs. Our results reveal that MetaQA outperforms SelfCheckGPT in terms of precision, recall, and f1 score. For the four LLMs we study, MetaQA outperforms SelfCheckGPT with a superiority margin ranging from 0.041 - 0.113 (for precision), 0.143 - 0.430 (for recall), and 0.154 - 0.368 (for F1-score). For instance, with Mistral-7B, MetaQA achieves an average F1-score of 0.435, compared to SelfCheckGPT’s F1-score of 0.205, representing an improvement rate of 112.2%. MetaQA also demonstrates superiority across all different categories of questions. ![]() |
|
Yang, Chao |
![]() Songyang Yan, Xiaodong Zhang, Kunkun Hao, Haojie Xin, Yonggang Luo, Jucheng Yang, Ming Fan, Chao Yang, Jun Sun, and Zijiang Yang (Xi'an Jiaotong University, China; Xidian University, China; Synkrotron, China; Chongqing Changan Automobile, China; Singapore Management University, Singapore; University of Science and Technology of China, China) The safety and reliability of Automated Driving Systems (ADS) are paramount, necessitating rigorous testing methodologies to uncover potential failures before deployment. Traditional testing approaches often prioritize either natural scenario sampling or safety-critical scenario generation, resulting in overly simplistic or unrealistic hazardous tests. In practice, the demand for natural scenarios (e.g., when evaluating the ADS's reliability in real-world conditions), critical scenarios (e.g., when evaluating safety in critical situations), or somewhere in between (e.g., when testing the ADS in regions with less civilized drivers) varies depending on the testing objectives. To address this issue, we propose the On-demand Scenario Generation (OSG) Framework, which generates diverse scenarios with varying risk levels. Achieving the goal of OSG is challenging due to the complexity of quantifying the criticalness and naturalness stemming from intricate vehicle-environment interactions, as well as the need to maintain scenario diversity across various risk levels. OSG learns from real-world traffic datasets and employs a Risk Intensity Regulator to quantitatively control the risk level. It also leverages an improved heuristic search method to ensure scenario diversity. We evaluate OSG on the Carla simulators using various ADSs. We verify OSG's ability to generate scenarios with different risk levels and demonstrate its necessity by comparing accident types across risk levels. With the help of OSG, we are now able to systematically and objectively compare the performance of different ADSs based on different risk levels. ![]() ![]() |
|
Yang, Chen |
![]() Junjie Chen, Xingyu Fan, Chen Yang, Shuang Liu, and Jun Sun (Tianjin University, China; Renmin University of China, China; Singapore Management University, Singapore) ![]() |
|
Yang, Haoran |
![]() Haoran Yang and Haipeng Cai (Washington State University, USA; SUNY Buffalo, USA) Multilingual systems are prevalent and broadly impactful, but also complex due to the intricate interactions between the heterogeneous programming languages the systems are developed in. This complexity is further aggravated by the diversity of cross-language interoperability across different language combinations, resulting in additional, often stealthy cross-language bugs. Yet despite the growing number of tools aimed to discover cross-language bugs, a systematic understanding of such bugs is still lacking. To fill this gap, we conduct the first comprehensive study of cross-language bugs, characterizing them in 5 aspects including their symptoms, locations, manifestation, root causes, and fixes, as well as their relationships. Through careful identification and detailed analysis of 400 cross-language bugs in real-world multilingual projects classified from 54,356 relevant code commits in their GitHub repositories, we revealed not only bug characteristics of those five aspects but also how they compare between two top language combinations in the multilingual world (Python-C and Java-C). In addition to findings of the study as well as its enabling tools and datasets, we also provide practical recommendations regarding the prevention, detection, and patching of cross-language bugs. ![]() ![]() |
|
Yang, Haowen |
![]() Haowen Yang, Zhengda Li, Zhiqing Zhong, Xiaoying Tang, and Pinjia He (Chinese University of Hong Kong, Shenzhen, China) With the increasing demand for handling large-scale and complex data, data science (DS) applications often suffer from long execution time and rapid RAM exhaustion, which leads to many serious issues like unbearable delays and crashes in financial transactions. As popular DS libraries are frequently used in these applications, their performance bugs (PBs) are a major contributing factor to these issues, making it crucial to address them for improving overall application performance. However, PBs in popular DS libraries remain largely unexplored. To address this gap, we conducted a study of 202 PBs collected from seven popular DS libraries. Our study examined the impact of PBs and proposed a taxonomy for common root causes. We found over half of the PBs arise from inefficient data processing operations, especially within data structure. We also explored the effort required to locate their root causes and fix these bugs, along with the challenges involved. Notably, around 20% of the PBs could be fixed using simple strategies (e.g. Conditions Optimizing), suggesting the potential for automated repair approaches. Our findings highlight the severity of PBs in core DS libraries and offer insights for developing high-performance libraries and detecting PBs. Furthermore, we derived test rules from our identified root causes, identifying eight PBs, of which four were confirmed, demonstrating the practical utility of our findings. ![]() |
|
Yang, Jucheng |
![]() Songyang Yan, Xiaodong Zhang, Kunkun Hao, Haojie Xin, Yonggang Luo, Jucheng Yang, Ming Fan, Chao Yang, Jun Sun, and Zijiang Yang (Xi'an Jiaotong University, China; Xidian University, China; Synkrotron, China; Chongqing Changan Automobile, China; Singapore Management University, Singapore; University of Science and Technology of China, China) The safety and reliability of Automated Driving Systems (ADS) are paramount, necessitating rigorous testing methodologies to uncover potential failures before deployment. Traditional testing approaches often prioritize either natural scenario sampling or safety-critical scenario generation, resulting in overly simplistic or unrealistic hazardous tests. In practice, the demand for natural scenarios (e.g., when evaluating the ADS's reliability in real-world conditions), critical scenarios (e.g., when evaluating safety in critical situations), or somewhere in between (e.g., when testing the ADS in regions with less civilized drivers) varies depending on the testing objectives. To address this issue, we propose the On-demand Scenario Generation (OSG) Framework, which generates diverse scenarios with varying risk levels. Achieving the goal of OSG is challenging due to the complexity of quantifying the criticalness and naturalness stemming from intricate vehicle-environment interactions, as well as the need to maintain scenario diversity across various risk levels. OSG learns from real-world traffic datasets and employs a Risk Intensity Regulator to quantitatively control the risk level. It also leverages an improved heuristic search method to ensure scenario diversity. We evaluate OSG on the Carla simulators using various ADSs. We verify OSG's ability to generate scenarios with different risk levels and demonstrate its necessity by comparing accident types across risk levels. With the help of OSG, we are now able to systematically and objectively compare the performance of different ADSs based on different risk levels. ![]() ![]() |
|
Yang, Lanxin |
![]() Yue Li, Bohan Liu, Ting Zhang, Zhiqi Wang, David Lo, Lanxin Yang, Jun Lyu, and He Zhang (Nanjing University, China; Singapore Management University, Singapore) ![]() |
|
Yang, Min |
![]() Zhengmin Yu, Yuan Zhang, Ming Wen, Yinan Nie, Wenhui Zhang, and Min Yang (Fudan University, China; Huazhong University of Science and Technology, China) ![]() |
|
Yang, Minghui |
![]() Chenxu Wang, Tianming Liu, Yanjie Zhao, Minghui Yang, and Haoyu Wang (Huazhong University of Science and Technology, China; Monash University, Australia; OPPO, China) With the rapid development of Large Language Models (LLMs), their integration into automated mobile GUI testing has emerged as a promising research direction. However, existing LLM-based testing approaches face significant challenges, including time inefficiency and high costs due to constant LLM querying. To address these issues, this paper introduces LLMDroid, a novel testing framework designed to enhance existing automated mobile GUI testing tools by leveraging LLMs more efficiently. The workflow of LLMDroid comprises two main stages: Autonomous Exploration and LLM Guidance. During Autonomous Exploration, LLMDroid utilizes existing testing tools while leveraging LLMs to summarize explored pages. When code coverage growth slows, it transitions to LLM Guidance to strategically direct testing towards unexplored functionalities. This approach minimizes LLM interactions while maximizing their impact on test coverage. We applied LLMDroid to three popular open-source Android testing tools and evaluated it on 14 top-listed apps from Google Play. Results demonstrate an average increase of 26.16% in code coverage and 29.31% in activity coverage. Furthermore, our evaluation under different LLMs reveals that LLMDroid outperform existing step-wise approaches with significant cost efficiency, achieving optimal performance at $0.49 per hour using GPT-4o among tested models, with a cost-effective alternative achieving 94% of this performance at just $0.03 per hour. These findings highlight LLMDroid’s effectiveness in enhancing automated mobile app testing and its potential for widespread adoption. ![]() |
|
Yang, Wei |
![]() Dezhi Ran, Yuan Cao, Yuzhe Guo, Yuetong Li, Mengzhou Wu, Simin Chen, Wei Yang, and Tao Xie (Peking University, China; Beijing Jiaotong University, China; University of Chicago, USA; University of Texas at Dallas, USA) ![]() |
|
Yang, Wenhua |
![]() Shaoheng Cao, Renyi Chen, Wenhua Yang, Minxue Pan, and Xuandong Li (Nanjing University, China; Samsung Electronics, China; Nanjing University of Aeronautics and Astronautics, China) Model-based testing (MBT) has been an important methodology in software engineering, attracting extensive research attention for over four decades. However, despite its academic acclaim, studies examining the impact of MBT in industrial environments—particularly regarding its extended effects—remain limited and yield unclear results. This gap may contribute to the challenges in establishing a study environment for implementing and applying MBT in production settings to evaluate its impact over time. To bridge this gap, we collaborated with an industrial partner to undertake a comprehensive, longitudinal empirical study employing mixed methods. Over two months, we implemented our MBT tool within the corporation, assessing the immediate and extended effectiveness and efficiency of MBT compared to script-writing-based testing. Through a mix of quantitative and qualitative methods—spanning controlled experiments, questionnaire surveys, and interviews—our study uncovers several insightful findings. These include differences in effectiveness and maintainability between immediate and extended MBT application, the evolving perceptions and expectations of engineers regarding MBT, and more. Leveraging these insights, we propose actionable implications for both the academic and industrial communities, aimed at bolstering confidence in MBT adoption and investment for software testing purposes. ![]() |
|
Yang, Xiaohu |
![]() Keyu Liang, Zhongxin Liu, Chao Liu, Zhiyuan Wan, David Lo, and Xiaohu Yang (Zhejiang University, China; Chongqing University, China; Singapore Management University, Singapore) Code search is a crucial task in software engineering, aiming to retrieve code snippets that are semantically relevant to a natural language query. Recently, Pre-trained Language Models (PLMs) have shown remarkable success and are widely adopted for code search tasks. However, PLM-based methods often struggle in cross-domain scenarios. When applied to a new domain, they typically require extensive fine-tuning with substantial data. Even worse, the data scarcity problem in new domains often forces these methods to operate in a zero-shot setting, resulting in a significant decline in performance. RAPID, which generates synthetic data for model fine-tuning, is currently the only effective method for zero-shot cross-domain code search. Despite its effectiveness, RAPID demands substantial computational resources for fine-tuning and needs to maintain specialized models for each domain, underscoring the need for a zero-shot, fine-tuning-free approach for cross-domain code search. The key to tackling zero-shot cross-domain code search lies in bridging the gaps among domains. In this work, we propose to break the query-code matching process of code search into two simpler tasks: query-comment matching and code-code matching. We first conduct an empirical study to investigate the effectiveness of these two matching schemas in zero-shot cross-domain code search. Our findings highlight the strong complementarity among the three matching schemas, i.e., query-code, query-comment, and code-code matching. Based on the findings, we propose CodeBridge, a zero-shot, fine-tuning-free approach for cross-domain code search. Specifically, CodeBridge first employs zero-shot prompting to guide Large Language Models (LLMs) to generate a comment for each code snippet in the codebase and produce a code for each query. Subsequently, it encodes queries, code snippets, comments, and the generated code using PLMs and assesses similarities through three matching schemas: query-code, query-comment, and generated code-code. Lastly, CodeBridge leverages a sampling-based fusion approach that combines these three similarity scores to rank the final search outcomes. Experimental results show that our approach outperforms the state-of-the-art PLM-based code search approaches, i.e., CoCoSoDa and UniXcoder, by an average of 21.4% and 24.9% in MRR, respectively, across three datasets. Our approach also yields results that are better than or comparable to those of the zero-shot cross-domain code search approach RAPID, which requires costly fine-tuning. ![]() ![]() ![]() Yi Gao, Xing Hu, Xiaohu Yang, and Xin Xia (Zhejiang University, China) Test smells arise from poor design practices and insufficient domain knowledge, which can lower the quality of test code and make it harder to maintain and update. Manually refactoring of test smells is time-consuming and error-prone, highlighting the necessity for automated approaches. Current rule-based refactoring methods often struggle in scenarios not covered by predefined rules and lack the flexibility needed to handle diverse cases effectively. In this paper, we propose a novel approach called UTRefactor, a context-enhanced, LLM-based framework for automatic test refactoring in Java projects. UTRefactor extracts relevant context from test code and leverages an external knowledge base that includes test smell definitions, descriptions, and DSL-based refactoring rules. By simulating the manual refactoring process through a chain-of-thought approach, UTRefactor guides the LLM to eliminate test smells in a step-by-step process, ensuring both accuracy and consistency throughout the refactoring. Additionally, we implement a checkpoint mechanism to facilitate comprehensive refactoring, particularly when multiple smells are present. We evaluate UTRefactor on 879 tests from six open-source Java projects, reducing the number of test smells from 2,375 to 265, achieving an 89% reduction. UTRefactor outperforms direct LLM-based refactoring methods by 61.82% in smell elimination and significantly surpasses the performance of a rule-based test smell refactoring tool. Our results demonstrate the effectiveness of UTRefactor in enhancing test code quality while minimizing manual involvement. ![]() ![]() Zuobin Wang, Zhiyuan Wan, Yujing Chen, Yun Zhang, David Lo, Difan Xie, and Xiaohu Yang (Zhejiang University, China; Hangzhou High-Tech Zone (Binjiang), China; Hangzhou City University, China; Singapore Management University, Singapore) In smart contract development, practitioners frequently reuse code to reduce development effort and avoid reinventing the wheel. This reused code, whether identical or similar to its original source, is referred to as a code clone. Unintentional code cloning can propagate flaws and vulnerabilities, potentially undermining the reliability and maintainability of software systems. Previous studies have identified a significant prevalence of code clones in Solidity smart contracts on the Ethereum blockchain. To mitigate the risks posed by code clones, clone detection has emerged as an active field of research and practice in software engineering. Recent studies have extended existing techniques or proposed novel techniques tailored to the unique syntactic and semantic features of Solidity. Nonetheless, the evaluations of existing techniques, whether conducted by their original authors or independent researchers, involve codebases in various programming languages and utilize different versions of the corresponding tools. The resulting inconsistency makes direct comparisons of the evaluation results impractical, and hinders the ability to derive meaningful conclusions across the evaluations. There remains a lack of clarity regarding the effectiveness of these techniques in detecting smart contract clones, and whether it is feasible to combine different techniques to achieve scalable yet accurate detection of code clones in smart contracts. To address this gap, we conduct a comprehensive empirical study that evaluates the effectiveness and scalability of five representative clone detection techniques on 33,073 verified Solidity smart contracts, along with a benchmark we curate, in which we manually label 72,010 pairs of Solidity smart contracts with clone tags. Moreover, we explore the potential of combining different techniques to achieve optimal performance of code clone detection for smart contracts, and propose SourceREClone, a framework designed for the refined integration of different techniques, which achieves a 36.9% improvement in F1 score compared to a straightforward combination of the state of the art. Based on our findings, we discuss implications, provide recommendations for practitioners, and outline directions for future research. ![]() |
|
Yang, Xu |
![]() Xu Yang, Shaowei Wang, Jiayuan Zhou, and Wenhan Zhu (University of Manitoba, Canada; Huawei, Canada) Deep Learning-based Vulnerability Detection (DLVD) techniques have garnered significant interest due to their ability to automatically learn vulnerability patterns from previously compromised code. Despite the notable accuracy demonstrated by pioneering tools, the broader application of DLVD methods in real-world scenarios is hindered by significant challenges. A primary issue is the “one-for-all” design, where a single model is trained to handle all types of vulnerabilities. This approach fails to capture the patterns of different vulnerability types, resulting in suboptimal performance, particularly for less common vulnerabilities that are often underrepresented in training datasets. To address these challenges, we propose MoEVD, which adopts the Mixture-of-Experts (MoE) framework for vulnerability detection. MoEVD decomposes vulnerability detection into two tasks, CWE type classification and CWE-specific vulnerability detection. By splitting the task, in vulnerability detection, MoEVD allows specific experts to handle distinct types of vulnerabilities instead of handling all vulnerabilities within one model. Our results show that MoEVD achieves an F1-score of 0.44, significantly outperforming all studied state-of-the-art (SOTA) baselines by at least 12.8%. MoEVD excels across almost all CWE types, improving recall over the best SOTA baseline by 9% to 77.8%. Notably, MoEVD does not sacrifice performance on long-tailed CWE types; instead, its MoE design enhances performance (F1-score) on these by at least 7.3%, addressing long-tailed issues effectively. ![]() ![]() Xu Yang, Wenhan Zhu, Michael Pacheco, Jiayuan Zhou, Shaowei Wang, Xing Hu, and Kui Liu (University of Manitoba, Canada; Huawei, Canada; Zhejiang University, China; Huawei Software Engineering Application Technology Lab, China) Detecting vulnerability fix commits in open-source software is crucial for maintaining software security. To help OSS identify vulnerability fix commits, several automated approaches are developed. However, existing approaches like VulFixMiner and CoLeFunDa, focus solely on code changes, neglecting essential context from development artifacts. Tools like Vulcurator, which integrates issue reports, fail to leverage semantic associations between different development artifacts (e.g., pull requests and history vulnerability fixes). Moreover, they miss vulnerability fixes in tangled commits and lack explanations, limiting practical use. Hence to address those limitations, we propose LLM4VFD, a novel framework that leverages Large Language Models (LLMs) enhanced with Chain-of-Thought reasoning and In-Context Learning to improve the accuracy of vulnerability fix detection. LLM4VFD comprises three components: (1) Code Change Intention, which analyzes commit summaries, purposes, and implications using Chain-of-Thought reasoning; (2) Development Artifact, which incorporates context from related issue reports and pull requests; (3) Historical Vulnerability, which retrieves similar past vulnerability fixes to enrich context. More importantly, on top of the prediction, LLM4VFD also provides a detailed analysis and explanation to help security experts understand the rationale behind the decision. We evaluated LLM4VFD against state-of-the-art techniques, including Pre-trained Language Model-based approaches and vanilla LLMs, using a newly collected dataset, BigVulFixes. Experimental results demonstrate that LLM4VFD significantly outperforms the best-performed existing approach by 68.1%--145.4%. Furthermore, We conducted a user study with security experts, showing that the analysis generated by LLM4VFD improves the efficiency of vulnerability fix identification. ![]() ![]() Xu Yang, Zhenbang Chen, Wei Dong, and Ji Wang (National University of Defense Technology, China) Floating-point constraint solving is challenging due to the complex representation and non-linear computations. Search-based constraint solving provides an effective method for solving floating-point constraints. In this paper, we propose QSF to improve the efficiency of search-based solving for floating-point constraints. The key idea of QSF is to model the floating-point constraint solving problem as a multi-objective optimization problem. Specifically, QSF considers both the number of unsatisfied constraints and the sum of the violation degrees of unsatisfied constraints as the objectives for search-based optimization. Besides, we propose a new evolutionary algorithm in which the mutation operators are specially designed for floating-point numbers, aiming to solve the multi-objective problem more efficiently. We have implemented QSF and conducted extensive experiments on both the SMT-COMP benchmark and the benchmark from real-world floating-point programs. The results demonstrate that compared to SOTA floating-point solvers, QSF achieved an average speedup of 15.72X under a 60-second timeout and an impressive 87.48X under a 600-second timeout on the first benchmark. Similarly, on the second benchmark, QSF delivered an average speedup of 22.44X and 29.23X, respectively, under the two timeout configurations. Furthermore, QSF has also enhanced the performance of symbolic execution for floating-point programs. ![]() |
|
Yang, Zijiang |
![]() Songyang Yan, Xiaodong Zhang, Kunkun Hao, Haojie Xin, Yonggang Luo, Jucheng Yang, Ming Fan, Chao Yang, Jun Sun, and Zijiang Yang (Xi'an Jiaotong University, China; Xidian University, China; Synkrotron, China; Chongqing Changan Automobile, China; Singapore Management University, Singapore; University of Science and Technology of China, China) The safety and reliability of Automated Driving Systems (ADS) are paramount, necessitating rigorous testing methodologies to uncover potential failures before deployment. Traditional testing approaches often prioritize either natural scenario sampling or safety-critical scenario generation, resulting in overly simplistic or unrealistic hazardous tests. In practice, the demand for natural scenarios (e.g., when evaluating the ADS's reliability in real-world conditions), critical scenarios (e.g., when evaluating safety in critical situations), or somewhere in between (e.g., when testing the ADS in regions with less civilized drivers) varies depending on the testing objectives. To address this issue, we propose the On-demand Scenario Generation (OSG) Framework, which generates diverse scenarios with varying risk levels. Achieving the goal of OSG is challenging due to the complexity of quantifying the criticalness and naturalness stemming from intricate vehicle-environment interactions, as well as the need to maintain scenario diversity across various risk levels. OSG learns from real-world traffic datasets and employs a Risk Intensity Regulator to quantitatively control the risk level. It also leverages an improved heuristic search method to ensure scenario diversity. We evaluate OSG on the Carla simulators using various ADSs. We verify OSG's ability to generate scenarios with different risk levels and demonstrate its necessity by comparing accident types across risk levels. With the help of OSG, we are now able to systematically and objectively compare the performance of different ADSs based on different risk levels. ![]() ![]() |
|
Yao, Peisen |
![]() Yuan Li, Peisen Yao, Kan Yu, Chengpeng Wang, Yaoyang Ye, Song Li, Meng Luo, Yepang Liu, and Kui Ren (Zhejiang University, China; Ant Group, China; Hong Kong University of Science and Technology, China; Southern University of Science and Technology, China) ![]() |
|
Ye, Guixin |
![]() Weiyuan Tong, Zixu Wang, Zhanyong Tang, Jianbin Fang, Yuqun Zhang, and Guixin Ye (NorthWest University, China; National University of Defense Technology, China; Southern University of Science and Technology, China) ![]() |
|
Ye, Junyao |
![]() Junyao Ye, Zhen Li, Xi Tang, Deqing Zou, Shouhuai Xu, Weizhong Qiang, and Hai Jin (Huazhong University of Science and Technology, China; University of Colorado at Colorado Springs, USA) ![]() |
|
Ye, Yang |
![]() Yujia Chen, Yang Ye, Zhongqi Li, Yuchi Ma, and Cuiyun Gao (Harbin Institute of Technology, Shenzhen, China; Huawei Cloud Computing Technologies, China) Large code models (LCMs) have remarkably advanced the field of code generation. Despite their impressive capabilities, they still face practical deployment issues, such as high inference costs, limited accessibility of proprietary LCMs, and adaptability issues of ultra-large LCMs. These issues highlight the critical need for more accessible, lightweight yet effective LCMs. Knowledge distillation (KD) offers a promising solution, which transfers the programming capabilities of larger, advanced LCMs (Teacher) to smaller, less powerful LCMs (Student). However, existing KD methods for code intelligence often lack consideration of fault domain knowledge and rely on static seed knowledge, leading to degraded programming capabilities of student models. In this paper, we propose a novel Self-Paced knOwledge DistillAtion framework, named SODA, aiming at developing lightweight yet effective student LCMs via adaptively transferring the programming capabilities from advanced teacher LCMs. SODA consists of three stages in one cycle: (1) Correct-and-Fault Knowledge Delivery stage aims at improving the student model’s capability to recognize errors while ensuring its basic programming skill during the knowledge transferring, which involves correctness-aware supervised learning and fault-aware contrastive learning methods. (2) Multi-View Feedback stage aims at measuring the quality of results generated by the student model from two views, including model-based and static tool-based measurement, for identifying the difficult questions; (3) Feedback-based Knowledge Update stage aims at updating the student model adaptively by generating new questions at different difficulty levels, in which the difficulty levels are categorized based on the feedback in the second stage. By performing the distillation process iteratively, the student model is continuously refined through learning more advanced programming skills from the teacher model. We compare SODA with four state-of-the-art KD approaches on three widely-used code generation benchmarks with different programming languages. Experimental results show that SODA improves the student model by 65.96% in terms of average Pass@1, outperforming the best baseline PERsD by 29.85%. Based on the proposed SODA framework, we develop SodaCoder, a series of lightweight yet effective LCMs with ∼7B parameters, which outperform 15 LCMs with less than or equal to 16B parameters. Notably, SodaCoder-DS-6.7B, built on DeepseekCoder-6.7B, even surpasses the prominent ChatGPT on average Pass@1 across seven programming languages (66.4 vs. 61.3). ![]() |
|
Ye, Yaoyang |
![]() Yuan Li, Peisen Yao, Kan Yu, Chengpeng Wang, Yaoyang Ye, Song Li, Meng Luo, Yepang Liu, and Kui Ren (Zhejiang University, China; Ant Group, China; Hong Kong University of Science and Technology, China; Southern University of Science and Technology, China) ![]() |
|
Yi, Ning |
![]() Weibin Wu, Yuhang Cao, Ning Yi, Rongyi Ou, and Zibin Zheng (Sun Yat-sen University, China) Question answering (QA) is a fundamental task of large language models (LLMs), which requires LLMs to automatically answer human-posed questions in natural language. However, LLMs are known to distort facts and make non-factual statements (i.e., hallucinations) when dealing with QA tasks, which may affect the deployment of LLMs in real-life situations. In this work, we propose DrHall, a framework for detecting and reducing the factual hallucinations of black-box LLMs with metamorphic testing (MT). We believe that hallucinated answers are unstable. Therefore, when LLMs hallucinate, they are more likely to produce different answers if we use metamorphic testing to make LLMs re-execute the same task with different execution paths, which motivates the design of DrHall. The effectiveness of DrHall is evaluated empirically on three datasets, including a self-built dataset of natural language questions: FactHalluQA, as well as two datasets of programming questions: Refactory and LeetCode. The evaluation results confirm that DrHall can consistently outperform the state-of-the-art baselines, obtaining an average F1 score of over 0.856 for hallucination detection. For hallucination correction, DrHall can also outperform the state-of-the-art baselines, with an average hallucination correction rate of over 53%. We hope that our work can enhance the reliability of LLMs and provide new insights for the research of LLM hallucination mitigation. ![]() |
|
Yin, Jianwei |
![]() Siwei Tan, Liqiang Lu, Debin Xiang, Tianyao Chu, Congliang Lang, Jintao Chen, Xing Hu, and Jianwei Yin (Zhejiang University, China) Quantum programs provide exponential speedups compared to classical programs in certain areas, but they also inevitably encounter logical faults. Automatically repairing quantum programs is much more challenging than repairing classical programs due to the non-replicability of data, the vast search space of program inputs, and the new programming paradigm. Existing works based on semantic-based or learning-based program repair techniques are fundamentally limited in repairing efficiency and effectiveness. In this work, we propose HornBro, an efficient framework for automated quantum program repair. The key insight of HornBro lies in the homotopy-like method, which iteratively switches between the classical and quantum parts. This approach allows the repair tasks to be efficiently offloaded to the most suitable platforms, enabling a progressive convergence toward the correct program. We start by designing an implication assertion pragma to enable rigorous specifications of quantum program behavior, which helps to generate a quantum test suite automatically. This suite leverages the orthonormal bases of quantum programs to accommodate different encoding schemes. Given a fixed number of test cases, it allows the maximum input coverage of potential counter-example candidates. Then, we develop a Clifford approximation method with an SMT-based search to transform the patch localization program into a symbolic reasoning problem. Finally, we offload the computationally intensive repair of gate parameters to quantum hardware by leveraging the differentiability of quantum gates. Experiments suggest that HornBro increases the repair success rate by more than 62.5% compared to the existing repair techniques, supporting more types of quantum bugs. It also achieves 35.7× speedup in the repair and 99.9% gate reduction of the patch. ![]() ![]() |
|
Yu, Jiongchi |
![]() Jiongchi Yu, Xie Xiaofei, Qiang Hu, Bowen Zhang, Ziming Zhao, Yun Lin, Lei Ma, Ruitao Feng, and Frank Liau (Singapore Management University, Singapore; Tianjin University, China; Zhejiang University, China; Shanghai Jiao Tong University, China; University of Tokyo, Japan; University of Alberta, Canada; Southern Cross University, Australia; Government Technology Agency of Singapore, Singapore) ![]() |
|
Yu, Kan |
![]() Yuan Li, Peisen Yao, Kan Yu, Chengpeng Wang, Yaoyang Ye, Song Li, Meng Luo, Yepang Liu, and Kui Ren (Zhejiang University, China; Ant Group, China; Hong Kong University of Science and Technology, China; Southern University of Science and Technology, China) ![]() |
|
Yu, Tingting |
![]() Zhaoxu Zhang, Komei Ryu, Tingting Yu, and William G.J. Halfond (University of Southern California, USA; University of Connecticut, USA) Bug report reproduction is a crucial but time-consuming task to be carried out during mobile app maintenance. To accelerate this process, researchers have developed automated techniques for reproducing mobile app bug reports. However, due to the lack of an effective mechanism to recognize different buggy behaviors described in the report, existing work is limited to reproducing crash bug reports, or requires developers to manually analyze execution traces to determine if a bug was successfully reproduced. To address this limitation, we introduce a novel technique to automatically identify and extract the buggy behavior from the bug report and detect it during the automated reproduction process. To accommodate various buggy behaviors of mobile app bugs, we conducted an empirical study and created a standardized representation for expressing the bug behavior identified from our study. Given a report, our approach first transforms the documented buggy behavior into this standardized representation, then matches it against real-time device and UI information during the reproduction to recognize the bug. Our empirical evaluation demonstrated that our approach achieved over 90% precision and recall in generating the standardized representation of buggy behaviors. It correctly identified bugs in 83% of the bug reports and enhanced existing reproduction techniques, allowing them to reproduce four times more bug reports. ![]() |
|
Yu, Zhengmin |
![]() Zhengmin Yu, Yuan Zhang, Ming Wen, Yinan Nie, Wenhui Zhang, and Min Yang (Fudan University, China; Huazhong University of Science and Technology, China) ![]() |
|
Zahid, Anwar Hossain |
![]() Shaila Sharmin, Anwar Hossain Zahid, Subhankar Bhattacharjee, Chiamaka Igwilo, Miryung Kim, and Wei Le (Iowa State University, USA; University of California at Los Angeles, USA) ![]() |
|
Zhang, Bowen |
![]() Jiongchi Yu, Xie Xiaofei, Qiang Hu, Bowen Zhang, Ziming Zhao, Yun Lin, Lei Ma, Ruitao Feng, and Frank Liau (Singapore Management University, Singapore; Tianjin University, China; Zhejiang University, China; Shanghai Jiao Tong University, China; University of Tokyo, Japan; University of Alberta, Canada; Southern Cross University, Australia; Government Technology Agency of Singapore, Singapore) ![]() |
|
Zhang, Cen |
![]() Ziqiao Kong, Cen Zhang, Maoyi Xie, Ming Hu, Yue Xue, Ye Liu, Haijun Wang, and Yang Liu (Nanyang Technological University, Singapore; Singapore Management University, Singapore; MetaTrust Labs, Singapore; Xi'an Jiaotong University, China) Billions of dollars are transacted through smart contracts, making vulnerabilities a major financial risk. One focus in the security arms race is on profitable vulnerabilities that attackers can exploit. Fuzzing is a key method for identifying these vulnerabilities. However, current solutions face two main limitations: 1. a lack of profit-centric techniques for expediting detection and, 2. insufficient automation in maximizing the profitability of discovered vulnerabilities, leaving the analysis to human experts. To address these gaps, we have developed VERITE, a profit-centric smart contract fuzzing framework that not only effectively detects those profitable vulnerabilities but also maximizes the exploited profits. VERITE has three key features: 1. DeFi action-based mutators for boosting the exploration of transactions with different fund flows; 2. potentially profitable candidates identification criteria, which checks whether the input has caused abnormal fund flow properties during testing; 3. a gradient descent-based profit maximization strategy for these identified candidates. VERITE is fully developed from scratch and evaluated on a dataset consisting of 61 exploited real-world DeFi projects with an average of over 1.1 million dollars loss. The results show that VERITE can automatically extract more than 18 million dollars in total and is significantly better than state-of-the-art fuzzer ItyFuzz in both detection (29/10) and exploitation (134 times more profits gained on average). Remarkably, in 12 targets, it gains more profits than real-world attacking exploits (1.01 to 11.45 times more). VERITE is also applied by auditors in contract auditing, where 6 (5 high severity) zero-day vulnerabilities are found with over $2,500 bounty rewards. ![]() ![]() ![]() |
|
Zhang, Guohan |
![]() Haodong Li, Xiao Cheng, Guohan Zhang, Guosheng Xu, Guoai Xu, and Haoyu Wang (Beijing University of Posts and Telecommunications, China; UNSW, Australia; Harbin Institute of Technology, Shenzhen, China; Huazhong University of Science and Technology, China) Learning-based Android malware detection has earned significant recognition across industry and academia, yet its effectiveness hinges on the accuracy of labeled training data. Manual labeling, being prohibitively expensive, has prompted the use of automated methods, such as leveraging anti-virus engines like VirusTotal, which unfortunately introduces mislabeling, aka "label noise". The state-of-the-art label noise reduction approach, MalWhiteout, can effectively reduce random label noise but underperforms in mitigating real-world emergent malware (EM) label noise stemming from newly emerging Android malware variants overlooked by VirusTotal. To tackle this, we conceptualize EM label noise detection as an anomaly detection problem and introduce a novel tool, MalCleanse, that surpasses MalWhiteout's ability to address EM label noise. MalCleanse combines uncertainty estimation with unsupervised anomaly detection, identifying samples with high uncertainty as mislabeled, thereby enhancing its capability to remove EM label noise. Our experimental results demonstrate a significant reduction in EM label noise by approximately 25.25%, achieving an F1 Score of 80.32% for label noise detection at a noise ratio of 40.61%. Notably, MalCleanse outperforms MalWhiteout with an increase of 40.9% in overall F1 score for mitigating EM label noise. This paper pioneers the integration of deep neural network model uncertainty to refine label accuracy, thereby enhancing the reliability of malware detection systems. Our approach represents a significant step forward in addressing the challenges posed by emergent malware in automated labeling systems. ![]() |
|
Zhang, He |
![]() Yue Li, Bohan Liu, Ting Zhang, Zhiqi Wang, David Lo, Lanxin Yang, Jun Lyu, and He Zhang (Nanjing University, China; Singapore Management University, Singapore) ![]() |
|
Zhang, Jian |
![]() Xu Wang, Mingming Zhang, Xiangxin Meng, Jian Zhang, Yang Liu, and Chunming Hu (Beihang University, China; Zhongguancun Laboratory, China; Ministry of Education, China; Nanyang Technological University, Singapore) Deep Neural Networks (DNNs) are prevalent across a wide range of applications. Despite their success, the complexity and opaque nature of DNNs pose significant challenges in debugging and repairing DNN models, limiting their reliability and broader adoption. In this paper, we propose MLM4DNN, an element-based automated DNN repair method. Unlike previous techniques that focus on post-training adjustments or rely heavily on predefined bug patterns, MLM4DNN repairs DNNs by leveraging a fine-tuned Masked Language Model (MLM) to predict correct fixes for nine predefined key elements in DNNs. We construct a large-scale dataset by masking nine key elements from the correct DNN source code and then force the MLM to restore the correct elements to learn the deep semantics that ensure the normal functionalities of DNNs. Afterwards, a light-weight static analysis tool is designed to filter out low-quality patches to enhance the repair efficiency. We introduce a patch validation method specifically for DNN repair tasks, which consists of three evaluation metrics from different aspects to model the effectiveness of generated patches. We construct a benchmark, BenchmarkAPR4DNN, including 51 buggy DNN models and an evaluation tool that outputs the three metrics. We evaluate MLM4DNN against six baselines on BenchmarkAPR4DNN, and results show that MLM4DNN outperforms all state-of-the-art baselines, including two dynamic-based and four zero-shot learning-based methods. After applying the fine-tuned MLM design to several prevalent Large Language Models (LLMs), we consistently observe improved performance in DNN repair tasks compared to the original LLMs, which demonstrates the effectiveness of the method proposed in this paper. ![]() |
|
Zhang, Jie M. |
![]() Zhenpeng Chen, Xinyue Li, Jie M. Zhang, Weisong Sun, Ying Xiao, Tianlin Li, Yiling Lou, and Yang Liu (Nanyang Technological University, Singapore; Peking University, China; King's College London, UK; Fudan University, China) Fairness is a critical requirement for Machine Learning (ML) software, driving the development of numerous bias mitigation methods. Previous research has identified a leveling-down effect in bias mitigation for computer vision and natural language processing tasks, where fairness is achieved by lowering performance for all groups without benefiting the unprivileged group. However, it remains unclear whether this effect applies to bias mitigation for tabular data tasks, a key area in fairness research with significant real-world applications. This study evaluates eight bias mitigation methods for tabular data, including both widely used and cutting-edge approaches, across 44 tasks using five real-world datasets and four common ML models. Contrary to earlier findings, our results show that these methods operate in a zero-sum fashion, where improvements for unprivileged groups are related to reduced benefits for traditionally privileged groups. However, previous research indicates that the perception of a zero-sum trade-off might complicate the broader adoption of fairness policies. To explore alternatives, we investigate an approach that applies the state-of-the-art bias mitigation method solely to unprivileged groups, showing potential to enhance benefits of unprivileged groups without negatively affecting privileged groups or overall ML performance. Our study highlights potential pathways for achieving fairness improvements without zero-sum trade-offs, which could help advance the adoption of bias mitigation methods. ![]() ![]() Borui Yang, Md Afif Al Mamun, Jie M. Zhang, and Gias Uddin (King's College London, UK; University of Calgary, Canada; York University, Canada) Large Language Models (LLMs) are prone to hallucinations, e.g., factually incorrect information, in their responses. These hallucinations present challenges for LLM-based applications that demand high factual accuracy. Existing hallucination detection methods primarily depend on external resources, which can suffer from issues such as low availability, incomplete coverage, privacy concerns, high latency, low reliability, and poor scalability. There are also methods depending on output probabilities, which are often inaccessible for closed-source LLMs like GPT models. This paper presents MetaQA, a self-contained hallucination detection approach that leverages metamorphic relation and prompt mutation. Unlike existing methods, MetaQA operates without any external resources and is compatible with both open-source and closed-source LLMs. MetaQA is based on the hypothesis that if an LLM’s response is a hallucination, the designed metamorphic relations will be violated. We compare MetaQA with the state-of-the-art zero-resource hallucination detection method, SelfCheckGPT, across multiple datasets, and on two open-source and two closed-source LLMs. Our results reveal that MetaQA outperforms SelfCheckGPT in terms of precision, recall, and f1 score. For the four LLMs we study, MetaQA outperforms SelfCheckGPT with a superiority margin ranging from 0.041 - 0.113 (for precision), 0.143 - 0.430 (for recall), and 0.154 - 0.368 (for F1-score). For instance, with Mistral-7B, MetaQA achieves an average F1-score of 0.435, compared to SelfCheckGPT’s F1-score of 0.205, representing an improvement rate of 112.2%. MetaQA also demonstrates superiority across all different categories of questions. ![]() |
|
Zhang, Junwei |
![]() Junwei Zhang, Xing Hu, Shan Gao, Xin Xia, David Lo, and Shanping Li (Zhejiang University, China; Huawei, China; Singapore Management University, Singapore) Unit testing is crucial for software development and maintenance. Effective unit testing ensures and improves software quality, but writing unit tests is time-consuming and labor-intensive. Recent studies have proposed deep learning (DL) techniques or large language models (LLMs) to automate unit test generation. These models are usually trained or fine-tuned on large-scale datasets. Despite growing awareness of the importance of data quality, there has been limited research on the quality of datasets used for test generation. To bridge this gap, we systematically examine the impact of noise on the performance of learning-based test generation models. We first apply the open card sorting method to analyze the most popular and largest test generation dataset, Methods2Test, to categorize eight distinct types of noise. Further, we conduct detailed interviews with 17 domain experts to validate and assess the importance, reasonableness, and correctness of the noise taxonomy. Then, we propose CleanTest, an automated noise-cleaning framework designed to improve the quality of test generation datasets. CleanTest comprises three filters: a rule-based syntax filter, a rule-based relevance filter, and a model-based coverage filter. To evaluate its effectiveness, we apply CleanTest on two widely-used test generation datasets, i.e., Methods2Test and Atlas. Our findings indicate that 43.52% and 29.65% of datasets contain noise, highlighting its prevalence. Finally, we conduct comparative experiments using four LLMs (i.e., CodeBERT, AthenaTest, StarCoder, and CodeLlama7B) to assess the impact of noise on test generation performance. The results show that filtering noise positively influences the test generation ability of the models. Fine-tuning the four LLMs with the filtered Methods2Test dataset, on average, improves its performance by 67% in branch coverage, using the Defects4J benchmark. For the Atlas dataset, the four LLMs improve branch coverage by 39%. Additionally, filtering noise improves bug detection performance, resulting in a 21.42% increase in bugs detected by the generated tests. ![]() |
|
Zhang, Li |
![]() Xiaoli Lian, Shuaisong Wang, Hanyu Zou, Fang Liu, Jiajun Wu, and Li Zhang (Beihang University, China) In the current software-driven era, ensuring privacy and security is critical. Despite this, the specification of security requirements for software is still largely a manual and labor-intensive process. Engineers are tasked with analyzing potential security threats based on functional requirements (FRs), a procedure prone to omissions and errors due to the expertise gap between cybersecurity experts and software engineers. To bridge this gap, we introduce F2SRD (Function-to-Security Requirements Derivation), an automated approach that proactively derives security requirements (SRs) from functional specifications under the guidance of relevant security verification requirements (VRs) drawn from the well recognized OWASP Application Security Verification Standard (ASVS). F2SRD operates in two main phases: Initially, we develop a VR retriever trained on a custom database of FR-VR pairs, enabling it to adeptly select applicable VRs from ASVS. This targeted retrieval informs the precise and actionable formulation of SRs. Subsequently, these VRs are used to construct structured prompts that direct GPT-4 in generating SRs. Our comparative analysis against two established models demonstrates F2SRD's enhanced performance in producing SRs that excel in inspiration, diversity, and specificity—essential attributes for effective security requirement generation. By leveraging security verification standards, we believe that the generated SRs are not only more focused but also resonate stronger with the needs of engineers. ![]() ![]() Xin Tan, Xiao Long, Yinghao Zhu, Lin Shi, Xiaoli Lian, and Li Zhang (Beihang University, China) Onboarding newcomers is vital for the sustainability of open-source software (OSS) projects. To lower barriers and increase engagement, OSS projects have dedicated experts who provide guidance for newcomers. However, timely responses are often hindered by experts' busy schedules. The recent rapid advancements of AI in software engineering have brought opportunities to leverage AI as a substitute for expert mentoring. However, the potential role of AI as a comprehensive mentor throughout the entire onboarding process remains unexplored. To identify design strategies of this "AI mentor", we applied Design Fiction as a participatory method with 19 OSS newcomers. We investigated their current onboarding experience and elicited 32 design strategies for future AI mentor. Participants envisioned AI mentor being integrated into OSS platforms like GitHub, where it could offer assistance to newcomers, such as "recommending projects based on personalized requirements" and "assessing and categorizing project issues by difficulty". We also collected participants' perceptions of a prototype, named "OSSerCopilot", that implemented the envisioned strategies. They found the interface useful and user-friendly, showing a willingness to use it in the future, which suggests the design strategies are effective. Finally, in order to identify the gaps between our design strategies and current research, we conducted a comprehensive literature review, evaluating the extent of existing research support for this concept. We find that research is relatively scarce in certain areas where newcomers highly anticipate AI mentor assistance, such as "discovering an interested project". Our study has the potential to revolutionize the current newcomer-expert mentorship and provides valuable insights for researchers and tool designers aiming to develop and enhance AI mentor systems. ![]() |
|
Zhang, Lingming |
![]() Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang (University of Illinois at Urbana-Champaign, USA) Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the following question: Do we really have to employ complex autonomous software agents? To attempt to answer this question, we build Agentless – an agentless approach to automatically resolve software development issues. Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic three-phase process of localization, repair, and patch validation, without letting the LLM decide future actions or operate with complex tools. Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance (32.00%, 96 correct fixes) and low cost ($0.70) compared with all existing open-source software agents at the time of paper submission! Agentless also achieves more than 50% solve rate when using Claude 3.5 Sonnet on the new SWE-bench Verified benchmark. In fact, Agentless has already been adopted by OpenAI as the go-to approach to showcase the real-world coding performance of both GPT-4o and the new o1 models; more recently, Agentless has also been used by DeepSeek to evaluate their newest DeepSeek V3 and R1 models. Furthermore, we manually classified the problems in SWE-bench Lite and found problems with exact ground truth patches or insufficient/misleading issue descriptions. As such, we construct SWE-bench Lite-𝑆 by excluding such problematic issues to perform more rigorous evaluation and comparison. Our work highlights the currently overlooked potential of a simplistic, cost-effective technique in autonomous software development. We hope Agentless will help reset the baseline, starting point, and horizon for autonomous software agents, and inspire future work along this crucial direction. We have open-sourced Agentless at: https://github.com/OpenAutoCoder/Agentless ![]() |
|
Zhang, Mingming |
![]() Xu Wang, Mingming Zhang, Xiangxin Meng, Jian Zhang, Yang Liu, and Chunming Hu (Beihang University, China; Zhongguancun Laboratory, China; Ministry of Education, China; Nanyang Technological University, Singapore) Deep Neural Networks (DNNs) are prevalent across a wide range of applications. Despite their success, the complexity and opaque nature of DNNs pose significant challenges in debugging and repairing DNN models, limiting their reliability and broader adoption. In this paper, we propose MLM4DNN, an element-based automated DNN repair method. Unlike previous techniques that focus on post-training adjustments or rely heavily on predefined bug patterns, MLM4DNN repairs DNNs by leveraging a fine-tuned Masked Language Model (MLM) to predict correct fixes for nine predefined key elements in DNNs. We construct a large-scale dataset by masking nine key elements from the correct DNN source code and then force the MLM to restore the correct elements to learn the deep semantics that ensure the normal functionalities of DNNs. Afterwards, a light-weight static analysis tool is designed to filter out low-quality patches to enhance the repair efficiency. We introduce a patch validation method specifically for DNN repair tasks, which consists of three evaluation metrics from different aspects to model the effectiveness of generated patches. We construct a benchmark, BenchmarkAPR4DNN, including 51 buggy DNN models and an evaluation tool that outputs the three metrics. We evaluate MLM4DNN against six baselines on BenchmarkAPR4DNN, and results show that MLM4DNN outperforms all state-of-the-art baselines, including two dynamic-based and four zero-shot learning-based methods. After applying the fine-tuned MLM design to several prevalent Large Language Models (LLMs), we consistently observe improved performance in DNN repair tasks compared to the original LLMs, which demonstrates the effectiveness of the method proposed in this paper. ![]() |
|
Zhang, Neng |
![]() Anji Li, Neng Zhang, Ying Zou, Zhixiang Chen, Jian Wang, and Zibin Zheng (Sun Yat-sen University, China; Central China Normal University, China; Queen's University, Canada; Wuhan University, China) Code snippets are widely used in technical forums to demonstrate solutions to programming problems. They can be leveraged by developers to accelerate problem-solving. However, code snippets often lack concrete types of the APIs used in them, which impedes their understanding and resue. To enhance the description of a code snippet, a number of approaches are proposed to infer the types of APIs. Although existing approaches can achieve good performance, their performance is limited by ignoring other information outside the input code snippet (e.g., the descriptions of similar code snippets) that could potentially improve the performance. In this paper, we propose a novel type inference approach, named CKTyper, by leveraging crowdsourcing knowledge in technical posts. The key idea is to generate a relevant context for a target code snippet from the posts containing similar code snippets and then employ the context to promote the type inference with large language models (e.g., ChatGPT). More specifically, we build a crowdsourcing knowledge base (CKB) by extracting code snippets from a large set of posts and index the CKB using Lucene. An API type dictionary is also built from a set of API libraries. Given a code snippet to be inferred, we first retrieve a list of similar code snippets from the indexed CKB. Then, we generate a crowdsourcing knowledge context (CKC) by extracting and summarizing useful content (e.g., API-related sentences) in the posts that contain the similar code snippets. The CKC is subsequently used to improve the type inference of ChatGPT on the input code snippet. The hallucination of ChatGPT is eliminated by employing the API type dictionary. Evaluation results on two open-source datasets demonstrate the effectiveness and efficiency of CKTyper. CKTyper achieves the optimal precision/recall of 97.80% and 95.54% on both datasets, respectively, significantly outperforming three state-of-the-art baselines and ChatGPT. ![]() |
|
Zhang, Qian |
![]() Jiyuan Wang, Yuxin Qiu, Ben Limpanukorn, Hong Jin Kang, Qian Zhang, and Miryung Kim (University of California at Los Angeles, USA; University of California at Riverside, USA) In recent years, the MLIR framework has had explosive growth due to the need for extensible deep learning compilers for hardware accelerators. Such examples include Triton, CIRCT, and ONNX-MLIR. MLIR compilers introduce significant complexities in localizing bugs or inefficiencies because of their layered optimization and transformation process with compilation passes. While existing delta debugging techniques can be used to identify a minimum subset of IR code that reproduces a given bug symptom, their naive application to MLIR is time-consuming because real-world MLIR compilers usually involve a large number of compilation passes. Compiler developers must identify a minimized set of relevant compilation passes to reduce the footprint of MLIR compiler code to be inspected for a bug fix. We propose DuoReduce, a dual-dimensional reduction approach for MLIR bug localization. DuoReduce leverages three key ideas in tandem to design an efficient MLIR delta debugger. First, DuoReduce reduces compiler passes that are irrelevant to the bug by identifying ordering dependencies among the different compilation passes. Second, DuoReduce uses MLIR-semantics-aware transformations to expedite IR code reduction. Finally, DuoReduce leverages cross-dependence between the IR code dimension and the compilation pass dimension by accounting for which IR code segments are related to which compilation passes to reduce unused passes. Experiments with three large-scale MLIR compiler projects find that DuoReduce outperforms syntax-aware reducers such as Perses and Vulcan in terms of IR code reduction by 31.6% and 21.5% respectively. If one uses these reducers by enumerating all possible compilation passes (on average 18 passes), it could take up to 145 hours. By identifying ordering dependencies among compilation passes, DuoReduce reduces this time to 9.5 minutes. By identifying which compilation passes are unused for compiling reduced IR code, DuoReduce reduces the number of passes by 14.6%. This translates to not needing to examine 281 lines of MLIR compiler code on average to fix the bugs. DuoReduce has the potential to significantly reduce debugging effort in MLIR compilers, which serves as the foundation for the current landscape of machine learning and hardware accelerators. ![]() |
|
Zhang, Quan |
![]() Lihua Guo, Yiwei Hou, Chijin Zhou, Quan Zhang, and Yu Jiang (Tsinghua University, China) In recent years, the increase in ransomware attacks has significantly impacted individuals and organizations. Many strategies have been proposed to detect ransomware’s file disruption operation. However, they rely on certain assumptions that gradually fail in the face of evolving ransomware attacks, which use more stealthy encryption methods or benign-imitation-based I/O orchestration. To mitigate this, we propose an approach to detect ransomware attacks through temporal correlation between encryption and I/O behaviors. Its key idea is that there is a strong temporal correlation inherent in ransomware’s encryption and I/O behaviors. To disrupt files, ransomware must first read the file data from the disk into memory, encrypt it, and then write the encrypted data back to the disk. This process creates a pronounced temporal correlation between the computation load of the encryption operations and the size of the files being encrypted. Based on this invariant, we implement a prototype called RansomRadar and evaluate its effectiveness against 411 latest ransomware samples from 89 families. Experimental results show that it achieves a detection rate of 100.00% with 2.68% false alarms. Its F1-score is 96.03, higher than the existing detectors for 31.82 on average. ![]() |
|
Zhang, Quanjun |
![]() Weisong Sun, Yuchen Chen, Chunrong Fang, Yebo Feng, Yuan Xiao, An Guo, Quanjun Zhang, Zhenyu Chen, Baowen Xu, and Yang Liu (Nanyang Technological University, Singapore; Nanjing University, China) Neural code models (NCMs) have been widely used to address various code understanding tasks, such as defect detection. However, numerous recent studies reveal that such models are vulnerable to backdoor attacks. Backdoored NCMs function normally on normal/clean code snippets, but exhibit adversary-expected behavior on poisoned code snippets injected with the adversary-crafted trigger. It poses a significant security threat. For example, a backdoored defect detection model may misclassify user-submitted defective code as non-defective. If this insecure code is then integrated into critical systems, like autonomous driving systems, it could jeopardize life safety. Therefore, there is an urgent need for effective techniques to detect and eliminate backdoors stealthily implanted in NCMs. To address this issue, in this paper, we innovatively propose a backdoor elimination technique for secure code understanding, called EliBadCode. EliBadCode eliminates backdoors in NCMs by inverting/reverse-engineering and unlearning backdoor triggers. Specifically, EliBadCode first filters the model vocabulary for trigger tokens based on the naming conventions of specific programming languages to reduce the trigger search space and cost. Then, EliBadCode introduces a sample-specific trigger position identification method, which can reduce the interference of non-backdoor (adversarial) perturbations for subsequent trigger inversion, thereby producing effective inverted backdoor triggers efficiently. Backdoor triggers can be viewed as backdoor (adversarial) perturbations. Subsequently, EliBadCode employs a Greedy Coordinate Gradient algorithm to optimize the inverted trigger and designs a trigger anchoring method to purify the inverted trigger. Finally, EliBadCode eliminates backdoors through model unlearning. We evaluate the effectiveness of in eliminating backdoors implanted in multiple NCMs used for three safety-critical code understanding tasks. The results demonstrate that EliBadCode can effectively eliminate backdoors while having minimal adverse effects on the normal functionality of the model. For instance, on defect detection tasks, EliBadCode substantially decreases the average Attack Success Rate (ASR) of the advanced backdoor attack from 99.76% to 2.64%, significantly outperforming the three baselines. The clean model produced by EliBadCode exhibits an average decrease in defect prediction accuracy of only 0.01% (the same as the baseline). ![]() |
|
Zhang, Tianyi |
![]() Zhi Tu, Liangkun Niu, Wei Fan, and Tianyi Zhang (Purdue University, USA) ![]() ![]() Weihao Chen, Jia Lin Cheoh, Manthan Keim, Sabine Brunswicker, and Tianyi Zhang (Purdue University, USA) ![]() |
|
Zhang, Ting |
![]() Yue Li, Bohan Liu, Ting Zhang, Zhiqi Wang, David Lo, Lanxin Yang, Jun Lyu, and He Zhang (Nanjing University, China; Singapore Management University, Singapore) ![]() |
|
Zhang, Wenhui |
![]() Zhengmin Yu, Yuan Zhang, Ming Wen, Yinan Nie, Wenhui Zhang, and Min Yang (Fudan University, China; Huazhong University of Science and Technology, China) ![]() |
|
Zhang, Wensheng |
![]() Shuwei Song, Ting Chen, Ao Qiao, Xiapu Luo, Leqing Wang, Zheyuan He, Ting Wang, Xiaodong Lin, Peng He, Wensheng Zhang, and Xiaosong Zhang (University of Electronic Science and Technology of China, China; Hong Kong Polytechnic University, China; Stony Brook University, USA; University of Guelph, Canada) Cryptocurrency tokens, implemented by smart contracts, are prime targets for attackers due to their substantial monetary value. To illicitly gain profit, attackers often embed malicious code or exploit vulnerabilities within token contracts. Token transfer identification is crucial for detecting malicious and vulnerable token contracts. However, existing methods suffer from high false positives or false negatives due to invalid assumptions or reliance on limited patterns. This paper introduces a novel approach that captures the essential principles of token contracts, which are independent of programming languages and token standards, and presents a new tool, CRYPTO-SCOUT. CRYPTO-SCOUT automatically and accurately identifies token transfers, enabling the detection of various malicious and vulnerable token contracts. CRYPTO-SCOUT's core innovation is its capability to automatically identify complex container-type variables used by token contracts for storing holder information. It processes the bytecode of smart contracts written in the two most widely-used languages, Solidity and Vyper, and supports the three most popular token standards, ERC20, ERC721, and ERC1155. Furthermore, CRYPTO-SCOUT detects four types of malicious and vulnerable token contracts and is designed to be extensible. Extensive experiments show that CRYPTO-SCOUT outperforms existing approaches and uncovers over 21,000 malicious/vulnerable token contracts and more than 12,000 transactions triggering them. ![]() |
|
Zhang, Xiangyu |
![]() Sayali Kate, Yifei Gao, Shiwei Feng, and Xiangyu Zhang (Purdue University, USA) Increasingly popular Robot Operating System (ROS) framework allows building robotic systems by integrating newly developed and/or reused modules, where the modules can use different versions of the framework (e.g., ROS1 or ROS2) and programming language (e.g. C++ or Python). The majority of such robotic systems' work happens in callbacks. The framework provides various elements for initializing callbacks and for setting up the execution of callbacks. It is the responsibility of developers to compose callbacks and their execution setup elements, and hence can lead to inconsistencies related to the setup of callback execution due to developer's incomplete knowledge of the semantics of elements in various versions of the framework. Some of these inconsistencies do not throw errors at runtime, making their detection difficult for developers. We propose a static approach to detecting such inconsistencies by extracting a static view of the composition of robotic system's callbacks and their execution setup, and then checking it against the composition conventions based on the elements' semantics. We evaluate our ROSCallBaX prototype on the dataset created from the posts on developer forums and ROS projects that are publicly available. The evaluation results show that our approach can detect real inconsistencies. ![]() |
|
Zhang, Xiaodong |
![]() Songyang Yan, Xiaodong Zhang, Kunkun Hao, Haojie Xin, Yonggang Luo, Jucheng Yang, Ming Fan, Chao Yang, Jun Sun, and Zijiang Yang (Xi'an Jiaotong University, China; Xidian University, China; Synkrotron, China; Chongqing Changan Automobile, China; Singapore Management University, Singapore; University of Science and Technology of China, China) The safety and reliability of Automated Driving Systems (ADS) are paramount, necessitating rigorous testing methodologies to uncover potential failures before deployment. Traditional testing approaches often prioritize either natural scenario sampling or safety-critical scenario generation, resulting in overly simplistic or unrealistic hazardous tests. In practice, the demand for natural scenarios (e.g., when evaluating the ADS's reliability in real-world conditions), critical scenarios (e.g., when evaluating safety in critical situations), or somewhere in between (e.g., when testing the ADS in regions with less civilized drivers) varies depending on the testing objectives. To address this issue, we propose the On-demand Scenario Generation (OSG) Framework, which generates diverse scenarios with varying risk levels. Achieving the goal of OSG is challenging due to the complexity of quantifying the criticalness and naturalness stemming from intricate vehicle-environment interactions, as well as the need to maintain scenario diversity across various risk levels. OSG learns from real-world traffic datasets and employs a Risk Intensity Regulator to quantitatively control the risk level. It also leverages an improved heuristic search method to ensure scenario diversity. We evaluate OSG on the Carla simulators using various ADSs. We verify OSG's ability to generate scenarios with different risk levels and demonstrate its necessity by comparing accident types across risk levels. With the help of OSG, we are now able to systematically and objectively compare the performance of different ADSs based on different risk levels. ![]() ![]() |
|
Zhang, Xiaosong |
![]() Shuwei Song, Ting Chen, Ao Qiao, Xiapu Luo, Leqing Wang, Zheyuan He, Ting Wang, Xiaodong Lin, Peng He, Wensheng Zhang, and Xiaosong Zhang (University of Electronic Science and Technology of China, China; Hong Kong Polytechnic University, China; Stony Brook University, USA; University of Guelph, Canada) Cryptocurrency tokens, implemented by smart contracts, are prime targets for attackers due to their substantial monetary value. To illicitly gain profit, attackers often embed malicious code or exploit vulnerabilities within token contracts. Token transfer identification is crucial for detecting malicious and vulnerable token contracts. However, existing methods suffer from high false positives or false negatives due to invalid assumptions or reliance on limited patterns. This paper introduces a novel approach that captures the essential principles of token contracts, which are independent of programming languages and token standards, and presents a new tool, CRYPTO-SCOUT. CRYPTO-SCOUT automatically and accurately identifies token transfers, enabling the detection of various malicious and vulnerable token contracts. CRYPTO-SCOUT's core innovation is its capability to automatically identify complex container-type variables used by token contracts for storing holder information. It processes the bytecode of smart contracts written in the two most widely-used languages, Solidity and Vyper, and supports the three most popular token standards, ERC20, ERC721, and ERC1155. Furthermore, CRYPTO-SCOUT detects four types of malicious and vulnerable token contracts and is designed to be extensible. Extensive experiments show that CRYPTO-SCOUT outperforms existing approaches and uncovers over 21,000 malicious/vulnerable token contracts and more than 12,000 transactions triggering them. ![]() |
|
Zhang, Yang |
![]() Yiwen Wu, Yang Zhang, Tao Wang, Bo Ding, and Huaimin Wang (National University of Defense Technology, China; State Key Laboratory of Complex and Critical Software Environment, China) Docker building is a critical component of containerization in modern software development, automating the process of packaging and converting sources into container images. It is not uncommon to find that Docker build faults (DBFs) occur frequently across Docker-based projects, inducing non-negligible costs in development activities. DBF resolution is a challenging problem and previous studies have demonstrated that developers spend non-trivial time in resolving encountered build faults. However, the characteristics of DBFs is still under-investigated, hindering practical solutions to build management in the Docker community. In this paper, to bridge this gap, we present a comprehensive study for understanding the real-world DBFs in practice. We collect and analyze a DBF dataset of 255 issues and 219 posts from GitHub, Stack Overflow, and Docker Forum. We investigate and construct characteristic taxonomies for the DBFs, including 15 symptoms, 23 root causes, and 35 fix patterns. Moreover, we study the fault distributions of symptoms and root causes, in terms of the different build types, i.e., Dockerfile builds and Docker-compose builds. Based on the results, we provide actionable implications and develop a knowledge-based application, which can potentially facilitate research and assist developers in improving the Docker build management. ![]() |
|
Zhang, Yinqian |
![]() Yinxi Liu, Wei Meng, and Yinqian Zhang (Rochester Institute of Technology, USA; Chinese University of Hong Kong, China; Southern University of Science and Technology, China) Ethereum smart contracts determine state transition results not only by the previous states, but also by a mutable global state consisting of storage variables. This has resulted in state-inconsistency bugs, which grant an attacker the ability to modify contract states either through recursive function calls to a contract (reentrancy), or by exploiting transaction order dependence (TOD). Current studies have determined that identifying data races on global storage variables can capture all state-inconsistency bugs. Nevertheless, eliminating false positives poses a significant challenge, given the extensive number of execution paths that could potentially cause a data race. For simplicity, existing research considers a data race to be vulnerable as long as the variable involved could have inconsistent values under different execution orders. However, such a data race could be benign when the inconsistent value does not affect any critical computation or decision-making process in the program. Besides, the data race could also be infeasible when there is no valid state in the contract that allows the execution of both orders. In this paper, we aim to appreciably reduce these false positives without introducing false negatives. We present DivertScan, a precise framework to detect exploitable state-inconsistency bugs in smart contracts. We first introduce the use of flow divergence to check where the involved variable may flow to. This allows DivertScan to precisely infer the potential effects of a data race and determine whether it can be exploited for inducing unexpected program behaviors. We also propose multiplex symbolic execution to examine different execution orders in one time of solving. This helps DivertScan to determine whether a common starting state could potentially exist. To address the scalability issue in symbolic execution, DivertScan utilizes an overapproximated pre-checking and a selective exploration strategy. As a result, it only needs to explore a limited state space. DivertScan significantly outperformed state-of-the-art tools by improving the precision rate by 20.72% to 74.93% while introducing no false negatives. It also identified five exploitable real-world vulnerabilities that other tools missed. The detected vulnerabilities could potentially lead to a loss of up to $68.2M, based on trading records and rate limits. ![]() |
|
Zhang, Yuan |
![]() Zhengmin Yu, Yuan Zhang, Ming Wen, Yinan Nie, Wenhui Zhang, and Min Yang (Fudan University, China; Huazhong University of Science and Technology, China) ![]() |
|
Zhang, Yuanliang |
![]() Xingpei Li, Yan Lei, Zhouyang Jia, Yuanliang Zhang, Haoran Liu, Liqian Chen, Wei Dong, and Shanshan Li (National University of Defense Technology, China; Chongqing University, China) As deep learning (DL) frameworks become widely used, converting models between frameworks is crucial for ecosystem flexibility. However, interestingly, existing model converters commonly focus on syntactic operator API mapping—transpiling operator names and parameters—which results in API compatibility issues (i.e., incompatible parameters, missing operators). These issues arise from semantic inconsistencies due to differences in operator implementation, causing conversion failure or performance degradation. In this paper, we present the first comprehensive study on operator semantic inconsistencies through API mapping analysis and framework source code inspection, revealing that 47% of sampled operators exhibit inconsistencies across DL frameworks, with source code limited to individual layers and no inter-layer interactions. This suggests that layer-wise source code alignment is feasible. Based on this, we propose ModelX, a source-level cross-framework conversion approach that extends operator API mapping by addressing semantic inconsistencies beyond the API level. Experiments on PyTorch-to-Paddle conversion show that ModelX successfully converts 624 out of 686 sampled operators and outperforms two state-of-the-art converters and three popular large language models. Moreover, ModelX achieves minimal metric gaps (avg. all under 3.4%) across 52 models from vision, text, and audio tasks, indicating strong robustness. ![]() ![]() Haoran Liu, Shanshan Li, Zhouyang Jia, Yuanliang Zhang, Linxiao Bai, Si Zheng, Xiaoguang Mao, and Xiangke Liao (National University of Defense Technology, China) ![]() |
|
Zhang, Yun |
![]() Zuobin Wang, Zhiyuan Wan, Yujing Chen, Yun Zhang, David Lo, Difan Xie, and Xiaohu Yang (Zhejiang University, China; Hangzhou High-Tech Zone (Binjiang), China; Hangzhou City University, China; Singapore Management University, Singapore) In smart contract development, practitioners frequently reuse code to reduce development effort and avoid reinventing the wheel. This reused code, whether identical or similar to its original source, is referred to as a code clone. Unintentional code cloning can propagate flaws and vulnerabilities, potentially undermining the reliability and maintainability of software systems. Previous studies have identified a significant prevalence of code clones in Solidity smart contracts on the Ethereum blockchain. To mitigate the risks posed by code clones, clone detection has emerged as an active field of research and practice in software engineering. Recent studies have extended existing techniques or proposed novel techniques tailored to the unique syntactic and semantic features of Solidity. Nonetheless, the evaluations of existing techniques, whether conducted by their original authors or independent researchers, involve codebases in various programming languages and utilize different versions of the corresponding tools. The resulting inconsistency makes direct comparisons of the evaluation results impractical, and hinders the ability to derive meaningful conclusions across the evaluations. There remains a lack of clarity regarding the effectiveness of these techniques in detecting smart contract clones, and whether it is feasible to combine different techniques to achieve scalable yet accurate detection of code clones in smart contracts. To address this gap, we conduct a comprehensive empirical study that evaluates the effectiveness and scalability of five representative clone detection techniques on 33,073 verified Solidity smart contracts, along with a benchmark we curate, in which we manually label 72,010 pairs of Solidity smart contracts with clone tags. Moreover, we explore the potential of combining different techniques to achieve optimal performance of code clone detection for smart contracts, and propose SourceREClone, a framework designed for the refined integration of different techniques, which achieves a 36.9% improvement in F1 score compared to a straightforward combination of the state of the art. Based on our findings, we discuss implications, provide recommendations for practitioners, and outline directions for future research. ![]() |
|
Zhang, Yuqun |
![]() Weiyuan Tong, Zixu Wang, Zhanyong Tang, Jianbin Fang, Yuqun Zhang, and Guixin Ye (NorthWest University, China; National University of Defense Technology, China; Southern University of Science and Technology, China) ![]() |
|
Zhang, Yuxia |
![]() Meng Fan, Yuxia Zhang, Klaas-Jan Stol, and Hui Liu (Beijing Institute of Technology, China; University College Cork, Ireland; Lero, Ireland; SINTEF, Norway) Continued contributions of core developers in open source software (OSS) projects are key for sustaining and maintaining successful OSS projects. A major risk to the sustainability of OSS projects is developer turnover. Prior studies have explored developer turnover at the level of individual projects. A shortcoming of such studies is that they ignore the impact of developer turnover on downstream projects. Yet, an awareness of the turnover of core developers offers useful insights to the rest of an open source ecosystem. This study performs a large-scale empirical analysis of code developer turnover in the Rust package ecosystem. We find that the turnover of core developers is quite common in the whole Rust ecosystem with 36,991 packages. This is particularly worrying as a vast majority of Rust packages only have a single core developer. We found that core developer turnover can significantly decrease the quality and efficiency of software development and maintenance, even leading to deprecation. This is a major source of concern for those Rust packages that are widely used. We surveyed developers' perspectives on the turnover of core developers in upstream packages. We found that developers widely agreed that core developer turnover can affect project stability and sustainability. They also emphasized the importance of transparency and timely notifications regarding the health status of upstream dependencies. This study provides unique insights to help communities focus on building reliable software dependency networks. ![]() ![]() Mian Qin, Yuxia Zhang, Klaas-Jan Stol, and Hui Liu (Beijing Institute of Technology, China; University College Cork, Ireland; Lero, Ireland; SINTEF, Norway) Open Source Software (OSS) projects are no longer only developed by volunteers. Instead, many organizations, from early-stage startups to large global enterprises, actively participate in many well-known projects. The survival and success of OSS projects rely on long-term contributors, who have extensive experience and knowledge. While prior literature has explored volunteer turnover in OSS, there is a paucity of research on company turnover in OSS ecosystems. Given the intensive involvement of companies in OSS and the different nature of corporate contributors vis-a-vis volunteers, it is important to investigate company turnover in OSS projects. This study first explores the prevalence and characteristics of companies that discontinue contributing to OSS projects, and then develops models to predict companies’ turnover. Based on a study of the Linux kernel, we analyze the early-stage behavior of 1,322 companies that have contributed to the project. We find that approximately 12% of companies discontinue contributing each year; one-sixth of those used to be core contributing companies (those that ranked in the top 20% by commit volume). Furthermore, withdrawing companies tend to have a lower intensity and scope of contributions, make primarily perfective changes, collaborate less, and operate on a smaller scale. We propose a Temporal Convolutional Network (TCN) deep learning model based on these indicators to predict whether companies will discontinue. The evaluation results show that the model achieves an AUC metric of .76 and an accuracy of .71. We evaluated the model in two other OSS projects, Rust and OpenStack, and the performance remains stable. ![]() ![]() Waseem Akram, Yanjie Jiang, Yuxia Zhang, Haris Ali Khan, and Hui Liu (Beijing Institute of Technology, China; Peking University, China) Accurate method naming is crucial for code readability and maintainability. However, manually creating concise and meaningful names remains a significant challenge. To this end, in this paper, we propose an approach based on Large Language Model (LLMs) to suggest method names according to function descriptions. The key of the approach is ContextCraft, an automated algorithm for generating context-rich prompts for LLM that suggests the expected method names according to the prompts. For a given query (functional description), it retrieves a few best examples whose functional descriptions have the greatest similarity with the query. From the examples, it identifies tokens that are likely to appear in the final method name as well as their likely positions, picks up pivot words that are semantically related to tokens in the according method names, and specifies the evaluation results of the LLM on the selected examples. All such outputs (tokens with probabilities and position information, pivot words accompanied by associated name tokens and similarity scores, and evaluation results) together with the query and the selected examples are then filled in a predefined prompt template, resulting in a context-rich prompt. This context-rich prompt reduces the randomness of LLMs by focusing the LLM’s attention on relevant contexts, constraining the solution space, and anchoring results to meaningful semantic relationships. Consequently, the LLM leverages this prompt to generate the expected method name, producing a more accurate and relevant suggestion. We evaluated the proposed approach with 43k real-world Java and Python methods accompanied by functional descriptions. Our evaluation results suggested that it significantly outperforms the state-of-the-art approach RNN-att-Copy, improving the chance of exact match by 52% and decreasing the edit distance between generated and expected method names by 32%. Our evaluation results also suggested that the proposed approach worked well for various LLMs, including ChatGPT-3.5, ChatGPT-4, ChatGPT-4o, Gemini-1.5, and Llama-3. ![]() ![]() Haris Ali Khan, Yanjie Jiang, Qasim Umer, Yuxia Zhang, Waseem Akram, and Hui Liu (Beijing Institute of Technology, China; Peking University, China; King Fahd University of Petroleum and Minerals, Saudi Arabia) It is often valuable to know whether a given piece of source code has or hasn’t been used to train a given deep learning model. On one side, it helps avoid data contamination problems that may exaggerate the performance of evaluated models. Conversely, it facilitates copyright protection by identifying private or protected code leveraged for model training without permission. To this end, automated approaches have been proposed for the detection, known as data contamination detection. Such approaches often heavily rely on the confidence of the involved models, assuming that the models should be more confident in handling contaminated data than cleaned data. However, such approaches do not consider the nature of the given data item, i.e., how difficult it is to predict the given item. Consequently, difficult-to-predict contaminated data and easy-to-predict cleaned data are often misclassified. As an initial attempt to solve this problem, this paper presents a naturalness-based approach, called Natural-DaCoDe, for code-completion models to distinguish contaminated source code from cleaned ones. Natural-DaCoDe leverages code naturalness to quantitatively measure the difficulty of a given source code for code-completion models. It then trains a classifier to distinguish contaminated source code according to both code naturalness and the performance of the code-completion models on the given source code. We evaluate Natural-DaCoDe with two pre-trained large language models (e.g., ChatGPT and Claude) and two code-completion models that we trained from scratch for detecting contamination data. Our evaluation results suggest that Natural-DaCoDe substantially outperformed the state-of-the-art approaches in detecting contaminated data, improving the average accuracy by 61.78%. We also evaluate Natural-DaCoDe with method name suggestion task, and it remains more accurate than the state-of-the-art approaches, improving the accuracy by 54.39%. Furthermore, Natural-DaCoDe was tested on a natural language text benchmark, significantly outperforming the state-of-the-art approaches by 22% . It may suggest that Natural-DaCoDe could be applied to various source code related tasks besides code complete. ![]() |
|
Zhang, Zenong |
![]() Amiao Gao, Zenong Zhang, Simin Wang, LiGuo Huang, Shiyi Wei, and Vincent Ng (Southern Methodist University, Dallas, USA; University of Texas at Dallas, USA) ![]() |
|
Zhang, Zhaoxu |
![]() Zhaoxu Zhang, Komei Ryu, Tingting Yu, and William G.J. Halfond (University of Southern California, USA; University of Connecticut, USA) Bug report reproduction is a crucial but time-consuming task to be carried out during mobile app maintenance. To accelerate this process, researchers have developed automated techniques for reproducing mobile app bug reports. However, due to the lack of an effective mechanism to recognize different buggy behaviors described in the report, existing work is limited to reproducing crash bug reports, or requires developers to manually analyze execution traces to determine if a bug was successfully reproduced. To address this limitation, we introduce a novel technique to automatically identify and extract the buggy behavior from the bug report and detect it during the automated reproduction process. To accommodate various buggy behaviors of mobile app bugs, we conducted an empirical study and created a standardized representation for expressing the bug behavior identified from our study. Given a report, our approach first transforms the documented buggy behavior into this standardized representation, then matches it against real-time device and UI information during the reproduction to recognize the bug. Our empirical evaluation demonstrated that our approach achieved over 90% precision and recall in generating the standardized representation of buggy behaviors. It correctly identified bugs in 83% of the bug reports and enhanced existing reproduction techniques, allowing them to reproduce four times more bug reports. ![]() |
|
Zhao, JunPeng |
![]() Susheng Wu, Ruisi Wang, Yiheng Cao, Bihuan Chen, Zhuotong Zhou, Yiheng Huang, JunPeng Zhao, and Xin Peng (Fudan University, China) Branching repositories facilitates efficient software development but can also inadvertently propagate vulnerabilities. When an original branch is patched, other unfixed branches remain vulnerable unless the patch is successfully ported. However, due to inherent discrepancies between branches, many patches cannot be directly applied and require manual intervention, which is time-consuming and leads to delays in patch porting, increasing vulnerability risks. Existing automated patch porting approaches are prone to errors, as they often overlook essential semantic and syntactic context of vulnerability and fail to detect or refine faulty patches. We propose Mystique, a novel LLM-based approach to address these limitations. Mystique first slices the semantic-related statements linked to the vulnerability while ensuring syntactic correctness, allowing it to extract the signatures for both the original patched function and the target vulnerable function. Mystique then utilizes a fine-tuned LLM to generate a fixed function, which is further iteratively checked and refined to ensure successful porting. Our evaluation shows that Mystique achieved a success rate of 0.954 at function level and of 0.924 at CVE level, outperforming state-of-the-art approaches by at least 13.2% at function level and 12.3% at CVE level. Our evaluation also demonstrates Mystique’s superior generality across various projects, bugs, and programming languages. Mystique successfully ported patches for 34 real-world vulnerable branches. ![]() ![]() |
|
Zhao, Kunsong |
![]() Kunsong Zhao, Zihao Li, Weimin Chen, Xiapu Luo, Ting Chen, Guozhu Meng, and Yajin Zhou (Hong Kong Polytechnic University, China; University of Electronic Science and Technology of China, China; Institute of Information Engineering at Chinese Academy of Sciences, China; Zhejiang University, China) WebAssembly has become the preferred smart contract format for various blockchain platforms due to its high portability and near-native execution speed. To effectively understand WebAssembly contracts, it is crucial to recover high-level type signatures because of the limited type information that WebAssembly provides. However, existing studies on type inference for smart contracts primarily center around Ethereum Virtual Machine bytecode, which is not applicable to WebAssembly owing to their differing targets and runtime semantics. This paper introduces WasmHint, a novel solution that leverages deep learning inference to automatically recover high-level parameter and return types from WebAssembly contracts. More specifically, WasmHint constructs a wCFG representation to clarify dependencies within WebAssembly code and simulates its execution to capture type-related operational information. By learning comprehensive code semantics, it infers parameter and return types, with a semantic corrector designed to enhance information coordination. We conduct experiments on a newly constructed dataset containing 77,208 WebAssembly contract functions. The results demonstrate that WasmHint achieves inference accuracies of 80.0% for parameter types and 95.8% for return types, with average improvements of 86.6% and 34.0% over the baseline methods, respectively. ![]() |
|
Zhao, Yanjie |
![]() Ting Zhou, Yanjie Zhao, Xinyi Hou, Xiaoyu Sun, Kai Chen, and Haoyu Wang (Huazhong University of Science and Technology, China; Australian National University, Australia) Declarative UI frameworks have gained widespread adoption in mobile app development, offering benefits such as improved code readability and easier maintenance. Despite these advantages, the process of translating UI designs into functional code remains challenging and time-consuming. Recent advancements in multimodal large language models (MLLMs) have shown promise in directly generating mobile app code from user interface (UI) designs. However, the direct application of MLLMs to this task is limited by challenges in accurately recognizing UI components and comprehensively capturing interaction logic. To address these challenges, we propose DeclarUI, an automated approach that synergizes computer vision (CV), MLLMs, and iterative compiler-driven optimization to generate and refine declarative UI code from designs. DeclarUI enhances visual fidelity, functional completeness, and code quality through precise component segmentation, Page Transition Graphs (PTGs) for modeling complex inter-page relationships, and iterative optimization. In our evaluation, DeclarUI outperforms baselines on React Native, a widely adopted declarative UI framework, achieving a 96.8% PTG coverage rate and a 98% compilation success rate. Notably, DeclarUI demonstrates significant improvements over state-of-the-art MLLMs, with a 123% increase in PTG coverage rate, up to 55% enhancement in visual similarity scores, and a 29% boost in compilation success rate. We further demonstrate DeclarUI’s generalizability through successful applications to Flutter and ArkUI frameworks. User studies with professional developers confirm that DeclarUI’s generated code meets industrial-grade standards in code availability, modification time, readability, and maintainability. By streamlining app development, improving efficiency, and fostering designer-developer collaboration, DeclarUI offers a practical solution to the persistent challenges in mobile UI development. ![]() ![]() Chenxu Wang, Tianming Liu, Yanjie Zhao, Minghui Yang, and Haoyu Wang (Huazhong University of Science and Technology, China; Monash University, Australia; OPPO, China) With the rapid development of Large Language Models (LLMs), their integration into automated mobile GUI testing has emerged as a promising research direction. However, existing LLM-based testing approaches face significant challenges, including time inefficiency and high costs due to constant LLM querying. To address these issues, this paper introduces LLMDroid, a novel testing framework designed to enhance existing automated mobile GUI testing tools by leveraging LLMs more efficiently. The workflow of LLMDroid comprises two main stages: Autonomous Exploration and LLM Guidance. During Autonomous Exploration, LLMDroid utilizes existing testing tools while leveraging LLMs to summarize explored pages. When code coverage growth slows, it transitions to LLM Guidance to strategically direct testing towards unexplored functionalities. This approach minimizes LLM interactions while maximizing their impact on test coverage. We applied LLMDroid to three popular open-source Android testing tools and evaluated it on 14 top-listed apps from Google Play. Results demonstrate an average increase of 26.16% in code coverage and 29.31% in activity coverage. Furthermore, our evaluation under different LLMs reveals that LLMDroid outperform existing step-wise approaches with significant cost efficiency, achieving optimal performance at $0.49 per hour using GPT-4o among tested models, with a cost-effective alternative achieving 94% of this performance at just $0.03 per hour. These findings highlight LLMDroid’s effectiveness in enhancing automated mobile app testing and its potential for widespread adoption. ![]() |
|
Zhao, Ziming |
![]() Jiongchi Yu, Xie Xiaofei, Qiang Hu, Bowen Zhang, Ziming Zhao, Yun Lin, Lei Ma, Ruitao Feng, and Frank Liau (Singapore Management University, Singapore; Tianjin University, China; Zhejiang University, China; Shanghai Jiao Tong University, China; University of Tokyo, Japan; University of Alberta, Canada; Southern Cross University, Australia; Government Technology Agency of Singapore, Singapore) ![]() |
|
Zheng, Han |
![]() Han Zheng, Flavio Toffalini, Marcel Böhme, and Mathias Payer (EPFL, Switzerland; Ruhr University Bochum, Germany; MPI-SP, Germany) Can a fuzzer cover more code with minimal corruption of the initial seed? Before a seed is fuzzed, the early greybox fuzzers first systematically enumerated slightly corrupted inputs by applying every mutation operator to every part of the seed, once per generated input. The hope of this so-called “deterministic” stage was that simple changes to the seed would be less likely to break the complex file format; the resulting inputs would find bugs in the program logic well beyond the program’s parser. However, when experiments showed that disabling the deterministic stage achieves more coverage, applying multiple mutation operators at the same time to a single input, most fuzzers disabled the deterministic stage by default. Instead of ignoring the deterministic stage, we analyze its potential and substantially improve deterministic exploration. Our deterministic stage is now the default in AFL++, reverting the earlier decision of dropping deterministic exploration. We start by investigating the overhead and the contribution of the deterministic stage to the discovery of coverage-increasing inputs. While the sheer number of generated inputs explains the overhead, we find that only a few critical seeds (20%), and only a few critical bytes in a seed (0.5%) are responsible for the vast majority of the coverage-increasing inputs (83% and 84%, respectively). Hence, we develop an efficient technique, called , to identify these critical seeds / bytes so as to prune a large number of unnecessary inputs. retains the benefits of the classic deterministic stage by only enumerating a tiny part of the total deterministic state space. We evaluate implementation on two benchmarking frameworks, FuzzBench and Magma. Our evaluation shows that outperforms state-of-the-art fuzzers with and without the (old) deterministic stage enabled, both in terms of coverage and bug finding. also discovered 8 new CVEs on exhaustively fuzzed security-critical applications. Finally, has been independently evaluated and integrated into AFL++ as default option. ![]() |
|
Zheng, Si |
![]() Haoran Liu, Shanshan Li, Zhouyang Jia, Yuanliang Zhang, Linxiao Bai, Si Zheng, Xiaoguang Mao, and Xiangke Liao (National University of Defense Technology, China) ![]() |
|
Zheng, Zibin |
![]() Weibin Wu, Haoxuan Hu, Zhaoji Fan, Yitong Qiao, Yizhan Huang, Yichen Li, Zibin Zheng, and Michael Lyu (Sun Yat-sen University, China; Chinese University of Hong Kong, China) Deep learning (DL) has revolutionized various software engineering tasks. Particularly, the emergence of AI code generators has pushed the boundaries of automatic programming to synthesize entire programs based on user-defined specifications in natural language. However, it remains a mystery if these AI code generators rely on the copy-and-paste programming practice, resulting in code clone concerns. In this work, to comprehensively study the code cloning behavior of AI code generators, we conduct an empirical study on three state-of-the-art commercial AI code generators to investigate the existence of all types of clones, which remains underexplored. Our experimental results show that the total Type-1 and Type-2 clone rates of the state-of-the-art commercial AI code generators can reach up to 7.50%, indicating marked code clone issues. Furthermore, it is observed that AI code generators risk infringing copyrights and propagating buggy and vulnerable code resulting from cloning code and show a certain degree of stability in generating code clones. ![]() ![]() Anji Li, Neng Zhang, Ying Zou, Zhixiang Chen, Jian Wang, and Zibin Zheng (Sun Yat-sen University, China; Central China Normal University, China; Queen's University, Canada; Wuhan University, China) Code snippets are widely used in technical forums to demonstrate solutions to programming problems. They can be leveraged by developers to accelerate problem-solving. However, code snippets often lack concrete types of the APIs used in them, which impedes their understanding and resue. To enhance the description of a code snippet, a number of approaches are proposed to infer the types of APIs. Although existing approaches can achieve good performance, their performance is limited by ignoring other information outside the input code snippet (e.g., the descriptions of similar code snippets) that could potentially improve the performance. In this paper, we propose a novel type inference approach, named CKTyper, by leveraging crowdsourcing knowledge in technical posts. The key idea is to generate a relevant context for a target code snippet from the posts containing similar code snippets and then employ the context to promote the type inference with large language models (e.g., ChatGPT). More specifically, we build a crowdsourcing knowledge base (CKB) by extracting code snippets from a large set of posts and index the CKB using Lucene. An API type dictionary is also built from a set of API libraries. Given a code snippet to be inferred, we first retrieve a list of similar code snippets from the indexed CKB. Then, we generate a crowdsourcing knowledge context (CKC) by extracting and summarizing useful content (e.g., API-related sentences) in the posts that contain the similar code snippets. The CKC is subsequently used to improve the type inference of ChatGPT on the input code snippet. The hallucination of ChatGPT is eliminated by employing the API type dictionary. Evaluation results on two open-source datasets demonstrate the effectiveness and efficiency of CKTyper. CKTyper achieves the optimal precision/recall of 97.80% and 95.54% on both datasets, respectively, significantly outperforming three state-of-the-art baselines and ChatGPT. ![]() ![]() Yanlin Wang, Tianyue Jiang, Mingwei Liu, Jiachi Chen, Mingzhi Mao, Xilin Liu, Yuchi Ma, and Zibin Zheng (Sun Yat-sen University, China; Huawei Cloud Computing Technologies, China) Large language models (LLMs) have brought a paradigm shift to the field of code generation, offering the potential to enhance the software development process. However, previous research mainly focuses on the accuracy of code generation, while coding style differences between LLMs and human developers remain under-explored. In this paper, we empirically analyze the differences in coding style between the code generated by mainstream LLMs and the code written by human developers, and summarize coding style inconsistency taxonomy. Specifically, we first summarize the types of coding style inconsistencies by manually analyzing a large number of generation results. We then compare the code generated by LLMs with the code written by human programmers in terms of readability, conciseness, and robustness. The results reveal that LLMs and developers exhibit differences in coding style. Additionally, we study the possible causes of these inconsistencies and provide some solutions to alleviate the problem. ![]() ![]() Weibin Wu, Yuhang Cao, Ning Yi, Rongyi Ou, and Zibin Zheng (Sun Yat-sen University, China) Question answering (QA) is a fundamental task of large language models (LLMs), which requires LLMs to automatically answer human-posed questions in natural language. However, LLMs are known to distort facts and make non-factual statements (i.e., hallucinations) when dealing with QA tasks, which may affect the deployment of LLMs in real-life situations. In this work, we propose DrHall, a framework for detecting and reducing the factual hallucinations of black-box LLMs with metamorphic testing (MT). We believe that hallucinated answers are unstable. Therefore, when LLMs hallucinate, they are more likely to produce different answers if we use metamorphic testing to make LLMs re-execute the same task with different execution paths, which motivates the design of DrHall. The effectiveness of DrHall is evaluated empirically on three datasets, including a self-built dataset of natural language questions: FactHalluQA, as well as two datasets of programming questions: Refactory and LeetCode. The evaluation results confirm that DrHall can consistently outperform the state-of-the-art baselines, obtaining an average F1 score of over 0.856 for hallucination detection. For hallucination correction, DrHall can also outperform the state-of-the-art baselines, with an average hallucination correction rate of over 53%. We hope that our work can enhance the reliability of LLMs and provide new insights for the research of LLM hallucination mitigation. ![]() |
|
Zhong, Sheng |
![]() Xing Su, Hanzhong Liang, Hao Wu, Ben Niu, Fengyuan Xu, and Sheng Zhong (Nanjing University, China; Institute of Information Engineering at Chinese Academy of Sciences, China) Understanding the Ethereum smart contract bytecode is essential for ensuring cryptoeconomics security. However, existing decompilers primarily convert bytecode into pseudocode, which is not easily comprehensible for general users, potentially leading to misunderstanding of contract behavior and increased vulnerability to scams or exploits. In this paper, we propose DiSCo, the first LLMs-based EVM decompilation pipeline, which aims to enable LLMs to understand the opaque bytecode and lift it into smart contract code. DiSCo introduces three core technologies. First, a logic-invariant intermediate representation is proposed to reproject the low-level bytecode into high-level abstracted units. The second technique involves semantic enhancement based on a novel type-aware graph model to infer stripped variables during compilation, enhancing the lifting effect. The third technology is a flexible method incorporating code specifications to construct LLM-comprehensible prompts for source code generation. Extensive experiments illustrate that our generated code guarantees a high compilability rate at 75%, with differential fuzzing pass rate averaging at 50%. Manual validation results further indicate that the generated solidity contracts significantly outperforms baseline methods in tasks such as code comprehension and attack reproduction. ![]() |
|
Zhong, Xiaohan |
![]() Weiqi Lu, Yongqiang Tian, Xiaohan Zhong, Haoyang Ma, Zhenyang Xu, Shing-Chi Cheung, and Chengnian Sun (Hong Kong University of Science and Technology, China; University of Waterloo, Canada) Data visualization (DataViz) libraries play a crucial role in presentation, data analysis, and application development, underscoring the importance of their accuracy in transforming data into visual representations. Incorrect visualizations can adversely impact user experience, distort information conveyance, and influence user perception and decision-making processes. Visual bugs in these libraries can be particularly insidious as they may not cause obvious errors like crashes, but instead mislead users of the underlying data graphically, resulting in wrong decision making. Consequently, a good understanding of the unique characteristics of bugs in DataViz libraries is essential for researchers and developers to detect and fix bugs in DataViz libraries. This study presents the first comprehensive analysis of bugs in DataViz libraries, examining 564 bugs collected from five widely-used libraries. Our study systematically analyzes their symptoms and root causes, and provides a detailed taxonomy. We found that incorrect/inaccurate plots are pervasive in DataViz libraries and incorrect graphic computation is the major root cause, which necessitates further automated testing methods for DataViz libraries. Moreover, we identified eight key steps to trigger such bugs and two test oracles specific to DataViz libraries, which may inspire future research in designing effective automated testing techniques. Furthermore, with the recent advancements in Vision Language Models (VLMs), we explored the feasibility of applying these models to detect incorrect/inaccurate plots. The results show that the effectiveness of VLMs in bug detection varies from 29% to 57%, depending on the prompts, and adding more information in prompts does not necessarily increase the effectiveness. Our findings offer valuable insights into the nature and patterns of bugs in DataViz libraries, providing a foundation for developers and researchers to improve library reliability, and ultimately benefit more accurate and reliable data visualizations across various domains. ![]() |
|
Zhong, Zhiqing |
![]() Haowen Yang, Zhengda Li, Zhiqing Zhong, Xiaoying Tang, and Pinjia He (Chinese University of Hong Kong, Shenzhen, China) With the increasing demand for handling large-scale and complex data, data science (DS) applications often suffer from long execution time and rapid RAM exhaustion, which leads to many serious issues like unbearable delays and crashes in financial transactions. As popular DS libraries are frequently used in these applications, their performance bugs (PBs) are a major contributing factor to these issues, making it crucial to address them for improving overall application performance. However, PBs in popular DS libraries remain largely unexplored. To address this gap, we conducted a study of 202 PBs collected from seven popular DS libraries. Our study examined the impact of PBs and proposed a taxonomy for common root causes. We found over half of the PBs arise from inefficient data processing operations, especially within data structure. We also explored the effort required to locate their root causes and fix these bugs, along with the challenges involved. Notably, around 20% of the PBs could be fixed using simple strategies (e.g. Conditions Optimizing), suggesting the potential for automated repair approaches. Our findings highlight the severity of PBs in core DS libraries and offer insights for developing high-performance libraries and detecting PBs. Furthermore, we derived test rules from our identified root causes, identifying eight PBs, of which four were confirmed, demonstrating the practical utility of our findings. ![]() |
|
Zhou, Chijin |
![]() Lihua Guo, Yiwei Hou, Chijin Zhou, Quan Zhang, and Yu Jiang (Tsinghua University, China) In recent years, the increase in ransomware attacks has significantly impacted individuals and organizations. Many strategies have been proposed to detect ransomware’s file disruption operation. However, they rely on certain assumptions that gradually fail in the face of evolving ransomware attacks, which use more stealthy encryption methods or benign-imitation-based I/O orchestration. To mitigate this, we propose an approach to detect ransomware attacks through temporal correlation between encryption and I/O behaviors. Its key idea is that there is a strong temporal correlation inherent in ransomware’s encryption and I/O behaviors. To disrupt files, ransomware must first read the file data from the disk into memory, encrypt it, and then write the encrypted data back to the disk. This process creates a pronounced temporal correlation between the computation load of the encryption operations and the size of the files being encrypted. Based on this invariant, we implement a prototype called RansomRadar and evaluate its effectiveness against 411 latest ransomware samples from 89 families. Experimental results show that it achieves a detection rate of 100.00% with 2.68% false alarms. Its F1-score is 96.03, higher than the existing detectors for 31.82 on average. ![]() |
|
Zhou, Jiayuan |
![]() Xu Yang, Shaowei Wang, Jiayuan Zhou, and Wenhan Zhu (University of Manitoba, Canada; Huawei, Canada) Deep Learning-based Vulnerability Detection (DLVD) techniques have garnered significant interest due to their ability to automatically learn vulnerability patterns from previously compromised code. Despite the notable accuracy demonstrated by pioneering tools, the broader application of DLVD methods in real-world scenarios is hindered by significant challenges. A primary issue is the “one-for-all” design, where a single model is trained to handle all types of vulnerabilities. This approach fails to capture the patterns of different vulnerability types, resulting in suboptimal performance, particularly for less common vulnerabilities that are often underrepresented in training datasets. To address these challenges, we propose MoEVD, which adopts the Mixture-of-Experts (MoE) framework for vulnerability detection. MoEVD decomposes vulnerability detection into two tasks, CWE type classification and CWE-specific vulnerability detection. By splitting the task, in vulnerability detection, MoEVD allows specific experts to handle distinct types of vulnerabilities instead of handling all vulnerabilities within one model. Our results show that MoEVD achieves an F1-score of 0.44, significantly outperforming all studied state-of-the-art (SOTA) baselines by at least 12.8%. MoEVD excels across almost all CWE types, improving recall over the best SOTA baseline by 9% to 77.8%. Notably, MoEVD does not sacrifice performance on long-tailed CWE types; instead, its MoE design enhances performance (F1-score) on these by at least 7.3%, addressing long-tailed issues effectively. ![]() ![]() Xu Yang, Wenhan Zhu, Michael Pacheco, Jiayuan Zhou, Shaowei Wang, Xing Hu, and Kui Liu (University of Manitoba, Canada; Huawei, Canada; Zhejiang University, China; Huawei Software Engineering Application Technology Lab, China) Detecting vulnerability fix commits in open-source software is crucial for maintaining software security. To help OSS identify vulnerability fix commits, several automated approaches are developed. However, existing approaches like VulFixMiner and CoLeFunDa, focus solely on code changes, neglecting essential context from development artifacts. Tools like Vulcurator, which integrates issue reports, fail to leverage semantic associations between different development artifacts (e.g., pull requests and history vulnerability fixes). Moreover, they miss vulnerability fixes in tangled commits and lack explanations, limiting practical use. Hence to address those limitations, we propose LLM4VFD, a novel framework that leverages Large Language Models (LLMs) enhanced with Chain-of-Thought reasoning and In-Context Learning to improve the accuracy of vulnerability fix detection. LLM4VFD comprises three components: (1) Code Change Intention, which analyzes commit summaries, purposes, and implications using Chain-of-Thought reasoning; (2) Development Artifact, which incorporates context from related issue reports and pull requests; (3) Historical Vulnerability, which retrieves similar past vulnerability fixes to enrich context. More importantly, on top of the prediction, LLM4VFD also provides a detailed analysis and explanation to help security experts understand the rationale behind the decision. We evaluated LLM4VFD against state-of-the-art techniques, including Pre-trained Language Model-based approaches and vanilla LLMs, using a newly collected dataset, BigVulFixes. Experimental results demonstrate that LLM4VFD significantly outperforms the best-performed existing approach by 68.1%--145.4%. Furthermore, We conducted a user study with security experts, showing that the analysis generated by LLM4VFD improves the efficiency of vulnerability fix identification. ![]() |
|
Zhou, Minghui |
![]() Farbod Daneshyan, Runzhi He, Jianyu Wu, and Minghui Zhou (Peking University, China) The release note is a crucial document outlining changes in new software versions. It plays a key role in helping stakeholders recognise important changes and understand the implications behind them. Despite this fact, many developers view the process of writing software release notes as a tedious and dreadful task. Consequently, numerous tools (e.g., DeepRelease and Conventional Changelog) have been developed by researchers and practitioners to automate the generation of software release notes. However, these tools fail to consider project domain and target audience for personalisation, limiting their relevance and conciseness. Additionally, they suffer from limited applicability, often necessitating significant workflow adjustments and adoption efforts, hindering practical use and stressing developers. Despite recent advancements in natural language processing and the proven capabilities of large language models (LLMs) in various code and text-related tasks, there are no existing studies investigating the integration and utilisation of LLMs in automated release note generation. Therefore, we propose SmartNote, a novel and widely applicable release note generation approach that produces high-quality, contextually personalised release notes by leveraging LLM capabilities to aggregate, describe, and summarise changes based on code, commit, and pull request details. It categorises and scores (for significance) commits to generate structured and concise release notes of prioritised changes. We conduct human and automatic evaluations that reveal SmartNote outperforms or achieves comparable performance to DeepRelease (state-of-the-art), Conventional Changelog (off-the-shelf), and the projects' original release note across four quality metrics: completeness, clarity, conciseness, and organisation. In both evaluations, SmartNote ranked first for completeness and organisation, while clarity ranked first in the human evaluation. Furthermore, our controlled study reveals the significance of contextual awareness, while our applicability analysis confirms SmartNote's effectiveness across diverse projects. ![]() ![]() |
|
Zhou, Ting |
![]() Ting Zhou, Yanjie Zhao, Xinyi Hou, Xiaoyu Sun, Kai Chen, and Haoyu Wang (Huazhong University of Science and Technology, China; Australian National University, Australia) Declarative UI frameworks have gained widespread adoption in mobile app development, offering benefits such as improved code readability and easier maintenance. Despite these advantages, the process of translating UI designs into functional code remains challenging and time-consuming. Recent advancements in multimodal large language models (MLLMs) have shown promise in directly generating mobile app code from user interface (UI) designs. However, the direct application of MLLMs to this task is limited by challenges in accurately recognizing UI components and comprehensively capturing interaction logic. To address these challenges, we propose DeclarUI, an automated approach that synergizes computer vision (CV), MLLMs, and iterative compiler-driven optimization to generate and refine declarative UI code from designs. DeclarUI enhances visual fidelity, functional completeness, and code quality through precise component segmentation, Page Transition Graphs (PTGs) for modeling complex inter-page relationships, and iterative optimization. In our evaluation, DeclarUI outperforms baselines on React Native, a widely adopted declarative UI framework, achieving a 96.8% PTG coverage rate and a 98% compilation success rate. Notably, DeclarUI demonstrates significant improvements over state-of-the-art MLLMs, with a 123% increase in PTG coverage rate, up to 55% enhancement in visual similarity scores, and a 29% boost in compilation success rate. We further demonstrate DeclarUI’s generalizability through successful applications to Flutter and ArkUI frameworks. User studies with professional developers confirm that DeclarUI’s generated code meets industrial-grade standards in code availability, modification time, readability, and maintainability. By streamlining app development, improving efficiency, and fostering designer-developer collaboration, DeclarUI offers a practical solution to the persistent challenges in mobile UI development. ![]() |
|
Zhou, Yajin |
![]() Kunsong Zhao, Zihao Li, Weimin Chen, Xiapu Luo, Ting Chen, Guozhu Meng, and Yajin Zhou (Hong Kong Polytechnic University, China; University of Electronic Science and Technology of China, China; Institute of Information Engineering at Chinese Academy of Sciences, China; Zhejiang University, China) WebAssembly has become the preferred smart contract format for various blockchain platforms due to its high portability and near-native execution speed. To effectively understand WebAssembly contracts, it is crucial to recover high-level type signatures because of the limited type information that WebAssembly provides. However, existing studies on type inference for smart contracts primarily center around Ethereum Virtual Machine bytecode, which is not applicable to WebAssembly owing to their differing targets and runtime semantics. This paper introduces WasmHint, a novel solution that leverages deep learning inference to automatically recover high-level parameter and return types from WebAssembly contracts. More specifically, WasmHint constructs a wCFG representation to clarify dependencies within WebAssembly code and simulates its execution to capture type-related operational information. By learning comprehensive code semantics, it infers parameter and return types, with a semantic corrector designed to enhance information coordination. We conduct experiments on a newly constructed dataset containing 77,208 WebAssembly contract functions. The results demonstrate that WasmHint achieves inference accuracies of 80.0% for parameter types and 95.8% for return types, with average improvements of 86.6% and 34.0% over the baseline methods, respectively. ![]() |
|
Zhou, Zhehua |
![]() Zhijie Wang, Zhehua Zhou, Jiayang Song, Yuheng Huang, Zhan Shu, and Lei Ma (University of Alberta, Canada; University of Tokyo, Japan) ![]() |
|
Zhou, Zhuotong |
![]() Susheng Wu, Ruisi Wang, Yiheng Cao, Bihuan Chen, Zhuotong Zhou, Yiheng Huang, JunPeng Zhao, and Xin Peng (Fudan University, China) Branching repositories facilitates efficient software development but can also inadvertently propagate vulnerabilities. When an original branch is patched, other unfixed branches remain vulnerable unless the patch is successfully ported. However, due to inherent discrepancies between branches, many patches cannot be directly applied and require manual intervention, which is time-consuming and leads to delays in patch porting, increasing vulnerability risks. Existing automated patch porting approaches are prone to errors, as they often overlook essential semantic and syntactic context of vulnerability and fail to detect or refine faulty patches. We propose Mystique, a novel LLM-based approach to address these limitations. Mystique first slices the semantic-related statements linked to the vulnerability while ensuring syntactic correctness, allowing it to extract the signatures for both the original patched function and the target vulnerable function. Mystique then utilizes a fine-tuned LLM to generate a fixed function, which is further iteratively checked and refined to ensure successful porting. Our evaluation shows that Mystique achieved a success rate of 0.954 at function level and of 0.924 at CVE level, outperforming state-of-the-art approaches by at least 13.2% at function level and 12.3% at CVE level. Our evaluation also demonstrates Mystique’s superior generality across various projects, bugs, and programming languages. Mystique successfully ported patches for 34 real-world vulnerable branches. ![]() ![]() |
|
Zhu, Hengcheng |
![]() Hengcheng Zhu, Valerio Terragni, Lili Wei, Shing-Chi Cheung, Jiarong Wu, and Yepang Liu (Hong Kong University of Science and Technology, China; University of Auckland, New Zealand; McGill University, Canada; Southern University of Science and Technology, China) Mock assertions provide developers with a powerful means to validate program behaviors that are unobservable to test assertions. Despite their significance, they are rarely considered by automated test generation techniques. Effective generation of mock assertions requires understanding how they are used in practice. Although previous studies highlighted the importance of mock assertions, none provide insight into their usages. To bridge this gap, we conducted the first empirical study on mock assertions, examining their adoption, the characteristics of the verified method invocations, and their effectiveness in fault detection. Our analysis of 4,652 test cases from 11 popular Java projects reveals that mock assertions are mostly applied to validating specific kinds of method calls, such as those interacting with external resources and those reflecting whether a certain code path was traversed in systems under test. Additionally, we find that mock assertions complement traditional test assertions by ensuring the desired side effects have been produced, validating control flow logic, and checking internal computation results. Our findings contribute to a better understanding of mock assertion usages and provide a foundation for future related research such as automated test generation that support mock assertions. ![]() ![]() ![]() Xiao Chen, Hengcheng Zhu, Jialun Cao, Ming Wen, and Shing-Chi Cheung (Hong Kong University of Science and Technology, China; Huazhong University of Science and Technology, China) Debugging can be much facilitated if one can identify the evolution commit that introduced the bug leading to a detected failure (aka. bug-inducing commit, BIC). Although one may, in theory, locate BICs by executing the detected failing test on various historical commit versions, it is impractical when the test cannot be executed on some of those versions. On the other hand, existing static techniques often assume the availability of additional information such as patches and bug reports, or the applicability of predefined heuristics like commit chronology. However, these approaches are ineffective when such assumptions do not hold, which are often the case in practice. To address these limitations, we propose SEMBIC to identify the BIC of a bug by statically tracking the semantic changes in the execution path prescribed by the failing test across successive historical commit versions. Our insight is that the greater the semantic changes a commit introduces concerning the failing execution path of a target bug, the more likely it is to be the BIC. To distill semantic changes relevant to the failure, we focus on three fine-grained semantic properties. We evaluate the performance of SEMBIC on a benchmark containing 199 real-world bugs from 12 open-source projects. We found that SEMBIC can identify BICs with high accuracy – it ranks the BIC as top 1 for 88 out of 199 bugs, and achieves an MRR of 0.520, outperforming the state-of-the-art technique by 29.4% and 13.6%, respectively. ![]() ![]() |
|
Zhu, Kangchen |
![]() Kangchen Zhu, Zhiliang Tian, Shangwen Wang, Weiguo Chen, Zixuan Dong, Mingyue Leng, and Xiaoguang Mao (National University of Defense Technology, China) The current landscape of binary code summarization predominantly focuses on generating a single summary, which limits its utility and understanding for reverse engineers. Existing approaches often fail to meet the diverse needs of users, such as providing detailed insights into usage patterns, implementation nuances, and design rationale, as observed in the field of source code summarization. This highlights the need for multi-intent binary code summarization to enhance the effectiveness of reverse engineering processes. To address this gap, we propose MiSum, a novel method that leverages multi-modality heterogeneous code graph alignment and learning to integrate both assembly code and pseudo-code. MiSum introduces a unified multi-modality heterogeneous code graph (MM-HCG) that aligns assembly code graphs with pseudo-code graphs, capturing both low-level execution details and high-level structural information. We further propose multi-modality heterogeneous graph learning with heterogeneous mutual attention and message passing, which highlights important code blocks and discovers inter-dependencies across different code forms. Additionally, an intent-aware summary generator with an intent-aware attention mechanism is introduced to produce customized summaries tailored to multiple intents. Extensive experiments, including evaluations across various architectures and optimization levels, demonstrate that MiSum outperforms state-of-the-art baselines in BLEU, METEOR, and ROUGE-L metrics. Human evaluations validate its capability to effectively support reverse engineers in understanding diverse binary code intents, marking a significant advancement in binary code analysis. ![]() ![]() ![]() |
|
Zhu, Liming |
![]() Yanqi Su, Zhenchang Xing, Chong Wang, Chunyang Chen, Sherry (Xiwei) Xu, Qinghua Lu, and Liming Zhu (Australian National University, Australia; CSIRO's Data61, Australia; Nanyang Technological University, Singapore; TU Munich, Germany; UNSW, Australia) Exploratory testing (ET) harnesses tester's knowledge, creativity, and experience to create varying tests that uncover unexpected bugs from the end-user's perspective. Although ET has proven effective in system-level testing of interactive systems, the need for manual execution has hindered large-scale adoption. In this work, we explore the feasibility, challenges and road ahead of automated scenario-based ET (a.k.a soap opera testing). We conduct a formative study, identifying key insights for effective manual soap opera testing and challenges in automating the process. We then develop a multi-agent system leveraging LLMs and a Scenario Knowledge Graph (SKG) to automate soap opera testing. The system consists of three multi-modal agents, Planner, Player, and Detector that collaborate to execute tests and identify potential bugs. Experimental results demonstrate the potential of automated soap opera testing, but there remains a significant gap compared to manual execution, especially under-explored scenario boundaries and incorrectly identified bugs. Based on the observation, we envision road ahead for the future of automated soap opera testing, focusing on three key aspects: the synergy of neural and symbolic approaches, human-AI co-learning, and the integration of soap opera testing with broader software engineering practices. These insights aim to guide and inspire the future research. ![]() |
|
Zhu, Mingxuan |
![]() Mingxuan Zhu, Zeyu Sun, and Dan Hao (Peking University, China; Institute of Software at Chinese Academy of Sciences, China) Compilers are crucial software tools that usually convert programs in high-level languages into machine code. A compiler provides hundreds of optimizations to improve the performance of the compiled code, which are controlled by enabled or disabled optimization flags. However, the vast number of combinations of these flags makes it extremely challenging to select the desired settings for compiler optimization flags (i.e., an optimization sequence) for a given target program. In the literature, many auto-tuning techniques have been proposed to select a desired optimization sequence via different strategies across the entire optimization space. However, due to the huge optimization space, these techniques commonly suffer from the widely recognized efficiency problem. To address this problem, in this paper, we propose a preference-driven selection approach PDCAT, which reduces the search space of optimization sequences through three components. In particular, PDCAT first identifies combined optimizations based on compiler documentation to exclude optimization sequences violating the combined constraints, and then categorizes the optimizations into a common optimization set (whose optimization flags are fixed) and an exploration set containing the remaining optimizations. Finally, within the search process, PDCAT assigns distinct enable probabilities to the explored optimization flags and finally selects a desired optimization sequence. The former two components reduce the search space by removing invalid optimization sequences and fixing some optimization flags, whereas the latter performs a biased search in the search space. To evaluate the performance of the proposed approach PDCAT, we conducted an extensive experimental study on the latest version of the GCC compiler with two widely used benchmarks, cBench and PolyBench. The results show that PDCAT significantly outperforms the four compared techniques, including the state-of-art technique SRTuner. Moreover, each component of PDCAT not only contributes to its performance, but also improves the acceleration performance of the compared techniques. ![]() |
|
Zhu, Wenhan |
![]() Xu Yang, Shaowei Wang, Jiayuan Zhou, and Wenhan Zhu (University of Manitoba, Canada; Huawei, Canada) Deep Learning-based Vulnerability Detection (DLVD) techniques have garnered significant interest due to their ability to automatically learn vulnerability patterns from previously compromised code. Despite the notable accuracy demonstrated by pioneering tools, the broader application of DLVD methods in real-world scenarios is hindered by significant challenges. A primary issue is the “one-for-all” design, where a single model is trained to handle all types of vulnerabilities. This approach fails to capture the patterns of different vulnerability types, resulting in suboptimal performance, particularly for less common vulnerabilities that are often underrepresented in training datasets. To address these challenges, we propose MoEVD, which adopts the Mixture-of-Experts (MoE) framework for vulnerability detection. MoEVD decomposes vulnerability detection into two tasks, CWE type classification and CWE-specific vulnerability detection. By splitting the task, in vulnerability detection, MoEVD allows specific experts to handle distinct types of vulnerabilities instead of handling all vulnerabilities within one model. Our results show that MoEVD achieves an F1-score of 0.44, significantly outperforming all studied state-of-the-art (SOTA) baselines by at least 12.8%. MoEVD excels across almost all CWE types, improving recall over the best SOTA baseline by 9% to 77.8%. Notably, MoEVD does not sacrifice performance on long-tailed CWE types; instead, its MoE design enhances performance (F1-score) on these by at least 7.3%, addressing long-tailed issues effectively. ![]() ![]() Xu Yang, Wenhan Zhu, Michael Pacheco, Jiayuan Zhou, Shaowei Wang, Xing Hu, and Kui Liu (University of Manitoba, Canada; Huawei, Canada; Zhejiang University, China; Huawei Software Engineering Application Technology Lab, China) Detecting vulnerability fix commits in open-source software is crucial for maintaining software security. To help OSS identify vulnerability fix commits, several automated approaches are developed. However, existing approaches like VulFixMiner and CoLeFunDa, focus solely on code changes, neglecting essential context from development artifacts. Tools like Vulcurator, which integrates issue reports, fail to leverage semantic associations between different development artifacts (e.g., pull requests and history vulnerability fixes). Moreover, they miss vulnerability fixes in tangled commits and lack explanations, limiting practical use. Hence to address those limitations, we propose LLM4VFD, a novel framework that leverages Large Language Models (LLMs) enhanced with Chain-of-Thought reasoning and In-Context Learning to improve the accuracy of vulnerability fix detection. LLM4VFD comprises three components: (1) Code Change Intention, which analyzes commit summaries, purposes, and implications using Chain-of-Thought reasoning; (2) Development Artifact, which incorporates context from related issue reports and pull requests; (3) Historical Vulnerability, which retrieves similar past vulnerability fixes to enrich context. More importantly, on top of the prediction, LLM4VFD also provides a detailed analysis and explanation to help security experts understand the rationale behind the decision. We evaluated LLM4VFD against state-of-the-art techniques, including Pre-trained Language Model-based approaches and vanilla LLMs, using a newly collected dataset, BigVulFixes. Experimental results demonstrate that LLM4VFD significantly outperforms the best-performed existing approach by 68.1%--145.4%. Furthermore, We conducted a user study with security experts, showing that the analysis generated by LLM4VFD improves the efficiency of vulnerability fix identification. ![]() |
|
Zhu, Yinghao |
![]() Xin Tan, Xiao Long, Yinghao Zhu, Lin Shi, Xiaoli Lian, and Li Zhang (Beihang University, China) Onboarding newcomers is vital for the sustainability of open-source software (OSS) projects. To lower barriers and increase engagement, OSS projects have dedicated experts who provide guidance for newcomers. However, timely responses are often hindered by experts' busy schedules. The recent rapid advancements of AI in software engineering have brought opportunities to leverage AI as a substitute for expert mentoring. However, the potential role of AI as a comprehensive mentor throughout the entire onboarding process remains unexplored. To identify design strategies of this "AI mentor", we applied Design Fiction as a participatory method with 19 OSS newcomers. We investigated their current onboarding experience and elicited 32 design strategies for future AI mentor. Participants envisioned AI mentor being integrated into OSS platforms like GitHub, where it could offer assistance to newcomers, such as "recommending projects based on personalized requirements" and "assessing and categorizing project issues by difficulty". We also collected participants' perceptions of a prototype, named "OSSerCopilot", that implemented the envisioned strategies. They found the interface useful and user-friendly, showing a willingness to use it in the future, which suggests the design strategies are effective. Finally, in order to identify the gaps between our design strategies and current research, we conducted a comprehensive literature review, evaluating the extent of existing research support for this concept. We find that research is relatively scarce in certain areas where newcomers highly anticipate AI mentor assistance, such as "discovering an interested project". Our study has the potential to revolutionize the current newcomer-expert mentorship and provides valuable insights for researchers and tool designers aiming to develop and enhance AI mentor systems. ![]() |
|
Zou, Deqing |
![]() Junyao Ye, Zhen Li, Xi Tang, Deqing Zou, Shouhuai Xu, Weizhong Qiang, and Hai Jin (Huazhong University of Science and Technology, China; University of Colorado at Colorado Springs, USA) ![]() |
|
Zou, Hanyu |
![]() Xiaoli Lian, Shuaisong Wang, Hanyu Zou, Fang Liu, Jiajun Wu, and Li Zhang (Beihang University, China) In the current software-driven era, ensuring privacy and security is critical. Despite this, the specification of security requirements for software is still largely a manual and labor-intensive process. Engineers are tasked with analyzing potential security threats based on functional requirements (FRs), a procedure prone to omissions and errors due to the expertise gap between cybersecurity experts and software engineers. To bridge this gap, we introduce F2SRD (Function-to-Security Requirements Derivation), an automated approach that proactively derives security requirements (SRs) from functional specifications under the guidance of relevant security verification requirements (VRs) drawn from the well recognized OWASP Application Security Verification Standard (ASVS). F2SRD operates in two main phases: Initially, we develop a VR retriever trained on a custom database of FR-VR pairs, enabling it to adeptly select applicable VRs from ASVS. This targeted retrieval informs the precise and actionable formulation of SRs. Subsequently, these VRs are used to construct structured prompts that direct GPT-4 in generating SRs. Our comparative analysis against two established models demonstrates F2SRD's enhanced performance in producing SRs that excel in inspiration, diversity, and specificity—essential attributes for effective security requirement generation. By leveraging security verification standards, we believe that the generated SRs are not only more focused but also resonate stronger with the needs of engineers. ![]() |
|
Zou, Ying |
![]() Anji Li, Neng Zhang, Ying Zou, Zhixiang Chen, Jian Wang, and Zibin Zheng (Sun Yat-sen University, China; Central China Normal University, China; Queen's University, Canada; Wuhan University, China) Code snippets are widely used in technical forums to demonstrate solutions to programming problems. They can be leveraged by developers to accelerate problem-solving. However, code snippets often lack concrete types of the APIs used in them, which impedes their understanding and resue. To enhance the description of a code snippet, a number of approaches are proposed to infer the types of APIs. Although existing approaches can achieve good performance, their performance is limited by ignoring other information outside the input code snippet (e.g., the descriptions of similar code snippets) that could potentially improve the performance. In this paper, we propose a novel type inference approach, named CKTyper, by leveraging crowdsourcing knowledge in technical posts. The key idea is to generate a relevant context for a target code snippet from the posts containing similar code snippets and then employ the context to promote the type inference with large language models (e.g., ChatGPT). More specifically, we build a crowdsourcing knowledge base (CKB) by extracting code snippets from a large set of posts and index the CKB using Lucene. An API type dictionary is also built from a set of API libraries. Given a code snippet to be inferred, we first retrieve a list of similar code snippets from the indexed CKB. Then, we generate a crowdsourcing knowledge context (CKC) by extracting and summarizing useful content (e.g., API-related sentences) in the posts that contain the similar code snippets. The CKC is subsequently used to improve the type inference of ChatGPT on the input code snippet. The hallucination of ChatGPT is eliminated by employing the API type dictionary. Evaluation results on two open-source datasets demonstrate the effectiveness and efficiency of CKTyper. CKTyper achieves the optimal precision/recall of 97.80% and 95.54% on both datasets, respectively, significantly outperforming three state-of-the-art baselines and ChatGPT. ![]() ![]() Shayan Noei, Heng Li, and Ying Zou (Queen's University, Canada; Polytechnique Montréal, Canada) Refactoring is a technical approach to increase the internal quality of software without altering its external functionalities. Developers often invest significant effort in refactoring. With the increased adoption of continuous integration and deployment (CI/CD), refactoring activities may vary within and across different releases and be influenced by various release goals. For example, developers may consistently allocate refactoring activities throughout a release, or prioritize new features early on in a release and only pick up refactoring late in a release. Different approaches to allocating refactoring tasks may have different implications for code quality. However, there is a lack of existing research on how practitioners allocate their refactoring activities within a release and their impact on code quality. Therefore, we first empirically study the frequent release-wise refactoring patterns in 207 open-source Java projects and their characteristics. Then, we analyze how these patterns and their transitions affect code quality. We identify four major release-wise refactoring patterns: early active, late active, steady active, and steady inactive. We find that adopting the late active pattern—characterized by gradually increasing refactoring activities as the release approaches—leads to the best code quality. We observe that as projects mature, refactoring becomes more active, reflected in the increasing use of the steady active release-wise refactoring pattern and the decreasing utilization of the steady inactive release-wise refactoring pattern. While the steady active pattern shows improvement in quality-related code metrics (e.g., cohesion), it can also lead to more architectural problems. Additionally, we observe that developers tend to adhere to a single refactoring pattern rather than switching between different patterns. The late active pattern, in particular, can be a safe release-wise refactoring pattern that is used repeatedly. Our results can help practitioners understand existing release-wise refactoring patterns and their effects on code quality, enabling them to utilize the most effective pattern to enhance release quality. ![]() |
693 authors
proc time: 14.26