FSE 2024 CoLos
32nd ACM International Conference on the Foundations of Software Engineering (FSE 2024)
Powered by
Conference Publishing Consulting

1st ACM International Conference on AI-Powered Software (AIware 2024), July 15–16, 2024, Porto de Galinhas, Brazil

AIware 2024 – Proceedings

Contents - Abstracts - Authors
Twitter: https://twitter.com/esecfse

1st ACM International Conference on AI-Powered Software (AIware 2024)

Frontmatter

Title Page


Welcome from the Chairs
Welcome to the 1st ACM International Conference on AI-Powered Software (AIware), held on 15th and 16th July 2024 in Porto de Galinhas, Brazil co-located with the ACM International Conference on the Foundations of Software Engineering (FSE 2024). AIware aims to be an annual conference that brings the software engineering community together in anticipation of the upcoming changes driven by Foundation Models (FMs) and looks at them from the perspective of AI-powered software and their evolution. AIware 2024 prioritizes fostering discussions about the latest developments in the interdisciplinary field of AIware rather than solely focusing on the presentation of papers. The emphasis is on engaging conversations from diverse backgrounds to identify emerging research challenges and establish a new research agenda for the community in the Foundation Model era. To present papers and for discussions, the two-day conference will have five sessions themed around AIware Vision, SE for AIware, Human - AI Conversation, Security & Safety and AIware for Software Lifecycle Activities. Furthermore, the conference program will include two keynotes and five industry talks. The final session in the conference program will be dedicated to presenting accepted papers of the AIware challenge track.

AIware 2024 Organization
Committee Listings

Main Track

Identifying the Factors That Influence Trust in AI Code Completion
Adam Brown, Sarah D'Angelo, Ambar Murillo, Ciera Jaspan, and Collin Green
(Google, USA; Google, New Zealand; Google, Germany)
AI-powered software development tooling is changing the way that developers interact with tools and write code. However, the ability for AI to truly transform software development may depend on developers' levels of trust in these tools, which has consequences for tool adoption and repeated usage. In this work, we take a mixed-methods approach to measure the factors that influence developers' trust in AI-powered code completion. We found that characteristics about the AI suggestion itself (e.g., the quality of the suggestion), the developer interacting with the suggestion (e.g., their expertise in a language), and the context of the development work (e.g., was the suggestion in a test file) all influenced acceptance rates of AI-powered code suggestions. Based on these findings we propose a number of recommendations for the design of AI-powered development tools to improve trust.

Publisher's Version
Unveiling the Potential of a Conversational Agent in Developer Support: Insights from Mozilla’s PDF.js Project
João Correia, Morgan C. Nicholson, Daniel Coutinho, Caio Barbosa, Marco Castelluccio, Marco Gerosa, Alessandro Garcia, and Igor Steinmacher
(PUC-Rio, Brazil; University of São Paulo, Brazil; Mozilla, United Kingdom; Northern Arizona University, USA)
Large language models and other foundation models (FMs) boost productivity by automating code generation, supporting bug fixes, and generating documentation. We propose that FMs can further support Open Source Software (OSS) projects by assisting developers and guiding the community. Currently, core developers and maintainers answer queries about processes, architecture, and source code, but their time is limited, often leading to delays. To address this, we introduce DevMentorAI, a tool that enhances developer-project interactions by leveraging source code and technical documentation. DevMentorAI uses the Retrieval Augmented Generation (RAG) architecture to identify and retrieve relevant content for queries. We evaluated DevMentorAI with a case study on PDF.js project, using real questions from a development chat room and comparing the answers provided by DevMentorAI to those from humans. A Mozilla expert rated the answers, finding DevMentorAI's responses more satisfactory in 8/14 of cases, equally satisfactory in 3/14, and less satisfactory in 3/14. These results demonstrate the potential of using foundation models and the RAG approach to support developers and reduce the burden on core developers.

Publisher's Version
Function+Data Flow: A Framework to Specify Machine Learning Pipelines for Digital Twinning
Eduardo de Conto, Blaise Genest, and Arvind Easwaran
(Nanyang Technological University, Singapore; CNRS@CREATE, Singapore; IPAL, Singapore)
The development of digital twins (DTs) for physical systems increasingly leverages artificial intelligence (AI), particularly for combining data from different sources or for creating computationally efficient, reduced-dimension models. Indeed, even in very different application domains, twinning employs common techniques such as model order reduction and modelization with hybrid data (that is, data sourced from both physics-based models and sensors). Despite this apparent generality, current development practices are ad-hoc, making the design of AI pipelines for digital twinning complex and time-consuming. Here we propose Function+Data Flow (FDF), a domain-specific language (DSL) to describe AI pipelines within DTs. FDF aims to facilitate the design and validation of digital twins. Specifically, FDF treats functions as first-class citizens, enabling effective manipulation of models learned with AI. We illustrate the benefits of FDF on two concrete use cases from different domains: predicting the plastic strain of a structure and modeling the electromagnetic behavior of a bearing.

Publisher's Version
A Transformer-Based Approach for Smart Invocation of Automatic Code Completion
Aral de Moor, Arie van Deursen, and Maliheh Izadi
(Delft University of Technology, Netherlands)
Transformer-based language models are highly effective for code completion, with much research dedicated to enhancing the content of these completions. Despite their effectiveness, these models come with high operational costs and can be intrusive, especially when they suggest too often and interrupt developers who are concentrating on their work. Current research largely overlooks how these models interact with developers in practice and neglects to address when a developer should receive completion suggestions. To tackle this issue, we developed a machine learning model that can accurately predict when to invoke a code completion tool given the code context and available telemetry data. To do so, we collect a dataset of 200k developer interactions with our cross-IDE code completion plugin and train several invocation filtering models. Our results indicate that our small-scale transformer model significantly outperforms the baseline while maintaining low enough latency. We further explore the search space for integrating additional telemetry data into a pre-trained transformer directly and obtain promising results. To further demonstrate our approach’s practical potential, we deployed the model in an online environment with 34 developers and provided real-world insights based on 74k actual invocations.

Publisher's Version
From Human-to-Human to Human-to-Bot Conversations in Software Engineering
Ranim Khojah, Francisco Gomes de Oliveira Neto, and Philipp Leitner
(Chalmers - University of Gothenburg, Sweden)
Software developers use natural language to interact not only with other humans, but increasingly also with chatbots. These interactions have different properties and flow differently based on what goal the developer wants to achieve and who they interact with. In this paper, we aim to understand the dynamics of conversations that occur during modern software development after the integration of AI and chatbots, enabling a deeper recognition of the advantages and disadvantages of including chatbot interactions in addition to human conversations in collaborative work. We compile existing conversation attributes with humans and NLU-based chatbots and adapt them to the context of software development. Then, we extend the comparison to include LLM-powered chatbots based on an observational study. We present similarities and differences between human-to-human and human-to-bot conversations, also distinguishing between NLU- and LLM-based chatbots. Furthermore, we discuss how understanding the differences among the conversation styles guides the developer on how to shape their expectations from a conversation and consequently support the communication within a software team. We conclude that the recent conversation styles that we observe with LLM-chatbots can not replace conversations with humans due to certain attributes regarding social aspects despite their ability to support productivity and decrease the developers' mental load.

Publisher's Version
Unveiling Assumptions: Exploring the Decisions of AI Chatbots and Human Testers
Francisco Gomes de Oliveira Neto
(Chalmers - University of Gothenburg, Sweden)
The integration of Large Language Models (LLMs) and chatbots introduces new challenges and opportunities for decision-making in software testing. Decision-making relies on a variety of information, including code, requirements specifications, and other software artifacts that are often unclear or exist solely in the developer’s mind. To fill in the gaps left by unclear information, we often rely on assumptions, intuition, or previous experiences to make decisions. This paper explores the potential of LLM-based chatbots like Bard, Copilot, and ChatGPT, to support software testers in test decisions such as prioritizing test cases effectively. We investigate whether LLM-based chatbots and human testers share similar “assumptions” or intuition in prohibitive testing scenarios where exhaustive execution of test cases is often impractical. Preliminary results from a survey of 127 testers indicate a preference for diverse test scenarios, with a significant majority (96%) favoring dissimilar test sets. Interestingly, two out of four chatbots mirrored this preference, aligning with human intuition, while the others opted for similar test scenarios, chosen by only 3.9% of testers. Our initial insights suggest a promising avenue within the context of enhancing the collaborative dynamics between testers and chatbots.

Publisher's Version Published Artifact Artifacts Available
Green AI in Action: Strategic Model Selection for Ensembles in Production
Nienke Nijkamp, June Sallou, Niels van der Heijden, and Luís Cruz
(Delft University of Technology, Netherlands; Deloitte Risk Advisory, Netherlands; University of Amsterdam, Netherlands)
Integrating Artificial Intelligence (AI) into software systems has significantly enhanced their capabilities while escalating energy demands. Ensemble learning, combining predictions from multiple models to form a single prediction, intensifies this problem due to cumulative energy consumption. This paper presents a novel approach to model selection that addresses the challenge of balancing the accuracy of AI models with their energy consumption in a live AI ensemble system. We explore how reducing the number of models or improving the efficiency of model usage within an ensemble during inference can reduce energy demands without substantially sacrificing accuracy. This study introduces and evaluates two model selection strategies, Static and Dynamic, for optimizing ensemble learning systems' performance while minimizing energy usage. Our results demonstrate that the Static strategy improves the F1 score beyond the baseline, reducing average energy usage from 100% from the full ensemble to 62%. The Dynamic strategy further enhances F1 scores, using on average 76% compared to 100% of the full ensemble. Moreover, we propose an approach that balances accuracy with resource consumption, significantly reducing energy usage without substantially impacting accuracy. This method decreased the average energy usage of the Static strategy from approximately 62% to 14%, and for the Dynamic strategy, from around 76% to 57%. Our field study of Green AI using an operational AI system developed by a large professional services provider shows the practical applicability of adopting energy-conscious model selection strategies in live production environments.

Publisher's Version
Measuring Impacts of Poisoning on Model Parameters and Embeddings for Large Language Models of Code
Aftab Hussain, Md Rafiqul Islam Rabin, and Mohammad Amin Alipour
(University of Houston, USA)
Large language models (LLMs) have revolutionized software development practices, yet concerns about their safety have arisen, particularly regarding hidden backdoors, aka trojans. Backdoor attacks involve the insertion of triggers into training data, allowing attackers to manipulate the behavior of the model maliciously. In this paper, we focus on analyzing the model parameters to detect potential backdoor signals in code models. Specifically, we examine attention weights and biases, and context embeddings of the clean and poisoned CodeBERT and CodeT5 models. Our results suggest noticeable patterns in context embeddings of poisoned samples for both the poisoned models; however, attention weights and biases do not show any significant differences. This work contributes to ongoing efforts in white-box detection of backdoor signals in LLMs of code through the analysis of parameters and embeddings.

Publisher's Version
A Comparative Analysis of Large Language Models for Code Documentation Generation
Shubhang Shekhar Dvivedi, Vyshnav Vijay, Sai Leela Rahul Pujari, Shoumik Lodh, and Dhruv Kumar
(IIIT Delhi, India)
This paper presents a comprehensive comparative analysis of Large Language Models (LLMs) for generation of code documentation. Code documentation is an essential part of the software writing process. The paper evaluates models such as GPT-3.5, GPT-4, Bard, Llama2, and StarChat on various parameters like Accuracy, Completeness, Relevance, Understandability, Readability and Time Taken for different levels of code documentation. Our evaluation employs a checklist-based system to minimize subjectivity, providing a more objective assessment. We find that, barring StarChat, all LLMs consistently outperform the original documentation. Notably, closed-source models GPT-3.5, GPT-4, and Bard exhibit superior performance across various parameters compared to open-source/source-available LLMs, namely Llama 2 and StarChat. Considering the time taken for generation, GPT-4 demonstrated the longest duration by a significant margin, followed by Llama2, Bard, with GPT-3.5 and StarChat having comparable generation times. Additionally, file level documentation had a considerably worse performance across all parameters (except for time taken) as compared to inline and function level documentation.

Publisher's Version
An AI System Evaluation Framework for Advancing AI Safety: Terminology, Taxonomy, Lifecycle Mapping
Boming Xia, Qinghua Lu, Liming Zhu, and Zhenchang Xing
(CSIRO’s Data61, Australia; UNSW, Australia; Australian National University, Australia)
The advent of advanced AI underscores the urgent need for comprehensive safety evaluations, necessitating collaboration across communities (i.e., AI, software engineering, and governance). However, divergent practices and terminologies across these communities, combined with the complexity of AI systems—of which models are only a part—and environmental affordances (e.g., access to tools), obstruct effective communication and comprehensive evaluation. This paper proposes a framework for AI system evaluation comprising three components: 1) harmonised terminology to facilitate communication across communities involved in AI safety evaluation; 2) a taxonomy identifying essential elements for AI system evaluation; 3) a mapping between AI lifecycle, stakeholders, and requisite evaluations for accountable AI supply chain. This framework catalyses a deeper discourse on AI system evaluation beyond model-centric approaches.

Publisher's Version
Towards AI for Software Systems
Nafise Eskandani and Guido Salvaneschi
(ABB Corporate Research Center, Germany; University of St. Gallen, Switzerland)
Generative Artificial Intelligence (GenAI) is being adopted for a number of Software Engineering activities, mostly centering around coding, such as code generation, code comprehension, code reviews, test generation, and bug fixing. Other phases in the Software Engineering process have been less explored. In this paper, we argue that more investigation is needed on the support that GenAI can provide to the design, and operation of software systems, i.e., a number of crucial activities, beyond coding, that are necessary to successfully deliver and maintain software services. These include reasoning about architectural choices and dealing with third-party platforms. We discuss crucial aspects of AI for software systems. taking as a use case Function as a Service (FaaS).We present several challenges, including cold start delays, stateless functions, debugging complexities, and vendor lock-in and explore the potential of GenAI tools to mitigate FaaS challenges. Finally, we outline future research into the application of GenAI tools for the development and deployment of software systems.

Publisher's Version
AI-Assisted Assessment of Coding Practices in Modern Code Review
Manushree Vijayvergiya, Małgorzata Salawa, Ivan Budiselić, Dan Zheng, Pascal Lamblin, Marko Ivanković, Juanjo Carin, Mateusz Lewko, Jovan Andonov, Goran Petrović, Daniel Tarlow, Petros Maniatis, and René Just
(Google, Switzerland; Google DeepMind, USA; Google DeepMind, Canada; Google, USA; University of Washington, USA)
Modern code review is a process in which an incremental code contribution made by a code author is reviewed by one or more peers before it is committed to the version control system. An important element of modern code review is verifying that code contributions adhere to best practices. While some of these best practices can be automatically verified, verifying others is commonly left to human reviewers. This paper reports on the development, deployment, and evaluation of AutoCommenter, a system backed by a large language model that automatically learns and enforces coding best practices. We implemented AutoCommenter for four programming languages (C++, Java, Python, and Go) and evaluated its performance and adoption in a large industrial setting. Our evaluation shows that an end-to-end system for learning and enforcing coding best practices is feasible and has a positive impact on the developer workflow. Additionally, this paper reports on the challenges associated with deploying such a system to tens of thousands of developers and the corresponding lessons learned.

Publisher's Version
Leveraging Machine Learning for Optimal Object-Relational Database Mapping in Software Systems
Sasan Azizian, Elham Rastegari, and Hamid Bagheri
(University of Nebraska-Lincoln, USA; University of Creighton, USA)
Modern software systems, developed using object-oriented programming languages (OOPL), often rely on relational databases (RDB) for persistent storage, leading to the object-relational impedance mismatch problem (IMP). Although Object-Relational Mapping (ORM) tools like Hibernate and Django provide a layer of indirection, designing efficient application-specific data mappings remains challenging and error-prone. The selection of mapping strategies significantly influences data storage and retrieval performance, necessitating a thorough understanding of paradigms and systematic tradeoff exploration. The state-of-the-art systematic design tradeoff space exploration faces scalability issues, especially in large systems. This paper introduces a novel methodology, dubbed Leant, for learning-based analysis of tradeoffs, leveraging machine learning to derive domain knowledge autonomously, thus aiding the effective mapping of object models to relational schemas. Our preliminary results indicate a reduction in time and cost overheads associated with developing (Pareto-) optimal object-relational database schemas, showcasing Leant's potential in addressing the challenges of object-relational impedance mismatch and advancing object-relational mapping optimization and database design.

Publisher's Version
A Case Study of LLM for Automated Vulnerability Repair: Assessing Impact of Reasoning and Patch Validation Feedback
Ummay Kulsum, Haotian Zhu, Bowen Xu, and Marcelo d'Amorim
(North Carolina State University, USA; Singapore Management University, Singapore)
Recent work in automated program repair (APR) proposes the use of reasoning and patch validation feedback to reduce the semantic gap between the LLMs and the code under analysis. The idea has been shown to perform well for general APR, but its effectiveness in other particular contexts remains underexplored. In this work, we assess the impact of reasoning and patch validation feedback to LLMs in the context of vulnerability repair, an important and challenging task in security. To support the evaluation, we present VRpilot, an LLM-based vulnerability repair technique based on reasoning and patch validation feedback. VRpilot (1) uses a chain-of-thought prompt to reason about a vulnerability prior to generating patch candidates and (2) iteratively refines prompts according to the output of external tools (e.g., compiler, code sanitizers, test suite, etc.) on previously generated patches. To evaluate performance, we compare VRpilot against the state-of-the-art vulnerability repair techniques for C and Java using public datasets from the literature. Our results show that VRpilot generates, on average, 14% and 7.6% more correct patches than the baseline techniques on C and Java, respectively. We show, through an ablation study, that reasoning and patch validation feedback are critical. We report several lessons from this study and potential directions for advancing LLM-empowered vulnerability repair.

Publisher's Version
SolMover: Smart Contract Code Translation Based on Concepts
Rabimba Karanjai, Lei Xu, and Weidong Shi
(University of Houston, USA; Kent State University, USA)
Large language models (LLMs) have showcased remarkable skills, rivaling or even exceeding human intelligence in certain areas. Their proficiency in translation is notable, as they may replicate the nuanced, preparatory steps of human translators for high-quality outcomes. Although there have been some notable work exploring using LLMs for code-to-code translation, there has not been one for smart contracts, especially when the target language is unseen to the LLMs. In this work, we introduce our novel framework SolMover, which consists of two different LLMs working in tandem in a framework to understand coding concepts and then use that to translate code to an unseen language. We explore the human-like learning capability of LLMs with a detailed evaluation of the methodology to translate existing smart contracts written in Solidity to a low-resource one called Move. Specifically, we enable one LLM to understand coding rules for the new language to generate a planning task, for the second LLM to follow, which does not have planning capability but does have coding. Experiments show that SolMover brings a significant improvement over gpt-3.5-turbo-1106 and outperforms both Palm2 and Mixtral-8x7B-Instruct. Our further analysis shows that employing our bug mitigation technique even without the framework still improves code quality for all models

Publisher's Version
Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs
Sylvain Kouemo Ngassom, Arghavan Moradi Dakhel, Florian Tambon, and Foutse Khomh
(Polytechnique Montréal, Canada)
LLM-based assistants, such as GitHub Copilot and ChatGPT, have the potential to generate code that fulfills a programming task described in a natural language description, referred to as a prompt. The widespread accessibility of these assistants enables users with diverse backgrounds to generate code and integrate it into software projects. However, studies show that code generated by LLMs is prone to bugs and may miss various corner cases in task specifications. Presenting such buggy code to users can impact their reliability and trust in LLM-based assistants. Moreover, significant efforts are required by the user to detect and repair any bug present in the code, especially if no test cases are available. In this study, we propose a self-refinement method aimed at improving the reliability of code generated by LLMs by minimizing the number of bugs before execution, without human intervention, and in the absence of test cases. Our approach is based on targeted Verification Questions (VQs) to identify potential bugs within the initial code. These VQs target various nodes within the Abstract Syntax Tree (AST) of the initial code, which have the potential to trigger specific types of bug patterns commonly found in LLM-generated code. Finally, our method attempts to repair these potential bugs by re-prompting the LLM with the targeted VQs and the initial code. Our evaluation, based on programming tasks in the CoderEval dataset, demonstrates that our proposed method outperforms state-of-the-art methods by decreasing the number of targeted errors in the code between 21% to 62% and improving the number of executable code instances to 13%.

Publisher's Version
The Role of Generative AI in Software Development Productivity: A Pilot Case Study
Mariana Coutinho, Lorena Marques, Anderson Santos, Marcio Dahia, Cesar França, and Ronnie de Souza Santos
(CESAR School, Brazil; University of Calgary, Canada)
With software development increasingly reliant on innovative technologies, there is a growing interest in exploring the potential of generative AI tools to streamline processes and enhance productivity. In this scenario, this paper investigates the integration of generative AI tools within software development, focusing on understanding their uses, benefits, and challenges to software professionals, in particular, looking at aspects of productivity. Through a pilot case study involving software practitioners working in different roles, we gathered valuable experiences on the integration of generative AI tools into their daily work routines. Our findings reveal a generally positive perception of these tools in individual productivity while also highlighting the need to address identified limitations. Overall, our research sets the stage for further exploration into the evolving landscape of software development practices with the integration of generative AI tools.

Publisher's Version
The Art of Programming: Challenges in Generating Code for Creative Applications
Michael Cook
(King’s College London, United Kingdom)
Programming has been a key tool for artists and other creatives for decades, and the creative use of programming presents unique challenges, opportunities and perspectives for researchers considering how AI can be used to support coding more generally. In this paper we aim to motivate researchers to look deeper into some of these areas, by highlighting some interesting uses of programming in creative practices, suggesting new research questions posed by these spaces, and briefly raising important issues that work in this area may face.

Publisher's Version
Automatic Programming vs. Artificial Intelligence
James Noble
(Creative Research & Programming, New Zealand; Australian National University, Australia)
Ever since we began programming in the 1950s, there have been two diametrically opposed tendencies within computer science and software engineering: on the left side of the Glorious Throne of Alan Turing, the tendency to perfect the Art of Computer Programming, and on the right side, the tendency to end it. These tendencies can be seen from the Manchester Mark I's "autocode" removing the need for programmers shortly after WW2; COBOL being a language that could be "read by the management"; to contemporary "no-code" development environments; and the idea that large language models herald "The End of Programming". This vision paper looks at what AI will not change about software systems, and the people who must use them, and necessarily must build them. Rather than neglecting 50 years of history, theory, and practice, and assuming programming can, will, and should be ended by AI, we speculate on how AI has, already does, and will continue to perfect one of the peak activities of being human: programming.

Publisher's Version
Neuro-Symbolic Approach to Certified Scientific Software Synthesis
Hamid Bagheri, Mehdi Mirakhorli, Mohamad Fazelnia, Ibrahim Mujhid, and Md Rashedul Hasan
(University of Nebraska-Lincoln, USA; University of Hawaii at Manoa, USA)
Scientific software development demands robust solutions to meet the complexities of modern scientific systems. In response, we propose a paradigm-shifting Neuro-Symbolic Approach to Certified Scientific Software Synthesis. This innovative framework integrates large language models (LLMs) with formal methods, facilitating automated synthesis of complex scientific software while ensuring verifiability and correctness. Through a combination of technologies including a Scientific and Satisfiability-Aided Large Language Model (SaSLLM), a Scientific Domain Specific Language (DSL), and Generalized Planning for Abstract Reasoning, our approach transforms scientific concepts into certified software solutions. By leveraging advanced reasoning techniques, our framework streamlines the development process, allowing scientists to focus on design and exploration. This approach represents a significant step towards automated, certified-by-design scientific software synthesis, revolutionizing the landscape of scientific research and discovery.

Publisher's Version
Effectiveness of ChatGPT for Static Analysis: How Far Are We?
Mohammad Mahdi Mohajer, Reem Aleithan, Nima Shiri Harzevili, Moshi Wei, Alvine Boaye Belle, Hung Viet Pham, and Song Wang
(York University, Canada)
This paper conducted a novel study to explore the capabilities of ChatGPT, a state-of-the-art LLM, in static analysis tasks such as static bug detection and false positive warning removal. In our evaluation, we focused on two types of typical and critical bugs targeted by static bug detection, i.e., Null Dereference and Resource Leak, as our subjects. We employ Infer, a well-established static analyzer, to aid the gathering of these two bug types from 10 open-source projects. Consequently, our experiment dataset contains 222 instances of Null Dereference bugs and 46 instances of Resource Leak bugs. Our study demonstrates that ChatGPT can achieve remarkable performance in the mentioned static analysis tasks, including bug detection and false-positive warning removal. In static bug detection, ChatGPT achieves accuracy and precision values of up to 68.37% and 63.76% for detecting Null Dereference bugs and 76.95% and 82.73% for detecting Resource Leak bugs, improving the precision of the current leading bug detector, Infer by 12.86% and 43.13% respectively. For removing false-positive warnings, ChatGPT can reach a precision of up to 93.88% for Null Dereference bugs and 63.33% for Resource Leak bugs, surpassing existing state-of-the-art false-positive warning removal tools.

Publisher's Version Published Artifact Artifacts Available
RUBICON: Rubric-Based Evaluation of Domain-Specific Human AI Conversations
Param Biyani, Yasharth Bajpai, Arjun Radhakrishna, Gustavo Soares, and Sumit Gulwani
(Microsoft, India; Microsoft, USA)
Evaluating conversational assistants, such as GitHub Copilot Chat, poses a significant challenge for tool builders in the domain of Software Engineering. These assistants rely on language models and chat-based user experiences, rendering their evaluation with respect to the quality of the Human-AI conversations complicated. Existing general-purpose metrics for measuring conversational quality found in literature are inadequate for appraising domain-specific dialogues due to their lack of contextual sensitivity. In this paper, we present RUBICON, a technique for evaluating domain-specific Human-AI conversations. RUBICON leverages large language models to generate candidate rubrics for assessing conversation quality and employs a selection process to choose the subset of rubrics based on their performance in scoring conversations. In our experiments, RUBICON effectively learns to differentiate conversation quality, achieving higher accuracy and yield rates than existing baselines.

Publisher's Version Info

Challenge Track

Investigating the Potential of Using Large Language Models for Scheduling
Deddy Jobson and Yilin Li
(Mercari, Japan)
The inaugural ACM International Conference on AI-powered Software introduced the AIware Challenge, prompting researchers to explore AI-driven tools for optimizing conference programs through constrained optimization. We investigate the use of Large Language Models (LLMs) for program scheduling, focusing on zero-shot learning and integer programming to measure paper similarity. Our study reveals that LLMs, even under zero-shot settings, create reasonably good first drafts of conference schedules. When clustering papers, using only titles as LLM inputs produces results closer to human categorization than using titles and abstracts with TFIDF. The code has been made publicly available at https://github.com/deddyjobson/llms-for-scheduling.

Publisher's Version
Automated Scheduling for Thematic Coherence in Conferences
Mahzabeen Emu, Tasnim Ahmed, and Salimur Choudhury
(Queen’s University, Canada)
This study presents a novel Artificial Intelligence (AI)-based approach to automate conference session scheduling that integrates Natural Language Processing (NLP) techniques. By analyzing contextual information from conference activities, this framework uses a Constraint Satisfaction Problem (CSP) model to optimize the scheduling. Using data from the MSR conference, and by employing GPT-3.5 and SciBERT, this framework generates a weighted contextual similarity and applies hard constraints, to ensure scheduling feasibility, and soft constraints, to improve thematic coherence. Evaluations demonstrate the ability to surpass manual scheduling.

Publisher's Version
Conference Program Scheduling using Genetic Algorithms
Rucha Deshpande, Aishwarya Devi Akila Pandian, and Vigneshwaran Dharmalingam
(Purdue University, USA)
Program creation is the process of allocating presentation slots for each paper accepted to a conference with parallel sessions. This process, generally done manually, involves intricate decision-making to align with multiple constraints and optimize goals. The total time for all presentations in a session cannot be longer than the length of the session, the total number of sessions has to be equal to the number of sessions provided by the Program Committee chairs, and if there are parallel tracks, no two papers with common authors can be scheduled in parallel at the same time. We propose a Conference Program Scheduler using Genetic Algorithms to automate the conference schedule creation process while addressing these constraints and maximizing session theme coherence. Our evaluation focuses on the algorithm's ability to meet the outlined constraints and its success in creating thematically coherent and time-efficient sessions.

Publisher's Version

Industry Statements and Demo Track

AI Assistant in JetBrains IDE: Insights and Challenges (Invited Talk)
Andrey Sokolov
(JetBrains, Netherlands)
Frontier Large language models in 2024 are capable of performing numerous tasks, including SE ones, and are widely used to create impressive demos. However, productizing these models remains notoriously challenging. JetBrains AI Assistant is one such product, a developer tool, that provides IDE users with over 10 different features backed by latest LLMs. It supports various programming languages and is shipped in a number of different IDE products (IntelliJ, Rider, PyCharm, Goland, PhpStorm, etc). In this session we will share the story of deploying LLM-based features to production over the past year developing AI Assistant: technical successes, challenges, and capabilities we believe LLMs pre-trained on code are still lacking in the SE context, from the perspective of an industrial developer tools vendor.

Publisher's Version
AI-Assisted User Intent Formalization for Programs: Problem and Applications (Invited Talk)
Shuvendu K. Lahiri
(Microsoft Research, USA)
The correctness of a program is always as good as the specification that is verified. However, writing down formal specifications is non-trivial for mainstream developers. This limits the usage of rigorous testing and verification techniques to improve the quality of code, including those generated by AI. In this talk, I will describe ongoing work on deriving formal specifications from informal source of intent such as natural language docstrings, API documentation and RFC. We describe different ways to formalize the intent as declarative specifications starting with tests, postconditions (in mainstream languages such as Python to verification-aware languages such as Dafny), to data format specifications in 3D language. We formalize metrics to evaluate the quality of various LLM-based techniques using techniques inspired by mutation testing, but with novel encoding requiring mutant generation or invoking formal verifiers. We describe our current progress on creating benchmarks for these tasks. We will focus on two applications of specification generation: first, we will demonstrate the use of postcondition generation for Java to discover several historical bugs in the Defects4J benchmark. Second, we demonstrate the use of agents for translating informal intent in RFCs into formal data-format-specifications in the 3D language, from which verified parsers can be constructed. We will outline the outstanding challenges in this area and hope to engage in active discussions with the participants of the PL/SE community.

Publisher's Version
AI-Based Digital Twins: A Tale of Innovation in Norwegian Public Sectors (Invited Talk)
Shaukat Ali
(Simula, Norway)
Digital twins, often called “virtual replicas” of their underlying systems, enable advanced system analyses during their design, development, and operation. This talk will cover how we utilized digital twins that we built with artificial intelligence techniques in two real-world applications in two Norwegian public sectors for quality assurance of the healthcare services they provide to residents. The first case concerns the Oslo City healthcare department, which, together with several industrial vendors, provides various IoT-based healthcare services to its residents, such as patients' home care with an industrial IoT platform connecting diverse medical devices, information systems, hospitals, pharmacies, etc. The second case concerns the Cancer Registry of Norway (CRN), which collects and processes cancer-related data and produces relevant data and statistics for various stakeholders (e.g., patients, government, and researchers) with a complex socio-technical software system connecting diverse external systems, e.g., medical laboratories, hospitals, and general practitioners’ software systems. For both cases, we built digital twins with AI-based techniques (e.g., neural networks) to build replicas of these systems to support automated testing at scale with thousands of diverse devices, such as in the case of Oslo City, supporting data validation at CRN and running advanced simulations to test the system. In addition, since these systems undergo continuous evolution, we also built digital twin evolution approaches with advanced AI techniques (e.g., transfer learning and meta-learning) for the cost-effective evolution of digital twins. The talk will present various challenges that we faced when developing such digital twins (e.g., related to personal data). Next, we will present the technical details of the digital twins and their evolution approaches, followed by the key results. Finally, we will present the ongoing work, further research, technical challenges, and issues related to the deployment of digital twins. Due to data privacy concerns and confidentiality agreements, the digital twins and approaches to building digital twins have not been publicly released.

Publisher's Version
Agents for Data Science: From Raw Data to AI-Generated Notebooks using LLMs and Code Execution (Invited Talk)
Jiahao Cai
(Google, USA)
Data science tasks involve a complex interplay of datasets, code and code outputs for answering questions, deriving insights, or building models from data. Tasks and chosen methods may require specialized data domain or scientific domain knowledge. Queries range from high-level (low-code) or highly technical (high-code). Code execution results, such as plots and tables are artifacts used by data scientists to interpret and reason about the current and future states of a solution towards completing the task. This presents unique challenges in designing, deploying and evaluating LLM-based agents for automating data science workflows. In this talk we will introduce an end-to-end, autonomous Data Science Agent (DSA) built around Gemini and available as an experiment at labs.google/code. DSA leverages agentic flows, planning and orchestration to tackle open-ended data science explorations. It uses LLMs for planning, task decomposition, code generation, reasoning and error-correction through code execution. DSA is designed to streamline the entire data science process, enabling users to query data in natural language, and get from a dataset and prompt to a fully AI-generated, populated notebook. We’ll discuss design choices (prompting, SFT, orchestration), iterative development cycles, evaluation, lessons learned and future challenges. Where applicable, we will showcase real-world case studies demonstrating how DSA can assist with bootstrapping the analysis of data from complex scientific domains.

Publisher's Version
AI in Software Engineering at Google: Progress and the Path Ahead (Invited Talk)
Satish Chandra
(Google, USA)
Over a period of just about 5 years, the use of AI-based tools for software engineering has gone from being a very promising research investigation to indispensable features in modern developer environments. This talk will present AI-powered improvements and continuing transformation of Google’s internal software development. This viewpoint comes from extensive experience with developing and deploying AI-based tools to surfaces where Google engineers spend the majority of their time, including inner loop activities such as code authoring, review and search, as well as outer loop ones such as bug management and planning. Improvements in these surfaces are monitored carefully for productivity and developer satisfaction. We will describe the challenges in how to align our internal efforts with the very fast moving field of LLMs. We need to constantly make judgment calls on technical feasibility, the possibility of iterative improvement and the measurability of impact as we decide what ideas to pursue for production level adaptation and adoption. The talk will go into several examples of this that we have gone through in recent past, and what we have learned in the process. We will conclude the talk with changes that we expect to land in the next five years and some thoughts on how the community can collaborate better by focusing on good benchmarks.

Publisher's Version

proc time: 9.03