FSE 2024 CoLos
32nd ACM International Conference on the Foundations of Software Engineering (FSE 2024)
Powered by
Conference Publishing Consulting

1st ACM International Conference on AI-Powered Software (AIware 2024), July 15–16, 2024, Porto de Galinhas, Brazil

AIware 2024 – Preliminary Table of Contents

Contents - Abstracts - Authors
Twitter: https://twitter.com/esecfse

1st ACM International Conference on AI-Powered Software (AIware 2024)

Frontmatter

Title Page


Message from the Chairs


Committees


Main Track

Identifying the Factors That Influence Trust in AI Code Completion
Adam Brown ORCID logo, Sarah D'Angelo ORCID logo, Ambar Murillo ORCID logo, Ciera Jaspan ORCID logo, and Collin Green ORCID logo
(Google, USA; Google, New Zealand; Google, Germany)
AI-powered software development tooling is changing the way that developers interact with tools and write code. However, the ability for AI to truly transform software development may depend on developers' levels of trust in these tools, which has consequences for tool adoption and repeated usage. In this work, we take a mixed-methods approach to measure the factors that influence developers' trust in AI-powered code completion. We found that characteristics about the AI suggestion itself (e.g., the quality of the suggestion), the developer interacting with the suggestion (e.g., their expertise in a language), and the context of the development work (e.g., was the suggestion in a test file) all influenced acceptance rates of AI-powered code suggestions. Based on these findings we propose a number of recommendations for the design of AI-powered development tools to improve trust.

Article Search
Unveiling the Potential of a Conversational Agent in Developer Support: Insights from Mozilla’s PDF.js Project
João Correia ORCID logo, Morgan C. Nicholson ORCID logo, Daniel Coutinho ORCID logo, Caio Barbosa ORCID logo, Marco Castelluccio ORCID logo, Marco GerosaORCID logo, Alessandro Garcia ORCID logo, and Igor Steinmacher ORCID logo
(PUC-Rio, Brazil; University of São Paulo, Brazil; Mozilla, United Kingdom; Northern Arizona University, USA)
Large language models and other foundation models (FMs) boost productivity by automating code generation, supporting bug fixes, and generating documentation. We propose that FMs can further support Open Source Software (OSS) projects by assisting developers and guiding the community. Currently, core developers and maintainers answer queries about processes, architecture, and source code, but their time is limited, often leading to delays. To address this, we introduce DevMentorAI, a tool that enhances developer-project interactions by leveraging source code and technical documentation. DevMentorAI uses the Retrieval Augmented Generation (RAG) architecture to identify and retrieve relevant content for queries. We evaluated DevMentorAI with a case study on PDF.js project, using real questions from a development chat room and comparing the answers provided by DevMentorAI to those from humans. A Mozilla expert rated the answers, finding DevMentorAI's responses more satisfactory in 8/14 of cases, equally satisfactory in 3/14, and less satisfactory in 3/14. These results demonstrate the potential of using foundation models and the RAG approach to support developers and reduce the burden on core developers.

Article Search
Function+Data Flow: A Framework to Specify Machine Learning Pipelines for Digital Twinning
Eduardo de Conto ORCID logo, Blaise Genest ORCID logo, and Arvind Easwaran ORCID logo
(Nanyang Technological University, Singapore; CNRS@CREATE, France; IPAL - CNRS - CNRS@CREATE, France)
The development of digital twins (DTs) for physical systems increasingly leverages artificial intelligence (AI), particularly for combining data from different sources or for creating computationally efficient, reduced-dimension models. Indeed, even in very different application domains, twinning employs common techniques such as model order reduction and modelization with hybrid data (that is, data sourced from both physics-based models and sensors). Despite this apparent generality, current development practices are ad-hoc, making the design of AI pipelines for digital twinning complex and time-consuming. Here we propose Function+Data Flow (FDF), a domain-specific language (DSL) to describe AI pipelines within DTs. FDF aims to facilitate the design and validation of digital twins. Specifically, FDF treats functions as first-class citizens, enabling effective manipulation of models learned with AI. We illustrate the benefits of FDF on two concrete use cases from different domains: predicting the plastic strain of a structure and modeling the electromagnetic behavior of a bearing.

Article Search
A Transformer-Based Approach for Smart Invocation of Automatic Code Completion
Aral de MoorORCID logo, Arie van Deursen ORCID logo, and Maliheh Izadi ORCID logo
(Delft University of Technology, Netherlands)
Transformer-based language models are highly effective for code completion, with much research dedicated to enhancing the content of these completions. Despite their effectiveness, these models come with high operational costs and can be intrusive, especially when they suggest too often and interrupt developers who are concentrating on their work. Current research largely overlooks how these models interact with developers in practice and neglects to address when a developer should receive completion suggestions. To tackle this issue, we developed a machine learning model that can accurately predict when to invoke a code completion tool given the code context and available telemetry data. To do so, we collect a dataset of 200k developer interactions with our cross-IDE code completion plugin and train several invocation filtering models. Our results indicate that our small-scale transformer model significantly outperforms the baseline while maintaining low enough latency. We further explore the search space for integrating additional telemetry data into a pre-trained transformer directly and obtain promising results. To further demonstrate our approach’s practical potential, we deployed the model in an online environment with 34 developers and provided real-world insights based on 74k actual invocations.

Article Search
From Human-to-Human to Human-to-Bot Conversations in Software Engineering
Ranim Khojah ORCID logo, Francisco Gomes de Oliveira Neto ORCID logo, and Philipp Leitner ORCID logo
(Chalmers - University of Gothenburg, Sweden)
Software developers use natural language to interact not only with other humans, but increasingly also with chatbots. These interactions have different properties and flow differently based on what goal the developer wants to achieve and who they interact with. In this paper, we aim to understand the dynamics of conversations that occur during modern software development after the integration of AI and chatbots, enabling a deeper recognition of the advantages and disadvantages of including chatbot interactions in addition to human conversations in collaborative work. We compile existing conversation attributes with humans and NLU-based chatbots and adapt them to the context of software development. Then, we extend the comparison to include LLM-powered chatbots based on an observational study. We present similarities and differences between human-to-human and human-to-bot conversations, also distinguishing between NLU- and LLM-based chatbots. Furthermore, we discuss how understanding the differences among the conversation styles guides the developer on how to shape their expectations from a conversation and consequently support the communication within a software team. We conclude that the recent conversation styles that we observe with LLM-chatbots can not replace conversations with humans due to certain attributes regarding social aspects despite their ability to support productivity and decrease the developers' mental load.

Article Search
Unveiling Assumptions: Exploring the Decisions of AI Chatbots and Human Testers
Francisco Gomes de Oliveira Neto ORCID logo
(Chalmers - University of Gothenburg, Sweden)
The integration of Large Language Models (LLMs) and chatbots introduces new challenges and opportunities for decision-making in software testing. Decision-making relies on a variety of information, including code, requirements specifications, and other software artifacts that are often unclear or exist solely in the developer’s mind. To fill in the gaps left by unclear information, we often rely on assumptions, intuition, or previous experiences to make decisions. This paper explores the potential of LLM-based chatbots like Bard, Copilot, and ChatGPT, to support software testers in test decisions such as prioritizing test cases effectively. We investigate whether LLM-based chatbots and human testers share similar “assumptions” or intuition in prohibitive testing scenarios where exhaustive execution of test cases is often impractical. Preliminary results from a survey of 127 testers indicate a preference for diverse test scenarios, with a significant majority (96%) favoring dissimilar test sets. Interestingly, two out of four chatbots mirrored this preference, aligning with human intuition, while the others opted for similar test scenarios, chosen by only 3.9% of testers. Our initial insights suggest a promising avenue within the context of enhancing the collaborative dynamics between testers and chatbots.

Article Search Artifacts Available
Green AI in Action: Strategic Model Selection for Ensembles in Production
Nienke Nijkamp ORCID logo, June Sallou ORCID logo, Niels van der Heijden ORCID logo, and Luís Cruz ORCID logo
(Delft University of Technology, Netherlands; University of Amsterdam, Netherlands)
Integrating Artificial Intelligence (AI) into software systems has significantly enhanced their capabilities while escalating energy demands. Ensemble learning, combining predictions from multiple models to form a single prediction, intensifies this problem due to cumulative energy consumption. This paper presents a novel approach to model selection that addresses the challenge of balancing the accuracy of AI models with their energy consumption in a live AI ensemble system. We explore how reducing the number of models or improving the efficiency of model usage within an ensemble during inference can reduce energy demands without substantially sacrificing accuracy. This study introduces and evaluates two model selection strategies, Static and Dynamic, for optimizing ensemble learning systems' performance while minimizing energy usage. Our results demonstrate that the Static strategy improves the F1 score beyond the baseline, reducing average energy usage from 100% from the full ensemble to 62%. The Dynamic strategy further enhances F1 scores, using on average 76% compared to 100% of the full ensemble. Moreover, we propose an approach that balances accuracy with resource consumption, significantly reducing energy usage without substantially impacting accuracy. This method decreased the average energy usage of the Static strategy from approximately 62% to 14%, and for the Dynamic strategy, from around 76% to 57%. Our field study of Green AI using an operational AI system developed by a large professional services provider shows the practical applicability of adopting energy-conscious model selection strategies in live production environments.

Article Search
Measuring Impacts of Poisoning on Model Parameters and Embeddings for Large Language Models of Code
Aftab Hussain ORCID logo, Md Rafiqul Islam RabinORCID logo, and Amin Alipour ORCID logo
(University of Houston, USA)
Large language models (LLMs) have revolutionized software development practices, yet concerns about their safety have arisen, particularly regarding hidden backdoors, aka trojans. Backdoor attacks involve the insertion of triggers into training data, allowing attackers to manipulate the behavior of the model maliciously. In this paper, we focus on analyzing the model parameters to detect potential backdoor signals in code models. Specifically, we examine attention weights and biases, and context embeddings of the clean and poisoned CodeBERT and CodeT5 models. Our results suggest noticeable patterns in context embeddings of poisoned samples for both the poisoned models; however, attention weights and biases do not show any significant differences. This work contributes to ongoing efforts in white-box detection of backdoor signals in LLMs of code through the analysis of parameters and embeddings.

Article Search
A Comparative Analysis of Large Language Models for Code Documentation Generation
Shubhang Shekhar Dvivedi ORCID logo, Vyshnav Vijay ORCID logo, Sai Leela Rahul Pujari ORCID logo, Shoumik Lodh ORCID logo, and Dhruv KumarORCID logo
(IIIT Delhi, India)
This paper presents a comprehensive comparative analysis of Large Language Models (LLMs) for generation of code documentation. Code documentation is an essential part of the software writing process. The paper evaluates models such as GPT-3.5, GPT-4, Bard, Llama2, and StarChat on various parameters like Accuracy, Completeness, Relevance, Understandability, Readability and Time Taken for different levels of code documentation. Our evaluation employs a checklist-based system to minimize subjectivity, providing a more objective assessment. We find that, barring StarChat, all LLMs consistently outperform the original documentation. Notably, closed-source models GPT-3.5, GPT-4, and Bard exhibit superior performance across various parameters compared to open-source/source-available LLMs, namely Llama 2 and StarChat. Considering the time taken for generation, GPT-4 demonstrated the longest duration by a significant margin, followed by Llama2, Bard, with GPT-3.5 and StarChat having comparable generation times. Additionally, file level documentation had a considerably worse performance across all parameters (except for time taken) as compared to inline and function level documentation.

Preprint
An AI System Evaluation Framework for Advancing AI Safety: Terminology, Taxonomy, Lifecycle Mapping
Boming Xia ORCID logo, Qinghua Lu ORCID logo, Liming Zhu ORCID logo, and Zhenchang Xing ORCID logo
(CSIRO’s Data61, Australia; UNSW, Australia; Australian National University, Australia)
The advent of advanced AI underscores the urgent need for comprehensive safety evaluations, necessitating collaboration across communities (i.e., AI, software engineering, and governance). However, divergent practices and terminologies across these communities, combined with the complexity of AI systems—of which models are only a part—and environmental affordances (e.g., access to tools), obstruct effective communication and comprehensive evaluation. This paper proposes a framework for AI system evaluation comprising three components: 1) harmonised terminology to facilitate communication across communities involved in AI safety evaluation; 2) a taxonomy identifying essential elements for AI system evaluation; 3) a mapping between AI lifecycle, stakeholders, and requisite evaluations for accountable AI supply chain. This framework catalyses a deeper discourse on AI system evaluation beyond model-centric approaches.

Article Search
Towards AI for Software Systems
Nafise Eskandani ORCID logo and Guido SalvaneschiORCID logo
(ABB Corporate Research Center, Germany; University of St. Gallen, Switzerland)
Generative Artificial Intelligence (GenAI) is being adopted for a number of Software Engineering activities, mostly centering around coding, such as code generation, code comprehension, code reviews, test generation, and bug fixing. Other phases in the Software Engineering process have been less explored. In this paper, we argue that more investigation is needed on the support that GenAI can provide to the design, and operation of software systems, i.e., a number of crucial activities, beyond coding, that are necessary to successfully deliver and maintain software services. These include reasoning about architectural choices and dealing with third-party platforms. We discuss crucial aspects of AI for software systems. taking as a use case Function as a Service (FaaS).We present several challenges, including cold start delays, stateless functions, debugging complexities, and vendor lock-in and explore the potential of GenAI tools to mitigate FaaS challenges. Finally, we outline future research into the application of GenAI tools for the development and deployment of software systems.

Article Search
AI-Assisted Assessment of Coding Practices in Modern Code Review
Manushree Vijayvergiya ORCID logo, Małgorzata Salawa ORCID logo, Ivan Budiselic ORCID logo, Dan Zheng ORCID logo, Pascal Lamblin ORCID logo, Marko Ivankovic ORCID logo, Juanjo Carin ORCID logo, Mateusz Lewko ORCID logo, Jovan Andonov ORCID logo, Goran Petrovic ORCID logo, Daniel Tarlow ORCID logo, Petros Maniatis ORCID logo, and René Just ORCID logo
(Google, Switzerland; Google DeepMind, USA; Google DeepMind, Canada; Google, USA; University of Washington, USA)
Modern code review is a process in which an incremental code contribution made by a code author is reviewed by one or more peers before it is committed to the version control system. An important element of modern code review is verifying that code contributions adhere to best practices. While some of these best practices can be automatically verified, verifying others is commonly left to human reviewers. This paper reports on the development, deployment, and evaluation of AutoCommenter, a system backed by a large language model that automatically learns and enforces coding best practices. We implemented AutoCommenter for four programming languages (C++, Java, Python, and Go) and evaluated its performance and adoption in a large industrial setting. Our evaluation shows that an end-to-end system for learning and enforcing coding best practices is feasible and has a positive impact on the developer workflow. Additionally, this paper reports on the challenges associated with deploying such a system to tens of thousands of developers and the corresponding lessons learned.

Article Search
Leveraging Machine Learning for Optimal Object-Relational Database Mapping in Software Systems
Sasan Azizian ORCID logo, Elham Rastegari ORCID logo, and Hamid Bagheri ORCID logo
(University of Nebraska-Lincoln, USA; University of Creighton, USA)
Modern software systems, developed using object-oriented programming languages (OOPL), often rely on relational databases (RDB) for persistent storage, leading to the object-relational impedance mismatch problem (IMP). Although Object-Relational Mapping (ORM) tools like Hibernate and Django provide a layer of indirection, designing efficient application-specific data mappings remains challenging and error-prone. The selection of mapping strategies significantly influences data storage and retrieval performance, necessitating a thorough understanding of paradigms and systematic tradeoff exploration. The state-of-the-art systematic design tradeoff space exploration faces scalability issues, especially in large systems. This paper introduces a novel methodology, dubbed Leant, for learning-based analysis of tradeoffs, leveraging machine learning to derive domain knowledge autonomously, thus aiding the effective mapping of object models to relational schemas. Our preliminary results indicate a reduction in time and cost overheads associated with developing (Pareto-) optimal object-relational database schemas, showcasing Leant's potential in addressing the challenges of object-relational impedance mismatch and advancing object-relational mapping optimization and database design.

Article Search
A Case Study of LLM for Automated Vulnerability Repair: Assessing Impact of Reasoning and Patch Validation Feedback
Ummay Kulsum ORCID logo, Haotian Zhu ORCID logo, Bowen Xu ORCID logo, and Marcelo d'Amorim ORCID logo
(North Carolina State University, USA; Singapore Management University, Singapore)
Recent work in automated program repair (APR) proposes the use of reasoning and patch validation feedback to reduce the semantic gap between the LLMs and the code under analysis. The idea has been shown to perform well for general APR, but its effectiveness in other particular contexts remains underexplored. In this work, we assess the impact of reasoning and patch validation feedback to LLMs in the context of vulnerability repair, an important and challenging task in security. To support the evaluation, we present VRpilot, an LLM-based vulnerability repair technique based on reasoning and patch validation feedback. VRpilot (1) uses a chain-of-thought prompt to reason about a vulnerability prior to generating patch candidates and (2) iteratively refines prompts according to the output of external tools (e.g., compiler, code sanitizers, test suite, etc.) on previously generated patches. To evaluate performance, we compare VRpilot against the state-of-the-art vulnerability repair techniques for C and Java using public datasets from the literature. Our results show that VRpilot generates, on average, 14% and 7.6% more correct patches than the baseline techniques on C and Java, respectively. We show, through an ablation study, that reasoning and patch validation feedback are critical. We report several lessons from this study and potential directions for advancing LLM-empowered vulnerability repair.

Article Search
SolMover: Smart Contract Code Translation Based on Concepts
Rabimba Karanjai ORCID logo, Lei Xu ORCID logo, and Weidong Shi ORCID logo
(University of Houston, USA; Kent State University, USA)
Large language models (LLMs) have showcased remarkable skills, rivaling or even exceeding human intelligence in certain areas. Their proficiency in translation is notable, as they may replicate the nuanced, preparatory steps of human translators for high-quality outcomes. Although there have been some notable work exploring using LLMs for code-to-code translation, there has not been one for smart contracts, especially when the target language is unseen to the LLMs. In this work, we introduce our novel framework SolMover, which consists of two different LLMs working in tandem in a framework to understand coding concepts and then use that to translate code to an unseen language. We explore the human-like learning capability of LLMs with a detailed evaluation of the methodology to translate existing smart contracts written in Solidity to a low-resource one called Move. Specifically, we enable one LLM to understand coding rules for the new language to generate a planning task, for the second LLM to follow, which does not have planning capability but does have coding. Experiments show that SolMover brings a significant improvement over gpt-3.5-turbo-1106 and outperforms both Palm2 and Mixtral-8x7B-Instruct. Our further analysis shows that employing our bug mitigation technique even without the framework still improves code quality for all models

Article Search
Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs
Sylvain Kouemo Ngassom ORCID logo, Arghavan Moradi Dakhel ORCID logo, Florian Tambon ORCID logo, and Foutse KhomhORCID logo
(Polytechnique Montréal, Canada)
LLM-based assistants, such as GitHub Copilot and ChatGPT, have the potential to generate code that fulfills a programming task described in a natural language description, referred to as a prompt. The widespread accessibility of these assistants enables users with diverse backgrounds to generate code and integrate it into software projects. However, studies show that code generated by LLMs is prone to bugs and may miss various corner cases in task specifications. Presenting such buggy code to users can impact their reliability and trust in LLM-based assistants. Moreover, significant efforts are required by the user to detect and repair any bug present in the code, especially if no test cases are available. In this study, we propose a self-refinement method aimed at improving the reliability of code generated by LLMs by minimizing the number of bugs before execution, without human intervention, and in the absence of test cases. Our approach is based on targeted Verification Questions (VQs) to identify potential bugs within the initial code. These VQs target various nodes within the Abstract Syntax Tree (AST) of the initial code, which have the potential to trigger specific types of bug patterns commonly found in LLM-generated code. Finally, our method attempts to repair these potential bugs by re-prompting the LLM with the targeted VQs and the initial code. Our evaluation, based on programming tasks in the CoderEval dataset, demonstrates that our proposed method outperforms state-of-the-art methods by decreasing the number of targeted errors in the code between 21% to 62% and improving the number of executable code instances to 13%.

Article Search
The Role of Generative AI in Software Development Productivity: A Pilot Case Study
Mariana Coutinho ORCID logo, Lorena Marques ORCID logo, Anderson Santos ORCID logo, Marcio Dahia ORCID logo, Cesar França ORCID logo, and Ronnie de Souza Santos ORCID logo
(CESAR School, n.n.; University of Calgary, Canada)
With software development increasingly reliant on innovative technologies, there is a growing interest in exploring the potential of generative AI tools to streamline processes and enhance productivity. In this scenario, this paper investigates the integration of generative AI tools within software development, focusing on understanding their uses, benefits, and challenges to software professionals, in particular, looking at aspects of productivity. Through a pilot case study involving software practitioners working in different roles, we gathered valuable experiences on the integration of generative AI tools into their daily work routines. Our findings reveal a generally positive perception of these tools in individual productivity while also highlighting the need to address identified limitations. Overall, our research sets the stage for further exploration into the evolving landscape of software development practices with the integration of generative AI tools.

Article Search
The Art of Programming: Challenges in Generating Code for Creative Applications
Michael Cook ORCID logo
(King’s College London, United Kingdom)
Programming has been a key tool for artists and other creatives for decades, and the creative use of programming presents unique challenges, opportunities and perspectives for researchers considering how AI can be used to support coding more generally. In this paper we aim to motivate researchers to look deeper into some of these areas, by highlighting some interesting uses of programming in creative practices, suggesting new research questions posed by these spaces, and briefly raising important issues that work in this area may face.

Article Search
Automatic Programming vs. Artificial Intelligence
James NobleORCID logo
(Creative Research & Programming, New Zealand; Australian National University, Australia)
Ever since we began programming in the 1950s, there have been two diametrically opposed tendencies within computer science and software engineering: on the left side of the Glorious Throne of Alan Turing, the tendency to perfect the Art of Computer Programming, and on the right side, the tendency to end it. These tendencies can be seen from the Manchester Mark I's "autocode" removing the need for programmers shortly after WW2; COBOL being a language that could be "read by the management"; to contemporary "no-code" development environments; and the idea that large language models herald "The End of Programming". This vision paper looks at what AI will not change about software systems, and the people who must use them, and necessarily must build them. Rather than neglecting 50 years of history, theory, and practice, and assuming programming can, will, and should be ended by AI, we speculate on how AI has, already does, and will continue to perfect one of the peak activities of being human: programming.

Article Search
Neuro-Symbolic Approach to Certified Scientific Software Synthesis
Hamid Bagheri ORCID logo, Mehdi Mirakhorli ORCID logo, Mohamad Fazelnia, Ibrahim Mujhid, and Md Rashedul Hasan
(University of Nebraska-Lincoln, USA; University of Hawaii at Manoa, USA)
Scientific software development demands robust solutions to meet the complexities of modern scientific systems. In response, we propose a paradigm-shifting Neuro-Symbolic Approach to Certified Scientific Software Synthesis. This innovative framework integrates large language models (LLMs) with formal methods, facilitating automated synthesis of complex scientific software while ensuring verifiability and correctness. Through a combination of technologies including a Scientific and Satisfiability-Aided Large Language Model (SaSLLM), a Scientific Domain Specific Language (DSL), and Generalized Planning for Abstract Reasoning, our approach transforms scientific concepts into certified software solutions. By leveraging advanced reasoning techniques, our framework streamlines the development process, allowing scientists to focus on design and exploration. This approach represents a significant step towards automated, certified-by-design scientific software synthesis, revolutionizing the landscape of scientific research and discovery.

Article Search
Effectiveness of ChatGPT for Static Analysis: How Far Are We?
Mohammad Mahdi Mohajer ORCID logo, Reem Aleithan ORCID logo, Nima Shiri Harzevili ORCID logo, Moshi Wei ORCID logo, Alvine Boaye Belle ORCID logo, Hung Viet Pham ORCID logo, and Song Wang ORCID logo
(York University, Canada)
This paper conducted a novel study to explore the capabilities of ChatGPT, a state-of-the-art LLM, in static analysis tasks such as static bug detection and false positive warning removal. In our evaluation, we focused on two types of typical and critical bugs targeted by static bug detection, i.e., Null Dereference and Resource Leak, as our subjects. We employ Infer, a well-established static analyzer, to aid the gathering of these two bug types from 10 open-source projects. Consequently, our experiment dataset contains 222 instances of Null Dereference bugs and 46 instances of Resource Leak bugs. Our study demonstrates that ChatGPT can achieve remarkable performance in the mentioned static analysis tasks, including bug detection and false-positive warning removal. In static bug detection, ChatGPT achieves accuracy and precision values of up to 68.37% and 63.76% for detecting Null Dereference bugs and 76.95% and 82.73% for detecting Resource Leak bugs, improving the precision of the current leading bug detector, Infer by 12.86% and 43.13% respectively. For removing false-positive warnings, ChatGPT can reach a precision of up to 93.88% for Null Dereference bugs and 63.33% for Resource Leak bugs, surpassing existing state-of-the-art false-positive warning removal tools.

Article Search Artifacts Available
RUBICON: Rubric-Based Evaluation of Domain-Specific Human AI Conversations
Param Biyani ORCID logo, Yasharth Bajpai ORCID logo, Arjun Radhakrishna ORCID logo, Gustavo Soares ORCID logo, and Sumit GulwaniORCID logo
(Microsoft, USA)
Evaluating conversational assistants, such as GitHub Copilot Chat, poses a significant challenge for tool builders in the domain of Software Engineering. These assistants rely on language models and chat-based user experiences, rendering their evaluation with respect to the quality of the Human-AI conversations complicated. Existing general-purpose metrics for measuring conversational quality found in literature are inadequate for appraising domain-specific dialogues due to their lack of contextual sensitivity. In this paper, we present RUBICON, a technique for evaluating domain-specific Human- AI conversations. RUBICON leverages large language models to generate candidate rubrics for assessing conversation quality and employs a selection process to choose the subset of rubrics based on their performance in scoring conversations. In our experiments, RUBICON effectively learns to differentiate conversation quality, achieving higher accuracy and yield rates than existing baselines.

Article Search Info

Challenge Track

Investigating the Potential of Using Large Language Models for Scheduling
Deddy Jobson ORCID logo and Li Yilin ORCID logo
(Mercari, Japan)
The inaugural ACM International Conference on AI-powered Software introduced the AIware Challenge, prompting researchers to explore AI-driven tools for optimizing conference programs through constrained optimization. We investigate the use of Large Language Models (LLMs) for program scheduling, focusing on zero-shot learning and integer programming to measure paper similarity. Our study reveals that LLMs, even under zero-shot settings, create reasonably good first drafts of conference schedules. When clustering papers, using only titles as LLM inputs produces results closer to human categorization than using titles and abstracts with TFIDF. The code can be found here.

Article Search
Automated Scheduling for Thematic Coherence in Conferences
Mahzabeen Emu ORCID logo, Tasnim Ahmed ORCID logo, and Salimur Choudhury ORCID logo
(Queen’s University, Canada)
This study presents a novel Artificial Intelligence (AI)-based approach to automate conference session scheduling that integrates Natural Language Processing (NLP) techniques. By analyzing contextual information from conference activities, this framework uses a Constraint Satisfaction Problem (CSP) model to optimize the scheduling. Using data from the MSR conference, and by employing GPT-3.5 and SciBERT, this framework generates a weighted contextual similarity and applies hard constraints, to ensure scheduling feasibility, and soft constraints, to improve thematic coherence. Evaluations demonstrate the ability to surpass manual scheduling.

Article Search
Conference Program Scheduling using Genetic Algorithms
Rucha Deshpande ORCID logo, Aishwarya Devi Akila Pandian ORCID logo, and Vigneshwaran Dharmalingam ORCID logo
(Purdue University, USA)
Program creation is the process of allocating presentation slots for each paper accepted to a conference with parallel sessions. This process, generally done manually, involves intricate decision-making to align with multiple constraints and optimize goals. The total time for all presentations in a session cannot be longer than the length of the session, the total number of sessions has to be equal to the number of sessions provided by the Program Committee chairs, and if there are parallel tracks, no two papers with common authors can be scheduled in parallel at the same time. We propose a Conference Program Scheduler using Genetic Algorithms to automate the conference schedule creation process while addressing these constraints and maximizing session theme coherence. Our evaluation focuses on the algorithm's ability to meet the outlined constraints and its success in creating thematically coherent and time-efficient sessions.

Article Search

proc time: 6.12