SANER 2020 – Proceedings

Message from the General Chairs and Program Co-Chairs
On behalf of the entire conference committee, it is our great pleasure to welcome you to London Ontario, Canada for SANER 2020, the 27th IEEE International Conference on Software Analysis, Evolution and Reengineering. SANER is the premier research conference on the theory and practice of recovering information from existing software and systems. It explores innovative methods of extracting the many kinds of information that can be recovered from software, software engineering documents, and systems artifacts, and examines innovative ways of using this information in system renovation and program understanding.

Main Research

Referee: A Pattern-Guided Approach for Auto Design in Compiler-Based Analyzers
Fang Lv, Hao Li, Lei Wang, Ying Liu

, Huimin Cui, Jingling Xue

, and Xiaobing Feng

(Institute of Computing Technology at Chinese Academy of Sciences, China; UNSW, Australia)
Coding rules become more critical for security-oriented softwares, which prefer compilers as their base platforms due to simultaneous demands not only in a mature grammatical analysis, but also in compilation and optimization techniques. However, engineering such a compiler-based analyzer, exploring proper launch points before integrating hundreds of rules one by one in the frontend of compilers, is a completely manual decision-making process with heavy redundant efforts exhausted. To improve this, we introduce a novel pattern-guided approach, named Referee, to facilitate this process. Referee improves the manual approach significantly by making three advances:(1) our pattern-guided approach can significantly reduce the amount of redundant manual efforts required,(2) a twin-graph aided broadcasting process is developed to enable rule patterns to be characterized with partially developed rules, and (3) a reliable recommendation mechanism is used to pinpoint the launch point for a new rule based on the accumulated experience from handling earlier rules. We have implemented Referee in GCC 8.2 with 163 rules from SPACE-C and MISRA-C standards. Referee achieves an accuracy of 89.9% on recommendation of launch points for new rules to our GCC-based analyzer automatically when trained with 70% of all the rules. Decreasing the training data size to 60% and 50% still yields an accuracy of 87.7% and 81.5%, respectively. Therefore, Referee can significantly reduce the amount of manual efforts that would otherwise be required, with a careful selection of seeding rule patterns, providing an interesting and fruitful avenue for further research.

Web APIs in Android through the Lens of Security
Pascal Gadient, Mohammad Ghafari, Marc-Andrea Tarnutzer, and Oscar Nierstrasz
(University of Bern, Switzerland)
Web communication has become an indispensable characteristic of mobile apps. However, it is not clear what data the apps transmit, to whom, and what consequences such transmissions have. We analyzed the web communications found in mobile apps from the perspective of security. We first manually studied 160 Android apps to identify the commonly-used communication libraries, and to understand how they are used in these apps. We then developed a tool to statically identify web API URLs used in the apps, and restore the JSON data schemas including the type and value of each parameter. We extracted 9,714 distinct web API URLs that were used in 3,376 apps. We found that developers often use the java.net package for network communication, however, third-party libraries like OkHttp are also used in many apps. We discovered that insecure HTTP connections are seven times more prevalent in closed-source than in open-source apps, and that embedded SQL and JavaScript code is used in web communication in more than 500 different apps. This finding is devastating; it leaves billions of users and API service providers vulnerable to attack.

SMARTSHIELD: Automatic Smart Contract Protection Made Easy
Yuyao Zhang, Siqi Ma, Juanru Li, Kailai Li, Surya Nepal, and Dawu Gu
(Shanghai Jiao Tong University, China; Data61 at CSIRO, Australia)
The immutable feature of blockchain determines that traditional security response mechanisms (e.g., code patching) must change to remedy insecure smart contracts. The only proper way to protect a smart contract is to fix potential risks in its code before it is deployed to the blockchain. However, existing tools for smart contract security analysis focus on the detection of bugs but seldom consider the code fix issues. Meanwhile, it is often time-consuming and error-prone for a developer to understand and fix flawed code manually. In this paper we propose SMARTSHIELD, a bytecode rectification system, to fix three typical security-related bugs (i.e., state changes after external calls, missing checks for out-of-bound arithmetic operations, and missing checks for failing external calls) in smart contracts automatically and help developers release secure contracts. Moreover, SMARTSHIELD guarantees that the rectified contract is not only immune to certain attacks but also gas-friendly (i.e., a slightly increase of gas cost). To evaluate the effectiveness and efficiency of SMARTSHIELD, we applied it to 28,261 real-world buggy contracts on Ethereum blockchain (as of January 2nd 2019). Experiment results demonstrated that among 95,502 insecure cases in those contracts, 87,346 (91.5%) of them were automatically fixed by SMARTSHIELD. A following test with both program analysis and real-world exploits further testified that the rectified contracts were secure against common attacks. Moreover, the rectification only introduced a 0.2% gas increment for each contract on average.

Automatically Extracting Subroutine Summary Descriptions from Unstructured Comments
Zachary Eberhart, Alexander LeClair, and Collin McMillan

(University of Notre Dame, USA)
Summary descriptions of subroutines are short (usually one-sentence) natural language explanations of a subroutine's behavior and purpose in a program. These summaries are ubiquitous in documentation, and many tools such as JavaDocs and Doxygen generate documentation built around them. And yet, extracting summaries from unstructured source code repositories remains a difficult research problem -- it is very difficult to generate clean structured documentation unless the summaries are annotated by programmers. This becomes a problem in large repositories of legacy code, since it is cost prohibitive to retroactively annotate summaries in dozens or hundreds of old programs. Likewise, it is a problem for creators of automatic documentation generation algorithms, since these algorithms usually must learn from large annotated datasets, which do not exist for many programming languages. In this paper, we present a semi-automated approach via crowdsourcing and a fully-automated approach for annotating summaries from unstructured code comments. We present experiments validating the approaches, and provide recommendations and cost estimates for automatically annotating large repositories.

Info

Resource Race Attacks on Android
Yan Cai

, Yutian Tang, Haicheng Li, Le Yu, Hao Zhou

, Xiapu Luo

, Liang He, and Purui Su
(Institute of Software at Chinese Academy of Sciences, China; Hong Kong Polytechnic University, China; University of Chinese Academy of Sciences, China; Peng Cheng Laboratory, China)
Smartphones are frequently involved in accessing private user data. Although many studies have been done to prevent malicious apps from leaking private user data, only a few recent works examine how to remove the sensitive information from the data collected by smartphone hardware resources (e.g., camera). Unfortunately, none of them investigates whether a malicious app can obtain such sensitive information when (or right before/after) a legitimate app collects such data (e.g., taking photos). To fill in the gap, in this paper, we model such attacks as the Resource Race Attack (RRAttack) based on races between two apps during their requests to exclusive resources to access sensitive information. RRAttacks have three categories according to when a race on requesting resources occurs: Pre-Use, In-Use, and Post-Use attacks. We further conduct the first systematic study on the feasibility of launching the RRAttacks on two heavily used exclusive Android resources: camera and touchscreen. In details, we perform Proof-of-Concept (PoC) attacks to reveal that, (a) camera is highly vulnerable to both In-Use and Post-Use attacks; and (b) touchscreen is vulnerable to Pre-Use attacks. Particularly, we demonstrate successful RRAttacks on them to steal private information, to cause financial loss, and to steal user passwords from Android 6 to the latest Android Q. Moreover, our analyses on 1,000 apps indicate that most of them are vulnerable to one to three RRAttacks. Finally, we propose a set of defense strategies against RRAttacks for user apps, system apps, and Android system itself.

We Are Family: Analyzing Communication in GitHub Software Repositories and Their Forks
Scott Brisson, Ehsan Noei, and Kelly Lyons
(University of Toronto, Canada)
GitHub facilitates software development practices that encourage collaboration and communication. Part of GitHub’s model includes forking, which enables users to make changes on a copy of the base repository. The process of forking opens avenues of communication between the users from the base repository and the users from the forked repositories. Since forking on GitHub is a common mechanism for initiating repositories, we are interested in how communication between a repository and its forks (forming a software family) relates to stars. In this paper, we study communications within 385 software families comprised of 13,431 software repositories. We find that the fork depth, the number of users who have contributed to multiple repositories in the same family, the number of followers from outside the family, familial pull requests, and reported issues share a statistically significant relationship with repository stars. Due to the importance of issues and pull requests, we identify and compare common topics in issues and pull requests from inside the repository (via branching) and within the family (via forking). Our results offer insights into the importance of communication within a software family, and how this leads to higher individual repository star counts.

Exploring Type Inference Techniques of Dynamically Typed Languages
C. M. Khaled Saifullah, Muhammad Asaduzzaman, and Chanchal K. Roy
(University of Saskatchewan, Canada; Queen's University, Canada)
Developers often prefer dynamically typed programming languages, such as JavaScript, because such languages do not require explicit type declarations. However, such a feature hinders software engineering tasks, such as code completion, type related bug fixes and so on. Deep learning-based techniques are proposed in the literature to infer the types of code elements in JavaScript snippets. These techniques are computationally expensive. While several type inference techniques have been developed to detect types in code snippets written in statically typed languages, it is not clear how effective those techniques are for inferring types in dynamically typed languages, such as JavaScript. In this paper, we investigate the type inference techniques of JavaScript to understand the above two issues further. While doing that we propose a new technique that considers the locally specific code tokens as the context to infer the types of code elements. The evaluation result shows that the proposed technique is 20-47% more accurate than the statically typed language-based techniques and 5-14 times faster than the deep learning techniques without sacrificing accuracy. Our analysis of sensitivity, overlapping of predicted types and the number of training examples justify the importance of our technique.

How Do Python Framework APIs Evolve? An Exploratory Study
Zhaoxu Zhang, Hengcheng Zhu, Ming Wen

, Yida Tao

, Yepang Liu

, and Yingfei Xiong

(Southern University of Science and Technology, China; Huazhong University of Science and Technology, China; Shenzhen University, China; Peking University, China)
Python is a popular dynamic programming language. In recent years, many frameworks implemented in Python have been widely used for data science and web development. Similar to frameworks in other languages, the APIs provided by Python frameworks often evolve, which would inevitably induce compatibility issues in client applications. While existing work has studied the evolution of frameworks in static programming languages such as Java, little is known on how Python framework APIs evolve and the characteristics of the compatibility issues induced by such evolution. To bridge this gap, we take a first look at the evolution of Python framework APIs and the resulting compatibility issues in client applications. We analyzed 288 releases of six popular Python frameworks from three different domains and 5,538 open-source projects built on these frameworks. We investigated the evolution patterns of Python framework APIs and found that they largely differ from those of Java framework APIs. We also investigated the compatibility issues in client applications and identified common strategies that developers adopt to fix these issues. Based on the empirical findings, we designed and implemented a tool, PyCompat, to automatically detect compatibility issues caused by misusing evolved framework APIs in Python applications. Experiments on 10 real-world projects show that our tool can effectively detect compatibility issues of developers' concern.

Associating Code Clones with Association Rules for Change Impact Analysis
Manishankar Mondal, Banani Roy, Chanchal K. Roy, and Kevin A. Schneider
(University of Saskatchewan, Canada)
When a programmer makes changes to a target program entity (files, classes, methods), it is important to identify which other entities might also get impacted. These entities constitute the impact set for the target entity. Association rules have been widely used for discovering the impact sets. However, such rules only depend on the previous co-change history of the program entities ignoring the fact that similar entities might often need to be updated together consistently even if they did not co-change before. Considering this fact, we investigate whether cloning relationships among program entities can be associated with association rules to help us better identify the impact sets.
In our research, we particularly investigate whether the impact set detection capability of a clone detector can be utilized to enhance the capability of the state-of-the-art association rule mining technique, Tarmaq, in discovering impact sets. We use the well known clone detector called NiCad in our investigation and consider both regular and micro-clones. Our evolutionary analysis on thousands of commit operations of eight diverse subject systems reveals that consideration of code clones can enhance the impact set detection accuracy of Tarmaq with a significantly higher precision and recall. Micro-clones of 3LOC and 4LOC and regular code clones of 5LOC to 20LOC contribute the most towards enhancing the detection accuracy.

LibDX: A Cross-Platform and Accurate System to Detect Third-Party Libraries in Binary Code
Wei Tang, Ping Luo, Jialiang Fu, and Dan Zhang
(Tsinghua University, China)
With the development of the open-source movement, third-party library reuse is commonly practiced in programming. Application developers can reuse the code to save time and development costs. However, there are some hidden risks in misusing third-party libraries such as license violation and security vulnerability. The identification of libraries written in C or C++ is impeded by compilation process which hides most features of code. The same open-source package can be compiled into different binary code by different compilation processes. Therefore, this paper proposes LibDX, a platform-independent and fully-automated system, to detect reused libraries in binary files. With a well-designed feature extractor, LibDX can overcome compilation diversity between binary files. LibDX novelly intro- duces the logic feature block concept which is applied to deal with the feature duplication challenge in a large-scale feature database. We built the largest test data set covering multiple platforms and evaluated LibDX with 9.5K packages including 25.8K C/C++ binary files. Our results show that LibDX achieves a precision of 92% and a recall of 97%, and outperforms state-of-the-art tools. We have validated the performance of the system with closed source commercial applications and found some license violation cases.

EthPloit: From Fuzzing to Efficient Exploit Generation against Smart Contracts
Qingzhao Zhang, Yizhuo Wang, Juanru Li, and Siqi Ma
(Shanghai Jiao Tong University, China; Data61 at CSIRO, Australia)
Smart contracts, programs running on blockchain systems, leverage diverse decentralized applications (DApps). Unfortunately, well-known smart contract platforms, Ethereum for example, face serious security problems. Exploits to contracts may cause enormous financial losses, which emphasize the importance of smart contract testing. However, current exploit generation tools have difficulty to solve hard constraints in execution paths and cannot simulate the blockchain behaviors very well. These problems cause a loss of coverage and accuracy of exploit generation.
To overcome the problems, we design and implement EthPloit, a smart contract exploit generator based on fuzzing. EthPloit adopts static taint analysis to generate exploit-targeted transaction sequences, a dynamic seed strategy to pass hard constraints and an instrumented Ethereum Virtual Machine to simulate blockchain behaviors. We evaluate EthPloit on 45,308 smart contracts and discovered 554 exploitable contracts. EthPloit automatically generated 644 exploits without any false positive and 306 of them cannot be generated by previous exploit generation tools.

Sequence Directed Hybrid Fuzzing
Hongliang Liang, Lin Jiang, Lu Ai, and Jinyi Wei
(Beijing University of Posts and Telecommunications, China)
Existing directed grey-box fuzzers are effective compared with coverage-based fuzzers. However, they fail to achieve a balance between effectiveness and efficiency, and it is difficult to cover complex paths due to random mutation. To mitigate the issue, we propose a novel approach, sequence directed hybrid fuzzing (SDHF), which leverages a sequence-directed strategy and concolic execution technique to enhance the effectiveness of fuzzing. Given a set of target statement sequences of a program, SDHF aims to generate inputs that can reach the statements in each sequence in order and trigger potential bugs in the program. We implement the proposed approach in a tool called Berry and evaluate its capability on crash reproduction, true positive verification, and vulnerability detection. Experimental results demonstrate that Berry outperforms four state-of-the-art fuzzers, including directed fuzzers BugRedux, AFLGo and Lolly, and undirected hybrid fuzzer QSYM. Moreover, Berry found 7 new vulnerabilities in real-world programs such as UPX and GNU Libextractor, and 3 new CVEs were assigned.

LESSQL: Dealing with Database Schema Changes in Continuous Deployment
Ariel Afonso, Altigran da Silva, Tayana Conte, Paulo Martins, João Cavalcanti, and Alessandro Garcia
(Federal University of Amazonas, Brazil; PUC-Rio, Brazil)
The adoption of Continuous Deployment (CD) aims at allowing software systems to quickly evolve to accommodate new features. However, structural changes to the database schema are frequent and may incur in systems’ services downtime. This encompasses the proper maintenance of both schema and source code, including rewrites of all outdated queries that use the same database. Previous solutions try to mitigate the burdening task of manually rewriting outdated queries. Unfortunately, a software team must still interact with some tools to properly fix the affected queries. Moreover, the team still has to locate and modify all the impacted code, which are often error-prone tasks. Thus, a project may not experience CD benefits when changes impact various code regions. In this paper we present an alternative approach, called LESSQL, whose goal is to improve queries’ stability in the presence of structural schema changes over time. LESSQL supports queries that are less dependent on the database schema since they do not include the FROM clause. An underlying framework intercepts each LESSQL query and generates a corresponding SQL query for the current schema. It also locates the query attributes in the current schema and generates proper expressions to join the required tables. LESSQL supports unsupervised, supervised and hybrid configurations to process mappings of attributes to a newer schema version. We conducted experiments in the context of a popular open-source project, which experienced many diverse structural schema changes. Experiments outcomes indicate that our approach is effective in significantly reducing the modifications required for applying schema changes, allowing to better reap the benefits of CD. While supervised and hybrid configurations achieved a success rate higher than 95% with a minor query generation overhead, the unsupervised configuration was also successful for certain types of structural schema changes. These results show that LESSQL effectively favours CD and keeps queries running after database schema changes without services interruption.

Info

Cross-Dataset Design Discussion Mining
Alvi Mahadi, Karan Tongay, and Neil A. Ernst
(University of Victoria, Canada)
Being able to identify software discussions that are primarily about design---which we call design mining---can improve documentation and maintenance of software systems. Existing design mining approaches have good classification performance using natural language processing (NLP) techniques, but the conclusion stability of these approaches is generally poor. A classifier trained on a given dataset of software projects has so far not worked well on different artifacts or different datasets. In this study, we replicate and synthesize these earlier results in a meta-analysis. We then apply recent work in transfer learning for NLP to the problem of design mining. However, for our datasets, these deep transfer learning classifiers perform no better than less complex classifiers. We conclude by discussing some reasons behind the transfer learning approach to design mining.

Info

C-3PR: A Bot for Fixing Static Analysis Violations via Pull Requests
Antônio Carvalho, Welder Luz, Diego Marcílio, Rodrigo Bonifácio, Gustavo Pinto, and Edna Dias Canedo
(University of Brasília, Brazil; USI Lugano, Switzerland; Federal University of Pará, Brazil)
Static analysis tools are frequently used to detect common programming mistakes or bad practices. Yet, the existing literature reports that these tools are still underused in the industry, which is partly due to (1) the frequent high number of false positives generated, (2) the lack of automated repairing solutions, and (3) the possible mismatches between tools and workflows of development teams. In this study we explored the question: "How could a bot-based approach allow seamless integration of static analysis tools into developers’ workflows?" To this end we introduce C-3PR, an event-based bot infrastructure that automatically proposes fixes to static analysis violations through pull requests (PRs). We have been using C-3PR in an industrial setting for a period of eight months. To evaluate C-3PR usefulness, we monitored its operation in response to 2179 commits to the code base of the tracked projects. The bot autonomously executed 201346 analyses, yielding 610 pull requests. Among them, 346 (57%) were merged into the projects’ code bases. We observed that, on average, these PRs are evaluated faster than general-purpose PRs (2.58 and 5.78 business days, respectively). Accepted transformations take even shorter time (1.56 days). Among the reasons for rejection, bugs in C-3PR and in the tools it uses are the most common ones. PRs that require the resolution of a merge conflict are almost always rejected as well. We also conducted a focus group to assess how C-3PR affected the development workflow. We observed that developers perceived C-3PR as efficient, reliable, and useful. For instance, the participants mentioned that, given the chance, they would keep using C-3PR. Our findings bring new evidence that a bot-based infrastructure could mitigate some challenges that hinder the wide adoption of static analysis tools.

Automated Bug Detection and Replay for COTS Linux Kernel Modules with Concolic Execution
Bo Chen, Zhenkun Yang, Li Lei, Kai Cong, and Fei Xie
(Intel, USA; Portland State University, USA)
Linux kernel is pervasive in the cloud, on mobile platforms, and on supercomputers. To support these diverse computing environments, the Linux kernel provides extensibility and modularity through Loadable Kernel Modules (LKM), while featuring a monolithic architecture for execution efficiency. This architecture design brings a major challenge to the security of Linux kernel. Having LKMs run in the same memory space as the base kernel on Ring 0, a single flaw from LKMs may compromise the entire system, e.g., gaining root access. However, validation and debugging of LKMs are inherently challenging, because of its special interface buried deeply in the kernel, and non-determinism from interrupts. Also, LKMs are shipped by various vendors and the public may not have access to their source code, making the validation even harder.
In this paper, we propose a framework for efficient bug detection and replay of commercial off-the-shelf (COTS) Linux kernel modules based on concolic execution. Our framework automatically generates compact sets of test cases for COTS LKMs, proactively checks for common kernel bugs, and allows to reproduce reported bugs repeatedly with actionable test cases. We evaluate our approach on over 20 LKMs covering major modules from the network and sound subsystems of Linux kernel. The results show that our approach can effectively detect various kernel bugs, and reports 5 new vulnerabilities including an unknown flaw that allows non-privileged users to trigger a kernel panic. By leveraging the replay capability of our framework, we patched all the reported bugs in the Linux kernel upstream, including 3 patches that were selected to the stable release of Linux kernel and back-ported to numerous production kernel versions. We also compare our prototype with kAFL, the stateof-the-art kernel fuzzer, and demonstrate the effectiveness of concolic execution over fuzzing on the kernel level.

Ultra-Large-Scale Repository Analysis via Graph Compression
Paolo Boldi, Antoine Pietri, Sebastiano Vigna, and Stefano Zacchiroli
(University of Milan, Italy; Inria, France; University Paris Diderot, France)
We consider the problem of mining the development history—as captured by modern version control systems—of ultra-large-scale software archives (e.g., tens of millions software repositories corresponding).
We show that graph compression techniques can be applied to the problem, dramatically reducing the hardware resources needed to mine similarly-sized corpus. As a concrete use case we compress the full Software Heritage archive, consisting of 5 billion unique source code files and 1 billion unique commits, harvested from more than 80 million software projects—encompassing a full mirror of GitHub.
The resulting compressed graph fits in less than 100 GB of RAM, corresponding to a hardware cost of less than 300 U.S. dollars. We show that the compressed in-memory representation of the full corpus can be accessed with excellent performances, with edge lookup times close to memory random access. As a sample exploitation experiment we show that the compressed graph can be used to conduct clone detection at this scale, benefiting from main memory access speed.

Studying Developer Reading Behavior on Stack Overflow during API Summarization Tasks
Jonathan A. Saddler, Cole S. Peterson, Sanjana Sama, Shruthi Nagaraj, Olga Baysal, Latifa Guerrouj, and Bonita Sharif
(University of Nebraska-Lincoln, USA; Youngstown State University, USA; Carleton University, Canada; ETS, Canada)
Stack Overflow is commonly used by software developers to help solve problems they face while working on software tasks such as fixing bugs or building new features. Recent research has explored how the content of Stack Overflow posts affects attraction and how the reputation of users attracts more visitors. However, there is very little evidence on the effect that visual attractors and content quantity have on directing gaze toward parts of a post, and which parts hold the attention of a user longer. Moreover, little is known about how these attractors help developers (students and professionals) answer comprehension questions. This paper presents an eye tracking study on thirty developers constrained to reading only Stack Overflow posts while summarizing four open source methods or classes. Results indicate that on average paragraphs and code snippets were fixated upon most often and longest. When ranking pages by number of appearance of code blocks and paragraphs, we found that while the presence of more code blocks did not affect number of fixations, the presence of increasing numbers of plain text paragraphs significantly drove down the fixations on comments. SO posts that were looked at only by students had longer fixation times on code elements within the first ten fixations. We found that 16 developer summaries contained 5 or more meaningful terms from SO posts they viewed. We discuss how our observations of reading behavior could benefit how users structure their posts.

On the Adoption of Kotlin on Android Development: A Triangulation Study
Victor Oliveira, Leopoldo Teixeira, and Felipe Ebert
(Federal University of Pernambuco, Brazil; Tempest Security Intelligence, Brazil)
In 2017, Google announced Kotlin as one of the officially supported languages for Android development. Among the reasons for choosing Kotlin, Google mentioned it is "concise, expressive, and designed to be type and null-safe''. Another important reason is that Kotlin is a language fully interoperable with Java and runs on the JVM. Despite Kotlin's rapid rise in the industry, very little has been done in academia to understand how developers are dealing with the adoption of Kotlin. The goal of this study is to understand how developers are dealing with the recent adoption of Kotlin as an official language for Android development, their perception about the advantages and disadvantages related to its usage, and the most common problems faced by them. This research was conducted using the concurrent triangulation strategy, which is a mixed-method approach. We performed a thorough analysis of 9,405 questions related to Kotlin development for the Android platform on Stack Overflow. Concurrently, we also conducted a basic qualitative research interviewing seven Android developers that use Kotlin to confirm and cross-validate our findings. Our study reveals that developers do seem to find the language easy to understand and to adopt it. This perception begins to change when the functional paradigm becomes more evident. Accordingly to the developers, the readability and legibility are compromised if developers overuse the functional flexibility that the language provides. The developers also consider that Kotlin increases the quality of the produced code mainly due to its null-safety guarantees, but it can also become a challenge when interoperating with Java, despite the interoperability being considered as an advantage. While adopting Kotlin requires some care from developers, its benefits seem to bring many advantages to the platform according to the developers, especially in the aspect of adopting a more modern language while maintaining the consolidated Java-based development environment.

Info

Energy Refactorings for Android in the Large and in the Wild
Marco Couto, João Saraiva

, and João Paulo Fernandes
(University of Minho, Portugal; University of Coimbra, Portugal)
Improving the energy efficiency of mobile applications is a timely goal, as it can contribute to increase a device’s usage time, which most often is powered by batteries. Recent studies have provided empirical evidence that refactoring energy-greedy code patterns can in fact reduce the energy consumed by an application. These studies, however, tested the impact of refactoring patterns individually, often locally (e.g., by measuring method-level gains) and using a small set of applications.
We studied the application-level impact of refactorings, comparing individual refactorings, among themselves and against the combinations on which they appear. We use scenarios that simulate realistic application usage on a large-scale repository of Android applications. To fully automate the detection and refactoring procedure, as well as the execution of test cases, we developed a publicly available tool called Chimera.
Our findings include statistical evidence that i) individual refactorings produce consistent gains, but with different impacts, ii) combining as much refactorings as possible most often, but not always, increases energy savings when compared to individual refactorings, and iii) a few combinations are harmful ‍to energy savings, as they can actually produce more losses than gains.
We prepared a set of guidelines for developers to follow, aiding them on deciding how to refactor and consistently reduce energy.

Info

Essential Sentences for Navigating Stack Overflow Answers
Sarah Nadi

and Christoph Treude
(University of Alberta, Canada; University of Adelaide, Australia)
Stack Overflow (SO) has become an essential resource for software development. Despite its success and prevalence, navigating SO remains a challenge. Ideally, SO users could benefit from highlighted navigational cues that help them decide if an answer is relevant to their task and context. Such navigational cues could be in the form of essential sentences that help the searcher decide whether they want to read the answer or skip over it. In this paper, we compare four potential approaches for identifying essential sentences. We adopt two existing approaches and develop two new approaches based on the idea that contextual information in a sentence (e.g., "if using windows") could help identify essential sentences. We compare the four techniques using a survey of 43 participants. Our participants indicate that it is not always easy to figure out what the best solution for their specific problem is, given the options, and that they would indeed like to easily spot contextual information that may narrow down the search. Our quantitative comparison of the techniques shows that there is no single technique sufficient for identifying essential sentences that can serve as navigational cues, while our qualitative analysis shows that participants valued explanations and specific conditions, and did not value filler sentences or speculations. Our work sheds light on the importance of navigational cues, and our findings can be used to guide future research to find the best combination of techniques to identify such cues.

Info

HistoRank: History-Based Ranking of Co-change Candidates
Manishankar Mondal, Banani Roy, Chanchal K. Roy, and Kevin A. Schneider
(University of Saskatchewan, Canada)
Evolutionary coupling is a well investigated phenomenon during software evolution and maintenance. If two or more program entities co-change (i.e., change together) frequently during evolution, it is expected that the entities are coupled. This type of coupling is called evolutionary coupling or change coupling in the literature. Evolutionary coupling is realized using association rules and two measures: support and confidence. Association rules have been extensively used for predicting co-change candidates for a target program entity (i.e., an entity that a programmer attempts to change). However, association rules often predict a large number of co-change candidates with many false positives. Thus, it is important to rank the predicted co-change candidates so that the true positives get higher priorities.
The predicted co-change candidates have always been ranked using the support and confidence measures of the association rules. In our research, we investigate five different ranking mechanisms on thousands of commits of ten diverse subject systems. On the basis of our findings, we propose a history-based ranking approach, HistoRank (History-based Ranking), that analyzes the previous ranking history to dynamically select the most appropriate one from those five ranking mechanisms for ranking co-change candidates of a target program entity. According to our experiment result, HistoRank outperforms each individual ranking mechanism with a significantly better MAP (mean average precision). We investigate different variants of HistoRank and realize that the variant that emphasizes the ranking in the most recent occurrence of co-change in the history performs the best.

D-Goldilocks: Automatic Redistribution of Remote Functionalities for Performance and Efficiency
Kijin An and Eli Tilevich
(Virginia Tech, USA)
Distributed applications enhance their execution by using remote resources. However, distributed execution incurs communication, synchronization, fault-handling, and security overheads. If these overheads are not offset by the yet larger execution enhancement, distribution becomes counterproductive. For maximum benefits, the distribution's granularity cannot be too fine or too crude; it must be just right. In this paper, we present a novel approach to re-architecting distributed applications, whose distribution granularity has turned ill-conceived. To adjust the distribution of such applications, our approach automatically reshapes their remote invocations to reduce aggregate latency and resource consumption. To that end, our approach insources a remote functionality for local execution, splits it into separate functions to profile their performance, and determines the optimal redistribution based on a cost function. Redistribution strategies combine separate functions into single remotely invocable units. To automate all the required program transformations, our approach introduces a series of domain-specific automatic refactorings. We have concretely realized our approach as an analysis and automatic program transformation infrastructure for the important domain of full-stack JavaScript applications, and evaluated its value, utility, and performance on a series of real-world cross-platform mobile apps. Our evaluation results indicate that our approach can become a useful tool for software developers charged with the challenges of re-architecting distributed applications.

Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree
Wenhan Wang, Ge Li

, Bo Ma, Xin Xia, and Zhi Jin

(Peking University, China; Monash University, Australia)
Code clones are semantically similar code fragments pairs that are syntactically similar or different. Detection of code clones can help to reduce the cost of software maintenance and prevent bugs. Numerous approaches of detecting code clones have been proposed previously, but most of them focus on detecting syntactic clones and do not work well on semantic clones with different syntactic features. To detect semantic clones, researchers have tried to adopt deep learning for code clone detection to automatically learn latent semantic features from data. Especially, to leverage grammar information, several approaches used abstract syntax trees (AST) as input and achieved significant progress on code clone benchmarks in various programming languages. However, these AST-based approaches still can not fully leverage the structural information of code fragments, especially semantic information such as control flow and data flow. To leverage control and data flow information, in this paper, we build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST). We construct FA-AST by augmenting original ASTs with explicit control and data flow edges. Then we apply two different types of graph neural networks (GNN) on FA-AST to measure the similarity of code pairs. As far as we have concerned, we are the first to apply graph neural networks on the domain of code clone detection.
We apply our FA-AST and graph neural networks on two Java datasets: Google Code Jam and BigCloneBench. Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks.

SAGA: Efficient and Large-Scale Detection of Near-Miss Clones with GPU Acceleration
Guanhua Li, Yijian Wu

, Chanchal K. Roy, Jun Sun

, Xin Peng

, Nanjie Zhan, Bin Hu, and Jingyi Ma
(Fudan University, China; Shanghai Key Laboratory of Data Science, China; University of Saskatchewan, Canada; Singapore Management University, Singapore)
Clone detection on large code repository is necessary for many big code analysis tasks. The goal is to provide rich information on identical and similar code across projects. Detecting near-miss code clones on big code is challenging since it requires intensive computing and memory resources as the scale of the source code increases. In this work, we propose SAGA, an efficient suffix-array based code clone detection tool designed with sophisticated GPU optimization. SAGA not only detects Type-1 and Type-2 clones but also does so for cross-project large repositories and for the most computationally expensive Type-3 clones. Meanwhile, it also works at segment granularity, which is even more challenging. It detects code clones in 100 million lines of code within 11 minutes (with recall and precision comparable to state-of-the-art approaches), which is more than 10 times faster than state-of-the-art tools. It is the only tool that efficiently detects Type-3 near-miss clones at segment granularity in large code repository (e.g., within 11 hours on 1 billion lines of code). We conduct a preliminary case study on 85,202 GitHub Java projects with 1 billion lines of code and exhibit the distribution of clones across projects. We find about 1.23 million Type-3 clone groups, containing 28 million lines of code at arbitrary segment granularity, which are only detectable with SAGA. We believe SAGA is useful in many software engineering applications such as code provenance analysis, code completion, change impact analysis, and many more.

CORE: Automating Review Recommendation for Code Changes
Jing Kai Siow, Cuiyun Gao, Lingling Fan, Sen Chen

, and Yang Liu

(Nanyang Technological University, Singapore)
Code review is a common process that is used by developers, in which a reviewer provides useful comments or points out defects in the submitted source code changes via pull request. Code review has been widely used for both industry and open-source projects due to its capacity in early defect identification, project maintenance, and code improvement. With rapid updates on project developments, code review becomes a non-trivial and labor-intensive task for reviewers. Thus, an automated code review engine can be beneficial and useful for project development in practice. Although there exist prior studies on automating the code review process by adopting static analysis tools or deep learning techniques, they often require external sources such as partial or full source code for accurate review suggestion. In this paper, we aim at automating the code review process only based on code changes and the corresponding reviews but with better performance.
The hinge of accurate code review suggestion is to learn good representations for both code changes and reviews. To achieve this with limited source, we design a multi-level embedding (i.e., word embedding and character embedding) approach to represent the semantics provided by code changes and reviews.The embeddings are then well trained through a proposed attentional deep learning model, as a whole named CORE. We evaluate the effectiveness of CORE on code changes and reviews collected from 19 popular Java projects hosted on Github. Experimental results show that our model CORE can achieve significantly better performance than the state-of-the-art model (DeepMem), with an increase of 131.03% in terms of Recall@10 and 150.69% in terms of Mean Reciprocal Rank. Qualitative general word analysis among project developers also demonstrates the performance of CORE in automating code review.

Distinguishing Similar Design Pattern Instances through Temporal Behavior Analysis
Renhao Xiong, David Lo

, and Bixin Li
(Southeast University, China; Singapore Management University, Singapore)
Design patterns (DPs) encapsulate valuable design knowledge of object-oriented systems. Detecting DP instances helps to reveal the underlying rationale, thus facilitates the maintenance of legacy code. Resulting from the internal similarity of DPs, implementation variants, and missing roles, approaches based on static analysis are unable to well identify structurally similar instances. Existing approaches further employ dynamic techniques to test the runtime behaviors of candidate instances. Automatically verifying the runtime behaviors of DP instances is a challenging task in multiple aspects. This paper presents an approach to improve the verification process of existing approaches. To exercise the runtime behaviors of DP instances in cases that test cases of legacy systems are often unavailable, we propose a markup language, TSML (Test Script Markup Language), to direct the generation of test cases by putting a DP instance into use. The execution of test cases is monitored based on a trace method that enables us to specify runtime events of interest using regular expressions. To characterize runtime behaviors, we introduce a modeling and specification method employing Allen’s interval-based temporal relations, which supports variant behaviors in a flexible way without hard-coded algorithms. A prototype tool has been implemented and evaluated on six open source systems to verify 466 instances reported by five existing approaches with respect to five DPs. The results show that the dynamic analysis increases the F1-score by 53.6% in distinguishing similar DP instances.

Relationship between the Effectiveness of Spectrum-Based Fault Localization and Bug-Fix Types in JavaScript Programs
Béla Vancsics

, Attila Szatmári

, and Árpád Beszédes

(University of Szeged, Hungary)
Spectrum-Based Fault Localization (SBFL) is a well-understood statistical approach to software fault localization, and there have been numerous studies performed that tackle its effectiveness. However, mostly Java and C/C++ programs have been addressed to date. We performed an empirical study on SBFL for JavaScript programs using a recent bug benchmark, BugsJS. In particular, we examined (1) how well some of the most popular SBFL algorithms, Tarantula, Ochiai and DStar, can predict the faulty source code elements in these JavaScript programs, (2) whether there is a significant difference between the effectiveness of the different SBFL algorithms, and (3) whether there is any relationship between the bug-fix types and the performance of SBFL methods. For the latter, we performed a manual classification of each benchmark bug according to an existing classification scheme. Results show that the performance of the SBFL algorithms is similar but there are some notable differences among them as well, and that certain bug-fix types can be significantly differentiated from the others (in both positive and negative direction) based on the fault localization effectiveness of the investigated algorithms.

Incremental Map-Reduce on Repository History
Johannes Härtel and Ralf Lämmel
(University of Koblenz-Landau, Germany)
Work on Mining Software Repositories typically involves processing abstractions of resources on individual revisions. A corresponding processing of abstractions of resource changes often depends on working with all revisions of the repository history to guarantee a high resolution of the measured changes. Abstractions of resources and abstractions of resource changes are often very related up to the point that they can be used interchangeably in the processing. In practice, approaches working with abstractions processed over high revision counts face a scalability challenge. In this work, we contribute to the challenge by incrementalizing the processing of repository resources and the corresponding abstractions. Our work is inspired by incrementalization theory including insights on Abelian groups, group homomorphisms and indexing. We provide a map-reduce interface that enables calls to foreign functionality and convenient operations for processing abstractions, such as mapping, filtering, group-wise aggregation and joining. Apache Spark is used for distribution. We compare the scalability of our approach with available MSR approaches, i.e., with LISA that reduces redundancy and with DJ-Rex that migrates an analysis to a distributed map-reduce framework.

How EvoStreets Are Observed in Three-Dimensional and Virtual Reality Environments
Marcel Steinbeck, Rainer Koschke

, and Marc O. Rüdel
(University of Bremen, Germany)
When analyzing software systems, a large amount of data accumulates. In order to assist developers in the preparation, evaluation, and understanding of findings, different visualization techniques have been developed. Due to recent progress in immersive virtual reality, existing visualization tools were ported to this environment. However, three-dimensional and virtual reality environments have different advantages and disadvantages, and by transferring concepts, such as layout algorithms and user interaction mechanisms, more or less one-to-one, the characteristics of these environments are neglected. In order to develop techniques adapting to the circumstance of a particular environment, more research in this field is necessary.
In previously conducted case studies, we compared EvoStreets deployed in three different environments: 2D, 2.5D, and virtual reality. We found evidence that movement patterns---path length, average speed, and occupied volume---differ significantly between the 2.5D and virtual reality environments for some of the tasks that had to be solved by 34 participants in a controlled experiment. In this paper, we analyze the results of this experiment in more details, to study if not only movement is affected by these environments, but also the way how EvoStreets are observed. Although we could not find enough evidence that the number of viewpoints and their duration differ significantly, we found indications that in virtual reality viewpoints are located closer to the EvoStreets and that the distance between viewpoints is shorter. Based on our previous results and the findings of this paper, we present visualization and user interaction concepts specific to the kind of environment.

Are the Code Snippets What We Are Searching for? A Benchmark and an Empirical Study on Code Search with Natural-Language Queries
Shuhan Yan, Hang Yu, Yuting Chen

, Beijun Shen

, and Lingxiao Jiang

(Shanghai Jiao Tong University, China; Singapore Management University, Singapore)
Code search methods, especially those that allow programmers to raise queries in a natural language, plays an important role in software development. It helps to improve programmers’ productivity by returning sample code snippets from the Internet and/or source-code repositories for their natural-language queries. Meanwhile, there are many code search methods in the literature that support natural-language queries. Difficulties exist in recognizing the strengths and weaknesses of each method and choosing the right one for different usage scenarios, because (1) the implementations of those methods and the datasets for evaluating them are usually not publicly available, and (2) some methods leverage different training datasets or auxiliary data sources and thus their effectiveness cannot be fairly measured and may be negatively affected in practical uses. To build a common ground for measuring code search methods, this paper builds CosBench, a dataset that consists of 1000 projects, 52 code-independent natural-language queries with ground truths, and a set of scripts for calculating four metrics on code research results. We have evaluated four IR (Information Retrieval)-based and two DL (Deep Learning)-based code search methods on CosBench. The empirical evaluation results clearly show the usefulness of the CosBench dataset and various strengths of each code search method. We found that DL-based methods are more suitable for queries on reusing code, and IRbased ones for queries on resolving bugs and learning API uses.

Automatically Learning Patterns for Self-Admitted Technical Debt Removal
Fiorella Zampetti, Alexander Serebrenik, and Massimiliano Di Penta
(University of Sannio, Italy; Eindhoven University of Technology, Netherlands)
Technical Debt (TD) expresses the need for improvements in a software system, e.g., to its source code or architecture. In certain circumstances, developers “self-admit” technical debt (SATD) in their source code comments. Previous studies investigate when SATD is admitted, and what changes developers perform to remove it. Building on these studies, we present a first step towards the automated recommendation of SATD removal strategies. By leveraging a curated dataset of SATD removal patterns, we build a multi-level classifier capable of recommending six SATD removal strategies, e.g., changing API calls, conditionals, method signatures, exception handling, return statements, or telling that a more complex change is needed. SARDELE (SAtd Removal using DEep LEarning) combines a convolutional neural network trained on embeddings extracted from the SATD comments with a recurrent neural network trained on embeddings extracted from the SATD-affected source code. Our evaluation reveals that SARDELE is able to predict the type of change to be applied with an average precision of ~55%, recall of ~57%, and AUC of 0.73, reaching up to 73% precision, 63% recall, and 0.74 AUC for certain categories such as changes to method calls. Overall, results suggest that SATD removal follows recurrent patterns and indicate the feasibility of supporting developers in this task with automated recommenders.

Refactoring Graphs: Assessing Refactoring over Time
Aline Brito, Andre Hora, and Marco Tulio Valente
(Federal University of Minas Gerais, Brazil)
Refactoring is an essential activity during software evolution. Frequently, practitioners rely on such transformations to improve source code maintainability and quality. As a consequence, this process may produce new source code entities or change the structure of existing ones. Sometimes, the transformations are atomic, i.e., performed in a single commit. In other cases, they generate sequences of modifications performed over time. To study and reason about refactorings over time, in this paper, we propose a novel concept called refactoring graphs and provide an algorithm to build such graphs. Then, we investigate the history of 10 popular open-source Java-based projects. After eliminating trivial graphs, we characterize a large sample of 1,150 refactoring graphs, providing quantitative data on their size, commits, age, refactoring composition, and developers. We conclude by discussing applications and implications of refactoring graphs, for example, to improve code comprehension, detect refactoring patterns, and support software evolution studies.

On Relating Technical, Social Factors, and the Introduction of Bugs
Filipe Falcão, Caio Barbosa

, Baldoino Fonseca, Alessandro Garcia

, Márcio Ribeiro, and Rohit Gheyi
(Federal University of Alagoas, Brazil; PUC-Rio, Brazil; Federal University of Campina Grande, Brazil)
As collaborative coding environments make it easier to contribute to software projects, the number of developers involved in these projects keeps increasing. This increase makes it more difficult for code reviewers to deal with buggy contributions. Collaborative environments like GitHub provide a rich source of data on developers' contributions. Such data can be used to extract information about developers regarding technical (e.g., their experience) and social (e.g., their interactions) factors. Recent studies analyzed the influence of these factors on different activities of software development. However, there is still room for improvement on the relation between these factors and the introduction of bugs. We present a broader study, including 8 projects from different domains and 6,537 bug reports, on relating five technical, three social factors, and the introduction of bugs. The results indicate that technical and social factors can discriminate between buggy and clean commits. But, the technical factors are more determining than social ones. Particularly, the developers' habits of not following technical contribution norms and the developer's commit bugginess are associated with an increase on commit bugginess. On the other hand, project's establishment, ownership level of developers' commit, and social influence are related to a lower chance of introducing bugs.

Info

Characterizing Architectural Drifts of Adaptive Systems
Daniel San Martín, Bento Siqueira, Valter Camargo, and Fabiano Ferrari
(Federal University of São Carlos, Brazil)
An adaptive system (AS) evaluates its own behavior and changes it when the evaluation indicates that the system is not accomplishing what it is intended to do, or when better functionality or performance is possible. MAPE-K is a reference model that prescribes the adaptation mechanism of ASs by means of high-level abstractions such as Monitors, Analyzers, Planners and Executors and the relationships among them. Since the abstractions and the relationships provided by MAPE-K are generic, other reference models were proposed focusing on providing lower level abstractions to support software engineers in a more suitable way. However, after the analysis of seven representative ASs, we realized the abstractions prescribed by the existing reference models are not properly implemented, thus leading to architectural drifts. Therefore, in this paper we characterized three of these drifts by describing them with a template and showing practical examples. The three architectural drifts of ASs are Scattered Reference Inputs, Mixed Executors and Effectors, and Obscure Alternatives. We expect that by identifying and characterizing these drifts, we can help software architects improve their design and, as a consequence, increase the reliability of this type of systems.

Using Productive Collaboration Bursts to Analyze Open Source Collaboration Effectiveness
Samridhi Choudhary, Christopher Bogart, Carolyn Rose, and James Herbsleb
(Amazon, USA; Carnegie Mellon University, USA)
Developers of open-source software projects tend to collaborate in bursts of activity over a few days at a time, rather than at an even pace. A project might find its productivity suffering if bursts of activity occur when a key person with the right role or right expertise is not available to participate. Open-source projects could benefit from monitoring the way they orchestrate attention among key developers, finding ways to make themselves available to one another when needed. In commercial software development, Sociotechnical Congruence (STC) has been used as a measure to assess whether coordination among developers is sufficient for a given task. However, STC has not previously been successfully applied to open-source projects, in which some industrial assumptions do not apply: management-chosen targets, mandated steady work hours, and top-down task allocation of inputs and targets. In this work we propose an operationalization of STC for open-source software development. We use temporal bursts of activity as a unit of analysis more suited to the natural rhythms of open-source work, as well as open source analogues of other component measures needed for calculating STC. As an illustration, we demonstrate that open-source development on PyPI projects in GitHub is indeed bursty, that activities in the bursts have topical coherence, and we apply our operationalization of STC. We argue that a measure of socio-technical congruence adapted to open source could provide projects with a better way of tracking how effectively they are collaborating when they come together to collaborate.

Slice-Based Cognitive Complexity Metrics for Defect Prediction
Basma S. Alqadi and Jonathan I. Maletic
(Imam Muhammad ibn Saud Islamic University, Saudi Arabia; Kent State University, USA)
Researchers have identified several quality metrics to predict defects, relying on different information however, these approaches lack metrics to estimate the effort of program understandability of system artifacts. In this paper, novel metrics to compute the cognitive complexity based on program slicing are introduced. These metrics help identify code that is more likely to have defects due to being challenging to comprehension. The metrics include such measures as the total number of slices in a file, the size, the average number of identifiers, and the average spatial distance of a slice. A scalable lightweight slicing tool is used to compute the necessary slicing data. A thorough empirical investigation into how cognitive complexity correlates with and predicts defects in the version histories of 10 datasets of 7 open source systems is performed. The results show that the increase of cognitive complexity significantly increases the number of defects in 94% of the cases. In a comparison study to metrics that have been shown to correlate with understandability and with defects, the addition of cognitive complexity metrics shows better prediction by up to 14% in F1, 16% in AUC, and 35% in R2.

The Silent Helper: The Impact of Continuous Integration on Code Reviews
Nathan Cassee

, Bogdan Vasilescu, and Alexander Serebrenik
(Eindhoven University of Technology, Netherlands; Carnegie Mellon University, USA)
The adoption of Continuous Integration (CI) has been shown multiple benefits for software engineering practices related to build, test and dependency management. However, the impact of CI on the social aspects of software development has been overlooked so far. Specifically, we focus on studying the impact of CI on a paradigmatic socio-technical activity within the software engineering domain, namely code reviews. Indeed, one might expect that the introduction of CI allows reviewers to focus on more challenging aspects of software quality that could not be assessed using CI. To assess validity of this expectation we conduct an exploratory study of code reviews in 685 GitHub projects that have adopted Travis-CI, the most popular CI-service on GitHub. We observe that with the introduction of CI, pull requests are being discussed less. On average CI saves up to one review comment per pull request. This decrease in amount of discussion, however, cannot be explained by the decrease in the number of updates of the pull requests. This means that in presence of CI developers perform the same amount of work by communicating less, giving rise to the idea of CI as a silent helper.

Heap Memory Snapshot Assisted Program Analysis for Android Permission Specification
Lannan Luo
(University of South Carolina, USA)
Given a permission-based framework, its permission specification, which is a mapping between API methods of the framework and the permissions they require, is important for software developers and analysts. In the case of Android Frame-work, which contains millions of lines of code, static analysis is promising for analyzing such a large codebase to derive its permission specification. One of the common building blocks for static analysis is the generation of a global call graph. However, as common for object-oriented languages, the target of a virtual function call depends on the runtime type of the receiving object, which is undecidable statically. Existing work applies traditional analysis approaches, such as class-hierarchy analysis and points-to analysis, to building an over-approximated call graph of the framework, causing much imprecision to downstream analysis. We propose the heap memory snapshot assisted program analysis that leverages the dynamic information stored in the heap of Android Framework execution to assist in generating a more precise call graph; then, further analysis is performed on the call graph to extract the permission specification. We have developed a prototype and evaluated it on different versions of Android Framework. The evaluation shows that our method significantly improves on prior work, producing more precise results.

A Code-Description Representation Learning Model Based on Attention
Qing Huang, An Qiu, Maosheng Zhong, and Yuan Wang
(Jiangxi Normal University, China)
Code search is to retrieve source code given a query. By deep learning, the existing work embeds source code and its description into a shared vector space; however, this space is so general that each code token is associated with each description term. In this paper, we propose a code-description representation learning model (CDRL) based on attention. This model refines the general shared space into the specific one. In such space, only semantically related code tokens and description terms are associated. The experimental results show that this model could retrieve relevant source code effectively and outperform the state-of-the-art method (e.g., CODEnn and QECK) by 4-8% in terms of precision when the first query result is inspected.

Suggesting Comment Completions for Python using Neural Language Models
Adelina Ciurumelea, Sebastian Proksch, and Harald C. Gall

(University of Zurich, Switzerland)
Source-code comments are an important communication medium between developers to better understand and maintain software. Current research focuses on auto-generating comments by summarizing the code. However, good comments contain additional details, like important design decisions or required trade-offs, and only developers can decide on the proper comment content. Automated summarization techniques cannot include information that does not exist in the code, therefore fully-automated approaches while helpful, will be of limited use. In our work, we propose to empower developers through a semi-automated system instead. We investigate the feasibility of using neural language models trained on a large corpus of Python documentation strings to generate completion suggestions and obtain promising results. By focusing on confident predictions, we can obtain a top-3 accuracy of over 70%, although this comes at the cost of lower suggestion frequency. Our models can be improved by leveraging context information like the signature and the full body of the method. Additionally, we are able to return good accuracy completions even for new projects, suggesting the generalizability of our approach.

Leveraging Contextual Information from Function Call Chains to Improve Fault Localization
Árpád Beszédes

, Ferenc Horváth

, Massimiliano Di Penta, and Tibor Gyimóthy
(University of Szeged, Hungary; University of Sannio, Italy)
In Spectrum Based Fault Localization, program elements such as statements or functions are ranked according to a suspiciousness score which can guide the programmer in finding the fault more efficiently. However, such a ranking does not include any additional information about the suspicious code elements. In this work, we propose to complement function-level spectrum based fault localization with function call chains – i.e., snapshots of the call stack occurring during execution – on which the fault localization is first performed, and then narrowed down to functions. Our experiments using defects from Defects4J show that (i) 69% of the defective functions can be found in call chains with highest scores, (ii) in 4 out of 6 cases the proposed approach can improve Ochiai ranking of 1 to 9 positions on average, with a relative improvement of 19-48%, and (iii) the improvement is substantial (66-98%) when Ochiai produces bad rankings for the faulty functions.

Deep Learning Based Identification of Suspicious Return Statements
Guangjie Li

, Hui Liu

, Jiahao Jin, and Qasim Umer
(Beijing Institute of Technology, China)
Identifiers in source code are composed of terms in natural languages. Such terms, as well as phrases composed of such terms, convey rich semantics that could be exploited for program analysis and comprehension. To this end, in this paper we propose a deep learning based approach, called MLDetector, to identifying suspicious return statements by leveraging semantics conveyed by the natural language phrases that are used as identifiers in the source code. We specially design a deep neural network to tell whether a given return statement matches its corresponding method signature. The rationale is that both method signature and return value should explicitly specify the output of the method, and thus a significant mismatch between method signature and return value may suggest a suspicious return statement. To address the challenge of lacking negative training data, i.e., incorrect return statements, we generate negative training data automatically by transforming real-world correct return statements. To feed code into neural network, we convert them into vectors by Word2Vec, an unsupervised neural network based learning algorithm. We evaluate the proposed approach in two parts. In the first part, we evaluate it on 500 open-source applications by automatically generating labeled training data. Results suggest that the precision of the proposed approach varies from 83% to 90%. In the second part, we conduct a case study on 100 real-world applications. Evaluation results suggest that 42 out of 65 real-world incorrect return statements are detected (with precision of 59%).

Clone Detection in Test Code: An Empirical Evaluation
Brent van Bladel and Serge Demeyer

(University of Antwerp, Belgium)
Duplicated test code (a.k.a. test code clones) has a negative impact on test comprehension and maintenance. Moreover, the typical structure of unit test code induces structural similarity, increasing the amount of duplication. Yet, most research on software clones and clone detection tools is focused on production code, often ignoring test code. In this paper we fill this gap by comparing four different clone detection tools (NiCad, CPD, iClones, TCORE) against the test code of three open-source projects. Our analysis confirms the prevalence of test code clones, as we observed between 23% and 29% test code duplication. We also show that most of the tools suffer from false negatives (NiCad = 83%, CPD = 84%, iClones = 21%, TCORE = 65%), which leaves ample room for improvement. These results indicate that further research on test clone detection is warranted.

Are SonarQube Rules Inducing Bugs?
Valentina Lenarduzzi, Francesco Lomio

, Heikki Huttunen, and Davide Taibi

(Lappeenranta-Lahti University of Technology, Finland; Tampere University, Finland)
The popularity of tools for analyzing Technical Debt, and particularly the popularity of SonarQube, is increasing rapidly. SonarQube proposes a set of coding rules, which represent something wrong in the code that will soon be reflected in a fault or will increase maintenance effort. However, our local companies were not confident in the usefulness of the rules proposed by SonarQube and contracted us to investigate the fault-proneness of these rules. In this work we aim at understanding which SonarQube rules are actually fault-prone and to understand which machine learning models can be adopted to accurately identify fault-prone rules. We designed and conducted an empirical study on 21 well-known mature open-source projects. We applied the SZZ algorithm to label the fault-inducing commits. We analyzed the fault-proneness by comparing the classification power of seven machine learning models. Among the 202 rules defined for Java by SonarQube, only 25 can be considered to have relatively low fault-proneness. Moreover, violations considered as ''bugs'' by SonarQube were generally not fault-prone and, consequently, the fault-prediction power of the model proposed by SonarQube is extremely low. The rules applied by SonarQube for calculating technical debt should be thoroughly investigated and their harmfulness needs to be further confirmed. Therefore, companies should carefully consider which rules they really need to apply, especially if their goal is to reduce fault-proneness.

ERA

Enhancing Source Code Refactoring Detection with Explanations from Commit Messages
Rrezarta Krasniqi and Jane Cleland-Huang
(University of Notre Dame, USA)
We investigate the extent to which code commit summaries provide rationales and descriptions of code refactorings. We present a refactoring description detection tool CMMiner that detects code commit messages containing refactoring information and differentiates between twelve different refactoring types. We further explore whether refactoring information mined from commit messages using CMMiner, can be combined with refactoring descriptions mined from source code using the well-known RMiner tool. For six refactoring types covered by both CMMiner and RMiner, we observed 21.96% to 38.59

Unleashing the Potentials of Immersive Augmented Reality for Software Engineering
Leonel Merino, Mircea Lungu, and Christoph Seidl

(University of Stuttgart, Germany; IT University of Copenhagen, Denmark)
In immersive augmented reality (IAR), users can wear a head-mounted display to see computer-generated images superimposed to their view of the world. IAR was shown to be beneficial across several domains, e.g., automotive, medicine, gaming and engineering, with positive impacts on, e.g., collaboration and communication. We think that IAR bears a great potential for software engineering but, as of yet, this research area has been neglected. In this vision paper, we elicit potentials and obstacles for the use of IAR in software engineering. We identify possible areas that can be supported with IAR technology by relating commonly discussed IAR improvements to typical software engineering tasks. We further demonstrate how innovative use of IAR technology may fundamentally improve typical activities of a software engineer through a comprehensive series of usage scenarios outlining practical application. Finally, we reflect on current limitations of IAR technology based on our scenarios and sketch research activities necessary to make our vision a reality. We consider this paper to be relevant to academia and industry alike in guiding the steps to innovative research and applications for IAR in software engineering.

Reflection on Building Hybrid Access Control by Configuring RBAC and MAC Features
Dae-Kyoo Kim, Hua Ming, and Lunjin Lu
(Oakland University, USA)
This paper reflects on the paper titled Building Hybrid Access Control by Configuring RBAC and MAC Features which was published in Information and System Technology, 2014. The publication presents an approach for building a hybrid access control model of Role-Based Access Control and Mandatory Access Control by defining and configuring them in terms of features. The publication has been cited three times, which shows limited impact. We review the citing papers as to how the publication is cited and discuss possible reasons for the limited impact. We also discuss its position in the current state of the arts since the publication. Then, we describe an on-going effort for a new approach to address the weaknesses of the publication with expected impact.

Is Developer Sentiment Related to Software Bugs: An Exploratory Study on GitHub Commits
Syed Fatiul Huq, Ali Zafar Sadiq, and Kazi Sakib
(University of Dhaka, Bangladesh)
The outcome of software products primarily depends on the developers, including their emotion or sentiment in a software development environment. Developer emotions have been observed to be correlated to several patterns, for instance, task resolution time, developer turnover, etc. by conducting sentiment analysis on software collaborative artifacts like Commits. This study aims to quantify the impact of those patterns by finding a relation between developer sentiment and software bugs. To do so, Fix-Inducing Changes –- changes that introduce bugs to the system –- are detected, along with changes that precede or fix those bugs. Sentiment of these changes are determined from their Commit messages using Senti4SD. It is statistically observed that Commits that introduce, precede or fix bugs are significantly more negative than regular Commits, with a higher proportion of emotional (non-neutral) messages. It is also found that a distinction between buggy and correct fixes exists based on the message's neutrality.

The Python/C API: Evolution, Usage Statistics, and Bug Patterns
Mingzhe Hu and Yu Zhang

(University of Science and Technology of China, China)
Python has become one of the most popular programming languages in the era of data science and machine learning, especially for its diverse libraries and extension modules. Python front-end with C/C++ native implementation achieves both productivity and performance, almost becoming the standard structure for many mainstream software systems. However, feature discrepancies between two languages can pose many security hazards in the interface layer using the Python/C API. In this paper, we applied static analysis to reveal the evolution and usage statistics of the Python/C API, and provided a summary and classification of its 10 bug patterns with empirical bug instances from Pillow, a widely used Python imaging library. Our toolchain can be easily extended to access different types of syntactic bug-finding checkers. And our systematical taxonomy to classify bugs can guide the construction of more highly automated and high-precision bug-finding tools.

Revisiting the Challenges and Opportunities in Software Plagiarism Detection
Xi Xu, Ming Fan

, Ang Jia, Yin Wang, Zheng Yan, Qinghua Zheng, and Ting Liu
(Xi'an Jiaotong University, China; Xidian University, China; Aalto University, Finland)
Software plagiarism seriously impedes the healthy development of open source software. To fight against code obfuscation and inherent non-determinism of thread scheduling applied against software plagiarism detection, we proposed a new dynamic birthmark called DYnamic Key Instruction Sequence (DYKIS) and a framework called Thread-oblivious dynamic Birthmark (TOB) for the purpose of reviving the existing birthmarks and a thread-aware dynamic birthmark called Thread-related System call Birthmark (TreSB). Though many approaches have been proposed for software plagiarism detection, they are still limited to satisfy the following highly desired requirements: the applicability to handle binary, the capability to detect partial plagiarism, the resiliency to code obfuscation, the interpretability on detection results, and the scalability to process large-scale software. In this position paper, we discuss and outline the research opportunities and challenges in the field of software plagiarism detection in order to stimulate brilliant innovations and direct our future research efforts.

Req2Lib: A Semantic Neural Model for Software Library Recommendation
Zhensu Sun, Yan Liu, Ziming Cheng, Chen Yang, and Pengyu Che
(Tongji University, China)
Third-party libraries are crucial to the development of software projects. To get suitable libraries, developers need to search through millions of libraries by filtering, evaluating, and comparing. The vast number of libraries places a barrier for programmers to locate appropriate ones. To help developers, researchers have proposed automated approaches to recommend libraries based on library usage pattern. However, these prior studies can not sufficiently match user requirements and suffer from cold-start problem. In this work, we would like to make recommendations based on requirement descriptions to avoid these problems. To this end, we propose a novel neural approach called Req2Lib which recommends libraries given descriptions of the project requirement. We use a Sequence-to-Sequence model to learn the library linked-usage information and semantic information of requirement descriptions in natural language. Besides, we apply a domain-specific pre-trained word2vec model for word embedding, which is trained over textual corpus from Stack Overflow posts. In the experiment, we train and evaluate the model with data from 5,625 java projects. Our preliminary evaluation demonstrates that Req2Lib can recommend libraries accurately.

Dependency Solving Is Still Hard, but We Are Getting Better at It
Pietro Abate, Roberto Di Cosmo, Georgios Gousios

, and Stefano Zacchiroli
(Nomadic Labs, France; Inria, France; University Paris Diderot, France; Delft University of Technology, Netherlands)
Dependency solving is a hard (NP-complete) problem in all non-trivial component models due to either mutually incompatible versions of the same packages or explicitly declared package conflicts. As such, software upgrade planning needs to rely on highly specialized dependency solvers, lest falling into pitfalls such as incompleteness---a combination of package versions that satisfy dependency constraints does exist, but the package manager is unable to find it.
In this paper we look back at proposals from dependency solving research dating back a few years. Specifically, we review the idea of treating dependency solving as a separate concern in package manager implementations, relying on generic dependency solvers based on tried and tested techniques such as SAT solving, PBO, MILP, etc.
By conducting a census of dependency solving capabilities in state-of-the-art package managers we conclude that some proposals are starting to take off (e.g., SAT-based dependency solving) while---with few exceptions---others have not (e.g., outsourcing dependency solving to reusable components). We reflect on why that has been the case and look at novel challenges for dependency solving that have emerged since.

A Reflection on “An Exploratory Study on Exception Handling Bugs in Java Programs”
Felipe Ebert, Fernando Castor, and Alexander Serebrenik
(Federal University of Pernambuco, Brazil; Eindhoven University of Technology, Netherlands)
Exception handling is a feature provided by most mainstream programming languages, and typically involves constructs to throw and handle error signals. On the one hand, early work has argued extensively about the benefits of exception handling, such as promoting modularity by defining how exception handlers can be implemented and maintained independently of the normal behavior of the system and easing but localization. On the other hand, some studies argue that exception handling can make the programming languages unnecessarily complex and promote the introduction of subtle bugs in programs. In 2015 we published a paper describing a study investigating the prevalence and nature of exception handling bugs in two large, widely adopted Java systems. This study also confronted its findings about real exception handling bugs with the perceptions of developers about those bugs, also accounting for bugs not related to exception handling. The goal of this reflection paper is to investigate the state of the art in exception handling research, with a particular emphasis on exception handling bugs, and how our paper has influenced other studies in the area. We found that our paper was cited by 33 articles, and all themes for future work we raised in our paper have been tackled by other studies in the short span of five years.

A Preliminary Study on Open-Source Memory Vulnerability Detectors
Yu Nong

and Haipeng Cai

(Washington State University, USA)
We present preliminary results of a study on memory vulnerability detectors based on (static and/or dynamic) program analysis. Against a public suite of 520 C/C++ programs as benchmarks which cover 14 different vulnerability categories, we measured the performance of five state-of-the-art detectors in terms of effectiveness and efficiency. Our study revealed that with respect to the particular set of benchmarks we chose: (1) the effectiveness of these studied detectors varied widely: 66.7% to 100% precision, 0% to 100% recall, and 0% to 100% F1 per category, indicating most of the techniques worked extremely well on certain kinds of vulnerabilities yet quite poorly on others, (2) these detectors were generally quite efficient: despite a few outliers, the average (per benchmark) time costs were around one second, (3) except for between the most and least accurate detectors, other pairs of detectors did not have statistically significant and large differences in accuracy in our pair-wise statistical testing. We also share insights into the failures and successes of these detectors obtained from our case studies.

A Reflection on the Predictive Accuracy of Dynamic Impact Analysis
Haipeng Cai

(Washington State University, USA)
Impact analysis as a critical step in software evolution assists developers with decision making as regards whether and where to apply code changes in evolving software. Dynamic approaches to this analysis particularly focus on the effects of potential code changes to a program with respect to its concrete executions. Given the existence of a number of prior approaches to dynamic impact analysis as opposed to a lack of systematic understanding of their performance, the first comprehensive study of the predictive accuracy of dynamic impact analysis was conducted, comparing the performance of representative techniques in this area against various kinds of realized code changes. This paper reflects on the progress in dynamic impact analysis, concerning the impact of that earlier study on later research. We also situate dynamic impact analysis within the current research and practice on impact analysis in general, and envision relevant future research vectors in this area.

JavaScript API Deprecation in the Wild: A First Assessment
Romulo Nascimento, Aline Brito, Andre Hora, and Eduardo Figueiredo
(Federal University of Minas Gerais, Brazil)
Building an application using third-party libraries is a common practice in software development. As any other software system, code libraries and their APIs evolve over time. In order to help version migration and ensure backward compatibility, a recommended practice during development is to deprecate API. Although studies have been conducted to investigate deprecation in some programming languages, such as Java and C#, there are no detailed studies on API deprecation in the JavaScript ecosystem. This paper provides an initial assessment of API deprecation in JavaScript by analyzing 50 popular software projects. Initial results suggest that the use of deprecation mechanisms in JavaScript packages is low. However, we find five different ways that developers use to deprecate API in the studied projects. Among these solutions, deprecation utility (i.e., any sort of function specially written to aid deprecation) and code comments are the most common practices in JavaScript. Finally, we find that the rate of helpful message is high: 67% of the deprecations have replacement messages to support developers when migrating APIs.

A Semantic-Based Framework for Analyzing App Users’ Feedback
Aman Yadav, Rishab Sharma, and Fatemeh H. Fard
(NIT, India; University of British Columbia, Canada)
The competitive market of mobile apps requires app developers to consider the users’ feedback frequently. This feedback, when comes from different resources, e.g. App Stores and Twitter, will provide a broader picture of the state of the app, as the users discuss different topics on each platform. Automated tools are developed to filter the informative comments for app developers. However, to integrate the feedbacks from different platforms, one should evaluate the similarities and/or differences of the text from each one. Different meaning of the words in various context, makes this evaluation a challenging task for automated processes. For example, Request night theme and Add dark mode are two comments that are requesting the same feature. This similarity cannot be identified automatically if the semantics of the words are not embedded in the analysis. In this paper, we propose a new framework to analyze the users’ feedback by embedding their semantics. As a case study, we investigate whether our approach can identify the similar/different comments from Google Play Store and Twitter, in the two well studied classes of bug reports and feature requests from literature. The initial results, validated by expert evaluation and statistical analysis, shows that this framework can automatically measure the semantic differences among users’ comments in both groups. The framework can be used to build intelligent tools to integrate the users’ feedback from other platforms, as well as providing ways to analyze the reviews in more detail automatically.

MobiLogLeak: A Preliminary Study on Data Leakage Caused by Poor Logging Practices
Rui Zhou, Mohammad Hamdaqa, Haipeng Cai

, and Abdelwahab Hamou-Lhadj
(Concordia University, Canada; Reykjavik University, Iceland; Washington State University, USA)
Logging is an essential software practice that is used by developers to debug, diagnose and audit software systems. Despite the advantages of logging, poor logging practices can potentially leak sensitive data. The problem of data leakage is more severe in applications that run on mobile devices, since these devices carry sensitive identification information ranging from physical device identifiers (e.g., IMEI MAC address) to communications network identifiers (e.g., SIM, IP, Bluetooth ID), and application-specific identifiers related to the location and the users’ accounts. This preliminary study explores the impact of logging practices on data leakage of such sensitive information. Particularly, we want to investigate whether log-related statements inserted into an application code could lead to data leakage. While studying logging practices in mobile applications is an active research area, to our knowledge, this is the first study that explores the interplay between logging and security in the context of mobile applications for Android. We propose an approach called MobiLogLeak, an approach that identifies log statements in deployed apps that leak sensitive data. MobiLogLeak relies on taint flow analysis. Among 5,000 Android apps that we studied, we found that 200 apps leak sensitive data through logging.

Identifying Vulnerable IoT Applications using Deep Learning
Hajra Naeem and Manar H. Alalfi
(Ryerson University, Canada)
This paper presents an approach for identification of vulnerable IoT applications using deep learning algorithms. The approach focuses on a category of vulnerabilities that leads to sensitive information leakage which can be identified using taint analysis. First, we analyze the source code of IoT apps in order to recover tokens along their frequencies and tainted flows. Second, we develop, Token2Vec, which transforms the source code tokens into vectors. We have also developed Flow2Vec, which transforms the identified tainted flows into vectors. Third, we use the recovered vectors to train a deep learning algorithm to build a model for the identification of tainted apps. We have evaluated the approach on two datasets and the experiments show that the proposed approach of combining tainted flows features with the base benchmark that uses token frequencies only, has improved the accuracy of the prediction models from 77.78% to 92.59% for Corpus1 and 61.11% to 87.03% for Corpus2.

A Mutation Framework for Evaluating Security Analysis Tools in IoT Applications
Sajeda Parveen and Manar H. Alalfi
(Ryerson University, Canada)
In this paper, we present an automated framework to evaluate taint flow analysis tools in the domain of IoT (Internet of things) apps. First, we propose a set of mutational operators tailored to evaluate flow-sensitive analysis tools. Then we developed mutators to automatically generate mutants for this type of sensitivity analysis. We demonstrated the framework on flow- sensitivity mutational operators to evaluate two taint flow analyzers, SaINT and Taint-Things. To the best of our knowledge, our framework is the first framework to address the need for evaluating taint flow analysis tools specifically developed for IoT SmartThings apps.

RENE

Pull Requests or Commits? Which Method Should We Use to Study Contributors’ Behavior?
Marcus Bertoncello, Gustavo Pinto, Igor Scaliante Wiese

, and Igor Steinmacher

(State University of Maringá, Brazil; Federal University of Paraná, Brazil; Federal University of Technology Paraná, Brazil; Northern Arizona University, USA)
Social coding environments have been consistently growing since the popularization of the contribution model known as pull-based. This model has facilitated how developers make their contributions; developers can easily place a few pull requests without further commitment. Developers without strong ties to a project, the so-called casual contributors, often make a single contribution before disappearing. Interestingly, some studies about the topic use the number of commits made to identify the casual contributors, while others use the number of merged pull requests. Does the method used influence the results? In this paper, we replicate a study about casual contributors that relied on commits to identify and analyze these contributors. To achieve this goal, we analyzed the same set of GitHub-hosted software repositories used in the original paper. By using pull requests, we found an average of 66% casual contributors (in comparison to 48.98% when using commits), who were responsible for 12.5% of the contributions accepted (1.73% when using commits). We used a sample of 442 developers to investigate the accuracy of the method. We found that 11.3% of the contributors identified using the pull requests were misclassified (26.2% using commits). We also evidenced that using pull requests is more precise for determining the number of contributions, given that GitHub projects mostly follow the pull-based process. Our results indicate that the method used for mining contributors' data has the potential to influence the results. With this replication, it may be possible to improve previous results and reduce future efforts for new researchers when conducting studies that rely on the number of contributions.

Info

Automated Deprecated-API Usage Update for Android Apps: How Far Are We?
Ferdian Thung, Stefanus A. Haryono

, Lucas Serrano, Gilles Muller, Julia Lawall

, David Lo

, and Lingxiao Jiang

(Singapore Management University, Singapore; Sorbonne University, France; LIP6, France; Inria, France)
As the Android API evolves, some API methods may be deprecated, to be eventually removed. App developers face the challenge of keeping their apps up-to-date, to ensure that the apps work in both older and newer Android versions. Currently, AppEvolve is the state-of-the-art approach to automate such updates, and it has been shown to be quite effective. Still, the number of experiments reported is moderate, involving only API usage updates in 41 usage locations. In this work, we replicate the evaluation of AppEvolve and assess whether its effectiveness is generalizable. Given the set of APIs on which AppEvolve has been evaluated, we test AppEvolve on other mobile apps that use the same APIs. Our experiments show that AppEvolve fails to generate applicable updates for 81% of our dataset, even though the relevant knowledge for correct API updates is available in the examples. We first categorize the limitations of AppEvolve that lead to these failures. We then propose a mitigation strategy that solves 86% of these failures by a simple refactoring of the app code to better resemble the code in the examples. The refactoring usually involves assigning the target API method invocation and the arguments of the target API method into variables. Indeed, we have also seen such transformations in the dataset distributed with the AppEvolve replication package, as compared to the original source code from which this dataset is derived. Based on these findings, we propose some promising future directions.

Industry

Experience Report: How Effective Is Automated Program Repair for Industrial Software?
Kunihiro Noda, Yusuke Nemoto, Keisuke Hotta, Hideo Tanida, and Shinji Kikuchi
(Fujitsu Labs, Japan)
Recent advances in automated program repair (APR) have widely caught the attention of industrial developers as a way of reducing debugging costs. While hundreds of studies have evaluated the effectiveness of APR on open-source software, industrial case studies on APR have been rarely reported; it is still unclear whether APR can work well for industrial software.
This paper reports our experience applying a state-of-the-art APR technique, Elixir, to large industrial software consisting of 150+ Java projects and 13 years of development histories. It provides lessons learned and recommendations regarding obstacles to the industrial use of current APR: low recall (7.7%), lack of bug-exposing tests (90%), low success rate (10%), among others. We also report the preliminary results of our ongoing improvement of Elixir. With some simple enhancements, the success rate of repair has been greatly improved by up to 40%.

Reducing Code Complexity through Code Refactoring and Model-Based Rejuvenation
Arjan J. Mooij, Jeroen Ketema, Steven Klusener, and Mathijs Schuts
(ESI/TNO, Netherlands; Philips, Netherlands)
Over time, software tends to grow more complex, hampering understandability and further development. To reduce accidental complexity, model-based rejuvenation techniques have been proposed. These techniques combine reverse engineering (extracting models) with forward engineering (generating code). Unfortunately, model extraction can be error-prone, and validation can often only be performed at a late stage by testing the generated code. We intend to mitigate the aforementioned challenges, making model-based rejuvenation more controlled.
We describe an exploratory case study that aims to rejuvenate an industrial embedded software component implementing a nested state machine. We combine two techniques. First, we develop and apply a series of small, automated, case-specific code refactorings that ensure the code (a) uses well-known programming idioms, and (b) easily maps onto the type of model we intend to extract. Second, we perform model-based rejuvenation focusing on the high-level structure of the code.
The above combination of techniques gives ample opportunity for early validation, in the form of code reviews and testing, as each refactoring is performed directly on the existing code. Moreover, aligning the code with the type of model we intend to extract significantly simplifies the extraction, making the process less error-prone. Hence, we consider code refactoring to be a useful stepping stone towards model-based rejuvenation.

Leveraging Machine Learning for Software Redocumentation
Verena Geist, Michael Moser, Josef Pichler, Stefanie Beyer, and Martin Pinzger
(Software Competence Center Hagenberg, Austria; University of Applied Sciences Upper Austria, Austria; University of Klagenfurt, Austria)
Source code comments contain key information about the underlying software system. Many redocumentation approaches, however, cannot exploit this valuable source of information. This is mainly due to the fact that not all comments have the same goals and target audience and can therefore only be used selectively for redocumentation. Performing a required classification manually, e.g. in the form of heuristic rules, is usually time-consuming and error-prone and strongly dependent on programming languages and guidelines of concrete software systems. By leveraging machine learning, it should be possible to classify comments and thus transfer valuable information from the source code into documentation with less effort but the same quality. We applied different machine learning techniques to a COBOL legacy system and compared the results with industry-strength heuristic classification. As a result, we found that machine learning outperforms the heuristics in number of errors and less effort.

Automated Code Transformations: Dealing with the Aftermath
Stefan Strobl, Christina Zoffi, Christoph Haselmann, Mario Bernhart, and Thomas Grechenig
(Vienna University of Technology, Austria)
Dealing with legacy systems has been a challenge for the industry for decades. The pressure to efficiently modernise legacy assets to meet new business needs and minimise associated risks is increasing. Automated code transformation, which is associated with serious (long-known) risks, is a high priority in the industrial environment due to the cost structure, the effort required and the supposed time savings. However, little has been published about the long-term effects of successful migrations. This paper looks at three different cases of automated code transformation at different stages of their lifecycle, highlights the lessons learned and derives a number of recommendations that should be useful for planning and executing future transformations.

Tool Demonstrations

CryptoExplorer: An Interactive Web Platform Supporting Secure Use of Cryptography APIs
Mohammadreza Hazhirpasand, Mohammad Ghafari, and Oscar Nierstrasz
(University of Bern, Switzerland)
Research has shown that cryptographic APIs are hard to use. Consequently, developers resort to using code examples available in online information sources that are often not secure.
We have developed a web platform, named CryptoExplorer, stocked with numerous real-world secure and insecure examples that developers can explore to learn how to use cryptographic APIs properly. This platform currently provides 3 263 secure uses, and 5 897 insecure uses of Java Cryptography Architecture mined from 2324 Java projects on GitHub.
A preliminary study shows that CryptoExplorer provides developers with secure crypto API use examples instantly, developers can save time compared to searching on the internet for such examples, and they learn to avoid using certain algorithms in APIs by studying misused API examples.
We have a pipeline to regularly mine more projects, and, on request, we offer our dataset to researchers.

AUSearch: Accurate API Usage Search in GitHub Repositories with Type Resolution
Muhammad Hilmi Asyrofi

, Ferdian Thung, David Lo

, and Lingxiao Jiang

(Singapore Management University, Singapore)
Nowadays, developers use APIs to implement their applications. To know how to use an API, developers may search for code examples that use the API in repositories such as GitHub. Although code search engines have been developed to help developers perform such search, these engines typically only accept a query containing the description of the task that needs to be implemented or the names of the APIs that the developer wants to use without the capability for the developer to specify particular search constraints, such as the class and parameter types that the relevant API should take. These engines are not designed for cases when the specific API and its types to search are known and the developers want code examples that show how the specific API is used, and therefore, their search results are often too inaccurate. In this work, we propose a tool, named AUSearch, to fill this gap. Given an API query that allows type constraints, AUSearch finds code examples in GitHub that contain usages of the specific APIs in the query. AUSearch performs type resolutions to ensure that the API usages found in the returned files are indeed invocations of the APIs specified in the query and highlights the relevant lines of code in the files for easier reference. We show that AUSearch is much more accurate than GitHub Code Search. A video demonstrating our tool is available from https://youtu.be/DKiGal5bSkU.

Clone Notifier: Developing and Improving the System to Notify Changes of Code Clones
Shogo Tokui, Norihiro Yoshida

, Eunjong Choi, and Katsuro Inoue
(Osaka University, Japan; Nagoya University, Japan; Kyoto Institute of Technology, Japan)
A code clone is a code fragment that is identical or similar to it in the source code. It has been identified as one of the main problems in software maintenance. When a developer fixes a defect, they need to find the code clones corresponding to the code fragments. In this paper, we present Clone Notifier, a system that alerts on creations and changes of code clones to software developers. First, Clone Notifier identifies creations and changes of code clones. Subsequently, it groups them into four categories (new, deleted, changed, stable) and assigns labels (e.g., consistent, inconsistent) to them. Finally, it notifies on creations and changes of code clones along with the corresponding categories and labels. Clone Notifier and its video are available at: https://github.com/s-tokui/CloneNotifier.

Video

Mining Version Control Systems and Issue Trackers with LibVCS4j
Marcel Steinbeck
(University of Bremen, Germany)
Mining version control systems, such as Git, Mercurial, and Subversion, is a good way to analyze different aspects of software evolution. Several previous studies, for example, analyzed how cloning in software evolves by tracking findings through the change history of source code files. By linking in issues (reported and discussed in issue tracking systems), results can be further refined---for instance, identifying commits that removed long-lived clones and fixed known issues. However, due to the absence of suitable tools, most of these studies implemented their own data extraction approaches. In this paper we present LibVCS4j, a Java programming library which can be integrated into existing analysis tools---for example, tools that detect code smells in source code files---to i) iterate through the history of software repositories, ii) run the desired analysis on a selected set of revisions, and iii) fetch additional information (commit messages, file differences, and so on) while processing a repository. The library unites different version control systems and issue tracking systems under a common interface and, thus, is well suited to decouple data analysis from data extraction.

SpojitR: Intelligently Link Development Artifacts
Michael Rath, Mihaela Todorova Tomova, and Patrick Mäder
(DLR, Germany; TU Ilmenau, Germany)
Traceability has been acknowledged as an important part of the software development process and is considered relevant when performing tasks such as change impact and coverage analysis. With the growing popularity of issue tracking systems and version control systems developers began including the unique identifiers of issues to commit messages. The goal of this message tagging is to trace related artifacts and eventually establish project-wide traceability. However, the trace creation process is still performed manually and not free of errors, i.e. developers may forget to tag their commit with an issue id. The prototype spojitR is designed to assist developers in tagging commit messages and thus (semi-) automatically creating trace links between commits and an issue they are working on. When no tag is present in a commit message, spojitR offers the developer a short recommendation list of potential issue ids to tag the commit message. We evaluated our tool using an open-source project hosted by the Apache Software Foundation. The source code, a demonstration, and a video about spojitR is available online: https://github.com/SECSY-Group/spojitr.

Video

Info

ChangeBeadsThreader: An Interactive Environment for Tailoring Automatically Untangled Changes
Satoshi Yamashita, Shinpei Hayashi, and Motoshi Saeki
(Tokyo Institute of Technology, Japan)
To improve the usability of a revision history, change untangling, which reconstructs the history to ensure that changes in each commit belong to one intentional task, is important. Although there are several untangling approaches based on the clustering of fine-grained editing operations of source code, they often produce unsuitable result for a developer, and manual tailoring of the result is necessary. In this paper, we propose ChangeBeadsThreader (CBT), an interactive environment for splitting and merging change clusters to support the manual tailoring of untangled changes. CBT provides two features: 1) a two-dimensional space where fine-grained change history is visualized to help users find the clusters to be merged and 2) an augmented diff view that enables users to confirm the consistency of the changes in a specific cluster for finding those to be split. These features allow users to easily tailor automatically untangled changes.

Info

Late Breaking Ideas

Reinforcement Learning Guided Symbolic Execution
Jie Wu, Chengyu Zhang

, and Geguang Pu

(East China Normal University, China)
Symbolic execution is an indispensable technique for software testing and program analysis. Path-explosion is one of the key challenges in symbolic execution. To relieve the challenge, this paper leverages the Q-learning algorithm to guide symbolic execution. Our guided symbolic execution technique focuses on generating a test input for triggering a particular statement in the program. In our approach, we first obtain the dominators with respect to a particular statement with static analysis. Such dominators are the statements that have to be visited before reaching the particular statement. Then we start the symbolic execution with the branch choice controlled by the policy in Q-learning. Only when symbolic execution encounters a dominator, it returns a positive reward to Q-learning. Otherwise, it will return a negative reward. And we update the Q-table in Q-learning accordingly. Our initial evaluation results indicate that in average more than 90% of exploration paths and instructions are reduced for reaching the target statement compared with the default search strategy in KLEE, which shows the promise of this work.

Live Replay of Screen Videos: Automatically Executing Real Applications as Shown in Recordings
Rudolf Ramler

, Marko Gattringer, and Josef Pichler
(Software Competence Center Hagenberg, Austria; University of Applied Sciences Upper Austria, Austria)
Screencasts and videos with screen recordings are becoming an increasingly popular source of information for users to understand and learn about software applications. However, searching for answers to specific questions in screen videos is notoriously difficult due to the effort for locating specific events of interest and reproducing the application's state up to this event. To increase the efficiency when working with screen videos, we propose a solution for replaying recorded sequences shown in videos directly on live applications. In this paper, we describe the analysis of screen videos to automatically identify and extract user interactions and the construction of visual scripts, which are used to run the application in sync with replaying the video. Currently, a first prototype has been developed to demonstrate the technical feasibility of the approach. The paper provides an overview of the implemented solution concept and discusses technical challenges, open issues, as well as future application scenarios.

Documentation of Machine Learning Software
Yalda Hashemi, Maleknaz Nayebi, and Giuliano Antoniol
(Polytechnique Montréal, Canada)
Machine Learning software documentation is different from most of the documentations that were studied in software engineering research. Often, the users of these documentations are not software experts. The increasing interest in using data science and in particular, machine learning in different fields attracted scientists and engineers with various levels of knowledge about programming and software engineering. Our ultimate goal is automated generation and adaptation of machine learning software documents for users with different levels of expertise. We are interested in understanding the nature and triggers of the problems and the impact of the users’ levels of expertise in the process of documentation evolution. We will investigate the Stack Overflow Q&As and classify the documentation related Q/As within the machine learning domain to understand the types and triggers of the problems as well as the potential change requests to the documentation. We intend to use the results for building on top of the state of the art techniques for automatic documentation generation and extending on the adoption, summarization, and explanation of software functionalities.

Building an Inclusive Distributed Ledger System
Cynthia Dookie
In 2008, bitcoin disrupted the transactional ecosystem with its value propositions. However, sustaining autonomous, de-regulated systems in markets filled with laws and regulations has had its challenges. There are also open questions of scalability, resilience, availability, speed and finality of transactions. Sound technical components have emerged but we need to take this one step further and integrate the units to provide an accepted packaged solution. Using a game application as an entry point, we propose a simple, inclusive asset transfer system which will stimulate adoption and create traction in the distributed ledger universe.

SANER 2020 – Proceedings

Frontmatter

Main Research

ERA

RENE

Industry

Tool Demonstrations

Late Breaking Ideas