IWSC 2018 – Proceedings

Large Scale Clone Detection, Analysis, and Benchmarking: An Evolutionary Perspective (Keynote)
Chanchal K. Roy
(University of Saskatchewan, Canada)
Copying a code fragment and then reusing it by pasting and adapting (e.g., adding/modifying/deleting statements) is a common practice in software development, which results in a significant amount of duplicated code in software systems. Developers consider cloning as one of the principled reengineering approaches and often intentionally practice cloning for a variety of reasons such as faster development, avoiding risk by reusing stable old code, or for time pressure. On the other hand, duplicated code poses a number of threats to the maintenance of software systems such as clones are the #1 “bad smell” in Flower’s refactoring list and several recent studies including studies with industrial systems show that although for many cases clones are not really harmful, and even could be useful for some cases, they could be also detrimental to software maintenance. For example, reusing a fragment containing unknown bugs may result in bugs propagation, or any changes in requirements involving a cloned fragment may lead to changes to all the similar fragments to it, multiplying the work to be done. Furthermore, inconsistent changes to the cloned fragments during any updating processes may lead to severe unexpected behaviour. Software clones are thus considered to be one of the major contributors to the high software maintenance cost, which could be up to 80% of total software development cost. The era of Big Data has introduced new applications for clone detection. For example, clone detection has been used to find similar mobile applications, to intelligently tag code snippets, to identify code examples, and so on from large inter-project repositories. The dual role of clones in software development and maintenance, along with these many emerging new applications of clone detection, has led to a great many clone detection tools and analysis frameworks. In this keynote talk, I will review the cloning literature to date, in particular, I will talk about our recent work on large scale clone detection, and the challenges in evaluating such clone detectors and how we have overcome them at least in part with our BigCloneBench and Mutation framework. I will then talk about the recent advances in clone analysis and management along with a vision for a comprehensive clone management system.

Clone Analysis

Are There Functionally Similar Code Clones in Practice?
Verena Käfer

, Stefan Wagner, and Rainer Koschke

(University of Stuttgart, Germany; University of Bremen, Germany)
Having similar code fragments, also called clones, in software systems can lead to nnecessary comprehension, review and change efforts. Syntactically similar clones can often be encountered in practice. The same is not clear for only functionally similar clones (FSC). We conducted an exploratory survey among developers to investigate whether they encounter unctionally similar clones in practice and whether there is a difference in their inclination to remove them to syntactically similar clones. Of the 34 developers answering the survey, 31 have experienced FSC in their professional work, and 24 have experienced problems caused by FSCs. We found no difference in the inclination and reasoning for removing FSCs and syntactically similar clones. FSCs exist in practice and should be investigated to bring clone detectors to the same quality as for syntactically similar clones, because being able to detect them allows developers to manage and potentially remove them.

Structural Clones: An Evolution Perspective
Jaweria Kanwal, Hamid Abdul Basit, and Onaiza Maqbool
(Quaid-i-Azam University, Pakistan; Lahore University of Management Sciences, Pakistan)
Structural clones are recurring patterns of simple code clones in software that represent a bigger picture of similarity in software (e.g. software design). Elevating the analysis of cloning to structural clone level helps in better clone management in terms of clone understanding, maintenance and evolution. In this paper, we propose a systematic approach to study structural clone evolution in software versions. We use our approach to analyze the evolutionary behavior of structural clones and also compare it with the evolution of simple clones. We performed experiments on different versions of three Java systems. Our analysis of structural clone evolution reveals interesting evolutionary characteristics of clones. For example, one finding is that simple clones are more frequently changed than structural clones whereas average lifetime of structural clones is less than that of simple clones. Study of clone evolution is helpful for identifying maintenance implications of clones and for devising better clone management systems.

Generated Code in Studies on Clone Rates
Rainer Koschke and Moritz Weinig
(University of Bremen, Germany)
Various earlier studies have measured clone rates for diverse projects. One of the reasons for exceptionally high clone rates for individual source files was found to be auto-generated code. Automatically generated code is generally not maintained and, hence, should be excluded from clone-rate measurements. This kind of code might even introduce a bias to clone rates of projects when there is a large amount of generated code and clone rates for generated files generally deviate from the average clone rate for handwritten code. While some generated files stuck out with clone rates above the average in earlier studies, we do not know whether this is generally the case and how much code is actually generated automatically.
This paper investigates the amount of generated files in projects, whether clone rates for generated files really differ from handwritten code, and-overall-whether generated code in fact introduces a bias to clone rates. We heuristically detect generated files in a very large open-source project corpus of programs written in C, C++, C#, or Java and report the number of projects with generated code. For these projects, we compare clone rates of generated and handwritten files.
Our results show higher clone rates for generated files. Moreover, when we aggregate clone rates from files to projects, the clone rates of projects with at least one generated file are also slightly higher than in projects for which no generated files were detected. Our results suggest that researchers should indeed take special care to exclude generated code in studies on clone rates.

Cloning Applications: Code Generation and Software Quality Metrics

On the Characteristics of Buggy Code Clones: A Code Quality Perspective
Md. Rakibul Islam and Minhaz F. Zibran
(University of New Orleans, USA)
Code clone is an immensely studied code smell. Not all the clones in a software system are equally harmful. Earlier work studied various traits of clones including their stability and relationships with program faults against non-cloned code. This paper presents a comparative study on the characteristics of buggy and non-buggy clones from a code quality perspective.
In the light of 29 code quality metrics, we study buggy and non-buggy clones in 2,077 revisions of three software systems written in Java. The findings from this work add to the characterization of buggy clones. Such a characterization will be useful in cost-effective clone management and clone-aware software development.

Towards Automated Generation of Java Methods: A Way of Automated Reuse-Based Programming
Kento Shimonaka, Yoshiki Higo

, Junnosuke Matsumoto, Keigo Naitou, and Shinji Kusumoto
(Osaka University, Japan)
Automatic programming has been researched for a long time. A variety of methodologies have been proposed. However, they have limited applicability, or they can generate only a few lines of code. In this research, the authors are trying to generate source code of Java methods based on their specifications. In this paper, we propose a reuse-based code generation technique with method signature and test cases. First, our technique searches existing Java methods whose signature are the same as the one input by a user. Then, our technique reworks each of them by using test cases input by the user. Methods passing all the test cases are given to the user. At this moment, the authors have implemented a naive prototype and conducted experiments with four open source software. In total, our technique succeeded to generate 18 Java methods. In this paper, we also introduce some actual examples of generated Java methods and some ideas to enhance our technique.

Correlation Analysis between Code Clone Metrics and Project Data on the Same Specification Projects
Yoshiki Higo

, Shinsuke Matsumoto, Shinji Kusumoto, Takashi Fujinami, and Takashi Hoshino
(Osaka University, Japan; NTT, Japan)
The presence of code clones is pointed out as a factor that makes software maintenance more difficult. On the other hand, some research studies reported that only a small part of code clones requires simultaneous changes and their negative influences on software maintenance are limited. Besides, some other studies reported that code clones often have positive effects on software development. Currently, the authors are researching exploring the effect of clones on software development and maintenance. In this paper, the authors report their exploratory results on the relationship between clone metrics and project data such as the number of test cases and the number of found bugs. The targets of this exploration are nine web-based software systems. Interestingly, all of them were developed based on the same specification. In other words, they are functionally the same software systems. By targeting such projects, we can explore how implementation differences affect software development. As a result, unit/integration/system testing become more difficult in case that many clones exist in a project.

Clone Detection Techniques and Clone Visualization

A Picture Is Worth a Thousand Words: Code Clone Detection Based on Image Similarity
Chaiyong Ragkhitwetsagul

, Jens Krinke, and Bruno Marnette
(University College London, UK; Prodo, UK)
This paper introduces a new code clone detection technique based on image similarity. The technique captures visual perception of code seen by humans in an IDE by applying syntax highlighting and images conversion on raw source code text. We compared two similarity measures, Jaccard and earth mover's distance (EMD) for our image-based code clone detection technique. Jaccard similarity offered better detection performance than EMD. The F1 score of our technique on detecting Java clones with pervasive code modifications is comparable to five well-known code clone detectors: CCFinderX, Deckard, iClones, NiCad, and Simian. A Gaussian blur filter is chosen as a normalisation technique for type-2 and type-3 clones. We found that blurring code images before similarity computation resulted in higher precision and recall. The detection performance after including the blur filter increased by 1 to 6 percent. The manual investigation of clone pairs in three software systems revealed that our technique, while it missed some of the true clones, could also detect additional true clone pairs missed by NiCad.

Info

Detecting Functionally Similar Code within the Same Project
Ryo Tajima, Masataka Nagura, and Shingo Takada
(Keio University, Japan; Nihon University, Japan)
Multiple developers often take part in a software development project. Although these developers are collaborating towards the development within the same project, each developer creates code on their own. This may lead to duplicate or similar code appearing in different parts of the software. Such code should be removed to improve maintainability. This paper proposes an approach to automatically detect such code, which we shall call functionally similar code. The unit of detection is at the method level, and we focus on input/output and the method structure using program dependence graph. We show the results of applying our approach on open source software.

Towards Slice-Based Semantic Clone Detection
Hakam W. Alomari

and Matthew Stephan

(Miami University, USA)
This paper presents our proposed approach for detecting code clones based on similar slices of different versions of large software systems. We begin by presenting our initial thoughts on realizing software slice clone detection. We describe our initial results obtained by means of scripts to identify clones at different levels of granularity. The clones between versions are represented as pairs of cloned slices. Our results include a case study of over 191 versions of the Linux kernel, spanning over 10 years. In the near future, we plan on experimenting with established clone detectors to realize a complete and robust analysis approach.

Code Difference Visualization by a Call Tree
Toshihiro Kamiya
(Shimane University, Japan)
Understanding modifications to a software product is essential in software maintenance. To help programmers understand modifications, especially how code changes in a refactoring, this paper presents a semi-automated dynamic analysis to compare two revisions of a product. The approach basically detects “similar but different” sub-tree pairs between call trees from execution traces of the two revisions and then draws up a call graph of the pairs. In addition, pruning techniques or heuristics are used to make the graph smaller and easier to be understood.

IWSC 2018 – Proceedings

2018 IEEE 12th International Workshop on Software Clones (IWSC)

Frontmatter

Keynote

Clone Analysis

Cloning Applications: Code Generation and Software Quality Metrics

Clone Detection Techniques and Clone Visualization