IWSC 2020 – Proceedings

Message from the Chairs
Hello, and welcome to the 2020 International Workshop on Software Clones (IWSC), the 14th such workshop since the very first at ICSM 2002 in Montreal. This year, we are co-locating with the 27th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2020) in beautiful London, Ontario, Canada.

Clone Detection and Application

Twin-Finder: Integrated Reasoning Engine for Pointer-Related Code Clone Detection
Hongfa Xue, Yongsheng Mei, Kailash Gogineni, Guru Venkataramani, and Tian Lan
(George Washington University, USA)
Detecting code clones is crucial in various software engineering tasks. In particular, code clone detection can have significant uses in the context of analyzing and fixing bugs in large scale applications. However, prior works, such as machine learning-based clone detection, may cause a considerable amount of false positives. In this paper, we propose Twin-Finder, a novel, closed-loop approach for pointer-related code clone detection that integrates machine learning and symbolic execution techniques to achieve precision. Twin-Finder introduces a clone verification mechanism to formally verify if two clone samples are indeed clones and a feedback loop to automatically generated formal rules to tune machine learning algorithm and further reduce the false positives. Our experimental results show that Twin-Finder can swiftly identify up 9X more code clones comparing to a tree-based clone detector, Deckard and remove an average 91.69% false positives.

Improving Syntactical Clone Detection Methods through the Use of an Intermediate Representation
Pedro M. Caldeira, Kazunori Sakamoto, Hironori Washizaki, Yoshiaki Fukazawa, and Takahisa Shimada
(ETH Zurich, Switzerland; Waseda University, Japan; GAIO TECHNOLOGY, Japan)
Detection of type-3 and type-4 clones remains a difficult task. Current methods are complex, both on a conceptual and computational level. Similarly, their usage requires substantial implementation efforts. Instead of creating yet another method, it might be more productive to combine the simplicity of syntactic approaches with the abstractions granted by intermediate representations (IR). To this end, we devised a c-like IR based on LLVM and ran NiCad on it (LLNiCad). To establish whether the clone detection capabilities of syntactic approaches can be improved through an IR, we compared NiCad and LLNiCad on three open source projects taken from Krutz’s benchmark and a subset of Google code jam solutions. In our results, the f1-score of LLNiCad consistently outperforms NiCad. Indeed, for all clone types in Krutz’s benchmark, LLNiCad has a f1-score that is 37% higher than NiCad; with both better precision and recall. For type-4 clones in our GCJ benchmark, the f1-score of LLNiCad also outperforms CCCD (a semantic clone detector) by 44%. These findings suggest that IRs are beneficial for improving clone detection and that they have a larger impact on type-3 and type-4 clones.

Info

Evaluating Performance of Clone Detection Tools in Detecting Cloned Cochange Candidates
Md Nadim, Manishankar Mondal, and Chanchal K. Roy
(University of Saskatchewan, Canada)
Code reuse by copying and pasting from one place to another place in a codebase is a very common scenario in software development which is also one of the most typical reasons for introducing code clones. There is a huge availability of tools to detect such cloned fragments and a lot of studies have already been done for efficient clone detection. There are also several studies for evaluating those tools considering their clone detection effectiveness. Unfortunately, we find no study which compares different clone detection tools in the perspective of detecting cloned co-change candidates during software evolution. Detecting cloned co-change candidates is essential for clone tracking. In this study, we wanted to explore this dimension of code clone research. We used six promising clone detection tools to identify cloned and non-cloned co-change candidates from six C and Java-based subject systems and evaluated the performance of those clone detection tools in detecting the cloned co-change fragments. Our findings show that a good clone detector may not perform well in detecting cloned co-change candidates. The amount of unique lines covered by a clone detector and the number of detected clone fragments plays an important role in its performance. The findings of this study can enrich a new dimension of code clone research.

Blanker: A Refactor-Oriented Cloned Source Code Normalizer
Davide Pizzolotto and Katsuro Inoue
(Osaka University, Japan)
Refactoring is widely practiced by developers and has become a key factor in order to increase the maintainability of software. However, code clones pose a threat in any refactor process due to the fact that a developer should edit identical portions of code more than once. Despite the numerous researches in this topic, most of the results are focused on discovering type-3 and type-4 clones, that require an higher effort to be refactored and removed.
In this paper we present our tool, Blanker, that searches and unifies equivalent statements available in the language before feeding the source to an existing code clone detector limited to type-2 clones. This step acts as a normalization step and produces refactorable results without the error introduced by potentially unrelated added statements (like in type-3 clones), that would be unsuitable for refactoring purposes, and with added flexibility compared to checking for identical code portions (like in type-2 clones).
We used NiCad to detect clones before and after our normalization step and found up to 10% more type-2 clones after our normalization, all of them being refactor candidates.

CPPCD: A Token-Based Approach to Detecting Potential Clones
Yu-Liang Hung and Shingo Takada
(Keio University, Japan)
Most state-of-the-art clone detection approaches are aimed at finding clones accurately and/or efficiently. Yet, whether a code fragment is a clone often varies according to different people’s perspectives and different clone detection tools. In this paper, we present CPPCD (CP-based Potential Clone Detection), a novel token-based approach to detecting potential clones. It generates CP (clone probability) values and CP distribution graphs for developers to decide if a method is a clone. We have evaluated our approach on large-scale software projects written in Java. Our experiments suggest that the majority of clones have CP values greater than or equal to 0.75 and that CPPCD is an accurate (with respect to Type-1, Type-2, and Type-3 clones), efficient, and scalable approach to detecting potential clones.

Clone Analysis

An Empirical Study on Accidental Cross-Project Code Clones
Mitchel Pyl, Brent van Bladel, and Serge Demeyer

(University of Antwerp, Belgium)
Software clones are considered a code smell in software development. While most clones occur due to developers copy-paste behaviour, some of them arise accidentally as a symptom of coding idioms. If such accidental clones occur across projects, then they may indicate a lack of abstraction in the underlying programming language or libraries. In this research, we study accidental cross-project clones from the perspective of missing abstraction. We discuss the six cases of frequent cross-project clones, three of them symptoms of missing language features (which have been resolved with the release of Java 7 and Java 12), and two of them symptoms of missing library features (which have not yet been addressed).

Clone Detection on Large Scala Codebases
Wahidur Rahman, Yisen Xu, Fan Pu, Jifeng Xuan

, Xiangyang Jia, Michail Basios, Leslie Kanthan, Lingbo Li, Fan Wu, and Baowen Xu
(Imperial College London, UK; Wuhan University, China; Turing Intelligence Technology, UK; Nanjing University, China)
Code clones are identical or similar code segments. The wide existence of code clones can increase the cost of maintenance and jeopardise the quality of software. The research community has developed many techniques to detect code clones, however, there is little evidence of how these techniques may perform in industrial use cases. In this paper, we aim to uncover the differences when such techniques are applied in industrial use cases. We conducted large scale experimental research on the performance of two state-of-the-art code clone detection techniques, SourcererCC and AutoenCODE, on both open source projects and an industrial project written in the Scala language. Our results reveal that both algorithms perform differently on the industrial project, with the largest drop in precision being 30.7%, and the largest increase in recall being 32.4%. By manually labelling samples of the industrial project by its developers, we discovered that there are substantially less Type-3 clones in the aforementioned project than that in the open source projects.

Comparison and Visualization of Code Clone Detection Results
Kazuki Matsushima and Katsuro Inoue
(Osaka University, Japan)
Many techniques for code clone detection have been proposed and implemented as clone detectors in the past. These studies show that a result of code clone detection changes drastically for different tools and/or their detection parameters. Therefore, it is important to apply different clone detectors or different parameters and to identify the different or common parts of the obtained detection results. In this paper, we propose a method for comparison and visualization of detection results based on the correspondence of clone pairs. It enables developers to compare detection results by different tools and/or their detection parameters. Using this method, we will show the comparison results of an OSS using two code clone detectors, CCFinderX and NiCad.

Clone Swarm: A Cloud Based Code-Clone Analysis Tool
Venkat Bandi, Chanchal K. Roy

, and Carl Gutwin
(University of Saskatchewan, Canada)
A code clone is defined as a pair of similar code fragments within a software system. While code clones are not always harmful, they can have a detrimental effect on the overall quality of a software system due to the propagation of bugs and other maintenance implications. Because of this, software developers need to analyse the code clones that exist in a software system. However, despite the availability of several clone detection systems, the adoption of such tools outside of the clone community remains low. A possible reason for this is the difficulty and complexity involved in setting up and using these tools. In this paper, we present Clone Swarm, a code clone analytics tool that identifies clones in a project and presents the information in an easily accessible manner. Clone Swarm is publicly available and can mine any open-sourced GIT repository. Clone Swarm internally uses NiCad, a popular clone detection tool in the cloud and lets users interactively explore code clones using a web-based interface at multiple granularity levels (Function and Block level). Clone results are visualized in multiple overviews, all the way from a high-level plot down to an individual line by line comparison view of cloned fragments. Also, to facilitate future research in the area of clone detection and analysis, users can directly download the clone detection results for their projects. Clone Swarm is available online at clone-swarm.usask.ca. The source code for Clone Swarm is freely available under the MIT license on GitHub.

Video

Info

Semantic Clone Detection

SemanticCloneBench: A Semantic Code Clone Benchmark using Crowd-Source Knowledge
Farouq Al-omari, Chanchal K. Roy

, and Tonghao Chen
(University of Saskatchewan, Canada)
Not only do newly proposed code clone detection techniques, but existing techniques and tools also need to be evaluated and compared. This evaluation process could be done by assessing the reported clones manually or by using benchmarks. The main limitations of available benchmarks include: they are restricted to one programming language; they have a limited number of clone pairs that are confined within the selected system(s); they require manual validation; they do not support all types of code clones. To overcome these limitations, we proposed a methodology to generate a wide range of semantic clone benchmark(s) for different programming languages with minimal human validation. Our technique is based on the knowledge provided by developers who participate in the crowd-sourced information website, Stack Overflow. We applied automatic filtering, selection and validation to the source code in Stack Overflow answers. Finally, we build a semantic code clone benchmark of 4000 clones pairs for the languages Java, C, C# and Python.

Towards Semantic Clone Detection via Probabilistic Software Modeling
Hannes Thaller, Lukas Linsbauer, and Alexander Egyed
(JKU Linz, Austria)
Semantic clones are program components with similar behavior, but different textual representation. Semantic similarity is hard to detect, and semantic clone detection is still an open issue. We present semantic clone detection via Probabilistic Software Modeling (PSM) as a robust method for detecting semantically equivalent methods. PSM inspects the structure and runtime behavior of a program and synthesizes a network of Probabilistic Models (PMs). Each PM in the network represents a method in the program and is capable of generating and evaluating runtime events. We leverage these capabilities to accurately find semantic clones. Results show that the approach can detect semantic clones in the complete absence of syntactic similarity with high precision and low error rates.

IWSC 2020 – Proceedings

2020 IEEE 14th International Workshop on Software Clones (IWSC)

Frontmatter

Clone Detection and Application

Clone Analysis

Semantic Clone Detection