IWSC 2017 – Proceedings

Detecting and Analyzing Code Clones in HDL
Kyohei Uemura, Akira Mori, Kenji Fujiwara, Eunjong Choi, and Hajimu Iida
(NAIST, Japan; National Institute of Advanced Industrial Science and Technology, Japan; National Institute of Technology, Japan)
In this paper, we study code clones in hardware description languages (HDLs) in comparison with general programming languages. For this purpose, we have developed a method for detecting code clones in Verilog HDL. A key idea of the proposed method is to convert the Verilog HDL code into the pseudo C++ code, which is then processed by an existing code clone detector for C++. We conducted an experiment on 10 open source hardware products described in Verilog HDL, where we succeeded in detecting nearly 1,800 clone sets with approximately 80% precision. We compared code clones in Verilog HDL with those in JavaC based on the metrics to identify the differences among languages. We identified patterns on how code clones are created in Verilog HDL, which include cases for increasing stability and capability of parallel processing of the circuit.

Using Compilation/Decompilation to Enhance Clone Detection
Chaiyong Ragkhitwetsagul

and Jens Krinke
(University College London, UK)
We study effects of compilation and decompilation to code clone detection in Java. Compilation/decompilation canonicalise syntactic changes made to source code and can be used as source code normalisation. We used NiCad to detect clones before and after decompilation in three open source software systems, JUnit, JFreeChart, and Tomcat. We filtered and compared the clones in the original and decompiled clone set and found that 1,201 clone pairs (78.7%) are common between the two sets while 326 pairs (21.3%) are only in one of the sets. A manual investigation identified 325 out of the 326 pairs as true clones. The 252 original-only clone pairs contain a single false positive while the 74 decompiled-only clone pairs are all true positives. Many clones in the original source code that are detected only after decompilation are type-3 clones that are difficult to detect due to added or deleted statements, keywords, package names; flipped if-else statements; or changed loops. We suggest to use decompilation as normalisation to compliment clone detection. By combining clones found before and after decompilation, one can achieve higher recall without losing precision.

Info

Rearranging the Order of Program Statements for Code Clone Detection
Yusuke Sabi, Yoshiki Higo

, and Shinji Kusumoto
(Osaka University, Japan)
A code clone is a code fragment identical or similar to another code fragment in source code. Some of code clones are considered as a factor of bug replications and make it more difficult to maintain software. Various code clone detection tools have been proposed so far. However, in most algorithms adopted by existing clone detection tools, if program statements are reordered, they are not detected as code clones. In this research, we examined how clone detection results change by rearranging the order of program statements. We performed preprocessing to rearranging the order of program statements using program dependency graph (PDG). We compared clone detection results with and without preprocessing. As a result, by rearranging the order of program statements, the number of detected code clones is almost the same in most projects. We classified newly detected or disappeared clones manually. From our experimental results, we show that there is no newly detected clone whose statements are reordered and that there are four disappeared clones whose statements are reordered. We think three out of the four clones occurred by copy-and-paste operations. Therefore, we conclude that rearranging the order of program statements is not effective to detect reordered code clones.

Web-Service for Finding Cloned Files using b-Bit Minwise Hashing
Kaoru Ito, Takashi Ishio, and Katsuro Inoue
(Osaka University, Japan)
Source code reuse is a common practice in software development. Since industrial developers may accidentally reuse source files developed by open source software, clone detection tools are used to detect open source files in their closed source project. To execute a clone detection, developers need a database of existing open source software. While a web-service providing clone detection using a centralized database is likely useful, industrial developers are not allowed to submit their source code to a public server on the Internet. To solve the problem, we employ b-bit minwise hashing technique that enables to estimate similarity of documents using only hash values of the documents. Using the method, we implemented a file-clone detection web service; it takes as input a hash value of a source file and returns a list of similar source files in existing open source software. Our hash comparison method is efficient, although an estimated similarity may have a margin of error.

CodeEase: Harnessing Method Clone Structures for Reuse
Shamsa Abid, Salman Javed, Momna Naseem, Suleman Shahid, Hamid Abdul Basit, and Yoshiki Higo

(Lahore University of Management Sciences, Pakistan; Osaka University, Japan)
Searching for code examples on the internet is commonly and frequently performed by software developers but wastes a lot of their time and reduces their productivity. To aid developers with this problem, a system is needed that can allow them to get appropriate code recommendations for reuse within the IDE. In this paper, we present our prototype tool CodeEase, developed as an Eclipse plugin, which generates method recommendations against the code of the developer. The recommendations are based on clone detection and an analysis of Method Clone Structures (MCS) – a type of structural clones- from a large repository of code.

Clone Analysis

Software Clones in Scratch Projects: On the Presence of Copy-and-Paste in Computational Thinking Learning
Gregorio Robles, Jesús Moreno-León, Efthimia Aivaloglou, and Felienne Hermans
(Universidad Rey Juan Carlos, Spain; Programamos.es, Spain; Delft University of Technology, Netherlands)
Computer programming is being introduced in schools worldwide as part of a movement that promotes Computational Thinking (CT) skills among young learners. In general, learners use visual, block-based programming languages to acquire these skills, with Scratch being one of the most popular ones. Similar to professional developers, learners also copy and paste their code, resulting in duplication. In this paper we present the findings of correlating the assessment of the CT skills of learners with the presence of software clones in over 230,000 projects obtained from the Scratch platform. Specifically, we investigate i) if software cloning is an extended practice in Scratch projects, ii) if the presence of code cloning is independent of the programming mastery of learners, iii) if code cloning can be found more frequently in Scratch projects that require specific skills (as parallelism or logical thinking), and iv) if learners who have the skills to avoid software cloning really do so. The results show that i) software cloning can be commonly found in Scratch projects, that ii) it becomes more frequent as learners work on projects that require advanced skills, that iii) no CT dimension is to be found more related to the absence of software clones than others, and iv) that learners -even if they potentially know how to avoid cloning- still copy and paste frequently. The insights from this paper could be used by educators and learners to determine when it is pedagogically more effective to address software cloning, by educational programming platform developers to adapt their systems, and by learning assessment tools to provide better evaluations.

Info

Does Cloned Code Increase Maintenance Effort?
Manishankar Mondal, Chanchal K. Roy, and Kevin A. Schneider
(University of Saskatchewan, Canada)
In-spite of a number of in-depth investigations regarding the impact of clones in the maintenance phase there is no concrete answer to the long lived research question, ``Does the presence of code clones increase maintenance effort?". Existing studies have measured different change related metrics for cloned and non-cloned regions, however, no study calculates the maintenance effort spent for these code regions. In this paper, we perform an in-depth empirical study in order to compare the maintenance efforts required for cloned and non-cloned code. For the purpose of our study we implement a prototype tool which is capable of estimating the effort spent by a developer for changing a particular method. It can also predict effort that might need to be spent for making some changes to a particular method. Our estimation and prediction involve automatic extraction and analysis of the entire evolution history of a candidate software system. We applied our tool on hundreds of revisions of six open source subject systems written in three different programming languages for calculating the efforts spent for cloned and non-cloned code. According to our experimental results: (i) cloned code requires more effort in the maintenance phase than non-cloned code, and (ii) Type 2 and Type 3 clones require more effort compared to the efforts required by Type 1 clones. According to our findings, we should prioritize Type 2 and Type 3 clones when making clone management decisions.

Refactoring Patterns Study in Code Clones during Software Evolution
Jaweria Kanwal, Katsuro Inoue, and Onaiza Maqbool
(Quaid-i-Azam University, Pakistan; Osaka University, Japan)
To investigate how code clones are handled by developers when they perform refactorings during software releases, we performed a longitudinal study on different versions of five Java systems. Our results show that a small proportion of code clones are refactored during the releases and code clones of same clone class are refactored consistently.

Evolution of Code Clone Ratios throughout Development History of Open-Source C and C++ Programs
Anfernee Goon, Yuhao Wu, Makoto Matsushita, and Katsuro Inoue
(University of California at San Diego, USA; Osaka University, Japan)
A code clone is a fragment of code which is duplicated throughout the source code of a project. Code clones have been shown to make a project less maintainable because all code clones will share potential bugs and problems. Unlike other code clone research, this study analyzes the code clone ratios over the entire development lifetime of three open-source projects written in C/C++ to understand code clone growth in software over development and potential developer habits which could affect this growth. The study utilizes CCFinderX and Git to detect clone metrics across development history. The results from each project show very low, stable ratios across development history, with the code clone ratios only fluctuating greatly during the beginning of development mostly and very little refactoring occurring. This study goes further into the potential cause of low ratios and different fluctuations at different periods of development.

A Technique to Detect Multi-grained Code Clones
Yusuke Yuki, Yoshiki Higo

, and Shinji Kusumoto
(Osaka University, Japan)
It is said that the presence of code clones makes software maintenance more difficult. For such a reason, it is important to understand how code clones are distributed in source code. A variety of code clone detection tools has been developed before now. Recently, some researchers have detected code clones from a large set of source code to find library candidates or overlooked bugs.
In general, the smaller the granularity of the detection target is, the longer the detection time. On the other hand, the larger the granularity of the detection target is, the fewer detectable code clones are.
In this paper, we propose a technique that detects in order from coarse code clones to fine-grained ones. In the coarse-to-fine-grained-detections, code fragments detected as code clones at a certain granularity are excluded from detection targets of more fine-grained detections. Our proposed technique can detect code clones faster than fine-grained detection techniques. Besides, it can detect more code clones than coarse detection techniques.

Graph-Based Clone Detection

Enhancing Program Dependency Graph Based Clone Detection using Approximate Subgraph Matching
C. M. Kamalpriya and Paramvir Singh
(Bombardier Transportation, India; NIT Jalandhar, India)
Software code clone detection techniques and tools play a major role in improving the software quality as well as saving maintenance cost and effort. Program Dependency Graph (PDG) based clone detection techniques have a key advantage over other techniques as they are capable of detecting non-contiguous code clones in addition to contiguous clones. We propose further enhancement to current state of the art PDG-based detection to identify all possible (exact and approximate) clone relations from the obtained clone pair (PDG-based) results using Approximate Subgraph Matching (ASM). We obtain clone results of our proposed technique on three subject software systems, and validate the results on eclipse-ant from Bellon’s benchmark. We also present a new ASM-based distance measure to represent the similarity between software code clones.

Rethinking Dependence Clones
Tim A. D. Henderson and Andy Podgurski
(Case Western Reserve University, USA)
Semantic code clones are regions of duplicated code that may appear dissimilar but compute similar functions. Since in general it is algorithmically undecidable whether two or more programs compute the same function, locating all semantic code clones is infeasible. One way to dodge the undecidability issue and find potential semantic clones, using only static information, is to search for recurring subgraphs of a program dependence graph (PDG). PDGs represent control and data dependence relationships between statements or operations in a program. PDG-based clone detection techniques, unlike syntactically-based techniques, do not distinguish between code fragments that differ only because of dependence-preserving statement reorderings, which also preserve semantics. Consequently, they detect clones that are difficult to find by other means. Despite this very desirable property, work on PDG-based clone detection has largely stalled, apparently because of concerns about the scalability of the approach. We argue, however, that the time has come to reconsider PDG-based clone detection, as a part of a holistic strategy for clone management. We present evidence that its scalability problems are not as severe as previously thought. This suggests the possibility of developing integrated clone management systems that fuse information from multiple clone detection methods, including PDG-based ones.

Info

IWSC 2017 – Proceedings

2017 IEEE 11th International Workshop on Software Clones (IWSC)

Preface

Clone Detection and Applications

Clone Analysis

Graph-Based Clone Detection