IWSC 2012 – Proceedings

Foreword
Software clones are identical or similar pieces of code or design. Clones are known to be closely related to various issues on software engineering, such as software quality, complexity, architecture, refactoring, evolution, licensing, plagiarism, and so on. Various characteristics of software systems can be uncovered through the clone analysis, and system restructuring can be performed by merging clones.
The purpose of this workshop is to continue to solidify and give shape to this research/application area and community. More specifically, the goals are to bring together academic and industrial researchers and practitioners from around the world to evaluate the current state of research and applications, discuss common problems, discover new opportunities for collaboration, exchange ideas, and envision new areas of research and applications.
In this, the 6th international clone workshop, we will discuss issues in software clone detection, analysis and management, as well as applications to software engineering contexts that can benefit from knowledge of clones. As befits a maturing workshop, we have this year added independent program committee chairs to oversee the technical program, and we are pleased that Chanchal Roy and Jens Krinke have agreed to serve in that capacity.

Technical Papers

An Accurate Estimation of the Levenshtein Distance Using Metric Trees and Manhattan Distance
Thierry Lavoie and Ettore Merlo
(École Polytechnique de Montréal, Canada)
This paper presents an original clone detection technique which is an accurate approximation of the Levenshtein distance. It uses groups of tokens extracted from source code called windowed-tokens. From these, frequency vectors are then constructed and compared with the Manhattan distance in a metric tree. The goal of this new technique is to provide a very high precision clone detection technique while keeping a high recall. Precision and recall measurement is done with respect to the Levenshtein distance. The testbench is a large scale open source software. The collected results proved the technique to be fast, simple, and accurate. Finally, this article presents further research opportunities.

A Novel Approach Based on Formal Methods for Clone Detection
Antonio Cuomo, Antonella Santone, and Umberto Villano
(University of Sannio, Italy)
This paper presents an approach based on formal methods for detecting code clones. The methodology followed performs the analysis on Java bytecode, which is transformed into CCS (Calculus of Communicating Systems) processes which are successively checked for equivalence. A prototype tool targeted at the detection of Type 2 clones is presented. The experiments conducted on programs of different size assess the validity of the proposed approach, pointing out possible improvements for future research.

Claims and Beliefs about Code Clones: Do We Agree as a Community? A Survey
Debarshi Chatterji, Jeffrey C. Carver, and Nicholas A. Kraft
(University of Alabama, USA)
Research on code clones and their impact on software development has been increasing in recent years. There are a number of potentially competing claims among members of the community. There is currently not enough empirical evidence to provide concrete information about these claims. This paper presents the results of a survey of members of the code clone community. The goal of the survey was to determine the level of agreement of community members regarding some key topics. While the results showed a good bit of agreement, there was no universal consensus on all topics. Survey respondents were not in complete agreement about the definitions of Type III and Type IV clones. The survey respondents were more uncertain about how developers behave when working with clones. From the survey it is clear that there are areas where more empirical research is needed to better understand how to effectively work with clones.

Clone Detection Using Rolling Hashing, Suffix Trees and Dagification: A Case Study
Mikkel Jønsson Thomsen and Fritz Henglein

(University of Copenhagen, Denmark)
Microsoft Dynamics NAV is a widely used enterprise resource planning system for small and medium-sized enterprises that, by design, encourages rapid customization by copy-paste programming. We report the results of analyzing clone detection for NAV using two previously published methods and one new algorithmic method: character-based sliding window sampling using Rabin-Karp hashing (MOSS), line-based sequence matching using suffix trees (CodeDup), and abstract-syntax-tree based graph sharing analysis (XMLClone). The latter is piggybacked on XMLStore, which stores XML trees as directed acyclic graphs (dags) where all isomorphic subtrees are identified and coalesced into single nodes, which can be done in linear time using multiset discrimination. This dagification discovers all well-formed Type-1 and, with suitable input normalization, Type-2 clones. We find that the subsequent dag analysis to discover Type-3 clones performs well on NAV source code, both in terms of computational complexity and precision. This suggests that efficient dagification and independently configurable dag interpretation may be valuable ingredients for modular clone detection.

Dispersion of Changes in Cloned and Non-cloned Code
Manishankar Mondal, Chanchal K. Roy

, and Kevin A. Schneider
(University of Saskatchewan, Canada)
Currently, the impacts of clones in software maintenance activities are being investigated by different researchers in different ways. Comparative stability analysis of cloned and non-cloned regions of a subject system is a well-known way of measuring the impacts where the hypothesis is that, the more a region is stable the less it is harmful for maintenance. Each of the existing stability measurement methods lacks to address one important characteristic, dispersion, of the changes happening in the cloned and non-cloned regions of software systems. Change dispersion of a particular region quantifies the extent to which the changes are scattered over that region. The intuition is that, more dispersed changes require more efforts to be spent in the maintenance phase.
Measurement of Dispersion requires the extraction of method genealogies. In this paper, we have measured the dispersions of changes in cloned and non-cloned regions of several subject systems using a concurrent and robust framework for method genealogy extraction. We implemented the framework on Actor Architecture platform which facilitates coarse grained parallelism with asynchronous message passing capabilities. Our experimental results with 12 open-source subject systems written in three different programming languages (Java, C and C#) using two clone detection tools suggest that, the changes in cloned regions are more dispersed than the changes in non-cloned regions. Also, Type-3 clones exhibit more dispersion as compared to the Type-1 and Type-2 clones. The subject systems written in Java and C show higher dispersions as well as increased maintenance efforts as compared to the subject systems written in C#.

Java Bytecode Clone Detection via Relaxation on Code Fingerprint and Semantic Web Reasoning
Iman Keivanloo, Chanchal K. Roy, and Juergen Rilling
(Concordia University, Canada; University of Saskatchewan, Canada)
While finding clones in source code has drawn considerable attention, there has been only very little work in finding similar fragments in binary code and intermediate languages, such as Java bytecode. Some recent studies showed that it is possible to find distinct sets of clone pairs in bytecode representation of source code, which are not always detectable at source code-level. In this paper, we present a bytecode clone detection approach, called SeByte, which exploits the benefits of compilers (the bytecode representation) for detecting a specific type of semantic clones in Java bytecode. SeByte is a hybrid metric-based approach that takes advantage of both, Semantic Web technologies and Set theory. We use a two-step analysis process: (1) pattern matching via Semantic Web querying and reasoning, and (2) content matching, using Jaccard coefficient for set similarity measurement. Semantic Web-based pattern matching helps us to find method blocks which share similar patterns even in case of extreme dissimilarity (e.g., numerous repetitions or large gaps). Although it leads to high recall, it gives high false positive rate. We thus use the content matching (via Jaccard) to reduce false positive rate by focusing on content semantic resemblance. Our evaluation of four Java systems and five other tools shows that SeByte can detect a large number of semantic clones that are either not detected or supported by source code based clone detectors.

Mining Object-Oriented Design Models for Detecting Identical Design Structures
Umut Tekin, Ural Erdemir, and Feza Buzluca
(BILGEM, Turkey; Istanbul Technical University, Turkey)
The object-oriented design is the most popular design methodology of the last twenty-five years. Several design patterns and principles are defined to improve the design quality of object-oriented software systems. In addition, designers can use unique design motifs which are particular for the specific application domain. Another common habit is cloning and modifying some parts of the software while creating new modules. Therefore, object-oriented programs can include many identical design structures. This work proposes a sub-graph mining based approach to detect identical design structures in object-oriented systems. By identifying and analyzing these structures, we can obtain useful information about the design, such as commonly-used design patterns, most frequent design defects, domain-specific patterns, and design clones, which may help developers to improve their knowledge about the software architecture. Furthermore, problematic parts of frequent identical design structures are the appropriate refactoring opportunities because they affect multiple areas of the architecture. Experiments with several open-source projects show that we can successfully find many identical design structures in each project. We observe that usually most of the identical structures are an implementation of common design patterns; however we also detect various anti-patterns, domain-specific patterns, and design-level clones.

Safe Clone-Based Refactoring through Stereotype Identification and Iso-Generation
Nic Volanschi
(Metaware Technologies, France)
Most advanced existing tools for clone-based refactoring propose a limited number of pre-defined clone-removal transformations that can be applied automatically, typically under user control. This fixed set of refactorings usually guarantee that semantics is preserved, but is inherently limited to generally-applicable transformations (extract method, pull-up method, etc.). This tool design rules out many potential domain-specific or application-specific clone removals. Such cases are ordinarily recognized by humans as stereotypes derived from a higher-level concept and manually replaced with an appropriate abstraction. Thus, in current tools, generality is sacrificed for the safety of the transformation. This paper proposes an alternative approach, in which the spectrum of refactoring techniques is open, including manual interventions, while keeping strong safety guarantees based on the notion of iso-generation. Our method can operate on multiple languages and has been prototyped on a subset of a real-world legacy asset containing C and COBOL programs, with promising results.

Industrial Experience Papers

A Method for Proactive Moderation of Code Clones in IDEs
Radhika D. Venkatasubramanyam, Himanshu Kumar Singh, and K. Ravikanth
(Siemens, India)
Duplicating code and modifying it is a useful convenience when editing within an IDE. This sequence of operations, termed copy-paste-modify, has the downside of proliferating “nearly identical” code segments or code clones and could lead to rapid degeneration of code. Although techniques for proactive identification of clones and differences between them have been studied, no clear method to control clone formation, based on “acceptability criteria,” is known. In this paper, we present a technique to moderate the genesis of clones through copy-paste-modify operations. Our approach is guided by associating constraints formulated from predefined guidelines, and checking for their satisfaction at the time of copy and upon modification. By encoding “acceptability criteria” as constraints, our approach provides the means necessary for controlled creation of clones.

Industrial Application of Clone Change Management System
Yuki Yamanaka, Eunjong Choi, Norihiro Yoshida, Katsuro Inoue, and Tateki Sano
(Osaka University, Japan; Nara Institute of Science and Technology, Japan; NEC, Japan)
Clone change management is one of crucial issues in open source software(OSS) development as well as in industrial software development (e.g., development of social infrastructure, financial system, and medical equipment). When an industrial developer fixes a defect, he/she has to find the code clones corresponding to the code fragment including it. So far, several studies performed on the analysis of clone evolution in OSS. However, to our knowledge, a few research have been reported on an application of a clone change management system to industrial development process. In this paper, we propose a clone change management system based on the categorization of clone evolution, and then present case study of industrial application. In case study, we confirmed that the proposed system suggested two unintentionally developed clones in a half of the month.

Short Papers

A Common Conceptual Model for Clone Detection Results
Cory J. Kapser, Jan Harder, and Ira Baxter
(Techtonic Arts, Canada; University of Bremen, Germany; Semantic Designs, USA)
As the field of code clone research grows, the continuing problem of interoperability between code clone detection and analysis tools grows with it. As a step toward solving this problem, this paper presents a draft proposal for a generic model of code clone detection results. Using an online wiki, we hope to generate discussion and solidify a shared understanding of the core concepts of the problem domain, enabling us to ultimately develop a generic data exchange format.

Conte*t Clones or Re-thinking Clone on a Call Graph
Toshihiro Kamiya
(Future University Hakodate, Japan)
To improve clone-detection methods and to enable new analysis methods with clone detection, e.g., to detect a wider range of bad smells and anti-patterns, this paper introduces a concept context clone, in a form comparable to the (traditional) content clone. A context clone is based on the similarity of the contexts in which code fragment is used, instead of the similarity of the code fragments themselves. This paper includes an explanation of context clones, research questions about the context clone, expected use of the context clone in a mixed way with the content clone, and an actual example of a context clone with a prototype tool.

Filtering Clones for Individual User Based on Machine Learning Analysis
Jiachen Yang, Keisuke Hotta, Yoshiki Higo, Hiroshi Igaki, and Shinji Kusumoto
(Osaka University, Japan)
Results from code clone detectors may contain plentiful useless code clones, and judging whether a code clone is useful varies from user to user based on different purposes of them. We are planing a system to study the judgment of each individual user by applying machine learning algorithms on code clones. We describe the reason why individual judgment should be respected and how in this paper.

Near-Miss Model Clone Detection for Simulink Models
Manar H. Alalfi, James R. Cordy, Thomas R. Dean, Matthew Stephan, and Andrew Stevenson
(Queen's University, Canada)
This paper describes our plan to adapt mature code-based clone detection techniques to the efficient identification of near-miss clones in models. Our goal is to leverage successful source text-based clone detection techniques by transforming graph-based models to normalized text form in order to capture semantically meaningful near-miss results that can help in further model analysis tasks. In this position paper we present a first example, adapting the NiCad code clone detector to identifying near-miss Simulink model clones at the ``system'' granularity. In current work we are extending this technique to the Simulink (entire) ``model'' and (more refined) ``block'' granularities as well.

Semantic Clone Detection Using Method IOE-Behavior
Rochelle Elva and Gary T. Leavens

(University of Central Florida, USA)
This paper presents an algorithm for the detection of semantic clones in Java methods. Semantic clones are defined as functionally-identical code fragments. Our detection process operates on the premise that if two code fragments are semantic clones, then their input-output behavior would be identical. We adopt a wholistic approach to the definition of input-output behavior by including not only the parameters and return values of methods; but also their effects, as reflected in the pre- and post-states of the heap. We refer to this as a method’s IOE-behavior (input, output and effects).

Shuffling and Randomization for Scalable Source Code Clone Detection
Iman Keivanloo, Chanchal K. Roy, Juergen Rilling, and Philippe Charland
(Concordia University, Canada; University of Saskatchewan, Canada; DRDC at Valcartier, Canada)
In this research, we present a novel approach that allows existing state of the art clone detection tools to scale to very large datasets. A key benefit of our approach is that the improved tools scalability is achieved using standard hardware and without modifying the original implementations of the subject tools. We use a hybrid approach comprising of shuffling, repetition, and random subset generation of the subject dataset. As part of the experimental evaluation, we applied our shuffling and randomization approach on two state of the art clone detection tools. Our experience shows that it is possible to scale the classical tools to a very large dataset using standard hardware, and without significantly affecting the overall recall while exploiting all the strengths of the original tools including the precision.

Towards Qualitative Comparison of Simulink Model Clone Detection Approaches
Matthew Stephan, Manar H. Alalfi, Andrew Stevenson, and James R. Cordy
(Queen's University, Canada)
In this position paper we briefly review the Simulink model clone detection approaches in literature, including a new one currently being developed, and outline our plan for an experimental comparison. We are using public and private Simulink models to compare approaches based on clone relevance, performance, types of clones detected, user interaction, adaptability, and the ability to identify recurring patterns using a combination of manual inspection and model visualization.

Using Edge Bundle Views for Clone Visualization
Benedikt Hauptmann, Veronika Bauer, and Maximilian Junker
(TU Munich, Germany)
Clone detection results are often voluminous and difficult to present. Most clone presentations focus on the quantitative clone results but do not relate them to the structure of the analyzed system. This makes it difficult to interpret the results and draw conclusions. We suggest using edge bundle views to interrelate the system’s structure with the clone detection results. Using this technique, it is easier to interpret the clone results and direct further analysis effort.

We Have All of the Clones, Now What? Toward Integrating Clone Analysis into Software Quality Assessment
Wei Wang and Michael W. Godfrey

(University of Waterloo, Canada)
Cloning might seems to be an unconventional way of designing and developing software, yet it is very widely practised in industrial development. The cloning research community has made substantial progress on modeling, detecting, and analyzing software clones. Although there is continuing discussion on the real role of clones on software quality, our community may agree on the need for advancing clone management techniques. Current clone management techniques concentrate on providing editing tools that allow developers to easily inspect clone instances, track their evolution, and check change consistency. In this position paper, we argue that better clone management can be achieved by responding to the fundamental needs of industry practitioners. And the possible research directions include a software problem-oriented taxonomy of clones, and a better structured clone detection report. We believe this line of research should inspire new techniques, and reach to a much wider range of professionals from both the research and industry community.

Tool Demonstrations

Ctcompare: Code Clone Detection Using Hashed Token Sequences
Warren Toomey
(Bond University, Australia)
There is much research on the use of tokenized source code to find code clones both within and between trees of source code. Some approaches have used suffix trees [1], [3]; others have used variations of longest common substring algorithms [4], [5]. This paper outlines an algorithm, embodied in a new tool called ctcompare, that takes a different tokenization approach. Each code base to be compared is first lexically analysed to produce a sequence of tokens. These are then broken into overlapping tuples of N consecutive tokens. The tuples are then hashed and the hash values of token tuples are used to identify type-1 and type-2 clone pairs. Hashed token sequences combined with a database have already been used in earlier ctcompare versions and elsewhere [2], but with a significant performance penalty due to database insertions. The benefits of this approach over the existing research include the simultaneous comparison of multiple large code bases and fast absolute performance.

Experience of Finding Inconsistently-Changed Bugs in Code Clones of Mobile Software
Katsuro Inoue, Yoshiki Higo

, Norihiro Yoshida, Eunjong Choi, Shinji Kusumoto, Kyonghwan Kim, Wonjin Park, and Eunha Lee
(Osaka University, Japan; Samsung Electronics, South Korea)
When we reuse a code fragment, some of the identifiers in the fragment might be systematically changed to others. Failing these changes would become a potential bug in the copied fragment. We have developed a tool CloneInspector to detect such inconsistent changes in the code clones, and applied it to two mobile software systems. Using this tool, we were effectively able to find latent bugs in those systems.

Visualizing Code Clone Outbreak: An Industrial Case Study
Kentaro Yoshimura and Ryota Mibe
(Hitachi, Japan)
This paper describes an industrial experience on code clone visualization. Cloning source code fragments is a common practice in software development process. However, uncontrolled proliferation of code clones causes a serious problem in terms of software maintenance. In this paper, we briefly share our experience on code clone visualization especially for stakeholders who are not software experts. We describe our prototype tool for code clone visualization, and the feedback we have received with analyzing an enterprise business system.

IWSC 2012 – Proceedings

6th International Workshop on Software Clones (IWSC)

Preface

Technical Papers

Industrial Experience Papers

Short Papers

Tool Demonstrations