IWSC 2011 – Proceedings

Foreword
Software clones are identical or similar pieces of code or design. Clones are known to be closely related to various issues on software engineering, such as software quality, complexity, architecture, refactoring, evolution, licensing, plagiarism, and so on. Various characteristics of software systems can be uncovered through the clone analysis, and system restructuring can be performed by merging clones.
The purpose of this workshop is to continue to solidify and give shape to this research/application area and community. More specifically, the goals are to bring together academic and industrial researchers and practitioners from around the world to evaluate the current state of research and applications, discuss common problems, discover new opportunities for collaboration, exchange ideas, and envision new areas of research and applications.
In this, the 5th international clone workshop, we will discuss issues in software clone detection, analysis and management, as well as applications to software engineering contexts that can benefit from knowledge of clones.

Full Papers

Viewing Simple Clones from Structural Clones' Perspective
Hamid Abdul Basit, Usman Ali, and Stanislaw Jarzabek
(Lahore University of Management Sciences, Pakistan; National University of Singapore, Singapore)
In previous work, we described a technique for detecting design-level similar program structures that we called structural clones. Structural clones are recurring configurations of simple clones (i.e., similar code fragments). In this paper, we show how structural clone analysis extends the benefits of analysis based on simple clones only. First, we present experimental results showing that in many cases simple clones participated in structural clones. In such cases, structural clones being larger than simple clones but smaller in number, allow analysts to see the “forest from the trees”, as far as the similarity situation is concerned. We provide arguments and examples to show how the knowledge of structural clones – their location and exact similarities and differences – helps in program understanding, design recovery, maintenance, and refactoring.

Extracting Code Clones for Refactoring Using Combinations of Clone Metrics
Eunjong Choi, Norihiro Yoshida, Takashi Ishio, Katsuro Inoue, and Tateki Sano
(Osaka University, Japan; Nara Institute of Science and Technology, Japan; NEC Corporation, Japan)
Code clone detection tools may report a large number of code clones, while software developers are interested in only a subset of code clones that are relevant to software development tasks such as refactoring. Our research group has supported many software developers with the code clone detection tool CCFinder and its GUI front-end Gemini. Gemini shows clone sets (i.e., a set of code clones identical or similar to each other) with several clone metrics including their length and the number of code clones; however, it is not clear how to use those metrics to extract interesting code clones for developers. In this paper, we propose a method combining clone metrics to extract code clones for refactoring activity. We have conducted an empirical study on a web application developed by a Japanese software company. The result indicates that combinations of simple clone metric is more effective to extract refactoring candidates in detected code clones than individual clone metric.

Index-Based Model Clone Detection
Benjamin Hummel, Elmar Juergens, and Daniela Steidl
(TU München, Germany)
Existing algorithms for model clone detection operate in batch mode. Consequently, if a small part of a large model changes during maintenance, the entire detection needs to be recomputed to produce updated cloning information. Since this can take several hours, the lack of incremental detection algorithms hinders clone management, which requires up-to-date cloning information. In this paper we present an index-based algorithm for model clone detection that is incremental and distributable. We present a case study that demonstrates its capabilities, outline its current limitations and present directions for future work.

Is Cloned Code Older than Non-Cloned Code?
Jens Krinke
(University College London, UK)
It is still a debated question whether cloned code causes increased maintenance efforts. If cloned code is more stable than non-cloned code, i.e. it is changed less often, it will require less maintenance efforts. The more stable cloned code is, the longer it will not have been changed, so the stability can be estimated through the code's age. This paper presents a study on the average age of cloned code. For three large open source systems, the age of every line of source code is computed as the date of the last change in that line. In addition, every line is categorized whether it belongs to cloned code as detected by a clone detector. The study shows that on average, cloned code is older than non-cloned code. Moreover, if a file has cloned code, the average age of the cloned code of the file is lower than the average age of the non-cloned code in the same file. The results support the previous findings that cloned code is more stable than non-cloned code.

Automated Type-3 Clone Oracle Using Levenshtein Metric
Thierry Lavoie and Ettore Merlo
(Ecole Polytechnique de Montréal, Canada)
Clone detection techniques quality and performance evaluation require a system along with its clone oracle, that is a reference database of all accepted clones in the investigated system. Many challenges, including finding an adequate clone definition and scalability to industrial size systems, must be overcome to create good oracles. This paper presents an original method to construct clone oracles based on the Levenshtein metric. Although other oracles exist, this is the largest known oracle for type-3 clones that was created by an automated process on massive data sets. The method behind the creation of the oracle as well as actual oracles characteristics are presented. Discussion of the results in relation to other ways of building oracles is also provided along with future research possibilities.

Analyzing Web Service Similarity Using Contextual Clones
Douglas Martin and James R. Cordy
(Queen's University, Canada)
There are several tools and techniques developed over the past decade for detecting duplicated code in software. However, there exists a class of languages for which clone detection is ill-suited. We discovered one of these languages when we attempted to use clone detection to find similar web service operations in service descriptions written in the Web Service Description Language (WSDL). WSDL is structured in such a way that identifying units for comparison becomes a challenge. WSDL service descriptions contain specifications of one or more operations that are divided into pieces and intermingled throughout the description. In this paper, we describe a method of reorganizing them in order to leverage clone detection technology to identify similar services. We introduce the idea of contextual clones – clones that can only be found by augmenting code fragments with related information referenced by the fragment to give it context. We demonstrate this idea for WSDL and propose other languages and situations for which contextual clones may be of interest.

Scalable Clone Detection Using Description Logic
Philipp Schugerl
(Concordia University, Canada)
The semantic web is slowly transforming the web as we know it into a machine understandable pool of information that can be consumed and reasoned about by various clients. Source-code is no exception to this trend and various communities have proposed standards to share code as linked data. With the availability of large amounts of open source code published in publically accessible repositories and the introduction of massively horizontally scaling frameworks and cloud computing infrastructure, a new era of software mining across information silos is reshaping the software engineering landscape. The so far unreachable goal of analyzing code at a global level, and therefore detecting global software clones, has become manageable. Description logic and semantic web reasoners have so far only plaid a minor role in this transformation and are mainly used to model source code data. In this paper, we introduce a clone detection algorithm that uses a semantic web reasoner and is based on the Hadoop map-reduce framework that can scale horizontally to a large amount of data. We also define a novel and compact clone model that only considers control-blocks and used data types while still yielding similar clone detection results than more complex representations. In order to validate our approach we have compared our algorithm to some of the leading clone detection tools (CCFinder, JCD and Simian) and show differences in performance and detection precision.

Representing Clones in a Localized Manner
Robert Tairas, Ferosh Jacob, and Jeff Gray
(AtlanMod, France; University of Alabama, USA)
Code clones (i.e., duplicate sections of code) can be scattered throughout the source files of a program. Manually evaluating a group of such clones requires observing each clone in its original location (i.e., opening each file and finding the source location of each clone), which can be a time-consuming process. As an alternative, this paper introduces a new technique to localize the representation of code clones to provide a summary of the properties of two or more clones in one location. In our approach, the results of a clone detection tool are analyzed in an automated manner to determine the properties (i.e., similarities and differences) of the clones. These properties are visualized directly within the source editor. The localized representation is realized as part of the features of an Eclipse plug-in called CeDAR.

Short Papers

On the Need for Human-based Empirical Validation of Techniques and Tools for Code Clone Analysis
Jeffrey C. Carver, Debarshi Chatterji, and Nicholas A. Kraft
(University of Alabama, USA)
Code clone analysis techniques and tools are popular topics among the software engineering research community. Many studies draw conclusions solely based on an analytical analysis. These claims focus primarily on tool performance in terms of portability, scalability, robustness, precision, and recall. However, these types of analytical studies cannot adequately evaluate the behavior of the developers while using the tools. Human-based empirical studies are complementary to studies based on analytical data because they provide direct insight into developer behavior. In this paper we argue for the need for more human-based empirical studies in the area of code clone analysis techniques and tools.

Code Clone Detection Experience at Microsoft
Yingnong Dang, Song Ge, Ray Huang

, and Dongmei Zhang

(Microsoft Reesarch Asia, China)
Cloning source code is a common practice in the software development process. In general, the number of code clones increases in proportion to the growth of the code base. It is challenging to proactively keep clones consistent and remove unnecessary clones during the entire software development process of large-scale commercial software. In this position paper, we briefly share some typical usage scenarios of code clone detection that we collected from Microsoft engineers. We also discuss our experience on building XIAO, a code clone detection tool, and the feedback we have received from Microsoft engineers on using XIAO in real development settings.

Determining the Provenance of Software Artifacts
Michael W. Godfrey

, Daniel M. German, Julius Davies, and Abram Hindle
(University of Waterloo, Canada; University of Victoria, Canada)
Software clone detection has made substantial progress in the last 15 years, and software clone analysis is starting to provide real insight into how and why code clones are born, evolve, and sometimes die. In this position paper, we make the case that there is a more general problem lurking in the background: software artifact provenance analysis. We argue that determining the origin of software artifacts is an increasingly important problem with many dimensions. We call for simple and lightweight techniques that can be used to help narrow the search space, so that more expensive techniques -- including manual examination --- can be used effectively on a smaller candidate set. We predict the problem of software provenance will lead towards new avenues of research for the software clones community.

Research in Cloning Beyond Code: A First Roadmap
Elmar Juergens
(TU München, Germany)
Most research in software cloning has a strong focus on source code. However, cloning occurs in other software artifacts, as well. In this paper, we summarize existing work on cloning in other software artifacts and provide a list of research questions for future work.

How Code Skips Over Revisions
Toshihiro Kamiya
(Future University Hakodate, Japan)
This paper explores the need for `history-aware' searches, by experimentally showing a development process that includes code fragments which disappear at a revision and appear again at a later revision. Some of these code re-appearances are not a result of a revert command of a version control system, but a result of a developer who copied a code fragment from old source files.

Visualizing the Evolution of Code Clones
Ripon K. Saha, Chanchal K. Roy, and Kevin A. Schneider
(University of Saskatchewan, Canada)
The knowledge of code clone evolution throughout the history of a software system is essential in comprehending and managing its clones properly and cost-effectively. However, investigating and observing facts in a huge set of text-based data provided by a clone genealogy extractor could be challenging without the support of a visualization tool. In this position paper, we present an idea of visualizing code clone evolution by exploiting the advantages of existing clone visualization techniques that would be both scalable and useful.

Clone Detection through Process Algebras and Java Bytecode
Antonella Santone
(University of Sannio, Italy)
In this paper we present a formal method-based approach in detecting source code clones by means of analysing and comparing the Java Bytecode that is produced when the source code is compiled. A preliminary investigation has been also conducted to assess the validity of the proposed approach.

Towards Flexible Code Clone Detection, Management, and Refactoring in IDE
Minhaz F. Zibran and Chanchal K. Roy
(University of Saskatchewan, Canada)
In this paper, we propose an IDE-based clone management system to flexibly detect, manage, and refactor both exact and near-miss code clones. Using a k-difference hybrid suffix tree algorithm we can efficiently detect both exact and near-miss clones. We have implemented the algorithm as a plugin to the Eclipse IDE, and have been extending this for real-time code clone management with semi-automated refactoring support during the actual development process.

Tool Papers

VisCad: Flexible Code Clone Analysis Support For NiCad
Muhammad Asaduzzaman, Chanchal K. Roy, and Kevin A. Schneider
(University of Saskatchewan, Canada)
Clone detector results can be better understood with tools that support visualization and facilitate in-depth analysis. In this tool demo paper we present VisCad, a comprehensive code clone analysis and visualization tool that provides such support for the near-miss hybrid clone detection tool, NiCad. Through carefully selected metrics and visualization techniques VisCad can guide users to explore the cloning of a system from different perspectives.

Live Scatterplots
James R. Cordy
(Queen’s University, Canada)
Scatterplots have been used to help understand clone relationships in large scale systems since the earliest large system studies more than a decade ago. They often expose interesting patterns of cloning between subsystems and point to opportunities for further analysis. However, the remaining question when such patterns are seen is always, "but what is that?" Live scatterplots are aimed at providing an immediate, intuitive answer that can help the analyst to quickly identify and access subsystems and clones involved in a pattern simply by directly pointing at it in the scatterplot. Live scatterplots exploit the table, title and hyperlink tags of standard HTML to provide this ability in any standard browser, without the need for custom frameworks.

Efficiently Handling Clone Data: RCF and cyclone
Jan Harder and Nils Göde
(University of Bremen, Germany)
Exchange and inspection of clone information is an essential building block of successful clone management. Until now, a variety of different formats and tools for handling clone data has evolved. This paper presents two of our tools, both of which work closely together. We introduce the Rich Clone Format, which we have developed to standardize and ease the exchange of clone data. Furthermore, we describe our tool cyclone, which fosters the multi-perspective analysis of clone information.

CloneDiff - Semantic Differencing of Clones
Yinxing Xue, Zhenchang Xing, and Stanislaw Jarzabek
(National University of Singapore, Singapore)
Clone detection provides a scalable and efficient way to detect similar codes, while program differencing is a powerful and effective way to analyze similar codes. CloneDiff, a Program Dependence Graphs (PDGs) differencing tool, complements clone detection with program differencing for the purpose of characterizing clones. It captures semantic information of clones from PDGs, and uses graph matching techniques to compute a precise characterization of clones in terms of a category of semantic differences.

IWSC 2011 – Proceedings

Fifth International Workshop on Software Clones (IWSC 2011)

Preface

Full Papers

Short Papers

Tool Papers