IWSC 2013 – Proceedings

Foreword
Software Clones are identical or similar pieces of code, models or designs. In this, the 7th International Workshop on Software Clones (IWSC), we will discuss issues in softwareclone detection, analysis and management, as well as applications to software engineering contexts that can benefit from knowledge of clones. These are important emerging topics in software engineering research and practice. Special emphasis will be given this time to clone management in practice, emphasizing use cases and experiences. We will also discuss broader topics on software clones, such as clone detection methods, clone classification, management, and evolution, the role of clones in software system architecture, quality and evolution, clones in plagiarism, licensing, and copyright, and other topics related to similarity in software systems. The format of this workshop will give enough time for intense discussions.

Light-Weight Ontology Alignment using Best-Match Clone Detection
Paul L. Geesaman, James R. Cordy, and Amal Zouaq
(Queen's University, Canada; Royal Military College, Canada)
Ontologies are a key component of the Semantic Web, providing a common basis for representing and exchanging domain meaning in web documents and resources. Ontology alignment is the problem of relating the elements of two formal ontologies for a semantic domain, in order to identify common concepts and relationships represented using different terminology or language, and thus allow meaningful communication and exchange of documents and resources represented using different ontologies for the same domain. Many algorithms have been proposed for ontology alignment, each with their own strengths and weaknesses. The problem is in many ways similar to near-miss clone detection: while much of the description of concepts in two ontologies may be similar, there can be differences in structure or vocabulary that make similarity detection challenging. Based on our previous work extending clone detection to modelling languages such as WSDL using contextualization, in this work we apply near-miss clone detection to the problem of ontology alignment, and use the new notion of best-match clone detection to achieve results similar to many existing ontology alignment algorithms when applied to standard benchmarks.

A Mutation Analysis Based Benchmarking Framework for Clone Detectors
Jeffrey Svajlenko, Chanchal K. Roy, and James R. Cordy
(University of Saskatchewan, Canada; Queen's University, Canada)
In recent years, an abundant number of clone detectors have been proposed in literature. However, most of the tool papers have lacked a solid performance evaluation of the subject tools. This is due both to the lack of an available and reliable benchmark, and the manual efforts required to hand check a large number of candidate clones. In this tool demonstration paper we show how a mutation analysis based benchmarking framework can be used by developers and researchers to evaluate clone detection tools at a fine granularity with minimal effort.

The Limits of Clone Model Standardization
Jan Harder
(University of Bremen, Germany)
Many tools for clone detection exist. Each has its own model of clones and its own data format. This makes it difficult to share results, compare detectors, and replicate existing studies. Although there have been attempts to provide unified clone models in the past, no widely accepted unified clone model and data format has emerged. This paper discusses challenges that may be the reason why this is the case. In this paper we suggest that existing specialized models should be kept and supplemented with one unified model that serves for exchange only.

Refactoring Clones: A New Perspective
Nikolaos Tsantalis

and Giri Panamoottil Krishnan
(Concordia University, Canada)
In this position paper we support that there is still great potential for advancements in the research area of software clone refactoring, and argue about some possible research objectives and directions through illustrative examples.

Cloning: The Need to Understand Developer Intent
Debarshi Chatterji, Jeffrey C. Carver, and Nicholas A. Kraft
(University of Alabama, USA)
Many researchers have studied the positive and negative effects of code clones on software quality. However, little is known about the intent and rationale of the developers who clone code. Studies have shown that reusing code is a common practice for developers while programming, but there are many possible motivations for and approaches to code reuse. Although we have some ideas about the intentions of developers when cloning code, comprehensive research is needed to gather conclusive evidence about these intentions and categorize clones based on them. In this paper we argue that to provide developers with better clone management tools, we need to interview developers to better understand their intentions when managing cloned code.

Scaling Classical Clone Detection Tools for Ultra-Large Datasets: An Exploratory Study
Jeffrey Svajlenko, Iman Keivanloo, and Chanchal K. Roy
(University of Saskatchewan, Canada; Concordia University, Canada)
Detecting clones from large datasets is an interesting research topic for a number of reasons. However, building scalable clone detection tools is challenging and it is often impossible to use existing state of the art tools for such large datasets. In this research we have investigated the use of our Shuffling Framework for scaling classical clone detection tools to ultra large datasets. This framework achieves scalability on standard hardware by partitioning the dataset and shuffling the partitions over a number of detection rounds. This approach does not require modification to the subject tools, which allows their individual strengths and precisions to be captured at an acceptable loss of recall. In our study, we explored the performance and applicability of our framework for six clone detection tools. The clones found during our experiment were used to comment on the cloning habits of the global Java open-source development community.

How to Extract Differences from Similar Programs? A Cohesion Metric Approach
Akira Goto, Norihiro Yoshida, Masakazu Ioka, Eunjong Choi, and Katsuro Inoue
(Osaka University, Japan; NAIST, Japan)
Merging similar programs is a promising solution to improve the maintainability of source code. Before merging programs, any syntactic difference has to be extracted as a new method. However, it is difficult for a developer to identify and extract differences from programs appropriately because he/she has to consider not only syntactic and semantic correctness but also the modularity of the programs after merging. In this paper, we propose a slice-based cohesion metrics approach to suggesting the extractions of differences from similar Java methods. This approach identifies syntactic differences from two methods, and then suggests sets of cohesive regions including those differences. The case study shows that the proposed approach can suggest refactorings that not only merge two methods but also increase the cohesiveness.

Prioritizing Code Clone Detection Results for Clone Management
Radhika D. Venkatasubramanyam, Shrinath Gupta, and Himanshu Kumar Singh
(Siemens, India)
Clone detection through tools is a common practice in the software industry. Associated with clone detection is code clone management, which includes taking informed decisions for management of the large sets of clones as reported by the clone detection tools, a task that gets more challenging with larger code bases. In order to enable and ease the process of code clone management, we discuss various criteria that help in prioritizing the clone results. We consider the impact of clones with respect to factors of maintenance overhead, code quality, and refactoring cost. The criteria for prioritization are based on the need for industrial code to adhere to software quality standards. This paper attempts to provide a systematic approach for analyzing and prioritizing clones to determine the order of fixing. This methodology is currently being used in some of the Siemens Corporate Technology Development Center Asia Australia (CT DC AA) projects; a case study of one such project is presented in this paper.

No Clones, No Trouble?
Saman Bazrafshan
(University of Bremen, Germany)
In the last years research has focused on various fields of software clones, though there exists only a vague idea of costs and benefits. Without substantial research on the economic consequences of software clones, clone management will remain a questionable or even risky activity. Knowing how much effort is spent over the lifetime of a program to maintain source code that has been introduced in consequence of a deliberate clone removal might provide useful information in this respect. In this paper I present a framework to track code fragments that have been introduced by refactorings performed to remove existing code duplications. Tracking such code over time allows to investigate different aspects, for instance change frequency, that provide valuable insights regarding ongoing maintenance costs.

CPDP: A Robust Technique for Plagiarism Detection in Source Code
Basavaraju Muddu, Allahbaksh Asadullah, and Vasudev Bhat
(Infosys, India)
The advent of internet and growth of open source software repositories has made source code readily accessible to software developers. Although, reusing of source code has its own advantages, care must be taken to ensure that proprietary software does not infringe any licenses. In this context, plagiarism detection plays an important role. In this paper, we propose a robust technique to detect plagiarism in source code. Our approach uses a language aware token representation, which is resilient to code transformations and an improved querying and matching technique to detect plagiarism in software code. We evaluated our approach by comparing it with other plagiarism detection tools - Copy Paste Detector (CPD), Sherlock, CCFinder and Plaggie.

A Parallel and Efficient Approach to Large Scale Clone Detection
Hitesh Sajnani and Cristina Lopes
(UC Irvine, USA)
Over the past few years, researchers have implemented various algorithms to improve the scalability of clone detection. Most of these algorithms focus on scaling vertically on a single machine, and require complex intermediate data structures (e.g., suffix tree, etc.). However, several new use-cases of clone detection have emerged, which are beyond the computational capacity of a single machine. Moreover, for some of these use-cases it may be expensive to invest upfront in the cost of building these data structures. In this paper, we propose a technique to horizontally scale clone detection across multiple machines using the popular MapReduce framework. The technique does not require building any complex intermediate data structures. Moreover, in order to increase the efficiency, the technique uses a filtering heuristic to prune the number of code block comparisons. The filtering heuristic is independent of our approach and it can be used in conjunction with other approaches to increase their efficiency. In our experiments, we found that: (i) the computation time to detect clones decreases by almost half every time we double the number of nodes; and (ii) the scaleup is linear, with a decline of not more than 70% compared to the ideal case, on a cluster of 2-32 nodes for 150-2800 projects.

Towards a Curated Collection of Code Clones
Ewan Tempero
(University of Auckland, New Zealand)
In order to do research on code clones, it is necessary to have information about code clones. For example, if the research is to improve clone detection, this information would be used to validate the detectors or provide a benchmark to compare different detectors. Or if the research is on techniques for managing clones, then the information would be used as input to such techniques. Typically, researchers have to develop clone information themselves, even if doing so is not the main focus of their research. If such information could be made available, they would be able to use their time more efficiently. If such information was usefully organised and its quality clearly identified, that is, the information is curated, then the quality of the research would be improved as well. In this paper, I describe the beginnings of a curated source of information about a collection of code clones from the Qualitas Corpus. I describe how this information is currently organised, discuss how it might be used, and proposed directions it might take in the future. The collection currently includes 1.3M method-level clone-pairs from 109 different open source Java Systems, applying to approximately 5.6M lines of code.

Assessing Cross-Project Clones for Reuse Optimization
Veronika Bauer and Benedikt Hauptmann
(TU Munich, Germany)
Organizational structures (e. g., separate accounting, heterogeneous infrastructure, or different development processes) can restrict systematic reuse among projects within companies. As a consequence, code is often copied between projects which increases maintenance costs and can cause failures due to inconsistent bug fixing. Assessing cross-project clones helps to uncover organizational obstacles for code reuse and to leverage other ways of systematic reuse. Furthermore, knowing how strongly clones are entangled with the surrounding code helps to decide if and how to extract them to commonly used libraries. We propose to combine cross-project clone detection and dependency analyses to detect (1) what is cloned between projects, (2) how far the cloned code is entangled with the surrounding system and (3) what are candidates for extraction into common libraries.

On the Robustness of Clone Detection to Code Obfuscation
Sandro Schulze and Daniel Meyer
(TU Braunschweig, Germany; University of Magdeburg, Germany)
Code clones are a common reuse mechanism in software development. While there is an ongoing discussion about harmfulness and advantages of code cloning, this discussion is mainly centered around aspects of software quality. However, recent research has shown, that code cloning may have legal implications as well such as license violations. From this point of view, a developer may favor to hide his cloning activities. To this end, he could obfuscate the cloned code to deceive clone detectors. However, it is unknown how robust certain clone detection techniques are against code obfuscations. In this paper, we present a framework for semi-automated code obfuscations. Additionally, we present a case study to evaluate the robustness of selected clone detectors against such obfuscations.

Large Scale Multi-language Clone Analysis in a Telecommunication Industrial Setting
Ettore Merlo, Thierry Lavoie, Pascal Potvin, and Pierre Busnel
(Polytechnique Montréal, Canada; Ericsson, Canada)
This paper presents results from an experience of transferring the CLAN clone detection technology into a telecommunication industrial setting.
Eleven proprietary systems have been analyzed for a total of about 94 MLOC of C/C++ and Java source code.
The characteristics of the analyzed systems together with a description of the Web portal that is used as an interface to the clone analysis environment is described.
Reported results include figures and diagrams about clone frequencies, types, and similarity distributions.
Processing times including parsing, clone clustering, and Dynamic Programming visualisation are presented.
A discussion about lesson learned and future research work is also presented from an industrial point of view for real life practical applications of clone detection.

Feature-Based Detection of Bugs in Clones
Daniela Steidl and Nils Göde
(CQSE, Germany)
A significant amount of source code in software systems consists of comments, i. e., parts of the code which are ignored by the compiler. Comments in code represent a main source for system documentation and are hence key for source code understanding with respect to development and mainte- nance. Although many software developers consider comments to be crucial for program understanding, existing approaches for software quality analysis ignore system commenting or make only quantitative claims. Hence, current quality analyzes do not take a significant part of the software into account. In this work, we present a first detailed approach for quality analysis and assessment of code comments. The approach provides a model for comment quality which is based on different comment categories. To categorize comments, we use machine learning on Java and C/C++ programs. The model comprises different quality aspects: by providing metrics tailored to suit specific categories, we show how quality aspects of the model can be assessed. The validity of the metrics is evaluated with a survey among 16 experienced software developers, a case study demonstrates the relevance of the metrics in practice.

How Much Really Changes? A Case Study of Firefox Version Evolution using a Clone Detector
Thierry Lavoie and Ettore Merlo
(Polytechnique Montréal, Canada)
This paper focuses on the applicability of clone detectors for system evolution understanding. Specifically, it is a case study of Firefox for which the development release cycle changed from a slow release cycle to a fast release cycle two years ago. Since the transition of the release cycle, three times more versions of the software were deployed. To understand whether or not the changes between the newer versions are as significant as the changes in the older versions, we measured the similarity between consecutive versions. We analyzed 82MLOC of C/C++ code to compute the overall change distribution between all existing major versions of Firefox. The results indicate a significant decrease in the overall difference between many versions in the fast release cycle. We discuss the results and highlight how differently the versions have evolved in their respective release cycle. We also relate our results with other results assessing potential changes in the quality of Firefox. We conclude the paper by raising questions on the impact of a fast release cycle.