SANER 2018
2018 IEEE 25th International Conference on Software Analysis, Evolution, and Reengineering (SANER)
Powered by
Conference Publishing Consulting

2018 IEEE 25th International Conference on Software Analysis, Evolution, and Reengineering (SANER), March 20-23, 2018, Campobasso, Italy

SANER 2018 – Proceedings

Contents - Abstracts - Authors

Frontmatter

Title Page


Message from the Chairs


SANER 2018 Organization


SANER 2018 Sponsors and Supporters



Keynotes

A Decade of Software Quality Analysis in Practice: Surprises, Anecdotes, and Lessons Learned (Keynote)
Elmar Juergens
(CQSE, Germany)
I implemented and ran my first clone detection on industrial software roughly a decade ago. Fueled by both the amounts of problematic code it uncovered, and the (at least partially) positive feedback from developers, our research group subsequently fo-cused on quality analyses to improve engineering practice. Since then, our research prototypes have grown into a com-mercial tool employed by professional software developers around the world every day. It implements both static and dy-namic analyses for over 25 programming languages and runs in development, test and production environments of hundreds of companies. We bootstrapped our spin-off, CQSE GmbH, into a company of 30 employees (half of which hold a PhD in Software Engineering). All of us exclusively work on, or employ as part of our audit services, software quality analyses built upon this community’s research. In this keynote, I want to share our key insights: experiences, surprises and anecdotes. I will cover hard lessons learned on how to have an impact in real-world projects, surprising results of seemingly trivial approaches, the role of software visualizations in marketing and our key learnings in transferring research from academia to practice.

Towards a New Digital Business Operating System: Speed, Data, Ecosystems, and Empowerment (Keynote)
Jan Bosch
(Chalmers University of Technology, Sweden)
We are living in the most exciting time in the history of man-kind. The last century has seen unprecedented improvements in the quality of the human condition and technology is at the heart of this progress. Now we are experiencing an even bigger leap as we move towards a new level of digitalization and au-tomation. Ranging from self-driving cars to factories without workers to societal infrastructure, every sensor and actuator is becoming connected and new applications that enable new op-portunities are appearing daily. The fuel of this emerging con-nected, software-driven reality is software and the key chal-lenge is to continuously deliver value to customers. The future of software engineering in this context is centered around a new, emerging digital business operating system consisting of four dimensions: Speed, Data, Ecosystems and Empowerment. The focus on speed is concerned with the constantly increasing rate of deploying new software in the field. This continuous integra-tion and deployment is no longer only the purview of internet companies but is also increasingly deployed in embedded sys-tems. Second, data is concerned with the vast amounts of infor-mation collected from systems deployed in the field and the be-havior of the users of these systems. Software businesses need to significantly improve their ability to exploit the value present in that data. Third, ecosystems are concerned with the transi-tion in many companies from doing everything in-house to stra-tegic use of innovation partners and commodity providing partners. Finally, we need new ways of organizing work in this new, digital age. The keynote discusses these four main devel-opments but focuses on the continuous software engineering. Also, the keynote provides numerous examples from the Nordic and international industry and predicts the next steps that in-dustry and academia need to engage in to remain competitive.

Compilers Are Sprinters – IDEs Are Marathoners (Keynote)
Peter Gromov
(JetBrains, Germany)
Compilers and IDEs both analyze source code, yet compared to IDEs, compilers are easy. Compilers process source files a module at a time; IDEs have to load entire projects. Compilers exit after each run; IDEs run constantly, requiring responsible memory management and low CPU utilization. Compilers operate in batch; IDEs must constantly, incrementally re-analyze code after each change in the editor. Compilers stop when there is an error; IDEs are expected to be even more helpful when there are errors. Compilers create intermediate representations, soon throwing away source code; IDEs must always map back to source, respecting whitespaces and resolving references to the line and column number. The talk discusses these and other challenges, and how IDEs based on IntelliJ platform attack them.


Retrospective Papers
Fri, Mar 23, 16:30 - 17:30, Aula Magna

Ten Years of JDeodorant: Lessons Learned from the Hunt for Smells
Nikolaos Tsantalis, Theodoros Chaikalis, and Alexander Chatzigeorgiou
(Concordia University, Canada; University of Macedonia, Greece)
Deodorants are different from perfumes, because they are applied directly on body and by killing bacteria they reduce odours and offer a refreshing fragrance. That was our goal when we first thought about "bad smells" in code: to develop techniques for effectively identifying and removing (i.e., deodorizing) code smells from object-oriented software. JDeodorant encompasses a number of techniques for suggesting and automatically applying refactoring opportunities on Java source code, in a way that requires limited effort on behalf of the developer. In contrast to other approaches that rely on generic strategies that can be adapted to various smells, JDeodorant adopts ad-hoc strategies for each smell considering the particular characteristics of the underlying design or code problem. In this retrospective paper, we discuss the impact of JDeodorant over the last ten years and a number of tools and techniques that have been developed for a similar purpose which either compare their results with JDeodorant or have built on top of JDeodorant. Finally, we discuss the empirical findings from a number of studies that employed JDeodorant to extract their datasets.

Design Patterns Impact on Software Quality: Where Are the Theories?
Foutse Khomh and Yann-Gaël Guéhéneuc
(Polytechnique Montréal, Canada; Concordia University, Canada)
Software engineers are creators of habits. During software development, they follow again and again the same patterns when architecting, designing and implementing programs. Alexander introduced such patterns in architecture in 1974 and, 20 years later, they made their way in software development thanks to the work of Gamma et al. Software design patterns were promoted to make the design of programs more “flexible, modular, reusable, and understandable”. However, ten years later, these patterns, their roles, and their impact on software quality were not fully understood. We then set out to study the impact of design patterns on different quality attributes and published a paper entitled “Do Design Patterns Impact Software Quality Positively?” in the proceedings of the 12th European Conference on Software Maintenance and Reengineering (CSMR) in 2008. Ten years later, this paper received the Most Influential Paper award at the 25th International Conference on Software Analysis, Evolution, and Reengineering (SANER) in 2018. In this retrospective paper for the award, we report and reflect on our and others’ studies on the impact of design patterns, discussing some key findings reported about design patterns. We also take a step back from these studies and re-examine the role that design patterns should play in software development. Finally, we outline some avenues for future research work on design patterns, e.g., the identification of the patterns really used by developers, the theories explaining the impact of patterns, or their use to raise the abstraction level of programming languages.

Benchmarks for Software Clone Detection: A Ten-Year Retrospective
Chanchal K. Roy and James R. Cordy
(University of Saskatchewan, Canada; Queen's University, Canada)
There have been a great many methods and tools proposed for software clone detection. While some work has been done on assessing and comparing performance of these tools, very little empirical evaluation has been done. In particular, accuracy measures such as precision and recall have only been roughly estimated, due both to problems in creating a validated clone benchmark against which tools can be compared, and to the manual effort required to hand check large numbers of candidate clones. In order to cope with this issue, over the last 10 years we have been working towards building cloning benchmarks for objectively evaluating clone detection tools. Beginning with our WCRE 2008 paper, where we conducted a modestly large empirical study with the NiCad clone detection tool, over the past ten years we have extended and grown our work to include several languages, much larger datasets, and model clones in languages such as Simulink. From a modest set of 15 C and Java systems comprising a total of 7 million lines in 2008, our work has progressed to a benchmark called BigCloneBench with eight million manually validated clone pairs in a large inter-project source dataset of more than 25,000 projects and 365 million lines of code. In this paper, we present a history and overview of software clone detection benchmarks, and review the steps of ourselves and others to come to this stage. We outline a future for clone detection benchmarks and hope to encourage researchers to both use existing benchmarks and to contribute to building the benchmarks of the future.


Technical Research Papers

Program Analysis
Wed, Mar 21, 10:30 - 11:30, Aula Magna

Context Is King: The Developer Perspective on the Usage of Static Analysis Tools
Carmine Vassallo, Sebastiano Panichella, Fabio Palomba, Sebastian Proksch, Andy Zaidman, and Harald C. Gall
(University of Zurich, Switzerland; Delft University of Technology, Netherlands)
Automatic static analysis tools (ASATs) are tools that support automatic code quality evaluation of software systems with the aim of (i) avoiding and/or removing bugs and (ii) spotting design issues. Hindering their wide-spread acceptance are their (i) high false positive rates and (ii) low comprehensibility of the generated warnings. Researchers and ASATs vendors have proposed solutions to prioritize such warnings with the aim of guiding developers toward the most severe ones. However, none of the proposed solutions considers the development context in which an ASAT is being used to further improve the selection of relevant warnings. To shed light on the impact of such contexts on the warnings configuration, usage and adopted prioritization strategies, we surveyed 42 developers (69% in industry and 31% in open source projects) and interviewed 11 industrial experts that integrate ASATs in their workflow. While we can confirm previous findings on the reluctance of developers to configure ASATs, our study highlights that (i) 71% of developers do pay attention to different warning categories depending on the development context, and (ii) 63% of our respondents rely on specific factors (e.g., team policies and composition) when prioritizing warnings to fix during their programming. Our results clearly indicate ways to better assist developers by improving existing warning selection and prioritization strategies.

Micro-clones in Evolving Software
Manishankar Mondal, Chanchal K. Roy, and Kevin A. Schneider
(University of Saskatchewan, Canada)
Detection, tracking, and refactoring of code clones (i.e., identical or nearly similar code fragments in the code-base of a software system) have been extensively investigated by a great many studies. Code clones have often been considered bad smells. While clone refactoring is important for removing code clones from the code-base, clone tracking is important for consistently updating code clones that are not suitable for refactoring. In this research we investigate the importance of micro-clones (i.e., code clones of less than five lines of code) in consistent updating of the code-base. While the existing clone detectors and trackers have ignored micro clones, our investigation on thousands of commits from six subject systems imply that around 80% of all consistent updates during system evolution occur in micro clones. The percentage of consistent updates occurring in micro clones is significantly higher than that in regular clones according to our statistical significance tests. Also, the consistent updates occurring in micro-clones can be up to 23% of all updates during the whole period of evolution. According to our manual analysis, around 83% of the consistent updates in micro-clones are non-trivial. As micro-clones also require consistent updates like the regular clones, tracking or refactoring micro-clones can help us considerably minimize effort for consistently updating such clones. Thus, micro-clones should also be taken into proper consideration when making clone management decisions.

Software Logging
Wed, Mar 21, 11:45 - 12:45, Aula Magna

SMARTLOG: Place Error Log Statement by Deep Understanding of Log Intention
Zhouyang Jia, Shanshan Li, Xiaodong Liu, Xiangke Liao, and Yunhuai Liu
(National University of Defense Technology, China; Peking University, China)
Failure-diagnosis logs can dramatically reduce the system recovery time when software systems fail. Log automation tools can assist developers to write high quality log code. In traditional designs of log automation tools, they define log placement rules by extracting syntax features or summarizing code patterns. These approaches are, however, limited since the log placements are far beyond those rules but are according to the intention of software code. To overcome these limitations, we design and implement SmartLog, an intention-aware log automation tool. To describe the intention of log statements, we propose the Intention Description Model (IDM). SmartLog then explores the intention of existing logs and mines log rules from equivalent intentions. We conduct the experiments based on 6 real-world open-source projects. Experimental results show that SmartLog improves the accuracy of log placement by 43% and 16% compared with two state-of-the-art works. For 86 real-world patches aimed to add logs, 57% of them can be covered by SmartLog, while the overhead of all additional logs is less than 1%.

Info

Testing
Wed, Mar 21, 13:45 - 14:45, Aula Magna

Exploring the Integration of User Feedback in Automated Testing of Android Applications
Giovanni Grano, Adelina Ciurumelea, Sebastiano Panichella, Fabio Palomba, and Harald C. Gall
(University of Zurich, Switzerland)
The intense competition characterizing mobile application's marketplaces forces developers to create and maintain high-quality mobile apps in order to ensure their commercial success and acquire new users. This motivated the research community to propose solutions that automate the testing process of mobile apps. However, the main problem of current testing tools is that they generate redundant and random inputs that are insufficient to properly simulate the human behavior, thus leaving feature and crash bugs undetected until they are encountered by users. To cope with this problem, we conjecture that information available in user reviews---that previous work showed as effective for maintenance and evolution problems---can be successfully exploited to identify the main issues users experience while using mobile applications, e.g., GUI problems and crashes. In this paper we provide initial insights into this direction, investigating (i) what type of user feedback can be actually exploited for testing purposes, (ii) how complementary user feedback and automated testing tools are, when detecting crash bugs or errors and (iii) whether an automated system able to monitor crash-related information reported in user feedback is sufficiently accurate. Results of our study, involving 11,296 reviews of 8 mobile applications, show that user feedback can be exploited to provide contextual details about errors or exceptions detected by automated testing tools. Moreover, they also help detecting bugs that would remain uncovered when rely on testing tools only. Finally, the accuracy of the proposed automated monitoring system demonstrates the feasibility of our vision, i.e., integrate user feedback into testing process.

Structured Random Differential Testing of Instruction Decoders
Nathan Jay and Barton P. Miller
(University of Wisconsin-Madison, USA)
Decoding binary executable files is a critical facility for software analysis, including debugging, performance monitor- ing, malware detection, cyber forensics, and sandboxing, among other techniques. As a foundational capability, binary decoding must be consistently correct for the techniques that rely on it to be viable. Unfortunately, modern instruction sets are huge and the encodings are complex, so as a result, modern binary decoders are buggy. In this paper, we present a testing methodology that automatically infers structural information for an instruction set and uses the inferred structure to efficiently generate structured- random test cases independent of the instruction set being tested. Our testing methodology includes automatic output verification using differential analysis and reassembly to generate error reports. This testing methodology requires little instruction- set-specific knowledge, allowing rapid testing of decoders for new architectures and extensions to existing ones. We have implemented our testing procedure in a tool name Fleece and used it to test multiple binary decoders (Intel XED, libopcodes, LLVM, Dyninst and Capstone) on multiple architectures (x86, ARM and PowerPC). Our testing efficiently covered thousands of instruction format variations for each instruction set and uncovered decoding bugs in every decoder we tested.

Clustering Support for Inadequate Test Suite Reduction
Carmen Coviello, Simone Romano, Giuseppe Scanniello, Alessandro Marchetto, Giuliano Antoniol, and Anna Corazza
(University of Basilicata, Italy; Polytechnique Montréal, Canada; Federico II University of Naples, Italy)
Regression testing is an important activity that can be expensive (e.g., for large test suites). Test suite reduction approaches speed up regression testing by removing redundant test cases. These approaches can be classified as adequate or inadequate. Adequate approaches reduce test suites so that they completely preserve the test requirements (e.g., code coverage) of the original test suites. Inadequate approaches produce reduced test suites that only partially preserve the test requirements. An inadequate approach is appealing when it leads to a greater reduction in test suite size at the expense of a small loss in fault-detection capability. We investigate a clustering-based approach for inadequate test suite reduction and compare it with well-known adequate approaches. Our investigation is founded on a public dataset and allows an exploration of trade-offs in test suite reduction. Results help a more informed decision, using guidelines defined in this research, to balance size, coverage, and fault-detection loss of reduced test suites when using clustering.

Program Repair
Wed, Mar 21, 15:00 - 16:00, Aula Magna

Automatically Repairing Dependency-Related Build Breakage
Christian Macho, Shane McIntosh, and Martin Pinzger
(University of Klagenfurt, Austria; McGill University, Canada)
Build systems are widely used in today’s software projects to automate integration and build processes. Similar to source code, build specifications need to be maintained to avoid outdated specifications, and build breakage as a consequence. Recent work indicates that neglected build maintenance is one of the most frequently occurring reasons why open source and proprietary builds break. In this paper, we propose BuildMedic, an approach to automatically repair Maven builds that break due to dependency-related issues. Based on a manual investigation of 37 broken Maven builds in 23 open source Java projects, we derive three repair strategies to automatically repair the build, namely Version Update, Delete Dependency, and Add Repository. We evaluate the three strategies on 84 additional broken builds from the 23 studied projects in order to demonstrate the applicability of our approach. The evaluation shows that BuildMedic can automatically repair 45 of these broken builds (54%). Furthermore, in 36% of the successfully repaired build breakages, BuildMedic outputs at least one repair candidate that is considered a correct repair. Moreover, 76% of them could be repaired with only a single dependency correction.

Mining StackOverflow for Program Repair
Xuliang Liu and Hao Zhong
(Shanghai Jiao Tong University, China)
In recent years, automatic program repair has been a hot research topic in the software engineering community, and many approaches have been proposed. Although these approaches produce promising results, some researchers criticize that existing approaches are still limited in their repair capability, due to their limited repair templates. Indeed, it is quite difficult to design effective repair templates. An award-wining paper analyzes thousands of manual bug fixes, but summarizes only ten repair templates. Although more bugs are thus repaired, recent studies show such repair templates are still insufficient.
We notice that programmers often refer to Stack Overflow, when they repair bugs. With years of accumulation, Stack Overflow has millions of posts that are potentially useful to repair many bugs. The observation motives our work towards mining repair templates from Stack Overflow. In this paper, we propose a novel approach, called SOFIX, that extracts code samples from Stack Overflow, and mines repair patterns from extracted code samples. Based on our mined repair patterns, we derived 13 repair templates. We implemented these repair templates in SOFIX, and conducted evaluations on the widely used benchmark, Defects4J. Our results show that SOFIX repaired 23 bugs, which are more than existing approaches. After comparing repaired bugs and templates, we find that SOFIX repaired more bugs, since it has more repair templates. In addition, our results also reveal the urgent need for better fault localization techniques.

Dissection of a Bug Dataset: Anatomy of 395 Patches from Defects4J
Victor Sobreira, Thomas Durieux, Fernanda Madeiral, Martin Monperrus, and Marcelo de Almeida Maia
(Federal University of Uberlândia, Brazil; Inria, France; University of Lille, France; KTH, Sweden)
Well-designed and publicly available datasets of bugs are an invaluable asset to advance research fields such as fault localization and program repair as they allow directly and fairly comparison between competing techniques and also the replication of experiments. These datasets need to be deeply understood by researchers: the answer for questions like “which bugs can my technique handle?” and “for which bugs is my technique effective?” depends on the comprehension of properties related to bugs and their patches. However, such properties are usually not included in the datasets, and there is still no widely adopted methodology for characterizing bugs and patches. In this work, we deeply study 395 patches of the Defects4J dataset. Quantitative properties (patch size and spreading) were automatically extracted, whereas qualitative ones (repair actions and patterns) were manually extracted using a thematic analysis-based approach. We found that 1) the median size of Defects4J patches is four lines, and almost 30% of the patches contain only addition of lines; 2) 92% of the patches change only one file, and 38% has no spreading at all; 3) the top-3 most applied repair actions are addition of method calls, conditionals, and assignments, occurring in 77% of the patches; and 4) nine repair patterns were found for 95% of the patches, where the most prevalent, appearing in 43% of the patches, is on conditional blocks. These results are useful for researchers to perform advanced analysis on their techniques’ results based on Defects4J. Moreover, our set of properties can be used to characterize and compare different bug datasets.

Info

Mobile Development
Wed, Mar 21, 16:30 - 17:30, Aula Magna

Detecting Third-Party Libraries in Android Applications with High Precision and Recall
Yuan Zhang, Jiarun Dai, Xiaohan Zhang, Sirong Huang, Zhemin Yang, Min Yang, and Hao Chen
(Fudan University, China; Shanghai Institute of Intelligent Electronics and Systems, China; Shanghai Institute for Advanced Communication and Data Science, China; University of California at Davis, USA)
Third-party libraries are widely used in Android applications to ease development and enhance functionalities. However, the incorporated libraries also bring new security & privacy issues to the host application, and blur the accounting of application code and library code. Under this situation, a precise and reliable library detector is highly desirable. In fact, library code may be customized by developers during integration and dead library code may be eliminated by code obfuscators during application build process. However, existing research on library detection has not gracefully handled these problems, thus facing severe limitations in practice.
In this paper, we propose LibPecker, an obfuscation-resilient, highly precise and reliable library detector for Android applications. LibPecker adopts signature matching to give a similarity score between a given library and an application. By fully utilizing the internal class dependencies inside a library, LibPecker generates a strict signature for each class. To tolerate library code customization and elimination as much as possible, LibPecker introduces adaptive class similarity threshold and weighted class similarity score in calculating library similarity. To quantitatively evaluate precision and recall of LibPecker, we perform the first such experiment (to the best of our knowledge) with a large number of libraries and applications. Results show that LibPecker significantly outperforms state-of-the-art tool in both recall and precision (91% and 98.1% respectively

Software Quality
Thu, Mar 22, 10:30 - 11:30, Aula Magna

How Do Developers Fix Issues and Pay Back Technical Debt in the Apache Ecosystem?
Georgios Digkas, Mircea Lungu, Paris Avgeriou, Alexander Chatzigeorgiou, and Apostolos Ampatzoglou
(University of Groningen, Netherlands; University of Macedonia, Greece)
During software evolution technical debt (TD) follows a constant ebb and flow, being incurred and paid back, sometimes in the same day and sometimes ten years later. There have been several studies in the literature investigating how technical debt in source code accumulates during time and the consequences of this accumulation for software maintenance. However, to the best of our knowledge there are no large scale studies that focus on the types of issues that are fixed and the amount of TD that is paid back during software evolution. In this paper we present the results of a case study, in which we analyzed the evolution of fifty-seven Java open-source software projects by the Apache Software Foundation at the temporal granularity level of weekly snapshots. In particular, we focus on the amount of technical debt that is paid back and the types of issues that are fixed. The findings reveal that a small subset of all issue types is responsible for the largest percentage of TD repayment and thus, targeting particular violations the development team can achieve higher benefits.

How Good Is Your Puppet? An Empirically Defined and Validated Quality Model for Puppet
Eduard van der Bent, Jurriaan Hage, Joost Visser, and Georgios Gousios
(Utrecht University, Netherlands; Software Improvement Group, Netherlands; Delft University of Technology, Netherlands)
Puppet is a declarative language for configuration management that has rapidly gained popularity in recent years. Numerous organizations now rely on Puppet code for deploying their software systems onto cloud infrastructures. In this paper we provide a definition of code quality for Puppet code and an automated technique for measuring and rating Puppet code quality. To this end, we first explore the notion of code quality as it applies to Puppet code by performing a survey among Puppet developers. Second, we develop a measurement model for the maintainability aspect of Puppet code quality. To arrive at this measurement model, we derive appropriate quality metrics from our survey results and from existing software quality models. We implemented the Puppet code quality model in a software analysis tool. We validate our definition of Puppet code quality and the measurement model by a structured interview with Puppet experts and by comparing the tool results with quality judgments of those experts. The validation shows that the measurement model and tool provide quality judgments of Puppet code that closely match the judgments of experts. Also, the experts deem the model appropriate and usable in practice. The Software Improvement Group (SIG) has started using the model in its consultancy practice.

Info

Behavior and Runtime Analysis
Thu, Mar 22, 10:30 - 11:30, Room 2

Maintaining Behaviour Driven Development Specifications: Challenges and Opportunities
Leonard Peter Binamungu, Suzanne M. Embury, and Nikolaos Konstantinou
(University of Manchester, UK)
In Behaviour-Driven Development (BDD) the behaviour of a software system is specified as a set of example interactions with the system using a "Given-When-Then" structure. These examples are expressed in high level domain-specific terms, and are executable. They thus act both as a specification of requirements and as tests that can verify whether the current system implementation provides the desired behaviour or not. This approach has many advantages but also presents some problems. When the number of examples grows, BDD specifications can become costly to maintain and extend. Some teams find that parts of the system are effectively frozen due to the challenges of finding and modifying the examples associated with them. We surveyed 75 BDD practitioners from 26 countries to understand the extent of BDD use, its benefits and challenges, and specifically the challenges of maintaining BDD specifications in practice. We found that BDD is in active use amongst respondents, and that the use of domain specific terms, improving communication among stakeholders, the executable nature of BDD specifications, and facilitating comprehension of code intentions are the main benefits of BDD. The results also showed that BDD specifications suffer the same maintenance challenges found in automated test suites more generally. We map the survey results to the literature, and propose 10 research opportunities in this area.

Recursion Aware Modeling and Discovery for Hierarchical Software Event Log Analysis
Maikel Leemans, Wil M. P. van der Aalst, and Mark G. J. van den Brand
(Eindhoven University of Technology, Netherlands)
This paper presents 1) a novel hierarchy and recursion extension to the process tree model; and 2) the first, recursion aware process model discovery technique that leverages hierarchical information in event logs, typically available for software systems. This technique allows us to analyze the operational processes of software systems under real-life conditions at multiple levels of granularity. The work can be positioned in-between reverse engineering and process mining. An implementation of the proposed approach is available as a ProM plugin. Experimental results based on real-life (software) event logs demonstrate the feasibility and usefulness of the approach and show the huge potential to speed up discovery by exploiting the available hierarchy.

Design Analysis
Thu, Mar 22, 11:45 - 12:45, Aula Magna

Automatically Exploiting Implicit Design Knowledge When Solving the Class Responsibility Assignment Problem
Yongrui Xu, Peng Liang, and Muhammad Ali Babar
(Wuhan University, China; University of Adelaide, Australia)
Assigning responsibilities to classes is not only vital during initial software analysis/design phases in object-oriented analysis and design (OOAD), but also during maintenance and evolution phases, when new responsibilities have to be assigned to classes or existing responsibilities have to be changed. Class Re-sponsibility Assignment (CRA) is one of the most complex tasks in OOAD as it heavily relies on designers’ judgment and implicit design knowledge (DK) of design problems. Since CRA is highly dependent on the successful use of implicit DK, (semi)- automat-ed approaches that help designers to assign responsibilities to classes should make implicit DK explicit and exploit the DK ef-fectively. In this paper, we propose a learning based approach for the Class Responsibility Assignment (CRA) problem. A learning mechanism is introduced into Genetic Algorithm (GA) to extract the implicit DK about which responsibilities have a high proba-bility to be assigned to the same class, and then the extracted DK is employed automatically to improve the design quality of the generated solutions. The proposed approach has been evaluated through an experimental study with three cases. By comparing the solutions obtained from the proposed approach and the exist-ing approaches, the proposed approach can significantly improve the design quality of the generated solutions to the CRA problem, and the generated solutions by the proposed approach are more likely to be accepted by developers from the practical aspects.

Defect Prediction
Thu, Mar 22, 11:45 - 12:45, Room 2

Cross-Version Defect Prediction via Hybrid Active Learning with Kernel Principal Component Analysis
Zhou Xu, Jin Liu, Xiapu Luo, and Tao Zhang
(Wuhan University, China; Hong Kong Polytechnic University, China; Harbin Engineering University, China)
As defects in software modules may cause product failure and financial loss, it is critical to utilize defect prediction methods to effectively identify the potentially defective modules for a thorough inspection, especially in the early stage of software development lifecycle. For an upcoming version of a software project, it is practical to employ the historical labeled defect data of the prior versions within the same project to conduct defect prediction on the current version, i.e., Cross-Version Defect Prediction (CVDP). However, software development is a dynamic evolution process that may cause the data distribution (such as defect characteristics) to vary across versions. Furthermore, the raw features usually may not well reveal the intrinsic structure information behind the data. Therefore, it is challenging to perform effective CVDP. In this paper, we propose a two-phase CVDP framework that combines Hybrid Active Learning and Kernel PCA (HALKP) to address these two issues. In the first stage, HALKP uses a hybrid active learning method to select some informative and representative unlabeled modules from the current version for querying their labels, then merges them into the labeled modules of the prior version to form an enhanced training set. In the second stage, HALKP employs a non-linear mapping method, kernel PCA, to extract representative features by embedding the original data of two versions into a high-dimension space. We evaluate the HALKP framework on 31 versions of 10 projects with three prevalent performance indicators. The experimental results indicate that HALKP achieves encouraging results with average F-measure, g-mean and Balance of 0.480, 0.592 and 0.580, respectively and significantly outperforms nearly all baseline methods.

Using a Probabilistic Model to Predict Bug Fixes
Mauricio Soto and Claire Le Goues
(Carnegie Mellon University, USA)
Automatic Software Repair (APR) has significant potential to reduce software maintenance costs by reducing the human effort required to localize and fix bugs. State-of-theart generate-and-validate APR techniques select between and instantiate various mutation operators to construct candidate patches, informed largely by heuristic probability distributions. This may reduce effectiveness in terms of both efficiency and output quality. In practice, human developers have many options in terms of how to edit code to fix bugs, some of which are far more common than others (e.g., deleting a line of code is more common than adding a new class). We mined the most recent 100 bug-fixing commits from each of the 500 most popular Java projects in GitHub (the largest dataset to date) to create a probabilistic model describing edit distributions. We categorize, compare and evaluate the different mutation operators used in state-of-the-art approaches. We find that a probabilistic model-based APR approach patches bugs more quickly in the majority of bugs studied, and that the resulting patches are of higher quality than those produced by previous approaches. Finally, we mine association rules for multi-edit source code changes, an understudied but important problem. We validate the association rules by analyzing how much of our corpus can be built from them. Our evaluation indicates that 84.6% of the multi-edit patches from the corpus can be built from the association rules, while maintaining 90% confidence.

Connecting Software Metrics across Versions to Predict Defects
Yibin Liu, Yanhui Li, Jianbo Guo, Yuming Zhou, and Baowen Xu
(Nanjing University, China; Tsinghua University, China)
Accurate software defect prediction could help software practitioners allocate test resources to defect-prone modules effectively and efficiently. In the last decades, much effort has been devoted to build accurate defect prediction models, including developing quality defect predictors and modeling techniques. However, current widely used defect predictors such as code metrics and process metrics could not well describe how software modules change over the project evolution, which we believe is important for defect prediction. In order to deal with this problem, in this paper, we propose to use the Historical Version Sequence of Metrics (HVSM) in continuous software versions as defect predictors. Furthermore, we leverage Recurrent Neural Network (RNN), a popular modeling technique, to take HVSM as the input to build software prediction models. The experimental results show that, in most cases, the proposed HVSM-based RNN model has significantly better effort-aware ranking effectiveness than the commonly used baseline models.

APIs
Thu, Mar 22, 13:45 - 14:45, Aula Magna

Classifying Stack Overflow Posts on API Issues
Md Ahasanuzzaman, Muhammad Asaduzzaman, Chanchal K. Roy, and Kevin A. Schneider
(Queen's University, Canada; University of Saskatchewan, Canada)
The design and maintenance of APIs are complex tasks due to the constantly changing requirements of its users. Despite the efforts of its designers, APIs may suffer from a number of issues (such as incomplete or erroneous documentation, poor performance, and backward incompatibility). To maintain a healthy client base, API designers must learn these issues to fix them. Question answering sites, such as Stack Overflow (SO), has become a popular place for discussing API issues. These posts about API issues are invaluable to API designers, not only because they can help to learn more about the problem but also because they can facilitate learning the requirements of API users. However, the unstructured nature of posts and the abundance of non-issue posts make the task of detecting SO posts concerning API issues difficult and challenging. In this paper, we first develop a supervised learning approach using a Conditional Random Field (CRF), a statistical modeling method, to identify API issue-related sentences. We use the above information together with different features of posts and experience of users to build a technique, called CAPS, that can classify SO posts concerning API issues. Evaluation of CAPS using carefully curated SO posts on three popular API types reveals that the technique outperforms all three baseline approaches we consider in this study. We also conduct studies to test the generalizability of CAPS results and to understand the effects of different sources of information on it.

Why and How Java Developers Break APIs
Aline Brito, Laerte Xavier, Andre Hora, and Marco Tulio Valente
(Federal University of Minas Gerais, Brazil; Federal University of Mato Grosso do Sul, Brazil)
Modern software development depends on APIs to reuse code and increase productivity. As most software systems, these libraries and frameworks also evolve, which may break existing clients. However, the main reasons to introduce breaking changes in APIs are unclear. Therefore, in this paper, we report the results of an almost 4-month long field study with the developers of 400 popular Java libraries and frameworks. We configured an infrastructure to observe all changes in these libraries and to detect breaking changes shortly after their introduction in the code. After identifying breaking changes, we asked the developers to explain the reasons behind their decision to change the APIs. During the study, we identified 59 breaking changes, confirmed by the developers of 19 projects. By analyzing the developers' answers, we report that breaking changes are mostly motivated by the need to implement new features, by the desire to make the APIs simpler and with fewer elements, and to improve maintainability. We conclude by providing suggestions to language designers, tool builders, software engineering researchers and API developers.

Mining Accurate Message Formats for Service APIs
Md Arafat Hossain, Steve Versteeg, Jun Han, Muhammad Ashad Kabir, Jiaojiao Jiang, and Jean-Guy Schneider
(Swinburne University of Technology, Australia; CA Technologies, Australia)
APIs play a significant role in the sharing, utilization and integration of information and service assets for enterprises, delivering significant business value. However, the documentation of service APIs can often be incomplete, ambiguous, or even nonexistent, hindering API-based application development efforts. In this paper, we introduce an approach to automatically mine the fine-grained message formats required in defining the APIs of services and applications from their interaction traces, without assuming any prior knowledge. Our approach includes three major steps with corresponding techniques: (1) classifying the interaction messages of a service into clusters corresponding to message types, (2) identifying the keywords of messages in each cluster, and (3) extracting the format of each message type. We have applied our approach to network traces collected from four real services which used the following application protocols: REST, SOAP, LDAP and SIP. The results show that our approach achieves much greater accuracy in extracting message formats for service APIs than current state-of-art approaches.

Exploring Code Bases
Thu, Mar 22, 15:00 - 16:00, Aula Magna

Mining Framework Usage Graphs from App Corpora
Sergio Mover, Sriram Sankaranarayanan, Rhys Braginton Pettee Olsen, and Bor-Yuh Evan Chang
(University of Colorado at Boulder, USA)
We investigate the problem of mining graph-based usage patterns for large, object-oriented frameworks like Android—revisiting previous approaches based on graph-based object usage models (groums). Groums are a promising approach to represent usage patterns for object-oriented libraries because they simultaneously describe control flow and data dependencies between methods of multiple interacting object types. However, this expressivity comes at a cost: mining groums requires solving a subgraph isomorphism problem that is well known to be expensive. This cost limits the applicability of groum mining to large API frameworks. In this paper, we employ groum mining to learn usage patterns for object-oriented frameworks from program corpora. The central challenge is to scale groum mining so that it is sensitive to usages horizontally across programs from arbitrarily many developers (as opposed to simply usages vertically within the program of a single developer). To address this challenge, we develop a novel groum mining algorithm that scales on a large corpus of programs. We first use frequent itemset mining to restrict the search for groums to smaller subsets of methods in the given corpus. Then, we pose the subgraph isomorphism as a SAT problem and apply efficient pre-processing algorithms to rule out fruitless comparisons ahead of time. Finally, we identify containment relationships between clusters of groums to characterize popular usage patterns in the corpus (as well as classify less popular patterns as possible anomalies). We find that our approach scales on a corpus of over five hundred open source Android applications, effectively mining obligatory and best-practice usage patterns.

A Generalized Model for Visualizing Library Popularity, Adoption, and Diffusion within a Software Ecosystem
Raula Gaikovina Kula, Coen De Roover, Daniel M. German, Takashi Ishio, and Katsuro Inoue
(NAIST, Japan; Vrije Universiteit Brussel, Belgium; University of Victoria, Canada; Osaka University, Japan)
The popularity of super repositories such as Maven Central and the CRAN is a testament to software reuse activities in both open-source and commercial projects alike. However, several studies have highlighted the risks and dangers brought about by application developers keeping dependencies on outdated library versions. Intelligent mining of super repositories could reveal hidden trends within the corresponding software ecosystem and thereby provide valuable insights for such dependency-related decisions. In this paper, we propose the Software Universe Graph (SUG) Model as a structured abstraction of the evolution of software systems and their library dependencies over time. To demonstrate the SUG's usefulness, we conduct an empirical study using 6,374 Maven artifacts and over 6,509 CRAN packages mined from their real-world ecosystems. Visualizations of the SUG model such as `library coexistence pairings' and `dependents diffusion' uncover popularity, adoption and diffusion patterns within each software ecosystem. Results show the Maven ecosystem as having a more conservative approach to dependency updating than the CRAN ecosystem.

Supporting Exploratory Code Search with Differencing and Visualization
Wenjian Liu, Xin Peng, Zhenchang Xing, Junyi Li, Bing Xie, and Wenyun Zhao
(Fudan University, China; Shanghai Institute of Intelligent Electronics and Systems, China; Australian National University, Australia; Peking University, China)
Searching and reusing online code has become a common practice in software development. Two important characteristics of online code have not been carefully considered in current tool support. First, many pieces of online code are largely similar but subtly different. Second, several pieces of code may form complex relations through their differences. These two characteristics make it difficult to properly rank online code to a search query and reduce the efficiency of examining search results. In this paper, we present an exploratory online code search approach that explicitly takes into account the above two characteristics of online code. Given a list of methods returned for a search query, our approach uses clone detection and code differencing techniques to analyze both commonalities and differences among the methods in the search results. It then produces an exploration graph that visualizes the method differences and the relationships of methods through their differences. The exploration graph allows developers to explore search results in a structured view of different method groups present in the search results, and turns implicit code differences into visual cues to help developers navigate the search results. We implement our approach in a web-based tool called CodeNuance. We conduct experiments to evaluate the effectiveness of our CodeNuance tool for search results examination, compared with ranked-list and code-clustering based search results examination. We also compare the performance and user behavior differences in using our tool and other exploratory code search tools.

Video Info

Language Models
Thu, Mar 22, 16:30 - 17:15, Aula Magna

Syntax and Sensibility: Using Language Models to Detect and Correct Syntax Errors
Eddie Antonio Santos, Joshua Charles Campbell, Dhvani Patel, Abram Hindle, and José Nelson Amaral
(University of Alberta, Canada)
Syntax errors are made by novice and experienced programmers alike; however, novice programmers lack the years of experience that help them quickly resolve these frustrating errors. Standard LR parsers are of little help, typically resolving syntax errors and their precise location poorly. We propose a methodology that locates where syntax errors occur, and suggests possible changes to the token stream that can fix the error identified. This methodology finds syntax errors by using language models trained on correct source code to find tokens that seem out of place. Fixes are synthesized by consulting the language models to determine what tokens are more likely at the estimated error location. We compare n-gram and LSTM (long short-term memory) language models for this task, each trained on a large corpus of Java code collected from GitHub. Unlike prior work, our methodology does not rely that the problem source code comes from the same domain as the training data. We evaluated against a repository of real student mistakes. Our tools are able to find a syntactically-valid fix within its top-2 suggestions, often producing the exact fix that the student used to resolve the error. The results show that this tool and methodology can locate and suggest corrections for syntax errors. Our methodology is of practical use to all programmers, but will be especially useful to novices frustrated with incomprehensible syntax errors.

Info
A Deep Neural Network Language Model with Contexts for Source Code
Anh Tuan Nguyen, Trong Duc Nguyen, Hung Dang Phan, and Tien N. Nguyen
(Iowa State University, USA; University of Texas at Dallas, USA)
Statistical language models (LMs) have been applied in several software engineering applications. However, they have issues in dealing with ambiguities in the names of program and API elements (classes and method calls). In this paper, inspired by the success of Deep Neural Network (DNN) in natural language processing, we present DNN4C, a DNN language model that complements the local context of lexical code elements with both syntactic and type contexts. We designed a context-incorporating method to use with syntactic and type annotations for source code in order to learn to distinguish the lexical tokens in different syntactic and type contexts. Our empirical evaluation on code completion for real-world projects shows that DNN4C relatively improves 11.6%, 16.3%, 27.1%, and 44.7% top-1 accuracy over the state-of-the-art language models for source code used with the same features: RNN LM, DNN LM, SLAMC, and n-gram LM, respectively. For another application, we showed that DNN4C helps improve accuracy over n-gram LM in migrating source code from Java to C# with a machine translation model.

Binary Analysis
Thu, Mar 22, 16:30 - 17:15, Room 2

Efficient Features for Function Matching between Binary Executables
Chariton Karamitas and Athanasios Kehagias
(CENSUS, Greece; University of Thessaloniki, Greece)
Binary diffing is the process of reverse engineering two programs, when source code is not available, in order to study their syntactic and semantic differences. For large programs, binary diffing can be performed by function matching which, in turn, is reduced to a graph isomorphism problem between the compared programs' CFGs (Control Flow Graphs) and/or CGs (Call Graphs). In this paper we provide a set of carefully chosen features, extracted from a binary's CG and CFG, which can be used by BinDiff algorithm variants to, first, build a set of initial exact matches with minimal false positives (by scanning for unique perfect matches) and, second, propagate approximate matching information using, for example, a nearest-neighbor scheme. Furthermore, we investigate the benefits of applying Markov lumping techniques to function CFGs (to our knowledge, this technique has not been previously studied). The proposed function features are evaluated in a series of experiments on various versions of the Linux kernel (Intel64), the OpenSSH server (Intel64) and Firefox's xul.dll (IA-32). Our prototype system is also compared to Diaphora, the current state-of-the-art binary diffing software.

Using Recurrent Neural Networks for Decompilation
Deborah S. Katz, Jason Ruchti, and Eric Schulte
(Carnegie Mellon University, USA; GrammaTech, USA)
Decompilation, recovering source code from binary, is useful in many situations where it is necessary to analyze or understand software for which source code is not available. Source code is much easier for humans to read than binary code, and there are many tools available to analyze source code. Existing decompilation techniques often generate source code that is difficult for humans to understand because the generated code often does not use the coding idioms that programmers use. Differences from human-written code also reduce the effectiveness of analysis tools on the decompiled source code.
To address the problem of differences between decompiled code and human-written code, we present a novel technique for decompiling binary code snippets using a model based on Recurrent Neural Networks. The model learns properties and patterns that occur in source code and uses them to produce decompilation output. We train and evaluate our technique on snippets of binary machine code compiled from C source code. The general approach we outline in this paper is not language-specific and requires little or no domain knowledge of a language and its properties or how a compiler operates, making the approach easily extensible to new languages and constructs. Furthermore, the technique can be extended and applied in situations to which traditional decompilers are not targeted, such as for decompilation of isolated binary snippets; fast, on-demand decompilation; domain-specific learned decompilation; optimizing for readability of decompilation; and recovering control flow constructs, comments, and variable or function names. We show that the translations produced by this technique are often accurate or close and can provide a useful picture of the snippet's behavior.

Developers' Collaboration
Fri, Mar 23, 10:30 - 11:30, Aula Magna

How Do Developers Discuss Rationale?
Rana Alkadhi, Manuel Nonnenmacher, Emitza Guzman, and Bernd Bruegge
(TU Munich, Germany; University of Zurich, Switzerland)
Developers make various decisions during software development. The rationale behind these decisions is of great importance during software evolution of long living software systems. However, current practices for documenting rationale often fall short and rationale remains hidden in the heads of developers or embedded in development artifacts. Further challenges are faced for capturing rationale in OSS projects; in which developers are geographically distributed and rely mostly on written communication channels to support and coordinate their activities. In this paper, we present an empirical study to understand how OSS developers discuss rationale in IRC channels and explore the possibility of automatic extraction of rationale elements by analyzing IRC messages of development teams. To achieve this, we manually analyzed 7,500 messages of three large OSS projects and identified all fine-grained elements of rationale. We evaluated various machine learning algorithms for automatically detecting and classifying rationale in IRC messages. Our results show that 1) rationale is discussed on average in 25% of IRC messages, 2) code committers contributed on average 54% of the discussed rationale, and 3) machine learning algorithms can detect rationale with 0.76 precision and 0.79 recall, and classify messages into finer-grained rationale elements with an average of 0.45 precision and 0.43 recall.

Automated Quality Assessment for Crowdsourced Test Reports of Mobile Applications
Xin Chen, He Jiang, Xiaochen Li, Tieke He, and Zhenyu Chen
(Dalian University of Technology, China; Nanjing University, China)
In crowdsourced mobile application testing, crowd workers help developers perform testing and submit test reports for unexpected behaviors. These submitted test reports usually provide critical information for developers to understand and reproduce the bugs. However, due to the poor performance of workers and the inconvenience of editing on mobile devices, the quality of test reports may vary sharply. At times developers have to spend a significant portion of their available resources to handle the low-quality test reports, thus heavily decreasing their efficiency. In this paper, to help developers predict whether a test report should be selected for inspection within limited resources, we propose a new framework named TERQAF to automatically model the quality of test reports. TERQAF defines a series of quantifiable indicators to measure the desirable properties of test reports and aggregates the numerical values of all indicators to determine the quality of test reports by using step transformation functions. Experiments conducted over five crowdsourced test report datasets of mobile applications show that TERQAF can correctly predict the quality of test reports with accuracy of up to 88.06% and outperform baselines by up to 23.06%. Meanwhile, the experimental results also demonstrate that the four categories of measurable indicators have positive impacts on TERQAF in evaluating the quality of test reports.

Refactoring
Fri, Mar 23, 11:45 - 12:45, Aula Magna

The Impact of Refactoring Changes on the SZZ Algorithm: An Empirical Study
Edmilson Campos Neto, Daniel Alencar da Costa, and Uirá Kulesza
(Federal University of Rio Grande do Norte, Brazil; Instituto Federal do Rio Grande do Norte, Brazil; Queen's University, Canada)
SZZ is a widely used algorithm in the software engineering community to identify changes that are likely to introduce bugs (i.e., bug-introducing changes). Despite its wide adoption, SZZ still has room for improvements. For example, current SZZ implementations may still flag refactoring changes as bug-introducing. Refactorings should be disregarded as bug-introducing because they do not change the system behaviour. In this paper, we empirically investigate how refactorings impact both the input (bug-fix changes) and the output (bug-introducing changes) of the SZZ algorithm. We analyse 31,518 issues of ten Apache projects with 20,298 bug-introducing changes. We use an existing tool that automatically detects refactorings in code changes. We observe that 6.5% of lines that are flagged as bug-introducing changes by SZZ are in fact refactoring changes. Regarding bug-fix changes, we observe that 19.9% of lines that are removed during a fix are related to refactorings and, therefore, their respective inducing changes are false positives. We then incorporate the refactoring-detection tool in our Refactoring Aware SZZ Implementation (RA-SZZ). Our results reveal that RA-SZZ reduces 20.8% of the lines that are flagged as bug-introducing changes compared to the state-of-the-art SZZ implementations. Finally, we perform a manual analysis to identify change patterns that are not captured by the refactoring identification tool used in our study. Our results reveal that 47.95% of the analyzed bug-introducing changes contain additional change patterns that RA-SZZ should not flag as bug-introducing.

Info
An Extensible Approach for Taming the Challenges of JavaScript Dead Code Elimination
Niels Groot Obbink, Ivano Malavolta, Gian Luca Scoccia, and Patricia Lago
(VU University Amsterdam, Netherlands; Gran Sasso Science Institute, Italy)
JavaScript is becoming the de-facto programming language of the Web. Large-scale web applications (web apps) written in Javascript are commonplace nowadays, with big technology players (e.g., Google, Facebook) using it in their core flagship products. Today, it is common practice to reuse existing JavaScript code, usually in the form of third-party libraries and frameworks. If on one side this practice helps in speeding up development time, on the other side it comes with the risk of bringing dead code, i.e., JavaScript code which is never executed, but still downloaded from the network and parsed in the browser. This overhead can negatively impact the overall performance and energy consumption of the web app. In this paper we present Lacuna, an approach for JavaScript dead code elimination, where existing JavaScript analysis techniques are applied in combination. The proposed approach supports both static and dynamic analyses, it is extensible, and independent of the specificities of the used JavaScript analysis techniques. Lacuna can be applied to any JavaScript code base, without imposing any constraints to the developer, e.g., on her coding style or on the use of some specific JavaScript feature (e.g., modules). Lacuna has been evaluated on a suite of 29 publicly-available web apps, composed of 15,946 JavaScript functions, and built with different JavaScript frameworks (e.g., Angular, Vue.js, jQuery). Despite being a prototype, Lacuna obtained promising results in terms of analysis execution time and precision.

Automated Refactoring of Client-Side JavaScript Code to ES6 Modules
Aikaterini Paltoglou, Vassilis E. Zafeiris, E. A. Giakoumakis, and N. A. Diamantidis
(Athens University of Economics and Business, Greece)
JavaScript (JS) is a dynamic, weakly-typed and object-based programming language that expanded its reach, in recent years, from the desktop web browser to a wide range of runtime platforms in embedded, mobile and server hosts. Moreover, the scope of functionality implemented in JS scaled from DOM manipulation in dynamic HTML pages to full-scale applications for various domains, stressing the need for code reusability and maintainability. Towards this direction, the ECMAScript 6 (ES6) revision of the language standardized the syntax for class and module definitions, streamlining the encapsulation of data and functionality at various levels of granularity.
This work focuses on refactoring client-side web applications for the elimination of code smells, relevant to global variables and functions that are declared in JS files linked to a web page. These declarations “pollute” the global namespace at runtime and often lead to name conflicts with undesired effects. We propose a method for the encapsulation of global declarations through automated refactoring to ES6 modules. Our approach transforms each linked JS script of a web application to an ES6 module with appropriate import and export declarations that are inferred through static analysis. A prototype implementation of the proposed method, based on WALA libraries, has been evaluated on a set of open source projects. The evaluation results support the applicability and runtime efficiency of the proposed method.

Recommender Systems
Fri, Mar 23, 13:45 - 14:45, Aula Magna

Improving Developers Awareness of the Exception Handling Policy
Taiza Montenegro, Hugo Melo, Roberta Coelho, and Eiji Barbosa
(Federal University of Rio Grande do Norte, Brazil)
The exception handling policy of a system comprises the set of design rules that specify its exception handling behavior (how exceptions should be handled and thrown in a system). Such policy is usually undocumented and implicitly defined by the system architect. Developers are usually unaware of such rules and may think that by just sprinkling the code with catch-blocks they can adequately deal with the exceptional conditions of a system. As a consequence, the exception handling code once designed to make the program more reliable may become a source of faults (e.g., the uncaught exceptions are one of the main causes of crashes in current Java applications). To mitigate such problem, we propose Exception Policy Expert (EPE), a tool embedded in Eclipse IDE that warns developers about policy violations related to the code being edited. A case study performed in a real development context showed that the tool could indeed make the exception handling policy explicit to the developers during development.

Detecting Faulty Empty Cells in Spreadsheets
Liang Xu, Shuo Wang, Wensheng Dou, Bo Yang, Chushu Gao, Jun Wei, and Tao Huang
(University at Chinese Academy of Sciences, China; Institute of Software at Chinese Academy of Sciences, China; North China University of Technology, China)
Spreadsheets play an important role in various business tasks, such as financial reports and data analysis. In spreadsheets, empty cells are widely used for different purposes, e.g., separating different tables, or default value “0”. However, a user may delete a formula unintentionally, and leave a cell empty. Such ad-hoc modification may introduce a faulty empty cell that should have a formula. We observe that the context of an empty cell can help determine whether the empty cell is faulty. For example, is the empty cell next to a cell array in which all cells share the same semantics? Does the empty cell have headers similar to other non-empty cells’? In this paper, we propose EmptyCheck, to detect faulty empty cells in spreadsheets. By analyzing the context of an empty cell, EmptyCheck validates whether the cell belong to a cell array. If yes, the empty cell is faulty since it does not contain a formula. We evaluate EmptyCheck on 100 randomly sampled EUSES spreadsheets. The experimental result shows that EmptyCheck can detect faulty empty cells with high precision (75.00%) and recall (87.04%). Existing techniques can detect only 4.26% of the true faulty empty cells that EmptyCheck detects.

Software Security
Fri, Mar 23, 15:00 - 16:00, Aula Magna

Detection of Protection-Impacting Changes during Software Evolution
Marc-André Laverdière and Ettore Merlo
(Tata Consultancy Services, Canada; Polytechnique Montréal, Canada)
Role-Based Access Control (RBAC) is often used in web applications to restrict operations and protect security sensitive information and resources. Web applications regularly undergo maintenance and evolution and their security may be affected by source code changes between releases. To prevent security regression and vulnerabilities, developers have to take re-validation actions before deploying new releases. This may become a significant undertaking, especially when quick and repeated releases are sought. We define protection-impacting changes as those changed statements during evolution that alter privilege protection of some code. We propose an automated method that identifies protection-impacting changes within all changed statements between two versions. The proposed approach compares statically computed security protection models and repository information corresponding to different releases of a system to identify protection-impacting changes. Results of experiments present the occurrence of protection-impacting changes over 210 release pairs of WordPress, a PHP content management web application. First, we show that only 41% of the release pairs present protection-impacting changes. Second, for these affected release pairs, protection-impacting changes can be identified and represent a median of 47.00 lines of code, that is 27.41% of the total changed lines of code. Over all investigated releases in WordPress, protection-impacting changes amounted to 10.89% of changed lines of code. Conversely, an average of about 89% of changed source code have no impact on RBAC security and thus need no re-validation nor investigation. The proposed method reduces the amount of candidate causes of protection changes that developers need to investigate. This information could help developers re-validate application security, identify causes of negative security changes, and perform repairs in a more effective way.

Mining Sandboxes: Are We There Yet?
Lingfeng Bao, Tien-Duy B. Le, and David Lo
(Singapore Management University, Singapore)
The popularity of Android platform on mobile devices has attracted much attention from many developers and researchers, as well as malware writers. Recently, Jamrozik et al. proposed a technique to secure Android applications referred to as mining sandboxes. They used an automated test case generation technique to explore the behavior of the app under test and then extracted a set of sensitive APIs that were called. Based on the extracted sensitive APIs, they built a sandbox that can block access to APIs not used during testing. However, they only evaluated the proposed technique with benign apps but not investigated whether it was effective in detecting malicious behavior of malware that infects benign apps. Furthermore, they only investigated one test case generation tool (i.e., Droidmate) to build the sandbox, while many others have been proposed in the literature.
In this work, we complement Jamrozik et al.’s work in two ways: (1) we evaluate the effectiveness of mining sandboxes on detecting malicious behaviors; (2) we investigate the effectiveness of multiple automated test case generation tools to mine sandboxes. To investigate effectiveness of mining sandboxes in detecting malicious behaviors, we make use of pairs of malware and benign app it infects. We build a sandbox based on sensitive APIs called by the benign app and check if it can identify malicious behaviors in the corresponding malware. To generate inputs to apps, we investigate five popular test case generation tools: Monkey, Droidmate, Droidbot, GUIRipper, and PUMA. We conduct two experiments to evaluate the effectiveness and efficiency of these test case generation tools on detecting malicious behavior. In the first experiment, we select 10 apps and allow test case generation tools to run for one hour; while in the second experiment, we select 102 pairs of apps and allow the test case generation tools to run for one minute. Our experiments highlight that 75.5% 77.2% of malware in our dataset can be uncovered by mining sandboxes – showing its power to protect Android apps. We also find that Droidbot performs best in generating test cases for mining sandboxes, and its effectiveness can be further boosted when coupled with other test case generation tools.

DeepWeak: Reasoning Common Software Weaknesses via Knowledge Graph Embedding
Zhuobing Han, Xiaohong Li, Hongtao Liu, Zhenchang Xing, and Zhiyong Feng
(Tianjin University, China; Australian National University, Australia)
Common software weaknesses, such as improper input validation, integer overflow, can harm system security directly or indirectly, causing adverse effects such as denial-of-service, execution of unauthorized code. Common Weakness Enumeration (CWE) maintains a standard list and classification of common software weakness. Although CWE contains rich information about software weaknesses, including textual descriptions, common sequences and relations between software weaknesses, the current data representation, i.e., hyperlined documents, does not support advanced reasoning tasks on software weaknesses, such as prediction of missing relations and common consequences of CWEs.Such reasoning tasks become critical to managing and analyzing large numbers of common software weaknesses and their relations. In this paper, we propose to represent common software weaknesses and their relations as a knowledge graph, and develop a translation-based, description-embodied knowledge representation learning method to embed both software weaknesses and their relations in the knowledge graph into a semantic vector space. The vector representations (i.e., embeddings) of software weaknesses and their relations can be exploited for knowledge acquisition and inference.We conduct extensive experiments to evaluate the performance of software weakness and relation embeddings in three reasoning tasks, including CWE link prediction, CWE triple classification, and common consequence prediction. Our knowledge graph embedding approach outperforms other description- and/or structure-based representation learning methods.


Journal-First Abstracts

Towards Just-in-Time Suggestions for Log Changes (Journal-First Abstract)
Heng Li, Weiyi Shang, Ying Zou, and Ahmed E. Hassan
(Queen's University, Canada; Concordia University, Canada)
This is an extended abstract of a paper published in the Empirical Software Engineering journal. The original paper is communicated by Arie van Deursen. The paper empirically studied why developers make log changes and proposed an automated approach to provide developers with log change suggestions as soon as they commit a code change. Through a case study on four open source projects, we found that the reasons for log changes can be grouped along four categories: block change, log improvement, dependence-driven change, and logging issue. We also found that our automated approach can effectively suggest whether a log change is needed for a code change with a balanced accuracy of 0.76 to 0.82.

Which Log Level Should Developers Choose for a New Logging Statement? (Journal-First Abstract)
Heng Li, Weiyi Shang, and Ahmed E. Hassan
(Queen's University, Canada; Concordia University, Canada)
This is an extended abstract of a paper published in the Empirical Software Engineering journal. The original paper is communicated by Mark Grechanik. The paper empirically studied how developers assign log levels to their logging statements and proposed an automated approach to help developers determine the most appropriate log level when they add a new logging statement. We analyzed the development history of four open source projects (Hadoop, Directory Server, Hama, and Qpid). We found that our automated approach can accurately suggest the levels of logging statements with an AUC of 0.75 to 0.81. We also found that the characteristics of the containing block of a newly-added logging statement, the existing logging statements in the containing source code file, and the content of the newly-added logging statement play important roles in determining the appropriate log level for that logging statement.

A Study of the Relation of Mobile Device Attributes with the User-Perceived Quality of Android Apps (Journal-First Abstract)
Ehsan Noei, Mark D. Syer, Ying Zou, Ahmed E. Hassan, and Iman Keivanloo
(Queen's University, Canada)
The number of mobile apps and the number of mobile devices have increased considerably in the past few years. To succeed in the competitive market of mobile apps, such as Google Play Store, developers should improve the user-perceived quality of their apps. In this paper, we investigate the relationship between mobile device attributes and the user-perceived quality of Android apps. We observe that the user-perceived quality of apps varies across devices. Device attributes, such as the CPU and the screen resolution, share a significant relationship with the user-perceived quality. However, having a better characteristic of an attribute, such as a higher display resolution, does not necessarily share a positive relationship with the user-perceived quality. App developers should not only consider the app attributes but also consider the device attributes of the available devices to deliver high-quality apps. The original paper is published in the Empirical Software Engineering journal communicated by Lin Tan.

How Developers Micro-Optimize Android Apps (Journal-First Abstract)
Mario Linares-Vásquez, Christopher Vendome, Michele Tufano, and Denys Poshyvanyk
(Universidad de los Andes, Colombia; College of William and Mary, USA)
Optimizing mobile apps early on in the development cycle is supposed to be a key strategy for obtaining higher user rankings, more downloads, and higher retention. However, little research has been done with respect to identifying and understanding actual optimization practices performed by developers. In this paper, we present the results of three empirical studies aimed at investigating practices of Android developers towards improving apps performance, by means of micro-optimizations.

The Relationship between Evolutionary Coupling and Defects in Large Industrial Software (Journal-First Abstract)
Serkan Kirbas, Bora Caglayan, Tracy Hall, Steve Counsell, David Bowes, Alper Sen, and Ayse Bener
(Bloomberg, UK; Boğaziçi University, Turkey; Brunel University London, UK; Ryerson University, Canada; University of Hertfordshire, UK)
In this study, we investigate the effect of EC on the defect-proneness of large industrial software systems and explain why the effects vary.

A Comparison Framework for Runtime Monitoring Approaches (Journal-First Abstract)
Rick Rabiser, Sam Guinea, Michael Vierhauser, Luciano Baresi, and Paul Grünbacher
(JKU Linz, Austria; Politecnico di Milano, Italy; University of Notre Dame, USA)
This extended abstract summarizes our paper entitled "A Comparison Framework for Runtime Monitoring Approaches" published in the Journal on Systems and Software in vol. 125 in 2017 (https://doi.org/10.1016/j.jss.2016.12.034). This paper provides the following contributions: (i) a framework that supports analyzing and comparing runtime monitoring approaches using different dimensions and elements; (ii) an application of the framework to analyze and compare 32 existing monitoring approaches; and (iii) a discussion of perspectives and potential future applications of our framework, e.g., to support the selection of an approach for a particular monitoring problem or application context.

Info
Modularity and Architecture of PLC-Based Software for Automated Production Systems: An Analysis in Industrial Companies (Journal-First Abstract)
Birgit Vogel-Heuser, Juliane Fischer, Stefan Feldmann, Sebastian Ulewicz, and Susanne Rösch
(TU Munich, Germany)
Adaptive and flexible production systems require modular, reusable software as a prerequisite for their long-term life cycle of up to 50 years. We introduce a benchmark process to measure software maturity for industrial control software of automated production systems.

A Mapping Study on Design-Time Quality Attributes and Metrics (Journal-First Abstract)
Elvira Maria Arvanitou, Apostolos Ampatzoglou, Alexander Chatzigeorgiou, Matthias Galster, and Paris Avgeriou
(University of Groningen, Netherlands; University of Macedonia, Greece; University of Canterbury, New Zealand)
Developing a plan for monitoring software quality is a non-trivial task, in the sense that it requires: (a) the selection of relevant quality attributes, based on application domain and development phase, and (b) the selection of appropriate metrics to quantify quality attributes. The metrics selection process is further complicated due to the availability of various metrics for each quality attribute, and the constraints that impact metric selection (e.g., development phase, metric validity, and available tools). In this paper, we shed light on the state-of-research of design-time quality attributes by conducting a mapping study. We have identified 154 papers that have been included as primary studies. The study led to the following outcomes: (a) low-level quality attributes (e.g., cohesion, coupling, etc.) are more frequently studied than high-level ones (e.g., maintainability, reusability, etc.), (b) maintainability is the most frequently examined high-level quality attribute, regardless of the application domain or the development phase, (c) assessment of quality attributes is usually performed by a single metric, rather than a combination of multiple metrics, and (d) metrics are mostly validated in an empirical setting. These outcomes are interpreted and discussed based on related work, offering useful implications to both researchers and practitioners.

Review Participation in Modern Code Review: An Empirical Study of the Android, Qt, and OpenStack Projects (Journal-First Abstract)
Patanamon Thongtanunam, Shane McIntosh, Ahmed E. Hassan, and Hajimu Iida
(University of Adelaide, Australia; McGill University, Canada; Queen's University, Canada; NAIST, Japan)
Software code review is a well-established software quality practice. Recently, Modern Code Review (MCR) has been widely adopted in both open source and proprietary projects. Our prior work shows that review participation plays an important role in MCR practices, since the amount of review participation shares a relationship with software quality. However, little is known about which factors influence review participation in the MCR process. Hence, in this study, we set out to investigate the characteristics of patches that: (1) do not attract reviewers, (2) are not discussed, and (3) receive slow initial feedback. Through a case study of 196,712 reviews spread across the Android, Qt, and OpenStack open source projects, we find that the amount of review participation in the past is a significant indicator of patches that will suffer from poor review participation. Moreover, we find that the description length of a patch shares a relationship with the likelihood of receiving poor reviewer participation or discussion, while the purpose of introducing new features can increase the likelihood of receiving slow initial feedback. Our findings suggest that the patches with these characteristics should be given more attention in order to increase review participation, which will likely lead to a more responsive review process. This paper is an extended abstract of a paper published in the Empirical Software Engineering Journal. The full article can be found at: http://dx.doi.org/10.1007/s10664-016-9452-6

Spreadsheet Guardian: An Approach to Protecting Semantic Correctness throughout the Evolution of Spreadsheets (Journal-First Abstract)
Daniel Kulesz, Verena Käfer, and Stefan Wagner
(University of Stuttgart, Germany)
We developed an approach that protects users from using faulty spreadsheets in collaborative settings. Results from an empirical evaluation with 71 spreadsheet users indicate that the approach is both helpful and easy to learn and apply.


ERA Track
Thu, Mar 22, 13:45 - 14:45, Room 2

Extracting Features from Requirements: Achieving Accuracy and Automation with Neural Networks
Yang Li, Sandro Schulze, and Gunter Saake
(Otto von Guericke University Magdeburg, Germany)
Analyzing and extracting features and variability from different artifacts is an indispensable activity to support systematic integration of single software systems and Software Product Line (SPL). Beyond manually extracting variability, a variety of approaches, such as feature location in source code and feature extraction in requirements, has been proposed for automating the identification of features and their variation points. While requirements contain more complete variability information and provide traceability links to other artifacts, current techniques exhibit a lack of accuracy as well as a limited degree of automation. In this paper, we propose an unsupervised learning structure to overcome the abovementioned limitations. In particular, our technique consists of two steps: First, we apply Laplacian Eigenmaps, an unsupervised dimensionality reduction technique, to embed text requirements into compact binary codes. Second, requirements are transformed into a matrix representation by looking up a pre-trained word embedding. Then, the matrix is fed into CNN to learn linguistic characteristics of the requirements. Furthermore, we train CNN by matching the output of CNN with the pre-trained binary codes. Initial results show that accuracy is still limited, but that our approach allows to automate the entire process.

OctoBubbles: A Multi-view Interactive Environment for Concurrent Visualization and Synchronization of UML Models and Code
Rodi Jolak, Khanh-Duy Le, Kaan Burak Sener, and Michel R. V. Chaudron
(Chalmers University of Technology, Sweden; Gothenburg University, Sweden; National Research University, Russia)
The process of software understanding often requires developers to consult both high- and low-level software artifacts (i.e. models and code). The creation and persistence of such artifacts often take place in different environments, as well as seldom in one single environment. In both cases, software models and code fragments are viewable separately making the workspace overcrowded with many opened interfaces and tabs. In such a situation, developers might lose the big picture and spend unnecessary effort on navigation and locating the artifact of interest. To assist program comprehension and tackle the problem of software navigation, we present OctoBubbles, a multiview interactive environment for concurrent visualization and synchronization of software models and code. A preliminary evaluation of OctoBubbles with 15 professional developers shows a high level of interest, and points out to potential benefits. Furthermore, we present a future plan to quantitatively investigate the effectiveness of the environment.

A Comparison of Software Engineering Domain Specific Sentiment Analysis Tools
Md. Rakibul Islam and Minhaz F. Zibran
(University of New Orleans, USA)
Sentiment Analysis (SA) in software engineering (SE) text has drawn immense interests recently. The poor performance of general-purpose SA tools, when operated on SE text, has led to recent emergence of domain-specific SA tools especially designed for SE text. However, these domain-specific tools were tested on single dataset and their performances were compared mainly against general-purpose tools. Thus, two things remain unclear: (i) how well these tools really work on other datasets, and (ii) which tool to choose in which context. To address these concerns, we operate three recent domain-specific SA tools on three separate datasets. Using standard accuracy measurement metrics, we compute and compare their accuracies in the detection of sentiments in SE text.

Generating Descriptions for Screenshots to Assist Crowdsourced Testing
Di Liu, Xiaofang Zhang, Yang Feng, and James A. Jones
(Soochow University, China; University of California at Irvine, USA)
Crowdsourced software testing has been shown to be capable of detecting many bugs and simulating real usage scenarios. As such, it is popular in mobile-application testing. However in mobile testing, test reports often consist of only some screenshots and short text descriptions. Inspecting and understanding the overwhelming number of mobile crowdsourced test reports becomes a time-consuming but inevitable task. The paucity and potential inaccuracy of textual information and the well-defined screenshots of activity views within mobile applications motivate us to propose a novel technique to assist developers in understanding crowdsourced test reports by automatically describing the screenshots. To reach this goal, in this paper, we propose a fully automatic technique to generate descriptive words for the well-defined screenshots. We employ the test reports written by professional testers to build up language models. We use the computer-vision technique, namely Spatial Pyramid Matching (SPM), to measure similarities and extract features from the screenshot images. The experimental results, based on more than 1000 test reports from 4 industrial crowdsourced projects, show that our proposed technique is promising for developers to better understand the mobile crowdsourced test reports.

Reconciling the Past and the Present: An Empirical Study on the Application of Source Code Transformations to Automatically Rejuvenate Java Programs
Reno Dantas, Antônio Carvalho Júnior, Diego Marcílio, Luísa Fantin, Uriel Silva, Walter Lucas, and Rodrigo Bonifácio
(University of Brasília, Brazil)
Software systems change frequently over time, either due to new business requirements or technology pressures. Programming languages evolve in a similar constant fashion, though when a language release introduces new programming constructs, older constructs and idioms might become obsolete. The coexistence between newer and older constructs leads to several problems, such as increased maintenance efforts and higher learning curve for developers. In this paper we present a Rascal Java transformation library that evolves legacy systems to use more recent programming language constructs (such as multi-catch and lambda expressions). In order to understand how relevant automatic software rejuvenation is, we submitted 2462 transformations to 40 open source projects via the GitHub pull request mechanism. Initial results show that simple transformations, for instance the introduction of the diamond operator, are more likely to be accepted than transformations that change the code substantially, such as refactoring enhanced for loops to the newer functional style.

Info

Tool Demos

Mining
Wed, Mar 21, 10:30 - 11:30, Room 2

The Statechart Workbench: Enabling Scalable Software Event Log Analysis using Process Mining
Maikel Leemans, Wil M. P. van der Aalst, and Mark G. J. van den Brand
(Eindhoven University of Technology, Netherlands)
To understand and maintain the behavior of a (legacy) software system, one can observe and study the system's behavior by analyzing event data. For model-driven reverse engineering and analysis of system behavior, operation and usage based on software event data, we need a combination of advanced algorithms and techniques. In this paper, we present the Statechart Workbench: a novel software behavior exploration tool. Our tool provides a rich and mature integration of advanced (academic) techniques for the analysis of behavior, performance (timings), frequency (usage), conformance and reliability in the context of various formal models. The accompanied Eclipse plugin allows the user to interactively link all the results from the Statechart Workbench back to the source code of the system and enables users to get started right away with their own software. The work can be positioned in-between reverse engineering and process mining. Implementations, documentation, and a screencast (https://youtu.be/xR4XfU3E5mk) of the proposed approach are available, and a user study demonstrates the novelty and usefulness of the tool.

Video Info
APIDiff: Detecting API Breaking Changes
Aline Brito, Laerte Xavier, Andre Hora, and Marco Tulio Valente
(Federal University of Minas Gerais, Brazil; Federal University of Mato Grosso do Sul, Brazil)
Libraries are commonly used to increase productivity. As most software systems, they evolve over time and changes are required. However, this process may involve breaking compatibility with previous versions, leading clients to fail. In this context, it is important that libraries creators and clients frequently assess API stability in order to better support their maintenance practices. In this paper, we introduce APIDiff, a tool to identify API breaking and non-breaking changes between two versions of a Java library. The tool detects changes on three API elements: types, methods, and fields. We also report usage scenarios of APIDiff with four real-world Java libraries.

LICCA: A Tool for Cross-Language Clone Detection
Tijana Vislavski, Gordana Rakić, Nicolás Cardozo, and Zoran Budimac
(University of Novi Sad, Serbia; Universidad de los Andes, Colombia)
Code clones mostly have been proven harmful for the development and maintenance of software systems, leading to code deterioration and an increase in bugs as the system evolves. Modern software systems are composed of several components, incorporating multiple technologies in their development. In such systems, it is common to replicate (parts of) functionality across the different components, potentially in a different programming language. Effect of these duplicates is more acute, as their identification becomes more challenging. This paper presents LICCA, a tool for the identification of duplicate code fragments across multiple languages. LICCA is integrated with the SSQSA platform and relies on its high-level representation of code in which it is possible to extract syntactic and semantic characteristics of code fragments positing full cross-language clone detection. LICCA is on a technology development level. We demonstrate its potential by adopting a set of cloning scenarios, extended and rewritten in five characteristic languages: Java, C, JavaScript, Modula-2 and Scheme.

Video Info
GoldRusher: A Miner for Rapid Identification of Hidden Code
Aleieldin Salem
(TU Munich, Germany)
GoldRusher is a dynamic analysis tool primarily meant to aid reverse engineers with analyzing malware. Based on the fact that hidden code segments rarely execute, the tool is able to rapidly highlight functions and basic blocks that are potentially hidden, and identify the trigger conditions that control their executions.

Software Evolution
Wed, Mar 21, 15:00 - 16:00, Room 2

BECLoMA: Augmenting Stack Traces with User Review Information
Lucas Pelloni, Giovanni Grano, Adelina Ciurumelea, Sebastiano Panichella, Fabio Palomba, and Harald C. Gall
(University of Zurich, Switzerland)
Mobile devices such as smartphones, tablets and wearables are changing the way we do things, radically modifying our approach to technology. To sustain the high competition characterizing the mobile market, developers need to deliver high quality applications in a short release cycle. To reveal and fix bugs as soon as possible, researchers and practitioners proposed tools to automate the testing process. However, such tools generate a high number of redundant inputs, lacking of contextual information and generating reports difficult to analyze. In this context, the content of user reviews represents an unmatched source for developers seeking for defects in their applications. However, no prior work explored the adoption of information available in user reviews for testing purposes. In this demo we present BECLoMA, a tool to enable the integration of user feedback in the testing process of mobile apps. BECLoMA links information from testing tools and user reviews, presenting to developers an augmented testing report combining stack traces with user reviews information referring to the same crash. We show that BECLoMA facilitates not only the diagnosis and fix of app bugs, but also presents additional benefits: it eases the usage of testing tools and automates the analysis of user reviews from the Google Play Store.

Info
Bring Your Own Coding Style
Naoto Ogura, Shinsuke Matsumoto, Hideaki Hata, and Shinji Kusumoto
(Osaka University, Japan; NAIST, Japan)
Coding style is a representation of source code, which does not affect the behavior of program execution. The choice of coding style is purely a matter of developer preference. Inconsistency of coding style not only decreased readability but also can cause frustration during programming. In this paper, we propose a novel tool, called StyleCoordinator, to solve both of the following problems, which would appear to contradict each other: ensuring a consistent coding style for all source codes managed in a repository and ensuring the ability of developers to use their own coding styles in a local environment. In order to validate the execution performance, we apply the proposed tool to an actual software repository.

FINALIsT²: Feature Identification, Localization, and Tracing Tool
Andreas Burger and Sten Grüner
(ABB, Germany)
Feature identification and localization is a complicated and error-prone task. Nowadays it is mainly done manually by lead software developer or domain experts. Sometimes these experts are no longer available or cannot support in the feature identification and localization process. Due to that we propose a tool which supports this process with an iterative semi-automatic workflow for identifying, localizing and documenting features. Our tool calculates a feature cluster based on an defined entry point that is found by using information retrieval techniques. This feature cluster will be iteratively refined by the user. This iterative feedback-driven workflow enables developer which are not deeply involved in the development of the software to identify and extract features properly. We evaluated our tool on an industrial smart control system for electric motors with first promising results.

Video
ChangeMacroRecorder: Recording Fine-Grained Textual Changes of Source Code
Katsuhisa Maruyama, Shinpei Hayashi, and Takayuki Omori
(Ritsumeikan University, Japan; Tokyo Institute of Technology, Japan)
Recording code changes comes to be well recognized as an effective means for understanding the evolution of existing programs and making their future changes efficient. Although fine-grained textual changes of source code are worth leveraging in various situations, there is no satisfactory tool that records such changes. This paper proposes a yet another tool, called ChangeMacroRecorder, which automatically records all textual changes of source code while a programmer writes and modifies it on the Eclipse's Java editor. Its capability has been improved with respect to both the accuracy of its recording and the convenience for its use. Tool developers can easily and cheaply create their new applications that utilize recorded changes by embedding our proposed recording tool into them.

Info
RETICULA: Real-Time Code Quality Assessment
Luigi Frunzio, Bin Lin, Michele Lanza, and Gabriele Bavota
(University of Lugano, Switzerland)
Code metrics can be used to assess the internal quality of software systems, and in particular their adherence to good design principles. While providing hints about code quality, metrics are difficult to interpret. Indeed, they take a code component as input and assess a quality attribute (e.g., code readability) by providing a number as output. However, it might be unclear for developers whether that value should be considered good or bad for the specific code at hand.
We present RETICULA (REal TIme Code qUaLity Assessment), a plugin for the IntelliJ IDE to assist developers in perceiving code quality during software development. RETICULA compares the quality metrics for a project (or a single class) under development in the IDE with those of similar open source systems (classes) previously analyzed. With the visualized results, developers can gain insights about the quality of their code.
A video illustrating the features of RETICULA can be found at: https://reticulaplugin.github.io/.


Industry Track

Reengineering
Wed, Mar 21, 11:45 - 12:45, Room 2

Reengineering an Industrial HMI: Approach, Objectives, and Challenges
Bernhard Dorninger, Michael Moser, and Albin Kern
(Software Competence Center Hagenberg, Austria; ENGEL AUSTRIA, Austria)
Human Machine Interfaces (HMI) play a pivotal role in operating industrial machines. Depending on the extension of a manufacturers domain and the range of its machines as well as the possible options and variants, the ensuing HMI component repository may become substantially large, resulting in significant maintenance requirements and subsequent cost. A combination of cost pressure and other factors, such as significant change of requirements, may then call for a substantial reengineering. A viable alternative to manually reengineering the whole HMI framework might be the use of (semi)-automated reengineering techniques for suitable parts. We describe such a model based reengineering procedure relying on static analysis of the existing source code for suited aspects of a large HMI framework. We will sketch our overall approach including the objectives and highlight some important challenges of transforming HMI component information extracted from source code into a representation developed for the completely redesigned HMI infrastructure in the light of an existing product assembly and configuration process at a large machinery manufacturer.

Model-Based Software Restructuring: Lessons from Cleaning Up COM Interfaces in Industrial Legacy Code
Dennis Dams, Arjan Mooij, Pepijn Kramer, Andrei Rădulescu, and Jaromír Vaňhara
(ESI, Netherlands; TNO, Netherlands; Thermo Fisher Scientific, Netherlands)
The high-tech industry is faced with ever growing amounts of software to be maintained and extended. To keep the associated costs under control, there is a demand for more human overview and for large-scale code restructurings. Language technology such as parsing can assist in this, but classical restructuring tools are typically not flexible enough to accommodate the needs of specific cases.
In our research we investigate ways to make software restructuring tools customizable by software developers at Thermo Fisher Scientific as well as at other high-tech companies. We report on an industry-as-lab project, in which we have collaborated on cleaning up the compilation of COM interfaces of a large industrial software component.
As a generic result, we have identified a method that we call model-based software restructuring. The approach taken is to extract high-level models from the code, use these to specify and visualize the restructuring, which is then translated into low-level code transformations. To implement this approach, we integrate generic technology to develop custom solutions. We aim for semi-automation and incrementally automate recurring restructuring patterns.
The COM clean-up affected 72 type libraries and 1310 client projects with (one or more) dependencies on these type libraries. We have addressed these one type library at a time, and delivered all changes without blocking regular software development. Software developers in neighboring projects immediately noticed the very low defect rate of our restructuring. Moreover, as a spin-off, we have observed that the developed tools also start to contribute to regular software development.

Grammatical Inference from Data Exchange Files: An Experiment on Engineering Software
Markus Exler, Michael Moser, Josef Pichler, Günter Fleck, and Bernhard Dorninger
(Software Competence Center Hagenberg, Austria; Siemens, Austria)
Complex engineering problems are typically solved by running a batch of software programs. Data exchange between these software programs is frequently based on semi-structured text files. These files are edited by text editors providing basic input support, however without proper input validation prior program execution. Consequently, even minor lexical or syntactic errors cause software programs to stop without delivering a result. To tackle these problems a more specific editor support, which is aware of language concepts of data exchange files, needs to be provided. In this paper, we investigate if and in what quality a language grammar can be inferred from a set of existing text files, in order to provide a basis for the desired editing support. For this experiment, we chose a Minimal Adequate Teacher (MAT) method together with specific preprocessing of the existing text files. Thereby, we were able to construct complete grammar rules for most of the language constructs found in a corpus of semi-structured text files. The inferred grammar, however, requires refactoring towards a suitable and maintainable basis for the desired editor support.

Development and Testing
Fri, Mar 23, 11:45 - 12:45, Room 2

Fuzz Testing in Practice: Obstacles and Solutions
Jie Liang, Mingzhe Wang, Yuanliang Chen, Yu Jiang, and Renwei Zhang
(Tsinghua University, China; Huawei, China)
Fuzz testing has helped security researchers and organizations discover a large number of vulnerabilities. Although it is efficient and widely used in industry, hardly any empirical studies and experience exist on the customization of fuzzers to real industrial projects. In this paper, collaborating with the engineers from Huawei, we present the practice of adapting fuzz testing to a proprietary message middleware named libmsg, which is responsible for the message transfer of the entire distributed system department. We present the main obstacles coming across in applying an efficient fuzzer to libmsg, including system configuration inconsistency, system build complexity, fuzzing driver absence. The solutions for those typical obstacles are also provided. For example, for the most difficult and expensive obstacle of writing fuzzing drivers, we present a low-cost approach by converting existing sample code snippets into fuzzing drivers. After overcoming those obstacles, we can effectively identify software bugs, and report 9 previously unknown vulnerabilities, including flaws that lead to denial of service or system crash.

Diggit: Automated Code Review via Software Repository Mining
Robert Chatley and Lawrence Jones
(Imperial College London, UK; GoCardless, UK)
We present Diggit, a tool to automatically generate code review comments, offering design guidance on prospective changes, based on insights gained from mining historical changes in source code repositories. We describe how the tool was built and tuned for use in practice as we integrated Diggit into the working processes of an industrial development team. We focus on the developer experience, the constraints that had to be met in adapting academic research to produce a tool that was useful to developers, and the effectiveness of the results in practice.


RENE Track

Examining Past Results
Wed, Mar 21, 13:45 - 14:45, Room 2

Duplicate Question Detection in Stack Overflow: A Reproducibility Study
Rodrigo F. G. Silva, Klérisson Paixão, and Marcelo de Almeida Maia
(Federal University of Uberlândia, Brazil)
Stack Overflow has become a fundamental element of developer toolset. Such influence increase has been accompanied by an effort from Stack Overflow community to keep the quality of its content. One of the problems which jeopardizes that quality is the continuous growth of duplicated questions. To solve this problem, prior works focused on automatically detecting duplicated questions. Two important solutions are DupPredictor and Dupe. Despite reporting significant results, both works do not provide their implementations publicly available, hindering subsequent works in scientific literature which rely on them. We executed an empirical study as a reproduction of DupPredictor and Dupe. Our results, not robust when attempted with different set of tools and data sets, show that the barriers to reproduce these approaches are high. Furthermore, when applied to more recent data, we observe a performance decay of our both reproductions in terms of recall-rate over time, as the number of questions increases. Our findings suggest that the subsequent works concerning detection of duplicated questions in Question and Answer communities require more investigation to assert their findings.

How Do Scientists Develop Scientific Software? An External Replication
Gustavo Pinto, Igor Wiese, and Luiz Felipe Dias
(Federal University of Pará, Brazil; Federal University of Technology Paraná, Brazil; University of São Paulo, Brazil)
Although the goal of scientists is to do science, not to develop software, many scientists have extended their roles to include software development to their skills. However, since scientists have different background, it remains unclear how do they perceive software engineering practices or how do they acquire software engineering knowledge. In this paper we conducted an external replication of one influential 10 years paper about how scientists develop and use scientific software. In particular, we employed the same method (an on-line questionnaire) in a different population (R developers). When analyzing the more than 1,574 responses received, enriched with data gathered from their GitHub repositories, we correlated our findings with the original study. We found that the results were consistent in many ways, including: (1) scientists that develop software work mostly alone, (2) they decide themselves what they want to work on next, and (3) most of what they learnt came from self-study, rather than a formal education. However, we also uncover new facts, such as: some of the ''pain points'' regarding software development are not related to technical activities (e.g., interruptions, lack of collaborators, and lack of a reward system play a role). Our replication can help researchers, practitioners, and educators to better focus their efforts on topics that are important to the scientific community that develops software.

Re-evaluating Method-Level Bug Prediction
Luca Pascarella, Fabio Palomba, and Alberto Bacchelli
(Delft University of Technology, Netherlands; University of Zurich, Switzerland)
Bug prediction is aimed at supporting developers in the identification of code artifacts more likely to be defective. Researchers have proposed prediction models to identify bug prone methods and provided promising evidence that it is possible to operate at this level of granularity. Particularly, models based on a mixture of product and process metrics, used as independent variables, led to the best results. In this study, we first replicate previous research on method- level bug prediction on different systems/timespans. Afterwards, we reflect on the evaluation strategy and propose a more realistic one. Key results of our study show that the performance of the method-level bug prediction model is similar to what previously reported also for different systems/timespans, when evaluated with the same strategy. However—when evaluated with a more realistic strategy—all the models show a dramatic drop in performance exhibiting results close to that of a random classifier. Our replication and negative results indicate that method-level bug prediction is still an open challenge.

Code Smells
Fri, Mar 23, 10:30 - 11:30, Room 2

Keep It Simple: Is Deep Learning Good for Linguistic Smell Detection?
Sarah Fakhoury, Venera Arnaoudova, Cedric Noiseux, Foutse Khomh, and Giuliano Antoniol
(Washington State University, USA; Polytechnique Montréal, Canada)
Deep neural networks is a popular technique that has been applied successfully to domains such as image processing, sentiment analysis, speech recognition, and computational linguistic. Deep neural networks are machine learning algorithms that, in general, require a labeled set of positive and negative examples that are used to tune hyper-parameters and adjust model coefficients to learn a prediction function. Recently, deep neural networks have also been successfully applied to certain software engineering problem domains (e.g., bug prediction), however, results are shown to be outperformed by traditional machine learning approaches in other domains (e.g., recovering links between entries in a discussion forum). In this paper, we report our experience in building an automatic Linguistic Antipattern Detector (LAPD) using deep neural networks. We manually build and validate an oracle of around 1,700 instances and create binary classification models using traditional machine learning approaches and Convolutional Neural Networks. Our experience is that, considering the size of the oracle, the available hardware and software, as well as the theory to interpret results, deep neural networks are outperformed by traditional machine learning algorithms in terms of all evaluation metrics we used and resources (time and memory). Therefore, although deep learning is reported to produce results comparable and even superior to human experts for certain complex tasks, it does not seem to be a good fit for simple classification tasks like smell detection. Researchers and practitioners should be careful when selecting machine learning models for the problem at hand.

Detecting Code Smells using Machine Learning Techniques: Are We There Yet?
Dario Di Nucci, Fabio Palomba, Damian A. Tamburri, Alexander Serebrenik, and Andrea De Lucia
(University of Salerno, Italy; Vrije Universiteit Brussel, Belgium; University of Zurich, Switzerland; Eindhoven University of Technology, Netherlands)
Code smells are symptoms of poor design and implementation choices weighing heavily on the quality of produced source code. During the last decades several code smell detection tools have been proposed. However, the literature shows that the results of these tools can be subjective and are intrinsically tied to the nature and approach of the detection. In a recent work the use of Machine-Learning (ML) techniques for code smell detection has been proposed, possibly solving the issue of tool subjectivity giving to a learner the ability to discern between smelly and non-smelly source code elements. While this work opened a new perspective for code smell detection, it only considered the case where instances affected by a single type smell are contained in each dataset used to train and test the machine learners. In this work we replicate the study with a different dataset configuration containing instances of more than one type of smell. The results reveal that with this configuration the machine learning techniques reveal critical limitations in the state of the art which deserve further research.

Info

proc time: 2.71