Conference Publishing Consulting
22nd ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE 2014),
November 16–21, 2014,
Hong Kong, China
Message from the Chairs
On behalf of the entire organizing team of FSE 2014, it is our great pleasure to welcome you to Hong Kong for the 22nd ACM SIGSOFT International Symposium on the Foundations of Software Engineering. The conference brings together researchers, practitioners, and educators to present and discuss the most recent innovations, trends, experiences, and challenges in software engineering. This year’s program continues the great tradition of previous FSE meetings by being rich and exciting, but the location of the conference is particularly noteworthy as it is the first time that this major international software engineering conference is being held outside North America. Hong Kong is renowned as a lively city with a beautiful harbor and landscape, and is famous for its finance, shopping, and gourmet cuisine, blending Eastern and Western cultures.
Omlet: A Revolution against Big-Brother Social Networks (Invited Talk)
Monica S. Lam
(Stanford University, USA)
With the wide-spread adoption of proprietary social networks like Facebook and mobile chat platforms like Wechat, we may be heading to a future where all our communication are monetized and our online transactions are mediated by monopolistic big-data companies. This talk describes a new anti-data monetization movement led by Omlet, an open messaging service and distributed computing platform that spun out of 4 years of research at Stanford University. With Omlet, (1) users can own their data and have them hosted on cloud services of their choice and (2) distributed "p2p webapps" enable phones and other internet of things to interact with each other without having its communication be monetized. Introduced in March 2014, Omlet is already seeing traction, as it is being distributed on millions of Android phones, by Asus and other yet-to-be-announced device makers. This paradigm shift to decentralized computation not only safeguards users' data privacy, it fosters open competition and innovation, and provides an efficient and scalable foundation to handle the billions of phones and devices. Software engineering researchers can help make this a reality by making distributed mobile app development on such a platform accessible.
From Software Engineering to Software Systems (Invited Talk)
Alexander L. Wolf
(Imperial College London, UK)
I began my career in software engineering research and now find myself working more in software systems research. Is there a difference? In this talk I reflect on this question by recalling the stream of ideas, students, and colleagues that have shaped my path. I present an overview of the current projects in which I am involved to understand at a technical level where the two research communities, software engineering and software systems, connect and diverge.
Ten Years with Evidence-Based Software Engineering. What Is It? Has It Had Any Impact? What’s Next? (Invited Talk)
(Simula Research Laboratory, Norway)
An evidence-based software engineer is one who is able to: 1) Formulate a question, related to a decision or judgment, so that it can be answered by the use of evidence, 2) Collect, critically evaluate and summarise relevant evidence from research, practice and local studies, 3) Apply the evidence, integrated with knowledge about the local context, to guide decisions and judgments. The keynote reflects on what it in practise means to be evidence-based in software engineering contexts, where the number of different contexts is high and the research-based evidence sparse, and why there is a need for more evidence-based practises. We summarise our experience from ten years of Evidence-Based Software Engineering in the context of university courses, training of software engineers and systematic literature reviews of software engineering research. While there are challenges in training people in evidence-based practise, our experience suggest that it is feasible and that the training can make an important difference in terms of quality of software engineering judgment and decisions. Based on our experience we suggest changes in how evidence-based software engineering should be presented and taught, and how we should ease the transfer of research results into evidence-based practises.
Experiences Developing Tools for Developers (Invited Talk)
Software Engineers are horrible customers for software tools. If they don't like your tools, they will just write their own. If your tool wastes a few minutes of a developer's day, good luck getting them to ever try your tool again. And if, after years of effort, you manage to develop tools they actually like, you are really in trouble. This is when they start building systems on top of your tools. No API? No problem! They will hack and scrape as needed to get their job done. In this talk I'll go through a number of examples of successes, non-successes and over-successes from the past 8 years of evolving the developer infrastructure at Google. I'll highlight the challenges we faced, our attempts to address the challenges and share our lessons learned.
Are You Getting Traction? Tales from the Tech Transfer Trenches (Invited Talk)
(Samsung Electronics, USA)
So you have developed a new software productivity tool, written an FSE or an ICSE paper about it, and are justifiably proud of your work. If you work for a company, your (curmudgeonly) manager now wants to see its “impact” on the business. This is the part where you have to convince someone else to use your shiny new tool in their day-to-day work, or ship it as a product. But you soon realize that getting traction with developers or product managers is significantly harder than the research itself. Sounds familiar?
In the past several years, I have been involved in taking a variety of software productivity tools to various constituencies within a company: internal users, product teams, and service delivery teams. In this talk, I will share my experiences in interacting with these constituencies; sometimes successful experiences, but at other times not so successful ones. I will focus broadly on tools in two areas: bug finding and test automation. I will make some observations on when tech transfer works and when it stumbles.
Helping and Understanding Developers
Developers’ Code Context Models for Change Tasks
, David C. Shepherd, Katja Kevic
, Will Snipes, and Christoph Bräunlich
(University of Zurich, Switzerland; ABB Research, USA)
To complete a change task, software developers spend a substantial amount of time navigating code to understand the relevant parts. During this investigation phase, they implicitly build context models of the elements and relations that are relevant to the task. Through an exploratory study with twelve developers completing change tasks in three open source systems, we identified important characteristics of these context models and how they are created. In a second empirical analysis, we further examined our findings on data collected from eighty developers working on a variety of change tasks on open and closed source projects. Our studies uncovered, amongst other results, that code context models are highly connected, structurally and lexically, that developers start tasks using a combination of search and navigation and that code navigation varies substantially across developers. Based on these findings we identify and discuss design requirements to better support developers in the initial creation of code context models. We believe this work represents a substantial step in better understanding developers' code navigation and providing better tool support that will reduce time and effort needed for change tasks.
Software Developers’ Perceptions of Productivity
André N. Meyer, Thomas Fritz
, Gail C. Murphy
, and Thomas Zimmermann
(University of Zurich, Switzerland; University of British Columbia, Canada; Microsoft Research, USA)
The better the software development community becomes at creating software, the more software the world seems to demand. Although there is a large body of research about measuring and investigating productivity from an organizational point of view, there is a paucity of research about how software developers, those at the front-line of software construction, think about, assess and try to improve their productivity. To investigate software developers' perceptions of software development productivity, we conducted two studies: a survey with 379 professional software developers to help elicit themes and an observational study with 11 professional software developers to investigate emergent themes in more detail. In both studies, we found that developers perceive their days as productive when they complete many or big tasks without significant interruptions or context switches. Yet, the observational data we collected shows our participants performed significant task and activity switching while still feeling productive. We analyze such apparent contradictions in our findings and use the analysis to propose ways to better support software developers in a retrospection and improvement of their productivity through the development of new tools and the sharing of best practices.
Enablers, Inhibitors, and Perceptions of Testing in Novice Software Teams
, Stephan Kiesling, Olga Liskin, Leif Singer, and Kurt Schneider
(Leibniz Universität Hannover, Germany; University of Victoria, Canada)
There are many different approaches to testing software, with different benefits for software quality and the development process. Yet, it is not well understood what developers struggle with when getting started with testing - and why some do not test at all or not as much as would be good for their project. This missing understanding keeps us from improving processes and tools to help novices adopt proper testing practices. We conducted a qualitative study with 97 computer science students. Through interviews, we explored their experiences and attitudes regarding testing in a collaborative software project. We found enabling and inhibiting factors for testing activities, the different testing strategies they used, and novices’ perceptions and attitudes of testing. Students push test automation to the end of the project, thus robbing themselves from the advantages of having a test suite during implementation. Students were not convinced of the return of investment in automated tests and opted for laborious manual tests - which they often regretted in the end. Understanding such challenges and opportunities novices face when confronted with adopting testing can help us improve testing processes, company policies, and tools. Our findings provide recommendations that can enable organizations to facilitate the adoption of testing practices by their members.
Feedback Generation for Performance Problems in Introductory Programming Assignments
Sumit Gulwani, Ivan Radiček, and Florian Zuleger
(Microsoft Research, USA; Vienna University of Technology, Austria)
Providing feedback on programming assignments manually is a tedious, error prone, and time-consuming task. In this paper, we motivate and address the problem of generating feedback on performance aspects in introductory programming assignments. We studied a large number of functionally correct student solutions to introductory programming assignments and observed: (1) There are different algorithmic strategies, with varying levels of efficiency, for solving a given problem. These different strategies merit different feedback. (2) The same algorithmic strategy can be implemented in countless different ways, which are not relevant for reporting feedback on the student program. We propose a light-weight programming language extension that allows a teacher to define an algorithmic strategy by specifying certain key values that should occur during the execution of an implementation. We describe a dynamic analysis based approach to test whether a student's program matches a teacher's specification. Our experimental results illustrate the effectiveness of both our specification language and our dynamic analysis. On one of our benchmarks consisting of 2316 functionally correct implementations to 3 programming problems, we identified 16 strategies that we were able to describe using our specification language (in 95 minutes after inspecting 66, i.e., around 3%, implementations). Our dynamic analysis correctly matched each implementation with its corresponding specification, thereby automatically producing the intended feedback.
Debugging and Refactoring
Test Case Purification for Improving Fault Localization
Jifeng Xuan and Martin Monperrus
(INRIA, France; University of Lille, France)
Finding and fixing bugs are time-consuming activities in software development. Spectrum-based fault localization aims to identify the faulty position in source code based on the execution trace of test cases. Failing test cases and their assertions form test oracles for the failing behavior of the system under analysis. In this paper, we propose a novel concept of spectrum driven test case purification for improving fault localization. The goal of test case purification is to separate existing test cases into small fractions (called purified test cases) and to enhance the test oracles to further localize faults. Combining with an original fault localization technique (e.g., Tarantula), test case purification results in better ranking the program statements. Our experiments on 1800 faults in six open-source Java programs show that test case purification can effectively improve existing fault localization techniques.
Automatically Generated Patches as Debugging Aids: A Human Study
Yida Tao, Jindae Kim, Sunghun Kim, and Chang Xu
(Hong Kong University of Science and Technology, China; Nanjing University, China)
Recent research has made significant progress in automatic patch generation, an approach to repair programs with less or no manual intervention. However, direct deployment of auto-generated patches remains difficult, for reasons such as patch quality variations and developers' intrinsic resistance.
In this study, we take one step back and investigate a more feasible application scenario of automatic patch generation, that is, using generated patches as debugging aids. We recruited 95 participants for a controlled experiment, in which they performed debugging tasks with the aid of either buggy locations (i.e., the control group), or generated patches of varied qualities. We observe that: a) high-quality patches significantly improve debugging correctness; b) such improvements are more obvious for difficult bugs; c) when using low-quality patches, participants' debugging correctness drops to an even lower point than that of the control group; d) debugging time is significantly affected not by debugging aids, but by participant type and the specific bug to fix. These results highlight that the benefits of using generated patches as debugging aids are contingent upon the quality of the patches. Our qualitative analysis of participants' feedback further sheds light on how generated patches can be improved and better utilized as debugging aids.
A Foundation for Refactoring C with Macros
Jeffrey L. Overbey, Farnaz Behrang, and Munawar Hafiz
(Auburn University, USA)
This paper establishes the concept of "preprocessor dependences" as a foundation for building automated refactoring tools that transform source code containing lexical macros and conditional compilation directives, such as those provided by the C preprocessor. We define a preprocessor dependence graph (PPDG) that models the relationships among macro definitions, macro invocations, and conditional compilation directives in a file--the relationships that must be maintained for the semantics of the C preprocessor to be preserved. For many refactorings, a tool can construct a PPDG from the code before and after it is transformed, then perform a linear-time comparison of the two graphs to determine whether the refactoring will operate correctly in the presence of macros and conditional compilation directives. The proposed technique was implemented in OpenRefactory/C and tested by applying refactorings to GNU Coreutils version 8.21. Empirical results indicate that the technique is effective; it successfully handled refactoring scenarios in which Eclipse CDT, Visual Assist X, and XRefactory all refactored code incorrectly.
Vector Abstraction and Concretization for Scalable Detection of Refactorings
Narcisa Andreea Milea, Lingxiao Jiang, and Siau-Cheng Khoo
(National University of Singapore, Singapore; Singapore Management University, Singapore)
Automated techniques have been proposed to either identify refactoring opportunities (i.e., code fragments that can be but have not yet been restructured in a program), or reconstruct historical refactorings (i.e., code restructuring operations that have happened between different versions of a program). In this paper, we propose a new technique that can detect both refactoring opportunities and historical refactorings in large code bases. The key of our technique is the design of vector abstraction and concretization operations that can encode code changes induced by certain refactorings as characteristic vectors. Thus, the problem of identifying refactorings can be reduced to the problem of identifying matching vectors, which can be solved efficiently. We have implemented our technique for Java. The prototype is applied to 200 bundle projects from the Eclipse ecosystem containing 4.5 million lines of code, and reports in total more than 32K instances of 17 types of refactoring opportunities, taking 25 minutes on average for each type. The prototype is also applied to 14 versions of 3 smaller programs (JMeter, Ant, XML-Security), and detects (1) more than 2.8K refactoring opportunities within individual versions with a precision of about 87%, and (2) more than 190 historical refactorings across consecutive versions of the programs with a precision of about 92%.
FlowTwist: Efficient Context-Sensitive Inside-Out Taint Analysis for Large Codebases
Johannes Lerch, Ben Hermann
, Eric Bodden, and Mira Mezini
(TU Darmstadt, Germany; Fraunhofer SIT, Germany)
Over the past years, widely used platforms such as the Java Class Library have been under constant attack through vulnerabilities that involve a combination of two taint-analysis problems: an integrity problem allowing attackers to trigger sensitive operations within the platform, and a confidentiality problem allowing the attacker to retrieve sensitive information or pointers from the results of those operations. While existing static taint analyses are good at solving either of those problems, we show that they scale prohibitively badly when being applied to situations that require the exploitation of both an integrity and confidentiality problem in combination. The main problem is the huge attack surface of libraries such as the Java Class Library, which exposes thousands of methods potentially controllable by an attacker.
In this work we thus present FlowTwist, a novel taint-analysis approach that works inside-out, i.e., tracks data flows from potentially vulnerable calls to the outer level of the API which the attacker might control. This inside-out analysis requires a careful, context-sensitive coordination of both a backward and a forward taint analysis. In this work, we expose a design of the analysis approach based on the IFDS algorithm, and explain several extensions to IFDS that enable not only this coordination but also a helpful reporting of error situations to security analysts.
Experiments with the Java Class Library show that, while a simple forward taint-analysis approach does not scale even with much machine power, FlowTwist's algorithm is able to fully analyze the library within 10 minutes.
ORBS: Language-Independent Program Slicing
, Nicolas Gold, Mark Harman
, Syed Islam
, Jens Krinke, and Shin Yoo
(Loyola University Maryland, USA; University College London, UK)
Current slicing techniques cannot handle systems written in multiple programming languages. Observation-Based Slicing (ORBS) is a language-independent slicing technique capable of slicing multi-language systems, including systems which contain (third party) binary components. A potential slice obtained through repeated statement deletion is validated by observing the behaviour of the program: if the slice and original program behave the same under the slicing criterion, the deletion is accepted. The resulting slice is similar to a dynamic slice. We evaluate five variants of ORBS on ten programs of different sizes and languages showing that it is less expensive than similar existing techniques. We also evaluate it on bash and four other systems to demonstrate feasible large-scale operation in which a parallelised ORBS needs up to 82% less time when using four threads. The results show that an ORBS slicer is simple to construct, effective at slicing, and able to handle systems written in multiple languages without specialist analysis tools.
Vineeth Kashyap, Kyle Dewey, Ethan A. Kuefner, John Wagner, Kevin Gibbons, John Sarracino, Ben Wiedermann, and Ben Hardekopf
(University of California at Santa Barbara, USA; Harvey Mudd College, USA)
A Path-Sensitively Sliced Control Flow Graph
Joxan Jaffar and Vijayaraghavan Murali
(National University of Singapore, Singapore)
We present a new graph representation of programs with specified target
variables. These programs are intended to be processed by third-party
applications querying target variables such as testers and verifiers. The
representation embodies two concepts. First, it is path-sensitive in the sense
that multiple nodes representing one program point may exist so that infeasible
paths can be excluded. Second, and more importantly, it is sliced with
respect to the target variables. This key step is founded on a novel idea
introduced in this paper, called ``Tree Slicing'', and on the fact that
slicing is more effective when there is path sensitivity. Compared to the
traditional Control Flow Graph (CFG), the new graph may be bigger (due to
path-sensitivity) or smaller (due to slicing). We show that it is not much
bigger in practice, if at all. The main result however concerns its quality:
third-party testers and verifiers perform substantially
better on the new graph compared to the original CFG.
Mining Software Repositories
Let's Talk About It: Evaluating Contributions through Discussion in GitHub
Jason Tsay, Laura Dabbish, and James Herbsleb
(Carnegie Mellon University, USA)
Open source software projects often rely on code contributions from a wide variety of developers to extend the capabilities of their software. Project members evaluate these contributions and often engage in extended discussions to decide whether to integrate changes. These discussions have important implications for project management regarding new contributors and evolution of project requirements and direction. We present a study of how developers in open work environments evaluate and discuss pull requests, a primary method of contribution in GitHub, analyzing a sample of extended discussions around pull requests and interviews with GitHub developers. We found that developers raised issues around contributions over both the appropriateness of the problem that the submitter attempted to solve and the correctness of the implemented solution. Both core project members and third-party stakeholders discussed and sometimes implemented alternative solutions to address these issues. Different stakeholders also influenced the outcome of the evaluation by eliciting support from different communities such as dependent projects or even companies. We also found that evaluation outcomes may be more complex than simply acceptance or rejection. In some cases, although a submitter's contribution was rejected, the core team fulfilled the submitter's technical goals by implementing an alternative solution. We found that the level of a submitter's prior interaction on a project changed how politely developers discussed the contribution and the nature of proposed alternative solutions.
A Large Scale Study of Programming Languages and Code Quality in Github
Baishakhi Ray, Daryl Posnett, Vladimir Filkov, and Premkumar Devanbu
(University of California at Davis, USA)
What is the effect of programming languages on software quality? This question has been a topic of much debate for a very long time. In this study, we gather a very large data set from GitHub (729 projects, 80 Million SLOC, 29,000 authors, 1.5 million commits, in 17 languages) in an attempt to shed some empirical light on this question. This reasonably large sample size allows us to use a mixed-methods approach, combining multiple regression modeling with visualization and text analytics, to study the effect of language features such as static v.s. dynamic typing, strong v.s. weak typing on software quality. By triangulating findings from different methods, and controlling for confounding effects such as team size, project size, and project history, we report that language design does have a significant, but modest effect on software quality. Most notably, it does appear that strong typing is modestly better than weak typing, and among functional languages, static typing is also somewhat better than dynamic typing. We also find that functional languages are somewhat better than procedural languages. It is worth noting that these modest effects arising from language design are overwhelmingly dominated by the process factors such as project size, team size, and commit size. However, we hasten to caution the reader that even these modest effects might quite possibly be due to other, intangible process factors, e.g., the preference of certain personality types for functional, static and strongly typed languages.
Mining Preconditions of APIs in Large-Scale Code Corpus
Hoan Anh Nguyen
, Robert Dyer, Tien N. Nguyen, and Hridesh Rajan
(Iowa State University, USA)
Modern software relies on existing application programming interfaces (APIs) from libraries. Formal specifications for the APIs enable many software engineering tasks as well as help developers correctly use them. In this work, we mine large-scale repositories of existing open-source software to derive potential preconditions for API methods. Our key idea is that APIs’ preconditions would appear frequently in an ultra-large code corpus with a large number of API usages, while project-specific conditions will occur less frequently. First, we find all client methods invoking APIs. We then compute a control dependence relation from each call site and mine the potential conditions used to reach those call sites. We use these guard conditions as a starting point to automatically infer the preconditions for each API. We analyzed almost 120 million lines of code from SourceForge and Apache projects to infer preconditions for the standard Java Development Kit (JDK) library. The results show that our technique can achieve high accuracy with recall from 75–80% and precision from 82–84%. We also found 5 preconditions missing from human written specifications. They were all confirmed by a specification expert. In a user study, participants found 82% of the mined preconditions as a good starting point for writing specifications. Using our mining result, we also built a benchmark of more than 4,000 precondition-related bugs.
Automatic Mining of Specifications from Invocation Traces and Method Invariants
Ivo Krka, Yuriy Brun
, and Nenad Medvidovic
(Google, Switzerland; University of Massachusetts, USA; University of Southern California, USA)
Software library documentation often describes individual methods'
APIs, but not the intended protocols and method interactions. This can
lead to library misuse, and restrict runtime detection of protocol
violations and automated verification of software that uses the
library. Specification mining, if accurate, can help mitigate these
issues, which has led to significant research into new model-inference
techniques that produce FSM-based models from program invariants and
execution traces. However, there is currently a lack of empirical
studies that, in a principled way, measure the impact of the inference
strategies on model quality. To this end, we identify four such
strategies and systematically study the quality of the models they
produce for nine off-the-shelf libraries. We find that (1) using
invariants to infer an initial model significantly improves model
quality, increasing precision by 4% and recall by 41%, on average; (2)
effective invariant filtering is crucial for quality and scalability
of strategies that use invariants; and (3) using traces in combination
with invariants greatly improves robustness to input noise. We present
our empirical evaluation, implement new and extend existing
model-inference techniques, and make public our implementations,
ground-truth models, and experimental data. Our work can lead to
higher-quality model inference, and directly improve the techniques
and tools that rely on model inference.
Formal Methods and Verification
Counterexample Guided Abstraction Refinement of Product-Line Behavioural Models
Maxime Cordy, Patrick Heymans, Axel Legay, Pierre-Yves Schobbens, Bruno Dawagne, and Martin Leucker
(University of Namur, Belgium; INRIA, France; University of Lübeck, Germany)
The model-checking problem for Software Products Lines (SPLs) is harder than for single systems: variability constitutes a new source of complexity that exacerbates the state-explosion problem. Abstraction techniques have successfully alleviated state explosion in single-system models. However, they need to be adapted to SPLs, to take into account the set of variants that produce a counterexample. In this paper, we apply CEGAR (Counterexample-Guided Abstraction Refinement) and we design new forms of abstraction specifically for SPLs. We carry out experiments to evaluate the efficiency of our new abstractions. The results show that our abstractions, combined with an appropriate refinement strategy, hold the potential to achieve large reductions in verification time, although they sometimes perform worse. We discuss in which cases a given abstraction should be used.
Powering the Static Driver Verifier using Corral
Akash Lal and Shaz Qadeer
(Microsoft Research, India; Microsoft Research, USA)
The application of software-verification technology towards building realistic bug-finding tools requires working through several precision-scalability tradeoffs. For instance, a critical aspect while dealing with C programs is to formally define the treatment of pointers and the heap. A machine-level modeling is often intractable, whereas one that leverages high-level information (such as types) can be inaccurate. Another tradeoff is modeling integer arithmetic. Ideally, all arithmetic should be performed over bitvector representations whereas the current practice in most tools is to use mathematical integers for scalability. A third tradeoff, in the context of bounded program exploration, is to choose a bound that ensures high coverage without overwhelming the analysis. This paper works through these three tradeoffs when we applied Corral, an SMT-based verifier, inside Microsoft's Static Driver Verifier (SDV). Our decisions were guided by experimentation on a large set of drivers; the total verification time exceeded well over a month. We justify that each of our decisions were crucial in getting value out of Corral and led to Corral being accepted as the engine that powers SDV in the Windows 8.1 release, replacing the SLAM engine that had been used inside SDV for the past decade.
Verifying CTL-Live Properties of Infinite State Models using an SMT Solver
Amirhossein Vakili and Nancy A. Day
(University of Waterloo, Canada)
The ability to create and analyze abstract models is an important step in conquering software complexity. In this paper, we show that it is practical to verify dynamic properties of infinite state models expressed in a subset of CTL directly using an SMT solver without iteration, abstraction, or human intervention. We call this subset CTL-Live and it consists of the operators of CTL expressible using the least fixed point operator of the mu-calculus, which are commonly considered liveness properties (e.g., AF, AU). We show that using this method the verification of an infinite state model can sometimes complete more quickly than verifying a finite version of the model. We also examine modelling techniques to represent abstract models in first-order logic that facilitate this form of model checking.
Efficient Runtime-Enforcement Techniques for Policy Weaving
Richard Joiner, Thomas Reps, Somesh Jha, Mohan Dhawan
, and Vinod Ganapathy
(University of Wisconsin-Madison, USA; GrammaTech, USA; IBM Research, India; Rutgers University, USA)
Policy weaving is a program-transformation technique that rewrites a program so that it is guaranteed to be safe with respect to a stateful security policy. It utilizes (i) static analysis to identify points in the program at which policy violations might occur, and (ii) runtime checks inserted at such points to monitor policy state and prevent violations from occurring. The promise of policy weaving stems from the possibility of blending the best aspects of static and dynamic analysis components. Therefore, a successful instantiation of policy weaving requires a careful balance and coordination between the two.
Techniques for Improving Regression Testing in Continuous Integration Development Environments
Sebastian Elbaum, Gregg Rothermel, and John Penix
(University of Nebraska-Lincoln, USA; Google, USA)
In continuous integration development environments, software engineers frequently integrate new or changed code with the mainline codebase. This can reduce the amount of code rework that is needed as systems evolve and speed up development time. While continuous integration processes traditionally require that extensive testing be performed following the actual submission of code to the codebase, it is also important to ensure that enough testing is performed prior to code submission to avoid breaking builds and delaying the fast feedback that makes continuous integration desirable. In this work, we present algorithms that make continuous integration processes more cost-effective. In an initial pre-submit phase of testing, developers specify modules to be tested, and we use regression test selection techniques to select a subset of the test suites for those modules that render that phase more cost-effective. In a subsequent post-submit phase of testing, where dependent modules as well as changed modules are tested, we use test case prioritization techniques to ensure that failures are reported more quickly. In both cases, the techniques we utilize are novel, involving algorithms that are relatively inexpensive and do not rely on code coverage information -- two requirements for conducting testing cost-effectively in this context. To evaluate our approach, we conducted an empirical study on a large data set from Google that we make publicly available. The results of our study show that our selection and prioritization techniques can each lead to cost-effectiveness improvements in the continuous integration process.
Balancing Trade-Offs in Test-Suite Reduction
, Alex Gyori
, Milos Gligoric, Andrey Zaytsev, and Darko Marinov
(University of Illinois at Urbana-Champaign, USA)
Regression testing is an important activity but can get expensive for large test suites. Test-suite reduction speeds up regression testing by identifying and removing redundant tests based on a given set of requirements. Traditional research on test-suite reduction is rather diverse but most commonly shares three properties: (1) requirements are defined by a coverage criterion such as statement coverage; (2) the reduced test suite has to satisfy all the requirements as the original test suite; and (3) the quality of the reduced test suites is measured on the software version on which the reduction is performed. These properties make it hard for test engineers to decide how to use reduced test suites. We address all three properties of traditional test-suite reduction: (1) we evaluate test-suite reduction with requirements defined by killed mutants; (2) we evaluate inadequate reduction that does not require reduced test suites to satisfy all the requirements; and (3) we propose evolution-aware metrics that evaluate the quality of the reduced test suites across multiple software versions. Our evaluations allow a more thorough exploration of trade-offs in test-suite reduction, and our evolution-aware metrics show how the quality of reduced test suites can change after the version where the reduction is performed. We compare the trade-offs among various reductions on 18 projects with a total of 261,235 tests over 3,590 commits and a cumulative history spanning 35 years of development. Our results help test engineers make a more informed decision about balancing size, coverage, and fault-detection loss of reduced test suites.
Identifying the Characteristics of Vulnerable Code Changes: An Empirical Study
Amiangshu Bosu, Jeffrey C. Carver, Munawar Hafiz, Patrick Hilley, and Derek Janni
(University of Alabama, USA; Auburn University, USA; Providence College, USA; Lewis & Clark College, USA)
To focus the efforts of security experts, the goals of this empirical study are to analyze which security vulnerabilities can be discovered by code review, identify characteristics of vulnerable code changes, and identify characteristics of developers likely to introduce vulnerabilities. Using a three-stage manual and automated process, we analyzed 267,046 code review requests from 10 open source projects and identified 413 Vulnerable Code Changes (VCC). Some key results include: (1) code review can identify common types of vulnerabilities; (2) while more experienced contributors authored the majority of the VCCs, the less experienced contributors' changes were 1.8 to 24 times more likely to be vulnerable; (3) the likelihood of a vulnerability increases with the number of lines changed, and (4) modified files are more likely to contain vulnerabilities than new files. Knowing which code changes are more prone to contain vulnerabilities may allow a security expert to concentrate on a smaller subset of submitted code changes. Moreover, we recommend that projects should: (a) create or adapt secure coding guidelines, (b) create a dedicated security review team, (c) ensure detailed comments during review to help knowledge dissemination, and (d) encourage developers to make small, incremental changes rather than large changes.
Improving Recommender Systems
On the Localness of Software
Zhaopeng Tu, Zhendong Su, and Premkumar Devanbu
(University of California at Davis, USA)
The n-gram language model, which has its roots in statistical natural language processing, has been shown to successfully capture the repetitive and predictable regularities (“naturalness") of source code, and help with tasks such as code suggestion, porting, and designing assistive coding devices. However, we show in this paper that this natural-language-based model fails to exploit a special property of source code: localness. We find that human-written programs are localized: they have useful local regularities that can be captured and exploited. We introduce a novel cache language model that consists of both an n-gram and an added “cache" component to exploit localness. We show empirically that the additional cache component greatly improves the n-gram approach by capturing the localness of software, as measured by both cross-entropy and suggestion accuracy. Our model’s suggestion accuracy is actually comparable to a state-of-the-art, semantically augmented language model; but it is simpler and easier to implement. Our cache language model requires nothing beyond lexicalization, and thus is applicable to all programming languages.
Learning Natural Coding Conventions
, Earl T. Barr, Christian Bird, and Charles Sutton
(University of Edinburgh, UK; University College London, UK; Microsoft Research, USA)
Every programmer has a characteristic style, ranging from preferences about identifier naming to preferences about object relationships and design patterns. Coding conventions define a consistent syntactic style, fostering readability and hence maintainability. When collaborating, programmers strive to obey a project’s coding conventions. However, one third of reviews of changes contain feedback about coding conventions, indicating that programmers do not always follow them and that project members care deeply about adherence. Unfortunately, programmers are often unaware of coding conventions because inferring them requires a global view, one that aggregates the many local decisions programmers make and identifies emergent consensus on style. We present NATURALIZE, a framework that learns the style of a codebase, and suggests revisions to improve stylistic consistency. NATURALIZE builds on recent work in applying statistical natural language processing to source code. We apply NATURALIZE to suggest natural identifier names and formatting conventions. We present four tools focused on ensuring natural code during development and release management, including code review. NATURALIZE achieves 94 % accuracy in its top suggestions for identifier names. We used NATURALIZE to generate 18 patches for 5 open source projects: 14 were accepted.
How Should We Measure Functional Sameness from Program Source Code? An Exploratory Study on Java Methods
and Shinji Kusumoto
(Osaka University, Japan)
Program source code is one of the main targets of software engineering research. A wide variety of research has been conducted on source code, and many studies have leveraged structural, vocabulary, and method signature similarities to measure the functional sameness of source code. In this research, we conducted an empirical study to ascertain how we should use three similarities to measure functional sameness. We used two large datasets and measured the three similarities between all the method pairs in the datasets, each of which included approximately 15 million Java method pairs. The relationships between the three similarities were analyzed to determine how we should use each to detect functionally similar code. The results of our study revealed the following. (1) Method names are not always useful for detecting functionally similar code. Only if there are a small number of methods having a given name, the methods are likely to include functionally similar code. (2) Existing file-level, method-level, and block-level clone detection techniques often miss functionally similar code generated by copy-and-paste operations between different projects. (3) In the cases we use structural similarity for detecting functionally similar code, we obtained many false positives. However, we can avoid detecting most false positives by using a vocabulary similarity in addition to a structural one. (4) Using a vocabulary similarity to detect functionally similar code is not suitable for method pairs in the same file because such method pairs use many of the same program elements such as private methods or private fields.
The Plastic Surgery Hypothesis
Earl T. Barr, Yuriy Brun
, Premkumar Devanbu, Mark Harman
, and Federica Sarro
(University College London, UK; University of Massachusetts, USA; University of California at Davis, USA)
Recent work on genetic-programming-based approaches to automatic program patching have relied on the insight that the content of new code can often be assembled out of fragments of code that already exist in the code base. This insight has been dubbed the plastic surgery hypothesis; successful, well-known automatic repair tools such as GenProg rest on this hypothesis, but it has never been validated. We formalize and validate the plastic surgery hypothesis and empirically measure the extent to which raw material for changes actually already exists in projects. In this paper, we mount a large-scale study of several large Java projects, and examine a history of 15,723 commits to determine the extent to which these commits are graftable, i.e., can be reconstituted from existing code, and find an encouraging degree of graftability, surprisingly independent of commit size and type of commit. For example, we find that changes are 43% graftable from the exact version of the software being changed. With a view to investigating the difficulty of finding these grafts, we study the abundance of such grafts in three possible sources: the immediately previous version, prior history, and other projects. We also examine the contiguity or chunking of these grafts, and the degree to which grafts can be found in the same file. Our results are quite promising and suggest an optimistic future for automatic program patching methods that search for raw material in already extant code in the project being patched.
Concurrency and Parallelism
Grail: Context-Aware Fixing of Concurrency Bugs
Peng Liu, Omer Tripp, and Charles Zhang
(Wuhan University, China; IBM Research, USA; Hong Kong University of Science and Technology, China)
Writing efficient synchronization for multithreaded programs is notoriously hard. The resulting code often contains subtle concurrency bugs. Even worse, many bug fixes introduce new bugs. A classic example, seen widely in practice, is deadlocks resulting from fixing of an atomicity violation. These complexities have motivated the development of automated fixing techniques. Current techniques generate fixes that are typically conservative, giving up on available parallelism. Moreover, some of the techniques cannot guarantee the correctness of a fix, and may introduce deadlocks similarly to manual fix, whereas techniques that ensure correctness do so at the expense of even greater performance loss. We present Grail, a novel fixing algorithm that departs from previous techniques by simultaneously providing both correctness and optimality guarantees. Grail synthesizes bug-free yet optimal lock-based synchronization. To achieve this, Grail builds an analysis model of the buggy code that is both contextual, distinguishing different aliasing contexts to ensure efficiency, and global, accounting for the entire synchronization behavior of the involved threads to ensure correctness. Evaluation of Grail on 12 bugs from popular codebases confirms its practical advantages, especially compared with existing techniques: Grail patches are, in general, >=40% more efficient than the patches produced by other techniques, and incur only 2% overhead.
AI: A Lightweight System for Tolerating Concurrency Bugs
Mingxing Zhang, Yongwei Wu, Shan Lu, Shanxiang Qi, Jinglei Ren, and Weimin Zheng
(Tsinghua University, China; University of Wisconsin-Madison, USA; University of Illinois at Urbana-Champaign, USA)
Concurrency bugs are notoriously difficult to eradicate during software testing because of their non-deterministic nature. Moreover, fixing concurrency bugs is time-consuming and error-prone. Thus, tolerating concurrency bugs during production runs is an attractive complementary approach to bug detection and testing. Unfortunately, existing bug-tolerating tools are usually either 1) constrained in types of bugs they can handle or 2) requiring roll-back mechanism, which can hitherto not be fully achieved efficiently without hardware supports. This paper presents a novel program invariant, called Anticipating Invariant (AI), which can help anticipate bugs before any irreversible changes are made. Benefiting from this ability of anticipating bugs beforehand, our software-only system is able to forestall the failures with a simple thread stalling technique, which does not rely on execution roll-back and hence has good performance Experiments with 35 real-world concurrency bugs demonstrate that AI is capable of detecting and tolerating most types of concurrency bugs, including both atomicity and order violations. Two new bugs have been detected and confirmed by the corresponding developers. Performance evaluation with 6 representative parallel programs shows that AI incurs negligible overhead (<1%) for many nontrivial desktop and server applications.
Retrofitting Concurrency for Android Applications through Refactoring
Yu Lin, Cosmin Radoi, and Danny Dig
(University of Illinois at Urbana-Champaign, USA; Oregon State University, USA)
Running compute-intensive or blocking I/O operations in the UI event thread of smartphone apps can severely degrade responsiveness. Despite the fact that Android supports writing concurrent code via AsyncTask, we know little about how developers use AsyncTask to improve responsiveness. To understand how AsyncTask is used/underused/misused in practice, we rst conduct a formative study using a corpus of top 104 most popular open-source Android apps comprising 1.34M SLOC. Our study shows that even though half of the apps use AsyncTask, there are hundreds of places where they missed opportunities to encapsulate long-running operations in AsyncTask. Second, 46% of the usages are manually refactored. However, the refactored code contains concurrency bugs (such as data races) and performance bugs (concurrent code still executes sequentially). Inspired by these ndings, we designed, developed, and evaluated Asynchronizer, an automated refactoring tool that enables developers to extract long-running operations into AsyncTask. Asynchronizer uses a points-to static analysis to determine the safety of the transformation. Our empirical evaluation shows that Asynchronizer is (i) highly applicable, (ii) accurate, (iii) safer than manual refactoring (iv) it saves development eort, (v) its results have been accepted by the open-source developers. This shows that Asynchronizer is useful.
Sherlock: Scalable Deadlock Detection for Concurrent Programs
Mahdi Eslamimehr and Jens Palsberg
(University of California at Los Angeles, USA)
We present a new technique to find real deadlocks in concurrent programs that use locks. For 4.5 million lines of Java, our technique found almost twice as many real deadlocks as four previous techniques combined. Among those, 33 deadlocks happened after more than one million computation steps, including 27 new deadlocks. We first use a known technique to find 1275 deadlock candidates and then we determine that 146 of them are real deadlocks. Our technique combines previous work on concolic execution with a new constraint-based approach that iteratively drives an execution towards a deadlock candidate.
Self Adaptation and Repair / Program Analysis Applications
Search-Based Synthesis of Equivalent Method Sequences
Alberto Goffi, Alessandra Gorla, Andrea Mattavelli, Mauro Pezzè
, and Paolo Tonella
(University of Lugano, Switzerland; Saarland University, Germany; Fondazione Bruno Kessler, Italy)
Software components are usually redundant, since their interface offers different operations that are equivalent in their functional behavior. Several reliability techniques exploit this redundancy to either detect or tolerate faults in software. Metamorphic testing, for instance, executes pairs of sequences of operations that are expected to produce equivalent results, and identifies faults in case of mismatching outcomes. Some popular fault tolerance and self-healing techniques execute redundant operations in an attempt to avoid failures at runtime. The common assumption of these techniques, though, is that such redundancy is known a priori. This means that the set of operations that are supposed to be equivalent in a given component should be available in the specifications. Unfortunately, inferring this information manually can be expensive and error prone. This paper proposes a search-based technique to synthesize sequences of method invocations that are equivalent to a target method within a finite set of execution scenarios. The experimental results obtained on 47 methods from 7 classes show that the proposed approach correctly identifies equivalent method sequences in the majority of the cases where redundancy was known to exist, with very few false positives.
Beyond the Rainbow: Self-Adaptive Failure Avoidance in Configurable Systems
Jacob Swanson, Myra B. Cohen
, Matthew B. Dwyer, Brady J. Garvin, and Justin Firestone
(University of Nebraska-Lincoln, USA)
Self-adaptive software systems monitor their state and then adapt when certain conditions are met, guided by a global utility function. In prior work we developed algorithms and conducted a post-hoc analysis demonstrating the possibility of adapting to software failures by judiciously changing configurations. In this paper we present the REFRACT framework that realizes this idea in practice by building on the self-adaptive Rainbow architecture. REFRACT extends Rainbow with new components and algorithms targeting failure avoidance. We use REFRACT in a case study running four independently executing Firefox clients with 36 passing test cases and 7 seeded faults. The study show that workarounds for all but one of the seeded faults are found and the one that is not found never fails -- it is guarded from failing by a related workaround. Moreover, REFRACT finds workarounds for eight configuration-related unseeded failures from tests that were expected to pass (and did under the default configuration). Finally, the data show that when a failure and its workaround are found, configuration guards prevent the failure from appearing again. In a simulation lasting 24 hours we see over 150 guard activations and no failures with workarounds remaining beyond 16 hours.
Semantics-Based Obfuscation-Resilient Binary Code Similarity Comparison with Applications to Software Plagiarism Detection
Lannan Luo, Jiang Ming, Dinghao Wu
, Peng Liu, and Sencun Zhu
(Pennsylvania State University, USA)
Existing code similarity comparison methods, whether source or binary code based, are mostly not resilient to obfuscations. In the case of software plagiarism, emerging obfuscation techniques have made automated detection increasingly difficult. In this paper, we propose a binary-oriented, obfuscation-resilient method based on a new concept, longest common subsequence of semantically equivalent basic blocks, which combines rigorous program semantics with longest common subsequence based fuzzy matching. We model the semantics of a basic block by a set of symbolic formulas representing the input-output relations of the block. This way, the semantics equivalence (and similarity) of two blocks can be checked by a theorem prover. We then model the semantics similarity of two paths using the longest common subsequence with basic blocks as elements. This novel combination has resulted in strong resiliency to code obfuscation. We have developed a prototype and our experimental results show that our method is effective and practical when applied to real-world software.
Focus-Shifting Patterns of OSS Developers and Their Congruence with Call Graphs
Qi Xuan, Aaron Okano, Premkumar Devanbu, and Vladimir Filkov
(University of California at Davis, USA; Zhejiang University of Technology, China)
Developers in complex, self-organized open-source projects often work on many different files, and over time switch focus between them. Shifting focus can have impact on the software quality and productivity, and is thus an important topic of investigation. In this paper, we study focus shifting patterns (FSPs) of developers by comparing trace data from a dozen open source software (OSS) projects of their longitudinal commit activities and file dependencies from the projects call graphs. Using information theoretic measures of network structure, we find that fairly complex focus-shifting patterns emerge, and FSPs in the same project are more similar to each other. We show that developers tend to shift focus along with, rather than away from, software dependency links described by the call graphs. This tendency becomes weaker as either the interval between successive commits, or the organizational distance between committed files (i.e. directory distance), gets larger. Interestingly, this tendency appears stronger with more productive developers. We hope our study will initiate interest in further understanding of FSPs, which can ultimately help to (1) improve current recommender systems to predict the next focus of developers, and (2) provide insight into better call graph design, so as to facilitate developers' work.
How We Get There: A Context-Guided Search Strategy in Concolic Testing
Hyunmin Seo and Sunghun Kim
(Hong Kong University of Science and Technology, China)
One of the biggest challenges in concolic testing, an automatic test generation technique, is its huge search space. Concolic testing generates next inputs by selecting branches from previous execution paths. However, a large number of candidate branches makes a simple exhaustive search infeasible, which often leads to poor test coverage. Several search strategies have been proposed to explore high-priority branches only. Each strategy applies different criteria to the branch selection process but most do not consider context, how we got to the branch, in the selection process.
In this paper, we introduce a context-guided search (CGS) strategy. CGS looks at preceding branches in execution paths and selects a branch in a new context for the next input. We evaluate CGS with two publicly available concolic testing tools, CREST and CarFast, on six C subjects and six Java subjects. The experimental results show that CGS achieves the highest coverage of all twelve subjects and reaches a target coverage with a much smaller number of iterations on most subjects than other strategies.
Solving Complex Path Conditions through Heuristic Search on Induced Polytopes
Peter Dinges and Gul Agha
(University of Illinois at Urbana-Champaign, USA)
Test input generators using symbolic and concolic execution must solve
path conditions to systematically explore a program and generate high
coverage tests. However, path conditions may contain complicated
arithmetic constraints that are infeasible to solve: a solver may be
unavailable, solving may be computationally intractable, or the
constraints may be undecidable. Existing test generators either
simplify such constraints with concrete values to make them decidable,
or rely on strong but incomplete constraint solvers. Unfortunately,
simplification yields coarse approximations whose solutions rarely
satisfy the original constraint. Moreover, constraint solvers cannot
handle calls to native library methods. We show how a simple
combination of linear constraint solving and heuristic search can
overcome these limitations. We call this technique Concolic Walk. On
a corpus of 11 programs, an instance of our Concolic Walk algorithm
using tabu search generates tests with two- to three-times higher
coverage than simplification-based tools while being up to five-times
as efficient. Furthermore, our algorithm improves the coverage of two
state-of-the-art test generators by 21% and 32%. Other concolic and
symbolic testing tools could integrate our algorithm to solve complex
path conditions without having to sacrifice any of their own
capabilities, leading to higher overall coverage.
Statistical Symbolic Execution with Informed Sampling
, Corina S. Păsăreanu
, Willem Visser, and Jaco Geldenhuys
(University of Stuttgart, Germany; Carnegie Mellon University, USA; NASA Ames Research Center, USA; Stellenbosch University, South Africa)
Symbolic execution techniques have been proposed recently for the probabilistic analysis of programs. These techniques seek to quantify the likelihood of reaching program events of interest, e.g., assert violations. They have many promising applications but have scalability issues due to high computational demand. To address this challenge, we propose a statistical symbolic execution technique that performs Monte Carlo sampling of the symbolic program paths and uses the obtained information for Bayesian estimation and hypothesis testing with respect to the probability of reaching the target events. To speed up the convergence of the statistical analysis, we propose Informed Sampling, an iterative symbolic execution that first explores the paths that have high statistical significance, prunes them from the state space and guides the execution towards less likely paths. The technique combines Bayesian estimation with a partial exact analysis for the pruned paths leading to provably improved convergence of the statistical analysis. We have implemented statistical symbolic execution with informed sampling in the Symbolic PathFinder tool. We show experimentally that the informed sampling obtains more precise results and converges faster than a purely statistical analysis and may also be more efficient than an exact symbolic analysis. When the latter does not terminate symbolic execution with informed sampling can give meaningful results under the same time and memory limits.
Guodong Li, Esben Andreasen, and Indradeep Ghosh
(Fujitsu Labs, USA; Aarhus University, Denmark)
Selection and Presentation Practices for Code Example Summarization
Annie T. T. Ying and Martin P. Robillard
(McGill University, Canada)
Code examples are an important source for answering questions about
software libraries and applications. Many usage contexts for code
examples require them to be distilled to their essence: e.g., when
serving as cues to longer documents, or for reminding developers of a
previously known idiom. We conducted a study to discover how code can
be summarized and why. As part of the study, we collected 156 pairs
of code examples and their summaries from 16 participants, along with
over 26 hours of think-aloud verbalizations detailing the decisions of
the participants during their summarization activities. Based on a
qualitative analysis of this data we elicited a list of practices
followed by the participants to summarize code examples and propose
empirically-supported hypotheses justifying the use of specific
practices. One main finding was that none of the participants
exclusively extracted code verbatim for the summaries, motivating
abstractive summarization. The results provide a grounded basis for
the development of code example summarization and presentation
Mining Idioms from Source Code
and Charles Sutton
(University of Edinburgh, UK)
We present the first method for automatically mining code idioms from a corpus of previously written, idiomatic software projects. We take the view that a code idiom is a syntactic fragment that recurs across projects and has a single semantic purpose. Idioms may have metavariables, such as the body of a for loop. Modern IDEs commonly provide facilities for manually defining idioms and inserting them on demand, but this does not help programmers to write idiomatic code in languages or using libraries with which they are unfamiliar. We present Haggis, a system for mining code idioms that builds on recent advanced techniques from statistical natural language processing, namely, nonparametric Bayesian probabilistic tree substitution grammars. We apply Haggis to several of the most popular open source projects from GitHub. We present a wide range of evidence that the resulting idioms are semantically meaningful, demonstrating that they do indeed recur across software projects and that they occur more frequently in illustrative code examples collected from a Q&A site. Manual examination of the most common idioms indicate that they describe important program concepts, including object creation, exception handling, and resource management.
Automatic Generation of Release Notes
Laura Moreno, Gabriele Bavota, Massimiliano Di Penta
, Rocco Oliveto
, Andrian Marcus, and Gerardo Canfora
(University of Texas at Dallas, USA; University of Sannio, Italy; University of Molise, Italy)
This paper introduces ARENA (Automatic RElease Notes generAtor), an approach for the automatic generation of release notes. ARENA extracts changes from the source code, summarizes them, and integrates them with information from versioning systems and issue trackers. It was designed based on the manual analysis of 1,000 existing release notes. To evaluate the quality of the ARENA release notes, we performed three empirical studies involving a total of 53 participants (45 professional developers and 8 students). The results indicate that the ARENA release notes are very good approximations of those produced by developers and often include important information that is missing in the manually produced release notes.
Discovering Refactoring Opportunities in Cascading Style Sheets
, Nikolaos Tsantalis
, and Ali Mesbah
(Concordia University, Canada; University of British Columbia, Canada)
Cascading Style Sheets (CSS) is a language used for describing the look and formatting of HTML documents. CSS has been widely adopted in web and mobile development practice, since it enables a clean separation of content from presentation. The language exhibits complex features, such as inheritance, cascading and specificity, which make CSS code hard to maintain. Therefore, it is important to find ways to improve the maintainability of CSS code. In this paper, we propose an automated approach to remove duplication in CSS code. More specifically, we have developed a technique that detects three types of CSS declaration duplication and recommends refactoring opportunities to eliminate those duplications. Our approach uses preconditions that ensure the application of a refactoring will preserve the original document styling. We evaluate our technique on 38 real-world web systems and 91 CSS files, in total. Our findings show that duplication in CSS code is widely prevalent. Additionally, there is a significant number of presentation-preserving refactoring opportunities that can reduce the size of the CSS files and increase the maintainability of the code.
SAFEWAPI: Web API Misuse Detector for Web Applications
SungGyeong Bae, Hyunghun Cho, Inho Lim, and Sukyoung Ryu
(KAIST, South Korea; Samsung Electronics, South Korea)
Building Call Graphs for Embedded Client-Side Code in Dynamic Web Applications
Hung Viet Nguyen, Christian Kästner
, and Tien N. Nguyen
(Iowa State University, USA; Carnegie Mellon University, USA)
When developing and maintaining a software system, programmers often rely on IDEs to provide editor services such as syntax highlighting, auto-completion, and "jump to declaration". In dynamic web applications, such tool support is currently limited to either the server-side code or to hand-written or generated client-side code. Our goal is to build a call graph for providing editor services on client-side code while it is still embedded as string literals within server-side code. First, we symbolically execute the server-side code to identify all possible client-side code variations. Subsequently, we parse the generated client-side code with all its variations into a VarDOM that compactly represents all DOM variations for further analysis. Based on the VarDOM, we build conditional call graphs for embedded HTML, CSS, and JS. Our empirical evaluation on real-world web applications show that our analysis achieves 100% precision in identifying call-graph edges. 62% of the edges cross PHP strings, and 17% of them cross files - in both situations, navigation without tool support is tedious and error prone.
Architecture and Design
Sketches and Diagrams in Practice
Sebastian Baltes and Stephan Diehl
(University of Trier, Germany)
Sketches and diagrams play an important role in the daily work of software developers. In this paper, we investigate the use of sketches and diagrams in software engineering practice. To this end, we used both quantitative and qualitative methods. We present the results of an exploratory study in three companies and an online survey with 394 participants. Our participants included software developers, software architects, project managers, consultants, as well as researchers. They worked in different countries and on projects from a wide range of application areas. Most questions in the survey were related to the last sketch or diagram that the participants had created. Contrary to our expectations and previous work, the majority of sketches and diagrams contained at least some UML elements. However, most of them were informal. The most common purposes for creating sketches and diagrams were designing, explaining, and understanding, but analyzing requirements was also named often. More than half of the sketches and diagrams were created on analog media like paper or whiteboards and have been revised after creation. Most of them were used for more than a week and were archived. We found that the majority of participants related their sketches to methods, classes, or packages, but not to source code artifacts with a lower level of abstraction.
Architecture Challenges for Internal Software Ecosystems: A Large-Scale Industry Case Study
Klaus-Benedikt Schultis, Christoph Elsner, and Daniel Lohmann
(Siemens, Germany; University of Erlangen-Nuremberg, Germany)
The idea of software ecosystems encourages organizations to open software projects for external businesses, governing the cross-organizational development by architectural and other measures. Even within a single organization, this paradigm can be of high value for large-scale decentralized software projects that involve various internal, yet self-contained organizational units. However, this intra-organizational decentralization causes architecture challenges that must be understood to reason about suitable architectural measures.
We present an in-depth case study on collaboration and architecture challenges in two of these large-scale software projects at Siemens. We performed a total of 46 hours of semi-structured interviews with 17 leading software architects from all involved organizational units. Our major findings are: (1) three collaboration models on a continuum that ranges from high to low coupling, (2) a classification of architecture challenges, together with (3) a qualitative and quantitative exposure of the identified recurring issues along each collaboration model. Our study results provide valuable insights for both industry and academia: Practitioners that find themselves in one of the collaboration models can use empirical evidence on challenges to make informed decisions about counteractive measures. Researchers can focus their attention on challenges faced by practitioners to make software engineering more effective.
Variable-Specific Resolutions for Feature Interactions
Cecylia Bocovich and Joanne M. Atlee
(University of Waterloo, Canada)
Systems assembled from independently developed features suffer from feature interactions, in which features affect one another's behaviour in surprising ways. The feature-interaction problem states that the number of potential interactions is exponential in the number of features in a system. Resolution strategies offer general strategies that resolve entire classes of interactions, thereby reducing the work of the developer who is charged with the task of resolving interactions. In this paper, we focus on resolving interactions due to conflict. We present an approach, language, and implementation based on resolution modules in which the developer can specify an appropriate resolution for each variable under conflict. We performed a case study involving 24 automotive features, and found that the number of resolutions to be specified was much smaller than the number of possible feature interactions (6 resolutions for 24 features), that what constitutes an appropriate resolution strategy is different for different variables, and that the subset of situation calculus we used was sufficient to construct nontrivial resolution strategies for six distinct output variables.
An Empirical Study on Program Comprehension with Reactive Programming
Guido Salvaneschi, Sven Amann, Sebastian Proksch, and Mira Mezini
(TU Darmstadt, Germany; Lancaster University, UK)
Starting from the first investigations with strictly functional languages, reactive programming has been proposed as THE programming paradigm for reactive applications. The advantages of designs based on this style over designs based on the Observer design pattern have been studied for a long time. Over the years, researchers have enriched reactive languages with more powerful abstractions, embedded these abstractions into mainstream languages – including object-oriented languages – and applied reactive programming to several domains, like GUIs, animations, Web applications, robotics, and sensor networks. However, an important assumption behind this line of research – that, beside other advantages, reactive programming makes a wide class of otherwise cumbersome applications more comprehensible – has never been evaluated. In this paper, we present the design and the results of the first empirical study that evaluates the effect of reactive programming on comprehensibility compared to the traditional object-oriented style with the Observer design pattern. Results confirm the conjecture that comprehensibility is enhanced by reactive programming. In the experiment, the reactive programming group significantly outperforms the other group.
Apposcopy: Semantics-Based Detection of Android Malware through Static Analysis
Yu Feng, Saswat Anand, Isil Dillig, and Alex Aiken
(University of Texas at Austin, USA; Stanford University, USA)
We present Apposcopy, a new semantics-based approach for
identifying a prevalent class of Android malware that steals
private user information. Apposcopy incorporates (i) a high-level
language for specifying signatures that describe semantic
characteristics of malware families and (ii) a static analysis
for deciding if a given application matches a malware
signature. The signature matching algorithm of Apposcopy
uses a combination of static taint analysis and a new form of
program representation called Inter-Component Call Graph
to efficiently detect Android applications that have certain
control- and data-flow properties. We have evaluated Apposcopy
on a corpus of real-world Android applications and
show that it can effectively and reliably pinpoint malicious
applications that belong to certain malware families.
Detecting Energy Bugs and Hotspots in Mobile Apps
Abhijeet Banerjee, Lee Kee Chong, Sudipta Chattopadhyay, and Abhik Roychoudhury
(National University of Singapore, Singapore; Linköping University, Sweden)
Over the recent years, the popularity of smartphones has increased dramatically. This has lead to a widespread availability of smartphone applications. Since smartphones operate on a limited amount of battery power, it is important to develop tools and techniques that aid in energy-efficient application development. Energy inefficiencies in smartphone applications can broadly be categorized into energy hotspots and energy bugs. An energy hotspot can be described as a scenario where executing an application causes the smartphone to consume abnormally high amount of battery power, even though the utilization of its hardware resources is low. In contrast, an energy bug can be described as a scenario where a malfunctioning application prevents the smartphone from becoming idle, even after it has completed execution and there is no user activity. In this paper, we present an automated test generation framework that detects energy hotspots/bugs in Android applications. Our framework systematically generates test inputs that are likely to capture energy hotspots/bugs. Each test input captures a sequence of user interactions (e.g. touches or taps on the smartphone screen) that leads to an energy hotspot/bug in the application. Evaluation with 30 freely-available Android applications from Google Play/F-Droid shows the efficacy of our framework in finding hotspots/bugs. Manual validation of the experimental results shows that our framework reports reasonably low number of false positives. Finally, we show the usage of the generated results by improving the energy-efficiency of some Android applications.
EvoDroid: Segmented Evolutionary Testing of Android Apps
Riyadh Mahmood, Nariman Mirzaei, and Sam Malek
(George Mason University, USA)
Proliferation of Android devices and apps has created a demand for applicable automated software testing techniques. Prior research has primarily focused on either unit or GUI testing of Android apps, but not their end-to-end system testing in a systematic manner. We present EvoDroid, an evolutionary approach for system testing of Android apps. EvoDroid overcomes a key shortcoming of using evolutionary techniques for system testing, i.e., the inability to pass on genetic makeup of good individuals in the search. To that end, EvoDroid combines two novel techniques: (1) an Android-specific program analysis technique that identifies the segments of the code amenable to be searched independently, and (2) an evolutionary algorithm that given information of such segments performs a step-wise search for test cases reaching deep into the code. Our experiments have corroborated EvoDroid’s ability to achieve significantly higher code coverage than existing Android testing tools.
Prioritizing the Devices to Test Your App on: A Case Study of Android Game Apps
Hammad Khalid, Meiyappan Nagappan
, Emad Shihab, and Ahmed E. Hassan
(Queen's University, Canada; Rochester Institute of Technology, USA; Concordia University, Canada)
Star ratings that are given by the users of mobile apps directly impact the revenue of its developers. At the same time, for popular platforms like Android, these apps must run on hundreds of devices increasing the chance for device-specific problems. Device-specific problems could impact the rating assigned to an app, given the varying capabilities of devices (e.g., hardware and software). To fix device-specific problems developers must test their apps on a large number of Android devices, which is costly and inefficient. Therefore, to help developers pick which devices to test their apps on, we propose using the devices that are mentioned in user reviews. We mine the user reviews of 99 free game apps and find that, apps receive user reviews from a large number of devices: between 38 to 132 unique devices. However, most of the reviews (80%) originate from a small subset of devices (on average, 33%). Furthermore, we find that developers of new game apps with no reviews can use the review data of similar game apps to select the devices that they should focus on first. Finally, among the set of devices that generate the most reviews for an app, we find that some devices tend to generate worse ratings than others. Our findings indicate that focusing on the devices with the most reviews (in particular the ones with negative ratings), developers can effectively prioritize their limited Quality Assurance (QA) efforts, since these devices have the greatest impact on ratings.
Testing and Oracles
Improving Oracle Quality by Detecting Brittle Assertions and Unused Inputs in Tests
Chen Huo and James Clause
(University of Delaware, USA)
Writing oracles is challenging. As a result, developers often create oracles that check too little, resulting in tests that are unable to detect failures, or check too much, resulting in tests that are brittle and difficult to maintain. In this paper we present a new technique for automatically analyzing test oracles. The technique is based on dynamic tainting and detects both brittle assertions—assertions that depend on values that are derived from uncontrolled inputs—and unused inputs—inputs provided by the test that are not checked by an assertion. We also presented OraclePolish, an implementation of the technique that can analyze tests that are written in Java and use the JUnit testing framework. Using OraclePolish, we conducted an empirical evaluation of more than 4000 real test cases. The results of the evaluation show that OraclePolish is effective; it detected 164 tests that contain brittle assertions and 1618 tests that have unused inputs. In addition, the results also demonstrate that the costs associated with using the technique are reasonable.
On the Efficiency of Automated Testing
Marcel Böhme and Soumya Paul
(Saarland University, Germany; National University of Singapore, Singapore)
The aim of automated program testing is to gain confidence about a program's correctness by sampling its input space. The sampling process can be either systematic or random. For every systematic testing technique the sampling is informed by the analysis of some program artefacts, like the specification, the source code (e.g., to achieve coverage), or even faulty versions of the program (e.g., mutation testing). This analysis incurs some cost. In contrast, random testing is unsystematic and does not sustain any analysis cost. In this paper, we investigate the theoretical efficiency of systematic versus random testing. First, we mathematically model the most effective systematic testing technique S_0 in which every sampled test input strictly increases the "degree of confidence" and is subject to the analysis cost c. Note that the efficiency of S_0 depends on c. Specifically, if we increase c, we also increase the time it takes S_0 to establish the same degree of confidence. So, there exists a maximum analysis cost beyond which R is generally more efficient than S_0. Given that we require the confidence that the program works correctly for x% of its input, we prove an upper bound on c of S_0, beyond which R is more efficient on the average. We also show that this bound depends asymptotically only on x. For instance, let R take 10ms time to sample one test input; to establish that the program works correctly for 90% of its input, S_0 must take less than 41ms to sample one test input. Otherwise, R is expected to establish the 90%-degree of confidence earlier. We prove similar bounds on the cost if the software tester is interested in revealing as many errors as possible in a given time span.
An Empirical Analysis of Flaky Tests
Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov
(University of Illinois at Urbana-Champaign, USA)
Regression testing is a crucial part of software development. It checks that software changes do not break existing functionality. An important assumption of regression testing is that test outcomes are deterministic: an unmodified test is expected to either always pass or always fail for the same code under test. Unfortunately, in practice, some tests often called flaky tests—have non-deterministic outcomes. Such tests undermine the regression testing as they make it difficult to rely on test results. We present the first extensive study of flaky tests. We study in detail a total of 201 commits that likely fix flaky tests in 51 open-source projects. We classify the most common root causes of flaky tests, identify approaches that could manifest flaky behavior, and describe common strategies that developers use to fix flaky tests. We believe that our insights and implications can help guide future research on the important topic of (avoiding) flaky tests.
Are Mutants a Valid Substitute for Real Faults in Software Testing?
René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst
, Reid Holmes, and Gordon Fraser
(University of Washington, USA; University of Waterloo, Canada; University of Sheffield, UK)
A good test suite is one that detects real faults. Because the set
of faults in a program is usually unknowable, this definition is not
useful to practitioners who are creating test suites, nor to researchers
who are creating and evaluating tools that generate test suites. In
place of real faults, testing research often uses mutants, which are
artificial faults -- each one a simple syntactic variation -- that are
systematically seeded throughout the program under test. Mutation
analysis is appealing because large numbers of mutants can be
automatically-generated and used to compensate for low quantities
or the absence of known real faults.
Unfortunately, there is little experimental evidence to support
the use of mutants as a replacement for real faults. This paper
investigates whether mutants are indeed a valid substitute for real
faults, i.e., whether a test suite’s ability to detect mutants is
correlated with its ability to detect real faults that developers
have fixed. Unlike prior studies, these investigations also explicitly
consider the conflating effects of code coverage on the mutant
Our experiments used 357 real faults in 5 open-source applications
that comprise a total of 321,000 lines of code. Furthermore,
our experiments used both developer-written and automatically-generated
test suites. The results show a statistically significant correlation
between mutant detection and real fault detection, independently of
code coverage. The results also give concrete suggestions on how
to improve mutation analysis and reveal some inherent limitations.
Evolution and Maintenance
No Issue Left Behind: Reducing Information Overload in Issue Tracking
Olga Baysal, Reid Holmes, and Michael W. Godfrey
(Université de Montréal, Canada; University of Waterloo, Canada)
Modern software development tools such as issue trackers are often complex and multi-purpose tools that provide access to an immense amount of raw information. Unfortunately, developers sometimes feel frustrated when they cannot easily obtain the particular information they need for a given task; furthermore, the constant influx of new data — the vast majority of which is irrelevant to their task at hand — may result in issues being "dropped on the floor".
In this paper, we present a developer-centric approach to issue tracking that aims to reduce information overload and improve developers' situational awareness. Our approach is motivated by a grounded theory study of developer comments, which suggests that customized views of a project's repositories that are tailored to developer-specific tasks can help developers better track their progress and understand the surrounding technical context. From the qualitative study, we uncovered a model of the kinds of information elements that are essential for developers in completing their daily tasks, and from this model we built a tool organized around customized issue-tracking dashboards. Further quantitative and qualitative evaluation demonstrated that this dashboard-like approach to issue tracking can reduce the volume of irrelevant emails by over 99% and also improve support for specific issue-tracking tasks.
Panning Requirement Nuggets in Stream of Software Maintenance Tickets
Senthil Mani, Karthik Sankaranarayanan, Vibha Singhal Sinha, and Premkumar Devanbu
(IBM Research, India; University of California at Davis, USA)
There is an increasing trend to outsource maintenance of large applications and application portfolios of a business to third parties, specialising in application maintenance, who are incented to deliver the best possible maintenance at the lowest cost. To do so, they need to identify repeat problem areas, which cause more maintenance grief, and seek a unified remedy to avoid the costs spent on fixing these individually. These repeat areas, in a sense, represent major, evolving areas of need, or requirements, for the customer. The information about the repeating problem is typically embedded in the unstructured text of multiple tickets, waiting to be found and addressed. Currently, repeat problems are found by manual analysis; effective solutions depend on the collective experience of the team solving them. In this paper, we propose an approach to automatically analyze problem tickets to discover groups of problems being reported in them and provide meaningful, descriptive labels to help interpret these groups. Our approach incorporates a cleansing phase to handle the high level of noise observed in problem tickets and a method to incorporate multiple text clustering techniques and merge their results in a meaningful manner. We provide detailed experiments to quantitatively and qualitatively evaluate our approach
Learning to Rank Relevant Files for Bug Reports using Domain Knowledge
Xin Ye, Razvan Bunescu, and Chang Liu
(Ohio University, USA)
When a new bug report is received, developers usually need to reproduce the bug and perform code reviews to find the cause, a process that can be tedious and time consuming. A tool for ranking all the source files of a project with respect to how likely they are to contain the cause of the bug would enable developers to narrow down their search and potentially could lead to a substantial increase in productivity. This paper introduces an adaptive ranking approach that leverages domain knowledge through functional decompositions of source code files into methods, API descriptions of library components used in the code, the bug-fixing history, and the code change history. Given a bug report, the ranking score of each source file is computed as a weighted combination of an array of features encoding domain knowledge, where the weights are trained automatically on previously solved bug reports using a learning-to-rank technique. We evaluated our system on six large scale open source Java projects, using the before-fix version of the project for every bug report. The experimental results show that the newly introduced learning-to-rank approach significantly outperforms two recent state-of-the-art methods in recommending relevant files for bug reports. In particular, our method makes correct recommendations within the top 10 ranked source files for over 70% of the bug reports in the Eclipse Platform and Tomcat projects.
Querying Sequential Software Engineering Data
, Haidong Zhang, Jian-Guang Lou, Hongyu Zhang, Qiang Wang, Dongmei Zhang, and Siau-Cheng Khoo
(University of California at Davis, USA; Microsoft Research, China; National University of Singapore, Singapore)
We propose a pattern-based approach to effectively and efficiently analyzing sequential software engineering (SE) data. Different from other types of SE data, sequential SE data preserves unique temporal properties, which cannot be easily analyzed without much programming effort. In order to facilitate the analysis of sequential SE data, we design a sequential pattern query language (SPQL), which specifies the temporal properties based on regular expressions, and is enhanced with variables and statements to store and manipulate matching states. We also propose a query engine to effectively process the SPQL queries. We have applied our approach to analyze two types of SE data, namely bug report history and source code change history. We experiment with 181,213 Eclipse bug reports and 323,989 code revisions of Android. SPQL enables us to explore interesting temporal properties underneath these sequential data with a few lines of query code and low matching overhead. The analysis results can help better under- stand a software process and identify process violations.
Tsmart-GalsBlock: A Toolkit for Modeling, Validation, and Synthesis of Multi-clocked Embedded Systems
, Hehua Zhang, Huafeng Zhang, Xinyan Zhao, Han Liu, Chengnian Sun, Xiaoyu Song, Ming Gu, and Jiaguang Sun
(Tsinghua University, China; University of California at Davis, USA)
The key challenges of the model-driven approach to designing multi-clocked embedded systems are three-fold: (1) how to model local synchronous components and asynchronous communication between components in a single framework, (2) how to ensure the correctness of the model, and (3) how to maintain the consistency between the model and the implementation of the system. In this paper, we present Tsmart, a self-contained toolkit to address these three challenges. Tsmart seamlessly integrates (1) a graphical editor to facilitate the modeling of the complex behaviors and structures in an embedded system, (2) a simulator for interactive graphical simulation to understand and debug the system model, (3) a verication engine to verify the correctness of the system design, and (4) a synthesis engine to automatically generate ecient executable VHDL code from the model. The toolkit has been successfully applied to designing the main control system of a train communication controller, and the system has already been deployed and in operation. The evaluation of Tsmart on this real industrial application demonstrates the eectiveness and the potential of the toolkit.
A Tool Suite for the Model-Driven Software Engineering of Cyber-Physical Systems
Stefan Dziwok, Christopher Gerking
, Steffen Becker, Sebastian Thiele, Christian Heinzemann, and Uwe Pohlmann
(University of Paderborn, Germany; Fraunhofer IPT, Germany)
Cyber-physical systems, e.g., autonomous cars or trains, interact with their physical environment. As a consequence, they commonly have to coordinate with other systems via complex message communication while realizing safety-critical and real-time tasks. As a result, those systems should be correct by construction. Software architects can achieve this by using the MechatronicUML process and language. This paper presents the MechatronicUML Tool Suite that offers unique features to support the MechatronicUML modeling and analyses tasks.
XMLMate: Evolutionary XML Test Generation
Nikolas Havrikov, Matthias Höschele, Juan Pablo Galeotti
, and Andreas Zeller
(Saarland University, Germany)
Generating system inputs satisfying complex constraints is still a challenge for modern test generators. We present XMLMATE, a search-based test generator specially aimed at XML-based systems. XMLMATE leverages program structure, existing XML schemas, and XML inputs to generate, mutate, recombine, and evolve valid XML inputs. Over a set of seven XML-based systems, XMLMATE detected 31 new unique failures in production code, all triggered by system inputs and thus true alarms.
CHOReOSynt: Enforcing Choreography Realizability in the Future Internet
, Davide Di Ruscio
, Amleto Di Salle, and Alexander Perucci
(University of L'Aquila, Italy)
Choreographies are an emergent Service Engineering (SE) approach to compose together and coordinate services in a distributed way. A choreography formalizes the way business participants coordinate their interactions. The focus is not on orchestrations of the work performed within them, but rather on the exchange of messages between these participants. The problems usually addressed when considering a choreography-based specification of the system to be realized are realizability check, and conformance check. In this paper we describe the CHOReOSynt tool, which has been conceived to deal with an additional problem, namely, automated choreography enforcement. That is, when the goal is to actually realize a service choreography by reusing third-party services, their uncontrolled (or wrongly coordinated) composite behavior may show undesired interactions that preclude the choreography realization. CHOReOSynt solves this problem by automatically synthesizing additional software entities that, when interposed among the services, allow for preventing undesired interactions. Screencast: http://choreos.disim.univaq.it/downloads/
RaPiD: A Toolkit for Reliability Analysis of Non-deterministic Systems
Lin Gui, Jun Sun
, Yang Liu
, Truong Khanh Nguyen, and Jin Song Dong
(National University of Singapore, Singapore; Singapore University of Technology and Design, Singapore; Nanyang Technological University, Singapore)
Non-determinism in concurrent or distributed software systems (i.e., various possible execution orders among different distributed components) presents new challenges to the existing reliability analysis methods based on Markov chains. In this work, we present a toolkit RaPiD for the reliability analysis of non-deterministic systems. Taking Markov decision process as reliability model, RaPiD can help in the analysis of three fundamental and rewarding aspects regarding software reliability. First, to have reliability assurance on a system, RaPiD can synthesize the overall system reliability given the reliability values of system components. Second, given a requirement on the overall system reliability, RaPiD can distribute the reliability requirement to each component. Lastly, RaPiD can identify the component that affects the system reliability most significantly. RaPiD has been applied to analyze several real-world systems including a financial stock trading system, a proton therapy control system and an ambient assisted living room system.
Aalta: An LTL Satisfiability Checker over Infinite/Finite Traces
Jianwen Li, Yinbo Yao, Geguang Pu, Lijun Zhang, and Jifeng He
(East China Normal University, China; Institute of Software at Chinese Academy of Sciences, China)
Linear Temporal Logic (LTL) is been widely used nowadays in verification and AI. Checking satisfiability of LTL formulas is a fundamental step in removing possible errors in LTL assertions. We present in this paper Aalta, a new LTL satisfiability checker, which supports satisfiability checking for LTL over both infinite and finite traces. Aalta leverages the power of modern SAT solvers. We have conducted a comprehensive comparison between Aalta and other LTL satisfiability checkers, and the experimental results show that Aalta is very competitive. The tool is available at www.lab205.org/aalta.
Omen+: A Precise Dynamic Deadlock Detector for Multithreaded Java Libraries
and Murali Krishna Ramanathan
(Indian Institute of Science, India)
Designing thread-safe libraries without concurrency defects can be a challenging task. Detecting deadlocks while invoking methods in these libraries concurrently is hard due to the possible number of method invocation combinations, the object assignments to the parameters and the associated thread interleavings. In this paper, we describe the design and implementation of OMEN+ that takes a multithreaded library as the input and detects true deadlocks in a scalable manner. We achieve this by automatically synthesizing relevant multithreaded tests and analyze the associated execution traces using a precise deadlock detector. We validate the usefulness of OMEN+ by applying it on many multithreaded Java libraries and detect a number of deadlocks even in documented thread-safe libraries. The tool is available for free download at http://www.csa.iisc.ernet.in/~sss/tool/omenplus.html.
Archie: A Tool for Detecting, Monitoring, and Preserving Architecturally Significant Code
Mehdi Mirakhorli, Ahmed Fakhry, Artem Grechko, Matteusz Wieloch, and Jane Cleland-Huang
(Rochester Institute of Technology, USA; DePaul University, USA)
The quality of a software architecture is largely dependent upon the underlying architectural decisions at the framework, tactic, and pattern levels. Decisions to adopt certain solutions determine the extent to which desired qualities such as security, availability, and performance are achieved in the delivered system. In this tool demo, we present our Eclipse plug-in named Archie as a solution for maintaining architectural qualities in the design and code despite long-term maintenance and evolution activities. Archie detects architectural tactics such as heartbeat, resource pooling, and role-based access control (RBAC) in the source code of a project; constructs traceability links between the tactics, design models, rationales and source code; and then uses these to monitor the environment for architecturally significant changes and to keep developers informed of underlying design decisions and their associated rationales.
Linking Sketches and Diagrams to Source Code Artifacts
Sebastian Baltes, Peter Schmitz, and Stephan Diehl
(University of Trier, Germany)
Recent studies have shown that sketches and diagrams play an important role in the daily work of software developers. If these visual artifacts are archived, they are often detached from the source code they document, because there is no ad- equate tool support to assist developers in capturing, archiving, and retrieving sketches related to certain source code artifacts. This paper presents SketchLink, a tool that aims at increasing the value of sketches and diagrams created during software development by supporting developers in these tasks. Our prototype implementation provides a web application that employs the camera of smartphones and tablets to capture analog sketches, but can also be used on desktop computers to upload, for instance, computer-generated diagrams. We also implemented a plugin for a Java IDE that embeds the links in Javadoc comments and visualizes them in situ in the source code editor as graphical icons.
BumbleBee: A Refactoring Environment for Spreadsheet Formulas
Felienne Hermans and Danny Dig
(Delft University of Technology, Netherlands; Oregon State University, USA)
Spreadsheets are widely used in industry. It is estimated that end-user programmers outnumber regular programmers by a factor of 5. However, spreadsheets are error-prone: several reports exist of companies that have lost big sums of money due to spreadsheet errors. In previous work, spreadsheet smells have proven to be the cause of some of these errors. To that end, we have developed a tool that can apply refactorings to spreadsheet formulas, implementing our previous work on spreadsheet refactoring, which showed that spreadsheet formula smells are very common and that refactorings for them are widely applicable and that refactoring them with a tool is both quicker and less error-prone. Our new tool Bumblebee is able to execute refactorings originating from both these papers, by means of an extensible syntax, and can furthermore apply refactorings on entire groups of formulas, thus improving upon the existing tool RefBook. Finally, BumbleBee can also execute transformations other than refactorings.
RefDistiller: A Refactoring Aware Code Review Tool for Inspecting Manual Refactoring Edits
Everton L. G. Alves, Myoungkyu Song, and Miryung Kim
(University of Texas at Austin, USA; Federal University of Campina Grande, Brazil; University of California at Los Angeles, USA)
Manual refactoring edits are error prone, as refactoring requires developers to coordinate related transformations and understand the complex inter-relationship between affected types, methods, and variables. We present RefDistiller, a refactoring-aware code review tool that can help developers detect potential behavioral changes in manual refactoring edits. It first detects the types and locations of refactoring edits by comparing two program versions. Based on the reconstructed refactoring information, it then detects potential anomalies in refactoring edits using two techniques: (1) a template-based checker for detecting missing edits and (2) a refactoring separator for detecting extra edits that may change a program's behavior. By helping developers be aware of deviations from pure refactoring edits, RefDistiller can help developers have high confidence about the correctness of manual refactoring edits. RefDistiller is available as an Eclipse plug-in at https://sites.google.com/site/refdistiller/ and its demonstration video is available at http://youtu.be/0Iseoc5HRpU.
Critics: An Interactive Code Review Tool for Searching and Inspecting Systematic Changes
Tianyi Zhang, Myoungkyu Song, and Miryung Kim
(University of California at Los Angeles, USA; University of Texas at Austin, USA)
During peer code reviews, developers often examine program differences. When using existing program differencing tools, it is difficult for developers to inspect systematic changes—similar, related changes that are scattered across multiple files. Developers cannot easily answer questions such as "what other code locations changed similar to this change?" and "are there any other locations that are similar to this code but are not updated?" In this paper, we demonstrate Critics, an Eclipse plug-in that assists developers in inspecting systematic changes. It (1) allows developers to customize a context-aware change template, (2) searches for systematic changes using the template, and (3) detects missing or inconsistent edits. Developers can interactively refine the customized change template to see corresponding search results. Critics has potential to improve developer productivity in inspecting large, scattered edits during code reviews. The tool's demonstration video is available at https://www.youtube.com/watch?v=F2D7t_Z5rhk
ConceptCloud: A Tagcloud Browser for Software Archives
Gillian J. Greene and Bernd Fischer
(Stellenbosch University, South Africa)
ConceptCloud is an interactive browser for SVN and Git repositories. Its main novelty is the combination of an intuitive tag cloud interface with an underlying concept lattice that provides a formal structure for navigation. This combination allows users to explore repositories serendipitously, without predefined search goals and along different navigation paths. ConceptCloud can derive different lattice types for a repository and supports concurrent navigation in multiple linked tag clouds that can each be individually customized, which allows multi-faceted repository explorations.
Titan: A Toolset That Connects Software Architecture with Quality Analysis
Lu Xiao, Yuanfang Cai, and Rick Kazman
(Drexel University, USA; University of Hawaii, USA)
In this tool demo, we will illustrate our tool---Titan---that supports a new architecture model: design rule spaces (DRSpaces). We will show how Titan can capture both architecture and evolutionary structure and help to bridge the gap between architecture and defect prediction. We will demo how to use our toolset to capture hundreds of buggy files into just a few architecturally related groups, and to reveal architecture issues that contribute to the error-proneness and change-proneness of these groups. Our tool has been used to analyze dozens of large-scale industrial projects, and has demonstrated its ability to provide valuable direction on which parts of the architecture are problematic, and on why, when, and how to refactor. The video demo of Titan can be found at https://art.cs.drexel.edu/~lx52/titan.mp4
BugLocalizer: Integrated Tool Support for Bug Localization
Ferdian Thung, Tien-Duy B. Le
, Pavneet Singh Kochhar, and David Lo
(Singapore Management University, Singapore)
To manage bugs that appear in a software, developers often make use of a bug tracking system such as Bugzilla. Users can report bugs that they encounter in such a system. Whenever a user reports a new bug report, developers need to read the summary and description of the bug report and manually locate the buggy files based on this information. This manual process is often time consuming and tedious. Thus, a number of past studies have proposed bug localization techniques to automatically recover potentially buggy files from bug reports. Unfortunately, none of these techniques are integrated to bug tracking systems and thus it hinders their adoption by practitioners. To help disseminate research in bug localization to practitioners, we develop a tool named BugLocalizer, which is implemented as a Bugzilla extension and builds upon a recently proposed bug localization technique. Our tool extracts texts from summary and description fields of a bug report and source code files. It then computes similarities of the bug report with source code files to find the buggy files. Developers can use our tool online from a Bugzilla web interface by providing a link to a git source code repository and specifying the version of the repository to be analyzed. We have released our tool publicly in GitHub, which is available at: https://github.com/smagsmu/buglocalizer. We have also provided a demo video, which can be accessed at: http://youtu.be/iWHaLNCUjBY.
Technical Presentations 1
Diagnose Crashing Faults on Production Software
(Hong Kong University of Science and Technology, China)
Software crashes are severe manifestations of software faults. Especially, software crashes in production software usually result in bad user experiences. Therefore, crashing faults mostly are required to be fixed with a high priority. Diagnosing crashing faults on production software is non-trivial, due to the characteristics of production environment. In general, it is required to address two major challenges. First, crash reports in production software are usually numerous, since production software is used by a large number of end users in various environments and configurations. Especially, a single fault may manifest as different crash reports, which makes the prioritizing debugging and understanding faults difficult. Second, deployed software is required to run with minimal overhead and cannot afford a heavyweight instrumentation approach to collect program execution information. Furthermore, end users require that the logged information should not reveal sensitive production data. This thesis contributes for developing crashing fault diagnosis tools that can be used in production environment.
Technical Presentations 2
Integrating Approaches for Feature Implementation
(University of Luxembourg, Luxembourg; htw saar, Germany)
Compositional and annotative approaches are two competing yet complementary candidates for implementing feature-oriented software product lines. While the former provides real modularity, the latter excels concerning expressiveness. To combine the respective advantages of compositional and annotative approaches, we aim at unifying their underlying representations by leveraging the snippet system instead of directories and files. In addition, to exploit this unification, we propose different editable views.
Numerical Program Analysis and Testing
(University College London, UK)
Numerical software is playing an increasingly critical role in modern society, but composing correct numerical programs is difficult. This paper describes a doctoral research program that aims to alleviate this issue. It tackles real world problems and is guided by features learned from empirically studying these programs. By assisting developers in the production of numerical software, it improves the quality and productivity of software development. The research depends on numerical analysis and lies in the intersection of software engineering and program analysis.
Traceability and Model Checking to Support Safety Requirement Verification
(Nanjing University of Aeronautics and Astronautics, China)
Ensuring safety-critical software safety requires strict verification of the conformance between safety requirements and programs. Formal verification techniques, such as model checking and theorem proving, can be used to partially realize this objective. DO-178C, a standard for airborne systems, allows formal verification techniques to replace certain forms of testing. My research is concerned with applying model checking to verify the conformance between safety requirements and programs. First, a formal language for specifying software safety requirements which are relevant to event sequences is introduced. Second, the traceability information models between formalized safety requirements and programs are built. Third, the checking of a program against a safety requirement is decomposed into smaller model checking problems by utilizing traceability information model between them.
Dealing with Uncertainty in Verification of Nondeterministic Systems
Yamilet R. Serrano Llerena
(National University of Singapore, Singapore)
Uncertainty complicates the formal verification of nondeterministic systems. Unpredictable changes and alterations in their environments can lead an invalid verification results and the decrease of confidence degree of these systems. However, current literature provides little account of addressing the uncertainty in formal verification. To address this problem, the goal of this research is to provide a method based on perturbation analysis for probabilistic model checking of nondeterministic systems which are modelled as Markov Decision Processes. And to apply our expected contributions to ubiquitous systems due to inherent presence of environment uncertainty and their resource limitations.
Technical Presentations 3
Static Analysis Driven Performance and Energy Testing
(National University of Singapore, Singapore)
Software testing is the process of evaluating the properties of a software. Properties of a software can be divided into two categories: functional properties and non-functional properties. Properties that influence the input-output relationship of the software can be categorized as functional properties. On the other hand, properties that do not influence the input-output relationship of the software directly can be categorized as non-functional properties. In context of real-time system software, testing functional as well as non functional properties is equally important. Over the years considerable amount of research effort has been dedicated in developing tools and techniques that systematically test various functional properties of a software. However, the same cannot be said about testing non-functional properties. Systematic testing of non-functional properties is often much more challenging than testing functional properties. This is because non-functional properties not only depends on the inputs to the program but also on the underlying hardware. Additionally, unlike the functional properties, nonfunctional properties are seldom annotated in the software itself. Such challenges provide the objectives for this work. The primary objective of this work is to explore and address the major challenges in testing non-functional properties of a software.
Autonomous Compliance Monitoring of Non-functional Properties
(National University of Singapore, Singapore)
While there is a good understanding of functional requirements and the need to test them during development, non-functional requirements are more elusive. Defining non-functional requirements can end up in a major undertaking consuming significant resources. Even after defining non-functional requirements, chances are they do not reflect the real-world usage of a deployed system. Differences can occur due to workload, hardware, or utilised third-party libraries. To tackle these challenges we propose a fully automatic compliance monitoring solution for non-functional properties. The proposed system mines stable behavioural patterns of the system and automatically extracts assertions that can be used to detect deviations of expected non-functional behaviour. We especially focus on non-functional properties that require runtime observation, e.g. execution time, performance, throughput.The full automation of the process enables a deployment in the field, giving rise to a distributed non-functional behaviour extraction system.
Detecting, Isolating, and Enforcing Dependencies among and within Test Cases
(Columbia University, USA)
Testing stateful applications is challenging, as it can be difficult to identify hidden dependencies on program state. These dependencies may manifest between several test cases, or simply within a single test case. When it's left to developers to document, understand, and respond to these dependencies, a mistake can result in unexpected and invalid test results. Although current testing infrastructure does not currently leverage state dependency information, we argue that it could, and that by doing so testing can be improved. Our results thus far show that by recovering dependencies between test cases and modifying the popular testing framework, JUnit, to utilize this information, we can optimize the testing process, reducing time needed to run tests by 62% on average. Our ongoing work is to apply similar analyses to improve existing state of the art test suite prioritization techniques and state of the art test case generation techniques.
Technical Presentations 4
Minimizing Software Conflicts through Proactive Detection of Conflicts and Task Scheduling
Bakhtiar Khan Kasi
(University of Nebraska-Lincoln, USA)
Software conflicts arising because of conflicting changes are a regular occurrence and delay projects. Workspace awareness tools have been proposed to facilitate task coordination among developers, enabling them to identify potential conflicts early, while conflicts are still easy to resolve. However, these tools have limitations, as they identify conflicts after conflicts have already occurred and therefore, are unable to prevent developers’ time and effort spent in resolving the conflicts. The goal of this Ph.D. research is to: (1) characterize the distribution of conflicts, their frequency and the factors within a project that affects the distribution and frequency of conflicts, (2) design and implement a conflict minimization technique that proactively identifies potential conflicts by analyzing developers’ tasks and avoids them by scheduling tasks in a conflict minimal manner and (3) evaluate the proposed approach using historic data from OSS projects and through user evaluations. Thus far, we have implemented our approach and evaluated it with historic data from four OSS projects and through simulated data.
Detecting and Preventing the Architectural Roots of Bugs
(Drexel University, USA)
Numerous techniques have been proposed to locate buggy files in a code base, but the problem of fixing one bug unexpectedly affecting other files is persistent and prevailing. Our recent study revealed that buggy files are usually architecturally connected by architecture issues such as unstable interfaces and modularity violations. We aim to detect and prevent these architecture issues that are the root causes of defects. Our contributions include (1) a new architecture model, Design Rule Space (DRSpace), that can express structural relations, quality, and evolutionary information simultaneously; (2) a method of automatically extracting defect-prone architecture roots by combining static architecture analysis with software revision history data mining. The preliminary application of our approach to dozens of open source and industry projects has demonstrated its significant potential to inform developers about how software defects should be discovered, examined, and handled.
Estimating the Effectiveness of Spectrum-Based Fault Localization
(Nanjing University, China)
Spectrum-Based Fault Localization (SBFL) techniques calculate risk values to predict buggy units in a program,but they may cause heavy manual work when the calculated risk values are not reasonable on some application scenarios. In this paper, presents a preliminary study to estimate the effectiveness of SBFL before manual code walk through, so that we can decide whether to adopt SBFL for a given application.
Social Network Analysis in Open Source Software Peer Review
(Nara Institute of Science and Technology, Japan)
Software peer review (aka. code review) is regarded as one of the most important approaches to keep software quality and productivity. Due to the distributed collaborations and communication nature of Open Source Software (OSS), OSS review differs from traditional industry review. Unlike other related works, this study investigated OSS peer review pro- cesses from social perspective by using social network anal- ysis (SNA). We analyzed the review history from three typi- cal OSS projects. The results provide hints on relationships among the OSS reviewers which can help to understand how developers work and communicate with each other.
Towards a Theory of Architectural Styles
(TU München, Germany)
Architectural styles and patterns play an important role in software architectures. However, they are usually only stated informally, which may cause problems such as ambiguity and wrong conclusions. A rigorous theory of architectural styles --- consisting of
(i) mathematical models for each style; (ii) axioms to identify different variants of a style; and (iii) rigorous analyses by means of mathematical proofs --- would address these problems. With this work we report on our progress towards such a rigorous theory of architectural styles.
Methodology and Culture: Drivers of Mediocrity in Software Engineering?
Marian Petre and Daniela Damian
(Open University, UK; University of Victoria, Canada)
Methodology implementation failure is attributed to developer mediocrity (by management) – not to organizational mediocrity (rigidity or control-driven, process-driven management), or to a lack of adaptation capability in the methodology. In supporting software construction as a creative process, however, we must promote excellence rather than conformity. We argue that we – through principled research -- must pay attention to the interplay between methodology and culture – the local adaptations needed to make things work, understand how the two co-evolve and how they may contribute together to software quality.
Known Unknowns: Testing in the Presence of Uncertainty
Sebastian Elbaum and David S. Rosenblum
(University of Nebraska-Lincoln, USA; National University of Singapore, Singapore)
Uncertainty is becoming more prevalent in the software systems we build, introducing challenges in the way we develop software, especially in software testing. In this work we explore how uncertainty affects software testing, how it is managed currently, and how it could be treated more effectively.
, Gail C. Murphy
, Emerson Murphy-Hill
, and Xavier Blanc
(University of British Columbia, Canada; North Carolina State University, USA; University of Bordeaux, France)
Although software development involves making numerous decisions amongst alternatives, the design and implementation choices made typically become invisible; what a developer sees in the project's artifacts are the end result of all of the decisions. What if, instead, all of the choices made were tracked and it was easy for a developer to revisit a point where a decision was made and choose another alternative? What if the development environment could detect and suggest alternative choices? What if it was easy and low-cost to try another path? We explore the idea of speculative reprogramming that could support a what-if environment for the programming stages of software development.
A Variability Perspective of Mutation Analysis
Xavier Devroey, Gilles Perrouin, Maxime Cordy, Mike Papadakis, Axel Legay, and Pierre-Yves Schobbens
(University of Namur, Belgium; University of Luxembourg, Luxembourg; INRIA, France)
Mutation testing is an effective technique for either improving or generating fault-finding test suites. It creates defective or incorrect program artifacts of the program under test and evaluates the ability of test suites to reveal them. Despite being effective, mutation is costly since it requires assessing the test cases with a large number of defective artifacts. Even worse, some of these artifacts are behaviourally ``equivalent'' to the original one and hence, they unnecessarily increase the testing effort. We adopt a variability perspective on mutation analysis. We model a defective artifact as a transition system with a specific feature selected and consider it as a member of a mutant family. The mutant family is encoded as a Featured Transition System, a compact formalism initially dedicated to model-checking of software product lines. We show how to evaluate a test suite against the set of all candidate defects by using mutant families. We can evaluate all the considered defects at the same time and isolate some equivalent mutants. We can also assist the test generation process and efficiently consider higher-order mutants.
Mining Micro-practices from Operational Data
and Audris Mockus
(Peking University, China; University of Tennessee, USA; Avaya Labs, USA)
Micro-practices are actual (and usually undocumented or incorrectly documented) activity patterns used by individuals or projects to accomplish basic software development tasks, such as writing code, testing, triaging bugs, or mentoring newcomers. The operational data in software repositories presents the tantalizing possibility to discover such fine-scale behaviors and use them to understand and improve software development. We propose a large-scale evidence-based approach to accomplish this by first creating a mirror of the projects in the open source universe. The next step would involve the inductive generalization from in-depth studies of specific projects from one side and the categorization of micro-practices in the entire universe from the other side.
Achieving Lightweight Trustworthy Traceability
Jane Cleland-Huang, Mona Rahimi, and Patrick Mäder
(DePaul University, USA; TU Ilmenau, Germany)
Despite the fact that traceability is a required element of almost all safety-critical software development processes, the trace data is often incomplete, inaccurate, redundant, conflicting, and outdated. As a result, it is neither trusted nor trustworthy. In this vision paper we propose a philosophical change in the traceability landscape which transforms traceability from a heavy-weight process producing untrusted trace links, to a light-weight results-oriented trustworthy solution. Current traceability practices which retard agility are cast away and replaced with a disciplined, just-in-time approach. The novelty of our solution lies in a clear separation of trusted trace links from untrusted ones, the change in perspective from `living-with' inacurate traces toward rigorous and ongoing debridement of stale links from the trusted pool, and the notion of synthesizing available `project exhaust' as evidence to systematically construct or reconstruct purposed, highly-focused trace links.
proc time: 0.36