DAPSE 2013 – Proceedings

Foreword
Data scientists in software engineering seek insight in data collected from software projects to improve software development. The demand for data scientists with domain knowledge in software development is growing rapidly and there is already a shortage of such data scientists. Data science is a skilled art with a steep learning curve. To shorten that learning curve, this workshop will collect best practices in form of data analysis patterns, that is, analyses of data that leads to meaningful conclusions and can be reused for comparable data. In the workshop we compiled a catalog of such patterns that will help experienced data scientists to better communicate about data analysis. The workshop is targeted at experienced data scientists and researchers and anyone interested in how to analyze data correctly and efficiently in a community accepted way.

Building Statistical Language Models of Code
Peter Schulam, Roni Rosenfeld, and Premkumar Devanbu
(CMU, USA; UC Davis, USA)
We present the Source Code Statistical Language Model data analysis pattern. Statistical language models have been an enabling tool for a wide array of important language technologies. Speech recognition, machine translation, and document summarization (to name a few) all rely on statistical language models to assign probability estimates to natural language utterances or sentences. In this data analysis pattern, we describe the process of building n-gram language models over software source files. We hope that by introducing the empirical software engineering community to best practices that have been established over the years in research for natural languages, statistical language models can become a tool that SE researchers are able to use to explore new research directions.

Commit Graphs
Maximilian Steff and Barbara Russo
(Free University of Bolzano, Italy)
We present commit graphs, a graph representation of the commit history in version control systems. The graph is structured by commonly changed files between commits. We derive two analysis patterns relating to bug-fixing commits and system modularity.

Concept to Commit: A Pattern Designed to Trace Code Changes from User Requests to Change Implementation by Analyzing Mailing Lists and Code Repositories
Scott McGrath, Kiran Bastola, and Harvey Siy
(University of Nebraska at Omaha, USA)
The concept to commit pattern is used for tracing code changes from user requests (analyzing the mailing list) to change implementation (analyzing the code repository). The analysis is done via text mining of both emails and commits descriptions in 4 stages. The first stage is identifying a search time window for the mailing list by evaluating a targeted commit time stamp. Once a window is established, the body of the mailing list is reduced to match the search window. The next stage involves basic text mining processing (tokenization, stemming, and document matrix creation). The final step is to perform frequency analysis (word cloud, heat map, or dendrogram).

Data Analysis Anti-patterns in Empirical Software Engineering
Sandro Morasca
(University of Insubria, Italy)
The paper introduces the concept of data analysis anti-patterns, i.e., data analysis procedures that may lead to invalid results that may mislead decision makers. Two examples of anti-patterns are presented and discussed.

Effect Size Analysis
Emanuel Giger and Harald C. Gall

(University of Zurich, Switzerland)
When we seek insight in collected data we are most often forced to limit our measurements to a portion of all individuals that can be hypothetically considered for observation. Nevertheless, as researchers, we want to draw more general conclusions that are valid beyond the restricted subset we are currently analyzing. Statistical significance testing is a fundamental pattern of data analysis that helps us to infer conclusions from a subset about the entire set of possible individuals. However, the outcome of such tests depends on several factors. Software engineering experiments often address similar research questions but vary with respect to those factors, for example, they operate on different sizes or measurements. Hence, the use of statistical significance alone to interpret findings across studies is insufficient. This paper describes how significance testing can be extended by an analysis of the magnitude, i.e., effect size, of an observation allowing to abstract the results of different studies.

Exploring Software Engineering Data with Formal Concept Analysis
Xiaobing Sun, Ying Chen, Bin Li

, and Bixin Li
(Yangzhou University, China; Southeast University, China)
Given the software engineering (SE) data, there does exist the binary relationship between entities and their properties within the data. Users are usually interested in their meaningful groupings of entities and properties. Formal concept analysis (FCA) is a powerful technique to deal with the binary relation between entities and entity properties to infer a hierarchy of concepts. The output of FCA is the concept lattice, where higher-level concepts represent general properties shared by many entities, while lower-level concepts represent the entityspecific properties. FCA has been widely and successfully used as a useful data analysis technique in various SE field, such as software comprehension, refactoring, and etc.

Extracting Artifact Lifecycle Models from Metadata History
Olga Baysal, Oleksii Kononenko, Reid Holmes, and Michael W. Godfrey

(University of Waterloo, Canada)
Software developers and managers make decisions based on the understanding they have of their software systems. This understanding is both built up experientially and through investigating various software development artifacts. While artifacts can be investigated individually, being able to summarize characteristics about a set of development artifacts can be useful. In this paper we propose lifecycle models as an effective way to gain an understanding of certain development artifacts. Lifecycle models capture the dynamic nature of how various development artifacts change over time in a graphical form that can be easily understood and communicated. Lifecycle models enables reasoning of the underlying processes and dynamics of the artifacts being analyzed. In this paper we describe how lifecycle models can be generated and demonstrate how they can be applied to the code review process of a development project.

Measure What Counts: An Evaluation Pattern for Software Data Analysis
Emmanuel Letier and Camilo Fitzgerald
(University College London, UK)
The 'Measure what counts' pattern consists in evaluating software data analysis techniques against problem-specific measures related to cost and other stakeholders' goals instead of relying solely on generic metrics such as recall, precision, F-measure, and Receiver Operating Characteristic area.

Parametric Classification over Multiple Samples
Barbara Russo
(Free University of Bolzano, Italy)
This pattern was originally designed to classify sequences of events in log files by error-proneness. Sequences of events trace application use in real contexts. As such, identifying error-prone sequences helps understand and predict application use. The classification problem we describe is typical in supervised machine learning, but the composite pattern we propose investigates it with several techniques to control for data brittleness. Data pre-processing, feature selection, parametric classification, and cross-validation are the major instruments that enable a good degree of control over this classification problem. In particular, the pattern includes a solution for typical problems that occurs when data comes from several samples of different populations and with different degree of sparcity.

Patterns for Cleaning Up Bug Data
Rodrigo Souza, Christina Chavez, and Roberto Bittencourt
(UFBA, Brazil; UEFS, Brazil)
Bug reports provide insight about the quality of an evolving software and about its development process. Such data, however, is often incomplete and inaccurate, and thus should be cleaned before analysis. In this paper, we present patterns that help both novice and experienced data scientists to discard invalid bug data that could lead to wrong conclusions.

Patterns for Extracting High Level Information from Bug Reports
Rodrigo Souza, Christina Chavez, and Roberto Bittencourt
(UFBA, Brazil; UEFS, Brazil)
Bug reports record tasks performed by users and developers while collaborating to resolve bugs. Such data can be transformed into higher level information that helps data scientists understand various aspects of the team's development process. In this paper, we present patterns that show, step by step, how to extract higher level information about software verification from bug report data.

Structural and Temporal Patterns-Based Features
Venkatesh-Prasad Ranganath and Jithin Thomas
(Microsoft Research, India)
In this paper, we propose a data transformation pattern to transform sequential data into a set of binary/categorical features and numerical features to enable data analysis. These features capture both structural and temporal information inherent in sequential data.

The Chunking Pattern
David M. Weiss and Audris Mockus
(Iowa State University, USA; Avaya Labs Research, USA)
Chunks are sets of code that have the property that a change that touches a chunk touches only that chunk. The pattern described in this paper defines chunks, indicates their usefulness, and provides an algorithm for calculating them.

DAPSE 2013 – Proceedings

1st International Workshop on Data Analysis Patterns in Software Engineering (DAPSE)