Powered by
1st ACM SIGSOFT International Workshop on Representation Learning for Software Engineering and Program Languages (RL+SE&PL 2020),
November 13, 2020,
Virtual, USA
1st ACM SIGSOFT International Workshop on Representation Learning for Software Engineering and Program Languages (RL+SE&PL 2020)
Frontmatter
Welcome from the Chairs
Welcome to the 1st International Workshop on Representation Learning for Software Engineering and Program Languages (RL+SE&PL), virtual event, Nov 9th 2020, co-located with ESEC/FSE 2020. The main focus of RL+SE&PL is to accelerate the exposure of the SE and PL community to the emerging and fast growing AI research in representation learning that plays as the key center in AI, leading to research results, techniques and perspectives that challenge the status quo in the discipline.
Papers
A Multi-task Representation Learning Approach for Source Code
Deze Wang, Wei Dong, and
Shanshan Li
(National University of Defense Technology, China)
Representation learning has shown impressive results for a multitude of tasks in software engineering. However, most researches still focus on a single problem. As a result, the learned representations cannot be applied to other problems and lack generalizability and interpretability. In this paper, we propose a Multi-task learning approach for representation learning across multiple downstream tasks of software engineering. From the perspective of generalization, we build a shared sequence encoder with a pretrained BERT for the token sequence and a structure encoder with a Tree-LSTM for the abstract syntax tree of code. From the perspective of interpretability, we integrate attention mechanism to focus on different representations and set learnable parameters to adjust the relationship between tasks. We also present the early results of our model. The learning process analysis shows our model has a significant improvement over strong baselines.
@InProceedings{RL+SE&PL20p1,
author = {Deze Wang and Wei Dong and Shanshan Li},
title = {A Multi-task Representation Learning Approach for Source Code},
booktitle = {Proc.\ RL+SE&PL},
publisher = {ACM},
pages = {1--2},
doi = {10.1145/3416506.3423575},
year = {2020},
}
Publisher's Version
Statistical Machine Translation Outperforms Neural Machine Translation in Software Engineering: Why and How
Hung Phan and Ali Jannesari
(Iowa State University, USA)
Neural Machine Translation (NMT) is the current trend approach in Natural Language Processing (NLP) to solve the problem of auto- matically inferring the content of target language given the source language. The ability of NMT is to learn deep knowledge inside lan- guages by deep learning approaches. However, prior works show that NMT has its own drawbacks in NLP and in some research problems of Software Engineering (SE). In this work, we provide a hypothesis that SE corpus has inherent characteristics that NMT will confront challenges compared to the state-of-the-art translation engine based on Statistical Machine Translation. We introduce a problem which is significant in SE and has characteristics that challenges the abil- ity of NMT to learn correct sequences, called Prefix Mapping. We implement and optimize the original SMT and NMT to mitigate those challenges. By the evaluation, we show that SMT outperforms NMT for this research problem, which provides potential directions to optimize the current NMT engines for specific classes of parallel corpus. By achieving the accuracy from 65% to 90% for code tokens generation of 1000 Github code corpus, we show the potential of using MT for code completion at token level.
@InProceedings{RL+SE&PL20p3,
author = {Hung Phan and Ali Jannesari},
title = {Statistical Machine Translation Outperforms Neural Machine Translation in Software Engineering: Why and How},
booktitle = {Proc.\ RL+SE&PL},
publisher = {ACM},
pages = {3--12},
doi = {10.1145/3416506.3423576},
year = {2020},
}
Publisher's Version
Info
A Differential Evolution-Based Approach for Effort-Aware Just-in-Time Software Defect Prediction
Xingguang Yang, Huiqun Yu, Guisheng Fan, and Kang Yang
(East China University of Science and Technology, China)
Software defect prediction technology is an effective method to improve software quality. Effort-aware just-in-time software defect prediction (JIT-SDP) aims to identify more defective changes in limited effort. Although many methods have been proposed for JIT-SDP, the prediction performance of existing prediction models still needs to be improved. To improve the effort-aware prediction performance, we propose a new method called DEJIT based on differential evolution algorithm. First, we propose a metric called density-percentile-average (DPA), which is used as the optimization objective of models on the training set. Then, we use logistic regression to build models and use the differential evolution algorithm to determine coefficients of logistic regression. We conduct empirical research on six open source projects. Empirical results demonstrate that the proposed method significantly outperforms the state-of-the-art 4 supervised models and 4 unsupervised models.
@InProceedings{RL+SE&PL20p13,
author = {Xingguang Yang and Huiqun Yu and Guisheng Fan and Kang Yang},
title = {A Differential Evolution-Based Approach for Effort-Aware Just-in-Time Software Defect Prediction},
booktitle = {Proc.\ RL+SE&PL},
publisher = {ACM},
pages = {13--16},
doi = {10.1145/3416506.3423577},
year = {2020},
}
Publisher's Version
When Representation Learning Meets Software Analysis
Ming Fan, Ang Jia, Jingwen Liu, Ting Liu, and Wei Chen
(Xi'an Jiaotong University, China; State Grid Shaanxi Electric Power Research Institute, China)
In recent years, deep learning is increasingly prevalent in the field of Software Engineering (SE). Especially, representation learning, which can learn vectors from the syntactic and semantics of the code, offers much convenience and promotion for the downstream tasks such as code search and vulnerability detection. In this work, we introduce our two applications of leveraging representation learning for software analysis, including defect prediction and vulnerability detection.
@InProceedings{RL+SE&PL20p17,
author = {Ming Fan and Ang Jia and Jingwen Liu and Ting Liu and Wei Chen},
title = {When Representation Learning Meets Software Analysis},
booktitle = {Proc.\ RL+SE&PL},
publisher = {ACM},
pages = {17--18},
doi = {10.1145/3416506.3423578},
year = {2020},
}
Publisher's Version
Boosting Component-Based Synthesis with Control Structure Recommendation
Binbin Liu, Wei Dong, Yating Zhang, Daiyan Wang, and Jiaxin Liu
(National University of Defense Technology, China)
Component-based synthesis is an important research field in program synthesis. API-based synthesis is a subfield of component-based synthesis, the component library of which are Java APIs. Unlike existing work in API-based synthesis that can only generate loop-free programs constituted by APIs, state-of-the-art work FrAngel can generate programs with control structures. However, for the generation of control structures, it samples different types of control structures all at random. Given the information about the desired method (such as method name and input/output types), experienced programmers can have an initial thought about the possible control structures that could be used in implementing the desired method. The knowledge about control structures in the method can be learned from high-quality projects. In this paper, we propose a novel approach of recommending control structures for API-based synthesis based on deep learning. A neural network that can jointly embed the natural language description, method name, and input/output types into high-dimensional vectors to predict the possible control structures of the desired method is proposed. We integrate the prediction model into the synthesizer to improve the efficiency of synthesis. We train our model on a codebase of high-quality Java projects from GitHub. The prediction results of the neural network are fed to the API-based synthesizer to guide the sampling process of control structures. The experimental results on 40 programming tasks show that our approach can effectively improve the efficiency of synthesis.
@InProceedings{RL+SE&PL20p19,
author = {Binbin Liu and Wei Dong and Yating Zhang and Daiyan Wang and Jiaxin Liu},
title = {Boosting Component-Based Synthesis with Control Structure Recommendation},
booktitle = {Proc.\ RL+SE&PL},
publisher = {ACM},
pages = {19--28},
doi = {10.1145/3416506.3423579},
year = {2020},
}
Publisher's Version
Towards Demystifying Dimensions of Source Code Embeddings
Md Rafiqul Islam Rabin, Arjun Mukherjee, Omprakash Gnawali, and
Mohammad Amin Alipour
(University of Houston, USA)
Source code representations are key in applying machine learning techniques for processing and analyzing programs. A popular approach in representing source code is neural source code embeddings that represents programs with high-dimensional vectors computed by training deep neural networks on a large volume of programs. Although successful, there is little known about the contents of these vectors and their characteristics.
In this paper, we present our preliminary results towards better understanding the contents of code2vec neural source code embeddings. In particular, in a small case study, we use the code2vec embeddings to create binary SVM classifiers and compare their performance with the handcrafted features. Our results suggest that the handcrafted features can perform very close to the highly-dimensional code2vec embeddings, and the information gains are more evenly distributed in the code2vec embeddings compared to the handcrafted features. We also find that the code2vec embeddings are more resilient to the removal of dimensions with low information gains than the handcrafted features. We hope our results serve a stepping stone toward principled analysis and evaluation of these code representations.
@InProceedings{RL+SE&PL20p29,
author = {Md Rafiqul Islam Rabin and Arjun Mukherjee and Omprakash Gnawali and Mohammad Amin Alipour},
title = {Towards Demystifying Dimensions of Source Code Embeddings},
booktitle = {Proc.\ RL+SE&PL},
publisher = {ACM},
pages = {29--38},
doi = {10.1145/3416506.3423580},
year = {2020},
}
Publisher's Version
Representation Learning for Software Engineering and Programming Languages
Tien N. Nguyen and Shaohua Wang
(University of Texas at Dallas, USA; New Jersey Institute of Technology, USA)
Recently, deep learning (DL) and machine learning (ML) methods have
been massively and successfully applied in various software
engineering (SE) and programming languages (PL) tasks. The results are
promising and exciting, and lead to further opportunities of exploring
the amenability of DL and ML to different SE and PL tasks. Notably,
the choice of the representations on which DL and ML methods are
applied critically impacts the performance of the DL and ML
methods. The rapidly developing field of representation learning (RL)
in artificial intelligence is concerned with questions surrounding how
we can best learn meaningful and useful representations of data. A
broad view of the RL in SE and PL can include the topics, e.g., deep
learning, feature learning, compositional modeling, structured
prediction, and reinforcement learning. This workshop will advance the
pace of research in the unique intersection of representation learning
and SE and PL, which will, in the long term, lead to more effective
solutions to common software engineering tasks such as coding,
maintenance, testing, and porting. In addition to attracting the
community of researchers who usually attend FSE, we have made
intensive efforts to attract researchers from the RL (broadly AI)
community to the workshop, specially from local, very strong groups in
local universities, and research labs in the nation.
@InProceedings{RL+SE&PL20p39,
author = {Tien N. Nguyen and Shaohua Wang},
title = {Representation Learning for Software Engineering and Programming Languages},
booktitle = {Proc.\ RL+SE&PL},
publisher = {ACM},
pages = {39--40},
doi = {10.1145/3416506.3423581},
year = {2020},
}
Publisher's Version
proc time: 1.59