2015 IEEE 31st International Conference on Software Maintenance and Evolution (ICSME), September 29 – October 1, 2015, Bremen, Germany

Is This Code Written in English? A Study of the Natural Language of Comments and Identifiers in Practice
Timo Pawelka and Elmar Juergens
(TU München, Germany; CQSE, Germany)
Abstract: Comments and identifiers are the main source of documentation of source-code and are therefore an integral part of the development and the maintenance of a program. As English is the world language, most comments and identifiers are written in English. However, if they are in any other language, a developer without knowledge of this language will almost perceive the code to be undocumented or even obfuscated. In absence of industrial data, academia is not aware of the extent of the problem of non- English comments and identifiers in practice. In this paper, we propose an approach for the language identification of source- code comments and identifiers. With the approach, a large-scale study has been conducted of the natural language of source-code comments and identifiers, analyzing multiple open-source and industry systems. The results show that a significant amount of the industry projects contain comments and identifiers in more than one language, whereas none of the analyzed open-source systems has this problem.


