Genetic programming for compiler testing: the experience of a postgraduate student at the ITMO ML laboratory

Viktor Petukhov, a second-year postgraduate student at ITMO University and a tech lead in one of the Kotlin project teams at JetBrains (deals with the Kotlin compiler), decided to combine his work with a detailed study of professional problems in a scientific format and joined ITMO Machine Learning Laboratories… We will tell you what came of it.

Photo by Marc Reichelt / Unsplash.com
Photo by Marc Reichelt / Unsplash.com

History of the issue

Programming languages ​​are the foundation of a vast array of products, services, and critical infrastructure. In its scientific article on this topic, Victor emphasizes that errors in the compiler code can indeed lead to significant consequences. However, the number of all kinds of test cases for them is usually extremely large, so they have to be composed with the involvement of intelligence – human or “artificial”, in order to properly investigate the most error-prone parts of the compiler.

But the automatic generation of such tests has limitations. As a rule, we tend to generate semantically correct code fragments, bypassing the parser (the simplest part of the compiler) and making our way to the most difficult parts: the type inference subsystem, the call resolution mechanism, and so on. If you use a formal model such as a grammar for this, the proportion of semantically correct code fragments obtained will be too small.

FROM using machine learning methods it is possible to improve the situation – to achieve a large number of generated semantically correct examples, and also to try to solve interesting secondary problems. Such as directed code generation – for example, to optimize the compiler runtime or the memory it consumes during compilation. This is how you can test the performance of the compiler.

The challenge of optimizing the performance characteristics of code generation along the way is not a simple addition to random generation. At the moment, not many specialists are working on this problem, so Victor saw an opportunity here for the development of professional and scientific competencies – he decided to propose such a method for generating source code in Kotlin so that “either the compilation time or the memory consumed were maximized.”

How it works

Using genetic programming to test compiler performance implies that with each new generation of code, some target parameter is optimized – for example, memory used.

The code generator, in turn, should also provide as much semantically correct fragments as possible. For this, Victor – together with his colleagues – developed a generator with a cross operator that takes into account the types of available variables.

An example of crossing source code fragments taking into account types
An example of crossing source code fragments taking into account types

The screenshot shows a possible substitution of one piece of code instead of another (crossing two files with source code). Here you can see that such a substitution will be type-compatible, in accordance with the implemented crossover operator. In the place where the code snippet is inserted, there are all the required types or their subtypes that are present inside the inline snippet.

During the experiment, we discovered some performance issues with the Kotlin compiler – for example, an exponential increase in the analysis time of lambdas on the right side of the “+ =” operator as the depth of nesting of these lambdas increases. The compiler team is already tackling this issue. In the screenshot below, we show how the compilation time of the given code fragment changes if we add nested lambdas.

An example of a point problem that was found
An example of a point problem that was found

What’s next

Five people are currently involved in the project. This is a collective work of ITMO University researchers and JetBrains representatives. Colleagues are convinced of the advisability of continuing the research and improving the approach proposed by Victor.

“We would also like to implement the mutation operator in order to get rid of the bias regarding the initial data set and generate fundamentally new language constructs. In addition to genetic programming, we plan to use generative models here, and then compare the results and think about developing a methodology for testing compiler performance “

– Victor Petukhov, second year postgraduate student, programmer – at the Faculty of IT and Programming, programmer at JetBrains and in the Kotlin compiler team.


Additional reading in our blog on Habré:

  • What projects is being prepared by the youth robotics laboratory

  • “We do what few people have done before us”: talking about infochemistry

  • What an interdisciplinary approach to robotics looks like


Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *