Automatic generation of messages for commits

Hey! My name is Alexandra Eliseeva, I am a student Computer Science Center… As part of my practice in the fall semester of 2020, I participated in the BERT for Source Code project led by Timofey Bryksin and Yaroslav Sokolov from JetBrains Research… I researched how to solve the problem of automatically generating commit messages using the BERT language model. What happened, and what still needs to be worked on, I will tell you in this post.

about the project

There are many interesting problems in the field of programming languages ​​processing, the automatic solution of which can be useful for creating convenient tools for developers.

The source code of programs differs in many ways from natural language texts, but it can also be thought of as a sequence of tokens and similar methods can be used. For example, in the field of natural language processing, the BERT language model is actively used. The process of its training involves two stages: pre-training on a large set of unlabeled data and additional training for specific tasks on smaller labeled datasets. This approach allows many tasks to be solved with very good quality.

Recent works (one, 2, 3) showed that if you train the BERT model on a large dataset of program code, then it copes well with several tasks in this area (among them, for example, localization and elimination of incorrectly used variables and generation of comments to methods).

The project aims to investigate the use of BERT for other source code tasks. In particular, we focused on the task of automatically generating commit messages.

About the task

Why did we choose this task?

Firstly, version control systems are used in the development of many projects, so a tool for automatically solving this problem can be relevant for a wide range of developers.

Second, we hypothesized that using BERT for this task could lead to good results. This is due to several reasons:

  • in existing works (four, five, 6) data is collected from open sources and requires serious filtering, so there are few examples for training. This is where the ability of BERT to train on small datasets can come in handy;
  • The state-of-the-art result at the time of working on the project was in the Transformer architecture model, which was pre-trained in a rather specific way on a small dataset (6). It was interesting for us to compare it with the BERT-based model, which is pretrained in a different way, but on much more data.

For the semester, I had to do the following:

  • study the subject area;
  • find a dataset and select a representation of the input data;
  • develop a pipeline for training and quality assessment;
  • conduct experiments.

Data

There are several open datasets for this task, I chose the most filtered one (five).

The dataset was collected from the top 1000 open GitHub repositories in the Java language. After filtering, out of the original millions of examples, there are about 30 thousand left.

The examples themselves are pairs from the output of the git diff command and the corresponding short message in English. It looks something like this:

Both changes and messages in the dataset are short – no more than 100 and 30 tokens, respectively.

In most of the existing work investigating the problem of automatically generating messages for commits, a sequence of tokens from git diff is fed to the models.

There is another idea: to explicitly select two sequences, before and after changes, and align them using the classical algorithm for calculating the editorial distance. Thus, the changed tokens are always in the same positions.

Ideally, we would like to try several approaches and understand how they affect the quality of solving this problem. At the initial stage, I used a fairly simple one: two sequences were fed to the model input, before and after changes, but without any alignment.

BERT for sequence-to-sequence tasks

Both the input and output data for the task of automatically generating messages for commits are sequences, the length of which may differ.

To solve such problems, an encoder-decoder architecture is usually used, which consists of two components:

  • the encoder model builds a vector representation based on the input sequence,
  • the model-decoder generates an output sequence based on the vector representation.

The BERT model is based on an encoder from the Transformer architecture and by itself is not suitable for such a task. Several options are possible in order to get a full-fledged sequence-to-sequence model, the simplest is to use any decoder with it. This approach with a decoder from the Transformer architecture has shown itself well, for example, for the task of neural machine translation (7).

Pipeline

To conduct the experiments, code was needed to train and evaluate the quality of such a sequence-to-sequence model.

To work with the BERT model, I used the HuggingFace’s Transformers library, and for the implementation in general, the PyTorch framework.

Since in the beginning I had little experience with PyTorch, I largely relied on existing examples for sequence-to-sequence models of other architectures, gradually adapting to the specifics of my task. Unfortunately, this approach resulted in a lot of poor quality code.

At some point, it was decided to start refactoring, practically rewriting the pipeline. The PyTorch Lightning library helped to structure the code, which allows you to collect all the main logic of the model in one module and automate it in many ways.

Experiments

In the course of the experiments, we wanted to understand whether the use of the BERT model pre-trained in the program code would improve the state-of-the-art result in this area.

Among the BERT models trained on the code, only CodeBERT (one), since only she had the Java programming language in her training examples. First, using CodeBERT as an encoder, I tried decoders of different architectures:

  1. I assumed that with this option it would be possible to quickly get some kind of basic result. GRU is not often used with the Transformer architecture, so it was not entirely clear what to expect.
    As a result, it was not possible to obtain any reasonable quality even after selecting several hyperparameters influencing the learning process.
  2. I tried the second option, using GPT-2 (eight) – a model based on a decoder from the Transformer architecture, often used for generation tasks, as well as its smaller version – distilGPT-2 (nine).

There was not enough time for further experiments in the fall semester; I continued them in the winter. We looked at several other ways of presenting input: we tried aligning the sequences before and after changes, and also just filed a git diff.

The main results of the experiments are as follows:

Summing up

In general, the assumption about the benefits of using CodeBERT for this task was not confirmed; in all cases, the Transformer model trained from scratch showed higher quality. The CoreGen6 model remains the best method in this area: it is also a Transformer, but additionally pre-trained using the objective function proposed by the authors.

To solve this problem, you can consider many more ideas: for example, try data representation based on abstract syntax trees, which is often used when working with program code (10, eleven), try other pre-trained models, or do some field-specific pre-training if resources are available. In the spring semester, we focused on a more practical application of the results obtained and were engaged in auto-completion of messages to commits. I’ll talk about this in the second part 🙂

In conclusion, I want to say that it was really interesting to participate in the project, I plunged into a new subject area for myself and learned a lot during this time. The work on the project was very well organized, for which many thanks to my leaders.

Thanks for attention!

Sources of

  1. Feng, Zhangyin, et al. “Codebert: A pre-trained model for programming and natural languages.” 2020
  2. Buratti, Luca, et al. “Exploring Software Naturalness through Neural Language Models.” 2020
  3. Kanade, Aditya, et al. “Learning and Evaluating Contextual Embedding of Source Code.” 2020
  4. Jiang, Siyuan, Ameer Armaly, and Collin McMillan. “Automatically generating commit messages from diffs using neural machine translation.” 2017
  5. Liu, Zhongxin, et al. “Neural-machine-translation-based commit message generation: how far are we ?.” 2018
  6. Nie, Lun Yiu, et al. “CoreGen: Contextualized Code Representation Learning for Commit Message Generation.” 2021
  7. Rothe, Sascha, Shashi Narayan, and Aliaksei Severyn. “Leveraging pre-trained checkpoints for sequence generation tasks.” 2020
  8. Radford, Alec, et al. “Language models are unsupervised multitask learners.” 2019
  9. Sanh, Victor, et al. “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.” 2019
  10. Yin, Pengcheng, et al. “Learning to represent edits.” 2018
  11. Kim, Seohyun, et al. “Code prediction by feeding trees to transformers.” 2021

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *