Wait: IBM will teach AI to write code and create CodeNet

500 million lines of code in over 55 different programming languages.


Percentage of submissions by language (left) and status (right).

  • The dataset contains 13,916,868 views, divided into 4053 tasks, five of which have no views.
  • Part of the dataset was assembled from submissions to the Google Code Jam from 2008 to 2020.
  • 53.6% (7,460,588) submissions were accepted, 29.5% were marked as “wrong answer”, and the rest were rejected due to a runtime or memory requirement mismatch.
  • The dataset contains materials in 55 different languages; 95% of them are written in C ++, Python, Java, C, Ruby and C #.
  • C ++ is the most widespread language with 8,008,527 representations (57% of the total), of which 4,353,049 are accepted.

Software eats the world“- wrote the American entrepreneur Mark Andreessen in 2011. Fast forward to these days – software is used in financial services and healthcare, smartphones and smart homes. Even cars today contain over 100 million lines of code.

However, such large amounts of code are difficult to debug, maintain, and update, especially when enterprises are looking to modernize their legacy software infrastructure. As a result, we are in a new era where it is important to take advantage of modern technologies such as artificial intelligence and hybrid cloud to create new solutions that can modernize the processes in the information technology pipeline.

Go to Project CodeNet… A large dataset dedicated to teaching artificial intelligence programming, it consists of approximately 14 million code examples and approximately 500 million lines of code in over 55 different programming languages, from modern ones like C ++, Java, Python and Go to legacy languages. such as COBOL, Pascal and Fortran.

But to understand the meaning of this dataset, we must first look back in time.

AI’s next frontier: the language of machines

Computer scientists have long been interested in the possibility of a computer programming computers. Can AI make it easier to understand, develop, and deploy code – the language of machines? It is possible, but it is not easy to achieve it.

The problem is with rule-based systems.

Let’s take a translation into a programming language. If it were easy, rule-based systems would work, and early programming languages ​​such as COBOL would have been transformed by now. But programming languages ​​have a context. The meaning of any statement is contextualized, and obtaining and translating it, as with human languages, is difficult and time-consuming.

The larger the program becomes, the more difficult it is to translate it. In human language, the context can be limited to a paragraph or so, here the context can refer to multiple code libraries. Context is a challenge for AI.

Roughly speaking, rule-based systems can successfully translate 50 to 60 percent of a program. Part of the program can be translated quite well, the rest usually has to be translated by hand using complex rules.

AI development for code

This is where AI can help because it can act like humans.

Project CodeNet, in particular, can stimulate algorithmic innovation to extract this context using sequential models, just like we use in human languages ​​to bring greater clarity to machine understanding of code as well as machine processing of code.

Project CodeNet is unique for its code samples selected in open programming contests over the years. It is unique not only in size and scale, but also in the quality of metadata and annotations with a rich set of information, be it code size, memory size, CPU time, or state that indicates acceptance or types of errors.

More than 90 percent of problems are associated with an appropriate description of the problem, containing a brief statement of the problem, the specification of the input and output format. For more than half of the programming problems (that is, seven million code samples), we also curated input and output samples from the problem description, which is the key to determining the equivalence of two code samples in different languages, which can stimulate reinforcement learning methods for translating code.

We provide them as part of the dataset – a handy Project CodeNet feature. Users can run the hosted code samples to extract additional metadata and validate the output of generative AI models. This will allow researchers to program the equivalence of intent when translating one programming language to another.

The rich metadata and the variety of code examples and the problems they solve open Project CodeNet to a multitude of use cases. The dataset can be used to search for code and find copies. The code samples in Project CodeNet are marked with an Acceptance status, and we can explore artificial intelligence techniques to distinguish correct code from problem code.

Project CodeNet metadata also allows you to track the evolution of a problem view to an accepted view, which can be used to learn about automatic code fixing. Each code sample is labeled with CPU runtime and memory size, which is useful for regression studies and predictions.

Given the abundance of programs written in many languages, we believe that Project CodeNet can serve as a reference dataset for translation from source to source and do for AI and code what the ImageNet dataset did for computer vision many years ago.

Modernization and operation of software infrastructure are also important from a business point of view. We touched on this last year when IBM announced several new opportunities, including IBM WatsonAIOps and Accelerator to modernize applications that automate the information technology pipeline.

For example, a major automotive customer asked IBM to help upgrade a $ 200 million asset of 3,500 multi-generation Java files. These files consisted of over one million lines of code developed over ten years using multi-generational Java technology.

It was complex, monolithic application code that was not suitable for cloud environments. By applying our AI stack to code, we reduced our year-long code migration business process to four weeks, modernized and built over 25 new cloud microservices by refactoring legacy monolithic application code.

Our team is proud to provide researchers and developers with a dataset and set of technologies that are easy to use and understand, while helping to design algorithms that will drive AI for code. It is our hope that Project CodeNet will deliver business value as businesses embark on their IT journey.

Open Project CodeNet on github and read preprint


I had strings, but now I’m free. There are no strings on me

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *