Interviewing ChatGPT-4 for a developer role

Increasingly, we hear that neural networks based on large language models will replace developers and write code in their place. But for the web to replace the developer, the language model must first… be interviewed! Need to check if it works?

Such a frivolous thought occurred to me after reading several posts that ChatGPT successfully solved tasks typical of job interviews. However, ChatGPT is known to have several features:

  • the answer contains an element of chance. The same question can have both correct and incorrect answers.

  • ChatGPT is trained on information from the Internet published up to September 2021. If the condition of the problem and its solution became available before this date, then the problem is probably simplified

I was curious how much of the ChatGPT results were “the ability to solve algorithmic problems” and how much were randomness or “remembering the solution”.

Plan of my experiment (briefly)

I decided to select several dozen tasks with leetcode according to two independent criteria: easy / difficult and published before / after September 2021. I randomly chose 13 tasks in each category, for a total of 52 tasks. The condition of each task was given to the ChatGPT-4 input with explanations corresponding to the “algorithmic interview” situation. From the network, I expected Python code that passes automatic tests for the task. If the code didn’t pass the tests, I gave one single attempt to make corrections to the code.

results

My experiment turned out the following ACeptance rate (that is, the percentage of correct solutions), taking into account one additional attempt:

Lungs

Complex

Until September 2021

50%

23%

After September 2021

36%

0%

During all this time

42%

12%

In only one of the 25 erroneous answers, ChatGPT was able to actually improve the solution, that is, fix the code that was not working correctly. It is not clear why the correct answer was not received immediately, since the request to improve the solution did not contain any additional information (other than the fact that the previously proposed solution does not work).

Interpretation of results

The results could be called expected if we were talking about a person: easy tasks are easier to solve than complex ones (auf); and previously seen are easier than new ones. Intuitively, it seems that this explanation should also be true for the results of the language model.

The result obtained looks modest enough for real live coding. Let’s imagine that an interview for a fictitious company is given two easy tasks, which can correspond to junior or trainee vacancies. The chance of ChatGPT-4 successfully passing an interview in this case is about 18%. If the interview consists of one easy and one difficult task, then the chances drop to 5%.

On the other hand, this result seems incredibly cool for a general purpose language model that has never been specifically trained to write code. Probably in the very near future we will see various special models that significantly outperform ChatGPT-4 in live coding.

I also want to believe that as the results of language models improve in solving algorithmic problems, interviews will become less like competitions in sports programming.

Links

Pivot table with the results of my experiment

Text results and more details about the experiment plan

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *