Testing the ChatGPT logic on simple tasks

In the tech news, we often see ChatGPT success stories. For example, that a smart bot has successfully passed exams at another university. Or that many will soon be out of work, as they will be replaced by a generative system based on a large language model. Surely, many people had a desire to test the capabilities of ChatGPT and find out if it is really as smart as it is written in the press. If you also had such a desire, this note is for you.

The main question is how to actually test? Whatever set of questions we take, it will be the most subjective assessment. After all, another tester would have taken other questions – interesting to him personally and would have received different results. Keep this subjectivity in mind and make allowances for it. As a programmer, of course, I am more interested in how ChatGPT works with logic and how accurately it “understands” the question asked. Also, the test requires questions, the answers to which can be evaluated unambiguously and formally. For example, a request to write a poem or an essay is not suitable for the purposes. It seemed the most logical to take a set of school problems and evaluate the percentage of successful solutions. It looks like some kind of metric. The tasks were selected from the Internet, the selection criterion: the wording should not be very long and clearly formulated. Here is an example of such a task:

There are 16 candies. How to divide them between Kolya and Petya so that Kolya has two more sweets than Petya?

The rest of the tasks are just as easy. A complete list of tasks will be given at the end. Given that in the news ChatGPT “successfully passes university exams”, it is logical to expect that he will solve simple school problems with a probability close to 100%. A few more points:

  • Since it is written in the TG channels dedicated to AI that it is better to ask questions in English (they say, this is how the bot gives answers more accurately), then the tasks were set in English. By the way, later comparing the results, I did not see a difference in the quality of the answers (if you ask questions in Russian).

  • Since it is written in channels dedicated to AI that “prompting” greatly affects the quality of the answer, the following prefix “Solve a mathematical problem. Let’s think step by step. Task:”. As a result, ChatGPT really talks step by step and explains the result. Although, he often argues in steps even without such a prefix. I tried other prompts, but did not notice any difference in the quality of the answers.

  • Another point that I have never read about in the channels dedicated to AI. If you ask the same task several times (clearing the context), then often the answers are different. That is, once the bot will answer “1.5 diggers”, and the next time it will answer “2.5 diggers”. Therefore, for evaluation, it is not enough to ask a question once – I had to ask each task several times (I had 10 attempts for 10 tasks) and calculate the percentage of correct answers.

  • The previous point has another important consequence. If you are shown a screenshot of GPT’s answer to prove how smart (or vice versa stupid) it is, then it is not known how many attempts the author made to achieve such a good (or vice versa bad) answer. But even if the author received such an answer on the first attempt, this does not mean that this answer is typical, and not a rare outlier of statistics. Therefore, it is hardly worth considering the examples given by someone as convincing evidence of something. By the way, this applies not only to GPT.

  • One more point: it is clear that 10 attempts are very few for serious statistics. If you make 10 more attempts, the result can be very different. But we don’t have a serious study, but just a “quick knee test”. He does not ask for more.

  • Analyzing the erroneous responses of the network, I corrected the formulation of some tasks. For example, in the problem of the number of games in checkers, it was necessary to add the following clarification: “Two people play each game of checkers”. The percentage of correct answers immediately increased. Interestingly, not up to 100% (ChatGPT still sometimes ignored this clarification), but the percentage became noticeably higher. Probably, adding such a hint can be considered a dishonest fit for the answer, but let it be a small advance for ChatGPT.

  • By the way, an interesting detail. If you ask a separate question “How many people play a game of checkers?”, then the bot confidently answers that there are two people. However, in a mathematical problem that needs to use this knowledge, the bot forgets about it at random times. Perhaps this suggests that the logical connections built during the training of the model are rather disparate.

But let’s move on to the test results. Below is a list where each number is the number of correct answers out of 10 attempts (for each task): 3, 10, 6, 0, 1, 3, 6, 4, 6, 2. That is, 3 correct answers were received for the first task answer out of 10 given, for the second task – 10 correct answers out of 10, and so on.

What are the conclusions? First, we see that the percentage of correct answers jumps a lot between tasks (there are even extreme values ​​​​of 0 and 10 correct answers out of 10). Although, the tasks seem to be quite close in terms of difficulty.

The total percentage of correct answers in this test turned out to be: 41%. Is it a lot or a little? Judging by the TG channels about AI, people have become quite radicalized in their assessments of the capabilities of ChatGPT. Some find examples where the bot made a mistake on the simplest task and draw far-reaching conclusions from this about the weakness of the bot – at least in its current implementation. From the point of view of such “critics” – 41% on simple school tasks – this is a complete failure. Especially considering the overheated expectations formed on the news about the successful passing of university exams and other incredible successes.

There is another category of people who, on the contrary, are as loyal as possible to ChatGPT and are inspired by its capabilities. From their point of view, 41% is a very cool result, because until recently we did not think that a machine was about to learn to understand the logic of human speech, reason, solve problems – all this sounds like absolute fantasy. But we are only at the beginning of the journey!

At the end, in store for a small twist. The above results were obtained on the GPT-3.5 language model. But there is already a GPT-4 version. Moreover, in the channels I saw the opinion that “Four compared to version 3.5 is like a professor compared to a junior high school student!”. Ok, let’s assume so. After all, we have a certain test at hand, let’s run it on version 4 and compare the results. So here they are: 9, 10, 10, 1, 2, 10, 6, 7, 10, 7. Final percentage: 72%

The assessment again depends on the loyalty of the “appraiser”. Critical citizens will say “Pfft… 72 percent on simple school tasks is no longer a disaster, but still rather weak!”. Loyal citizens will say “Total success! Almost double boost for one version – unrealistically cool! And the next version will be even cooler!”.

While you are choosing which of the positions is closer to you, there is one more twist. More like a small addition. I found an excellent set of problems on the Internet – this time without calculations, but completely from the category of formal logic. Below is an example of such a task (the prompt about the need to explain the reasoning was also used, of course):

Task: People who are either tall or heavy or tall and heavy are not suitable for us. George suits us.

Answer options:

A. George is not tall
B. George heavy
C. George is tall but not heavy
D. None of the above

It seems that such tasks are directly created to assess the ability of AI to understand the question and reason. The assessment of the validity of the answers to these tasks was made not only on the basis of the correctness of the letter of the answer, but also all logical conclusions were carefully checked. If there was a gross logical error, then the answer was not counted. If the conclusions were not strict enough, then the score was proportionally lowered.

GPT-3.5 model results: 5, 0, 0, 3.5, 0, 1.3, 1, 0, 0. Total: 10.8%

GPT-4 model results: 10, 7.8, 9.3, 7.2, 9.5, 8.7, 4.5, 10, 10. Total: 77%

In logical tasks, the difference is no longer 2 times, but 7 times – this is impressive.

We look forward to new versions of the models – even stronger in mathematics, logic and common sense. All the best!

Application (task texts):

Task: Lyova, Gena, Vasya, Tolya and Misha had three drums and two trumpets. What musical instrument did each boy have if Gena, Leva and Misha had the same instruments?

Task: There are 16 candies. How to divide them between Kolya and Petya so that Kolya has two more sweets than Petya?

Problem: The book costs one dollar plus half the cost of the book. How many dollars is the book worth?

Task: Vanya has as many brothers as sisters, and his sister has half as many sisters as brothers. How many sisters and how many brothers are in this family?

Task: Several ducks swim one after another. If you take one of these ducks, then there will be two ducks in front of it. If you take another of these ducks, then there will be two ducks behind it. If you take the third of these ducks, then there will be one duck in front of it and one duck behind it. What is the minimum number of ducks that fits this description?

Problem: Two trains 200 km apart start moving towards each other at a speed of 50 km/h each. A fly starts from one train and flies towards another train at a speed of 75 km/h. Having reached another train, the fly flies back to the first train. And so on, until the trains meet. How far will the fly fly in this time?

Task: A full bucket of milk weighs 10 kilograms. A bucket filled to half weighs 6 kilograms. How much does an empty bucket weigh?

Task: Kolya, Vasya and Borya were playing checkers. Each of them played two games. How many games were played in total? It should be noted that each game of checkers is played by two people.

Problem: Two years ago, the age of the brother plus the age of the sister was 15 years. Now my sister is 13 years old. How many years does it take for a brother to be 9 years old?

Task: There are two types of inhabitants in the city: liars who always lie and knights who always tell the truth. The traveler met two residents of the city. The first of them said: at least one of us is a liar. Which of the two inhabitants is a knave and which is a knight?

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *