LLM as a universal “master key” for a student – is it all that good?

Introduction

Let's immediately outline what will not happen here:

– discussions about what “intelligence” is and whether ChatGPT and others like them have it;

– categorical statements a la “never / in 5 years we will be driven out onto the streets by cheap AI agents”;

– debates about which is better – R/Python/Julia/calculator/accounts/mental arithmetic;

– loud conclusions about which LLM is the coolest.

Let's also indicate an approximate place of action – I work as a teacher at a university, for the fourth year I have been teaching disciplines related to data analysis, my main research tool and the tool that I teach is R. I have quite different students – bachelors, masters of economics and information sciences , but for almost all of them my first discipline is teaching them computer data analysis. In principle, the program is the same for everyone, plus or minus, we teach consistently: working with data, graphs, statistical tests, classical algorithms for solving problems of regression, classification, clustering.

That's it, the preamble is over. Now to the main action.

Situation

Somewhere in the month of October, some of my students began to receive very strange solutions, checking which one got the feeling that the intelligence of the answerer was changing along a sinusoid with a period of several minutes, or even seconds – a complete and adequate solution with conclusions in one task was replaced by an answer, from which it was clear that the student did not even try to understand the terms of the task – at the same time, the very style and tone of what was written was quite different from the style of an ordinary student. It quickly became clear that this is the GPT style, and in its worst form – when a person simply copies an assignment into LLM and then copies the answer, without even trying to correct it. Of course, the works were returned with appropriate comments, and the following questions came to the fore:

1. Where is the acceptable limit for using LLM models when solving tasks?

2. In general, how well can LLMs solve entry-level data analysis tasks?

3. How accurately can a teacher distinguish a human solution from a computer solution?

The answer to the first question was found quickly enough – there is no such acceptable limit. The main thing is that a person (either alone or in conjunction with AI) completes the task accurately and correctly and demonstrates an understanding of what was done through well-written conclusions.

The need to answer the second and third questions gave rise to the idea of ​​an experiment.

Experiment

I asked my student (I chose, as expected, smart and lazy) to solve problems using free LLM models that he could get his hands on. He took 6: Gemini 1.0 Pro, Copilot (Bing), YandexGPT 3 Pro, GPT-3.5 Turbo, GigaChat, BLACKBOX (somehow he even managed to put a dataframe in them)

The tasks are here:

1. Select observations from area A from the dataframe and test the hypothesis that the distribution of the number of fire crews corresponds to the negative binomial distribution. Draw conclusions.

2. Select observations of fires in areas with large rivers from the dataframe and check the hypothesis that the distribution of the amount of material damage corresponds to a uniform distribution. Draw conclusions.

3. Conduct a study on the type of distribution of the ratio of the number of fire crews to the number of units of equipment. Draw conclusions.

4. Test the hypothesis about the equality of the average values ​​of the fire area variable in areas with and without forestry. Draw conclusions.

5. Test the hypothesis about the equality of the median of the variable number of units of equipment in areas with and without large lakes. Draw conclusions.

6. Test the hypothesis that the variance of the fire area variable is equal in areas with and without mountains. Draw conclusions.

7. Check the hypothesis about the equality of the dispersion of the values ​​of the variable number of units of equipment in all areas under consideration. Draw conclusions.

8. Select observations for 2016 from the dataframe and test the hypothesis about the equality of the proportions of the variables: the presence of mountains and the presence of clay soils. Draw conclusions.

9. Test the hypothesis about the simultaneous equality of the mean values ​​and variance of the variable amount of material damage in areas D and B. Draw conclusions.

10. Select observations for 2012 from the dataframe and test the hypothesis about the equality of the dispersion of the values ​​of the variable amount of material damage in areas with and without clay soils. Draw conclusions.

The database (synthetic!) looks something like this:

Variables in it:

District – District;

Height – Average height above sea level;

Rivers – Presence of large rivers;

Mountain – Availability of mountains;

Lakes – Presence of large lakes;

Clay_Soils – Presence of clay soils;

Forestry – Availability of forest areas;

Area – Forest fire area, hectares;

Time – Fire time before extinguishing, h;

Damage – Amount of material damage from fire, thousand rubles;

Experses – Amount of fire extinguishing costs, thousand rubles;

Precipitation – Total precipitation for the month, mm;

Bridges – Number of fire crews involved in extinguishing the fire;

Equimpent – Number of units of equipment involved in extinguishing the fire;

Class – Fire class.

The expected solution would look something like this:

#### Задание 1 ####
library(tidyverse)
library(fitdistrplus)
DF_select <- DF %>% filter(District == 'A')
gofstat(fitdist(DF_select$Bridges, "nbinom"))
# р-значение меньше 0.05 (0.0005783625), мы отвергаем гипотезу о равенстве соответствии распределения 
# величины количества пожарных расчетов в районе А отрицательному биноминальному распределению

#### Задание 2 ####
library(EDFtest)
DF_select <- DF %>% filter(Rivers == 1)
gof.uniform(DF_select$Damage) # Не работает
fitdist(DF_select$Damage, "unif")
plotdist(DF_select$Damage, "unif", para = list(min=1, max=3343))
# По графикам теоретической и эмпирической плотности вероятности видно, что распределение 
# величины материального ущерба в районах с наличием крупных рек не является равномерным

#### Задание 3 ####
DF_select <- DF %>% mutate(Ratio = Bridges / Equimpent)
descdist(DF_select$Ratio)
# Выдвигаемые гипотезы - соответствие гамма, экспоненциальному, бета-распределениям
gof.gamma(DF_select$Ratio) # Не работает
gof.gamma.bootstrap(DF_select$Ratio, M=1000)
# Не соответствует гамма-распределению
gof.exp.bootstrap(DF_select$Ratio, M=1000) 
# Не соответствует экспоненциальному распределению
fitdist(DF_select$Ratio/max(DF_select$Ratio), "beta", method = "mme")
plotdist(DF_select$Ratio/max(DF_select$Ratio), "beta", para = list(shape1=1.107145, shape2=5.665215))
# Сомнительно, что бета
# Распределение величины соотношения количества пожарных расчетов к количеству единиц техники не
# соответсвует бета-, гамма-, экспоненциальному распределению, но, скорее всего, является их смесью

#### Задание 4 ####
library(doex)
AF(DF$Area, DF$Forestry)
# р-значение меньше 0.05, поэтому мы отвергаем гипотезу о равенстве средних значений 
# площади в районах с наличием и отсутствием лесничества.

#### Задание 5 ####
library(BSDA)
Sample_1 <- DF[DF$Lakes==1,"Equimpent"]
Sample_2 <- DF[DF$Lakes==0,"Equimpent"]
SIGN.test(sample(Sample_1,length(Sample_2),replace = TRUE),Sample_2)
# р-значение меньше 0.05, поэтому мы отвергаем гипотезу о равенстве медиан величины техники
# в районах с наличием и отсутствием крупных озер

#### Задание 6 ####
Sample_1 <- DF[DF$Mountain==1,"Area"]
Sample_2 <- DF[DF$Mountain==0,"Area"]
var.test(Sample_1,Sample_2)
# р-значение больше 0.05, поэтому мы не отвергаем гипотезу о равенстве дисперсии площади пожара
# в районах с наличием и отсутствием гор

#### Задание 7 ####
library(stats)
bartlett.test(DF$Equimpent, DF$District)
# р-значение больше 0.05, поэтому мы не отвергаем гипотезу о равенстве дисперсии количества техник 
# в разных районах

#### Задание 8 ####
DF_select <- DF %>% filter(Year == 2016)
prop.test(table(DF_select$Mountain,DF_select$Clay_Soils)) # Ошибка
table(DF_select$Mountain,DF_select$Clay_Soils)
# Поскольку в наблюдаемой подвыборке нет информации о пожарах в горных районах, можно констатировать,
# что гипотеза о равенстве пропорций не может быть проверена

#### Задание 9 ####
library(SHT)
mvar2.LRT(DF[DF$District=="B","Damage"],DF[DF$District=="D","Damage"])
# р-значение меньше 0.05, поэтому мы отвергаем гипотезу об одновременном равенстве среднего значения 
# и дисперсии материального ущерба в районах B и D

#### Задание 10 ####
DF_select <- DF %>% filter(Year == 2012)
fligner.test(DF_select$Damage, DF_select$Clay_Soils)
# р-значение больше 0.05, поэтому мы не отвергаем гипотезу о равенстве дисперсии материального ущерба 
# в районах с наличием и отсутствием глиняных почв в 2012 году

The answer from the best of the 6 models looked like this:

Experiment results

The evaluation results of all models are presented in the table below (by the way, the file names were encrypted before checking and I did not know whether I was checking the students’ solution or the LLM solution)

In general, to my unspeakable pleasure, all models scored less than half of the possible points, and this is an “unsatisfactory” rating. The best student scored 7.8 points out of 10, which is “good.”

conclusions

Actually, in short, free LLM models can solve individual data analysis problems well, but they are not able to solve sets of such tasks (I understand that the conclusion seems banal, but it is true); and they are most often pierced at the conclusions – they too often turn out to be verbose and “watery”.

If anyone has already put their hands on the keyboard to write (or even prove) that here is GPT-4 (which was updated today), or something in the future will click these jobs like nuts, don't bother – I won't even bother with argue with you.

Personally, I understood the main thing for myself – how to modernize tasks if I need to ensure that they are “difficult to solve” for LLMs. This time.

Secondly, it seems that in the wake of the thesis “education is a service”, too many have forgotten the basic, in general, idea: education “from the eyes” of a student is a process of cognition, learning something new (forming new neural connections, etc.). d.). And the result of this can be verified extremely simply, in fact, by a classic conversation between a teacher and a student. A simple conversation on the topic, without any “but this needs to be memorized and learned,” determines the student’s result, and even the degree of effort he puts in.

Third, I’m actually glad that LLMs have appeared. With their help, the best students will achieve more, the worst – on the contrary, less (because they will procrastinate until the last minute, “after all, why bother, the bot will do everything for me”). I really hope that LLMs will remove tests from the system of assessing learning outcomes (because they are unstable to LLMs), and the classical work of a teacher will begin to be valued more.

Well, I just like to come up with new tasks and their systems myself.

Thank you to everyone who has read this far, and I express my gratitude to my colleague in this work, Ivan Ivanovich B., a third-year student. I think it will be interesting to repeat it in a year.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *