human vs GPT-4

The use of large artificial intelligence language models has become extremely common in recent years. The popularity of ChatGPT has led to a sea of ​​discussions on the correct use of such systems, both from the practical and ethical side of the issue. When considering a particular AI, scientists compare its capabilities with the capabilities of the human brain. For example, scientists from the University of Arkansas (USA) conducted a study that compared the creative thinking of people and ChatGPT-4. What parameters were compared, how did ChatGPT perform, and what conclusions can be drawn from the results of this study? We will find answers to these questions in the scientists' report.

Basis of the study

The emergence of ChatGPT – a natural language processing model (NLP from

natural language processing

), developed by OpenAI, has caused a lot of discussion about the benefits and harms of artificial intelligence. OpenAI's Generative Pre-Trained Transformer (GPT from

Generative Pretrained Transformer

) is a type of machine learning that specializes in pattern recognition and prediction. It is trained using reinforcement learning based on human feedback (RLHF from

reinforcement learning from human feedback

) so that ChatGPT responses are indistinguishable from human responses.

OpenAI recently touted the new model (GPT-4) as being “more creative” than previous versions. Creativity as a phenomenological construct is not immune to the effects of AI. For example, researchers have begun evaluating AI models to determine appropriate design decisions and reasoning. These assessments focus on convergent thinking, i.e., identifying one optimal solution to a predefined problem. Although convergent thinking emphasizes a single optimal solution, it does not eliminate the possibility of original or non-obvious solutions. However, convergent thinking tasks, by their nature, typically do not allow for flexible or divergent thinking. In contrast, divergent thinking involves generating multiple creative solutions to a problem.

When researching creativity, scholars typically focus on the divergent dimension (vs. the convergent dimension), considering associative mechanisms that hint at people's ability to generate creative solutions (i.e., creativity). In particular, divergent thinking is considered an indicator of a person's creative potential, but it does not guarantee creative achievement. Instead, creativity may indicate future abilities rather than an immediate trait that determines whether a person is creative. Accordingly, a person’s creative potential is assessed using divergent thinking tasks (“alternative applications”, “consequences”). Divergent thinking tasks can be assessed using three parameters: fluency (number of answers), originality (novelty of the answer), and elaboration (length/detail of the answer). Responses in each category are assigned a score (i.e., for each item) and are used to assess individual differences in divergent creativity, or in other words, a person's creative potential.

The prevalence of AI has led scientists to try to compare the creative potential of humans and artificial intelligence. On the one hand, some researchers argue that human cognitive mechanisms present in creative tasks are absent in AI, and therefore artificial intelligence creativity can only reflect artificial creativity. On the other hand, computational creativity involves parallel networks that reflect the mechanisms of how people go through iterative, deliberative, and generative creative processes that help find creative solutions. Although these aspects have previously been proven to help in finding a creative solution, they still do not guarantee success, since a person may experience fixed ideas, which can serve as an obstacle to generating other creative solutions. The machine will not perceive this phenomenon in a metacognitive way, since it is trained through calculations. A machine's fixation on a specific solution to a problem reflects the result of its learning (computational processes), not its creative potential.

How are machines able to determine what is creative? This is another important question to which there is no exact answer yet. Currently, AI's inability to explicitly define what and why is considered creative is compensated for by human assistance. For example, human intervention is needed to input relevant and relevant data to train the model and shape the output so that it becomes more linguistically natural. This computational limitation suggests that the AI ​​is incapable of divergent thinking due to a lack of metacognitive processes (i.e. evaluation, motivation to perform tasks), since the AI ​​cannot generate creative ideas or repeat existing ones without intervention (i.e. input) person.

The question of AI creativity lies not only in the plane of computing capabilities, but also in the plane of philosophical perception. Many of the results of creative AI work are assessed by humans as less creative compared to the work of real people. In this case, for example, paintings painted by AI and by humans can be absolutely identical. Despite this, the knowledge that a given painting was created by AI becomes a decisive factor in its evaluation.

In the work we are considering today, scientists decided to compare human creative thinking and ChatGPT-4. In particular, divergent thinking was assessed.

Preparing for the study

151 people took part in the study. Each individual interaction with ChatGPT-4 was counted as a separate AI test subject. As a result, 151 interactions were conducted to ensure that there were equal numbers of human and AI participants.

AUT task (from alternative uses task, i.e., alternative use) was used to assess divergent thinking. In this task, participants were presented with an object (“fork” and “rope”) and asked to come up with as many creative uses for the objects as possible. Responses were scored on fluency (number of responses), originality (uniqueness of responses), and detail (number of words per valid response). Participants were given 3 minutes to come up with answers for each item.

Because the goal was to control fluency, the scientists eliminated cue parameters such as “quantity” from the instructions for GPT-4. Likewise, GPTs do not require timing parameters compared to humans, as scientists have outlined a specific number of responses required.

CT task (from consequences task, i.e. Consequences) is part of the verbal section of the TTCT (Torrance Creative Thinking Test), which provides clues to hypothetical scenarios (eg, what would happen if people no longer needed sleep?). As with the AUT, people had to indicate the maximum number of consequences within a given time period. Responses were scored for fluency (number of responses), originality (uniqueness of responses), and detail (number of words per response).

Participants were given two statements shown independently of each other: “imagine people no longer needing sleep” and “imagine people walking with their hands.” As with AUT, fluency and timing parameters were omitted from the instructions for GPT.

The Divergent Association Task (DAT) is a task of divergent and verbal semantic creativity. In this task, participants were asked to come up with 10 nouns that were as different as possible from each other. These nouns must not be proper nouns or any technical terms. Pairwise comparisons of semantic distance between 10 nouns are calculated using cosine distance. The average distance estimates between all pairwise comparisons are then multiplied by 100 to produce the final DAT score. High scores indicate greater distances (i.e., words are not similar). Task instructions are the same for both human and GPT-4 participants. There was no time limit to complete this task. The average human response time was 126.19 seconds and the average DAT score was 76.95. Participants who provided fewer than 7 responses were excluded from data analysis.

Research results

Both human and GPT-4 responses were processed to remove incomplete or irrelevant responses in both experiments. For AUT, 0.96% of responses were removed, and for CT, 4.83% were removed. The same procedure was performed for GPT responses: <0.001% removed for AUT and CT.

Traditional methods for assessing divergent thinking tasks require human evaluation of answers (i.e., the correct answer is determined by consensus of a larger number of raters). In this study, researchers used the Open Creativity Scoring (OCS) tool to objectively automate semantic distance scoring by determining the originality of responses by assigning distance scores (uniqueness). Unlike human assessment, which requires consideration of multiple factors (e.g. fatigue, bias, time, etc.) that can lead to potential confusion, automated assessment tools such as OCS bypass human-centric issues and, have been found to be highly correlated with human ratings.

The OCS tool was used to assess the AUT and CT tasks. Specifically, the Semantic Distance Estimator was used, which uses the GLoVe 840B text mining model to evaluate the originality of answers by representing the hint and answer as vectors in semantic space and calculating the cosine of the angle between the vectors. The OCS tool also evaluates granularity using the stop-list method. The cues for AUT were “rope” and “fork,” and for CT the cues were “people don’t sleep” and “people walk with their hands.”

As expected, an independent samples t test revealed no significant differences in overall verbal fluency between individuals and GPT-4. To assess the originality of responses using semantic distance measures, the researchers conducted an analysis of variance for the interaction between group (person, GPT-4) and cues (fork, rope). This model identified significant group contributions and clues to the resulting response. In addition, there were significant interaction effects between group and prompt. Specifically, both samples had higher originality scores for the clue “fork” compared to “rope,” but GPT-4 scored higher on originality, regardless of the clue. Subsequent analyzes showed that all pairwise comparisons were significantly different, with the exception of the originality of the human response for “fork” and the GPT-4 response for “rope.” Overall, GPT-4 was more successful at producing different answers compared to humans, and showed higher originality, but only for specific clues (graph below).



Image #1

Next, the scientists compared human response fluency and GPT-4. Response fluency differs from granularity in that it takes into account every response present in a single sentence. For example, the answer “you can use a fork for knitting or as a hair comb” contains 2 original answers, and the detail is 12 (word count: you could use a fork to knit or as a hair comb). Results from an independent t-test indicated that response granularity was significantly higher for GPT-4 compared to humans. As expected, an independent t test revealed no significant differences in overall verbal fluency between individuals and GPT-4.

The scientists then began analyzing the CT test results. To evaluate the originality of responses using semantic distance measures, the researchers conducted an analysis of variance (group: human, GPT; prompt: “no more sleep,” “walk on your hands”).

There were significant interaction effects between group and prompt. In particular, originality was slightly higher for the handwalking prompt in the GPT sample, although there were no significant differences in originality in the human sample between the two prompts. As in the previous test, GPT-4 was more successful at producing different answers compared to humans, and showed higher originality, but only for specific clues (graph below).



Image #2

The scientists then calculated the difference in response granularity between humans and GPT-4. The results of the independent I test showed that detail was significantly higher in the GPT-4 sample than in the human sample.

Next, the scientists began to analyze the results of the DAT tests. Individuals had a higher number of unique words (n = 523), which represented 69.92% of the total responses, compared to the number of GPT unique words (n = 152), which constituted 47.95% of the total responses. There was a total of 9.11% (n = 97) overlap in responses between both groups. Exclusively unique words that only appeared in human responses accounted for 87.03% (n = 651) compared to GPT unique responses which accounted for 69.40% (n = 220).

Differences in semantic distance scores between human DAT and GPT-4 responses were calculated. An independent sample t test showed that GPT responses had higher semantic distances compared to human responses. Although human participants had a wider range of unique responses, uniqueness in verbal fluency did not appear to improve semantic distance scores when comparing groups.

For more detailed information, I recommend taking a look at scientists' report.

Epilogue

In the work we reviewed today, scientists conducted a series of tests, the purpose of which was to determine the degree of creativity of GPT-4. The main criterion for this assessment was divergent thinking, which is associated with the generation of a unique and creative solution to a particular problem.

During practice trials, several tasks were presented to humans and GPT-4: alternative usage (AUT), consequences (CT), and divergent associations (DAT). The first task was to select the most creative uses of the objects (fork and rope). The second task required participants to come up with as many consequences as possible in response to a certain condition (people no longer need sleep, people walk on their hands). The third task was to generate words that would be as different from each other as possible. The criteria for assessing the test results were fluency (number of answers), originality (uniqueness of the answer) and elaboration (length/detail of the answer).

Analysis of the results showed that GPT-4 gave more original and more detailed answers. However, as the scientists note, in this study they tried to determine the creative potential of GPT-4, which is a separate parameter from creative skills. Artificial intelligence, unlike humans, does not have free will, and therefore depends directly on human help. Consequently, AI creativity is in a constant state of stagnation. It is also worth noting that people took a more thoughtful approach to completing tasks, as they tried to give answers that would not go too far beyond the limits of reality. But GPT-4 does not have such a limitation, which is why it was able to provide more answers. Another factor that sets GPT-4 humans apart is motivation. The people in the study may not have had the greatest possible motivation to generate more creative responses. GPT-4, roughly speaking, does not know what motivation is, which is why it gave many answers.

The main conclusion of this study is that large language models, which is what GPT-4 is, are rapidly evolving. Whether they will be a substitute for a person within the framework of creative activity or will become a tool of help is difficult to say. In any case, it all depends on how exactly AI will be used by humans.

A little advertising

Thank you for staying with us. Do you like our articles? Want to see more interesting materials? Support us by placing an order or recommending to friends,

cloud VPS for developers from $4.99

,

a unique analogue of entry-level servers that we invented for you:The whole truth about VPS (KVM) E5-2697 v3 (6 Cores) 10GB DDR4 480GB SSD 1Gbps from $19 or how to properly share a server?

(options available with RAID1 and RAID10, up to 24 cores and up to 40GB DDR4).

Dell R730xd is 2 times cheaper in the Maincubes Tier IV data center in Amsterdam? Only here 2 x Intel TetraDeca-Core Xeon 2x E5-2697v3 2.6GHz 14C 64GB DDR4 4x960GB SSD 1Gbps 100 TV from $199 in the Netherlands! Dell R420 – 2x E5-2430 2.2Ghz 6C 128GB DDR3 2x960GB SSD 1Gbps 100TB – from $99! Read about How to build a corporate infrastructure. class using Dell R730xd E5-2650 v4 servers costing 9,000 euros for pennies?

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *