GPT - technology for good

People have used any new powerful technology for both good and evil purposes: when gunpowder was invented, some made fireworks and salutes from it, and others made weapons and bombs; when they learned to split the atom, they began to make cheap electricity and bombs capable of destroying cities and countries. The technology that I will tell you about in this article can invent new drugs without side effects, recognize diseases before any doctors, write interesting books and help humanity solve many problems, but it can also (in the wrong hands) create dangerous substances, create computer viruses, manipulate people and much more…. But Pandora's box has already been opened and many companies and ordinary people in the world already have this technology, my task is to simplify the understanding of this miracle) I hope you will use the knowledge gained only for good!
My name is Georgy Gashokin, I have been programming since I was 7 years old…

In the world of rapidly evolving technologies, big data, and artificial intelligence, models like GPT occupy a special place. They can generate text, translate languages, write various types of creative content, and answer your questions informatively. But like any powerful tool, GPT can be used for good or evil.

In this article, we'll walk through my code that demonstrates how to create and train a simplified version of a GPT model, and discuss the ethical issues associated with using such technologies.

Getting to the point:

Part 1: Loading and Processing Data

import time
import joblib
import pandas as pd
import tensorflow as tf # 2.5.2
from tensorflow.keras.layers import Input, Dense, Dropout, BatchNormalization, ActivityRegularization, Embedding
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
from tensorflow.python.keras.callbacks import ReduceLROnPlateau, LearningRateScheduler, EarlyStopping, ModelCheckpoint
from tensorflow.python.keras.layers import TimeDistributed
import tensorflow_datasets as tfds
from tensorflow_addons.optimizers import LAMB
from tensorflow.keras.regularizers import l1, l2
from transformers import AutoTokenizer
from tensorflow.keras.models import load_model

# Загрузка данных
with open("vopros_otvet.txt", "r", encoding="utf-8") as file:
    data = file.read()

in_text_max = 128       # Максимальная длина входной последовательности
out_text_max = 300      # Максимальная длина генерируемого текста

# параметры модели      gpt3     gpt4
embedding_dim = 16      # 768    # 2048 (Размерность векторов эмбеддингов)
num_heads = 3           # 12     # 64   (Количество голов внимания в MultiHeadAttention)
dense_dim = 8           # 3072   # 4096 (Количество нейронов в Dense слоях)
num_layers = 3          # 96     # 96   (Количество слоев Transformer)
dropout = 0.3           # 0.3           (Вероятность dropout)

train_text_but = 1     # Флаг для обучения модели (1 - обучать, 0 - не обучать)
generate_text_but = 1  # Флаг для генерации текста (1 - генерировать, 0 - не генерировать)

content_copyUse code with caution.Python

In this code block we import the necessary libraries, load data from a file and define the main parameters of the model. I wrote next to the model parameters those parameters that are most likely to be in the real GPT-3 and GPT-4. If you have little VRAM, then set the parameters smaller so that the model can be loaded entirely into the video card memory, maybe in another article I will make another model that will only load into RAM…

As for the file vopros_otvet.txt, this file is needed to train the model. In it, I saved a set of questions and answers of the following type:

привет! <VOP> 
Здравствуйте! Чем я могу Вам помочь? <END> 

Какой калибр АК предпочтительнее 5,45 или 7,62? <VOP> 
Как и в любом выборе, ответ на этот вопрос зависит от конкретных обстоятельств и задач, которые нужно решить. 
Калибр 5,45 мм был разработан в СССР в 1970-х годах как ответ на американский калибр 5,56 мм, который использовался в штурмовых винтовках M16. 5,45 мм позволяет снизить отдачу и повысить точность стрельбы на дальних дистанциях, что делает его предпочтительным для стрельбы на больших расстояниях. Кроме того, патроны 5,45 мм легче и компактнее, что позволяет носить больше боеприпасов.
Калибр 7,62 мм является более традиционным для автоматического оружия и используется в АК-47 и АКМ. Он имеет большую остановочную силу и может пробивать бронежилеты и стены, что делает его предпочтительным для боевых условий и стрельбы на ближних дистанциях.
Таким образом, если нужно стрелять на больших расстояниях и повышать точность, то предпочтительнее калибр 5,45 мм. Если же нужно пробивать бронежилеты и стены, то лучше выбрать калибр 7,62 мм. В любом случае, выбор калибра должен быть обоснован и зависеть от конкретных задач и условий. <END>

Note that the question ends with and the answer ends with . This is important because the neural network will understand that after it needs to generate an answer, and as soon as you see that the neural network has written after generating the text, it means that it has finished its thought and it is time to slow it down, otherwise it will continue to generate nonsense…

Part 2: Tokenization

Tokenizer – It is a text processing tool that breaks text into individual elements called tokens. In short, the tokenizer will look at all the text in the vopros_otvet.txt file, break it down into simple repeating parts like prefixes, suffixes, roots in words, individual letters and signs, and give each token a digital number, so it will be easier for the neural network to work with the text, since it only understands numbers.

# Токенизация
special_tokens = ["<START>", "<END>","<VOP>","<сrypto-15:>"] # Специальные токены
oov_token = "<OOV>"                        # Токен для слов, не встречающихся в словаре
tokenizer = Tokenizer(lower=False,         # Не приводить к нижнему регистру
                      split=" ",         # Разделитель - пробел
                      filters="",        # Не фильтровать символы
                      oov_token=oov_token, # Токен для неизвестных слов
                      #char_level=True     # Не использовать символьную токенизацию
                      )

tokenizer.fit_on_texts(data.split('\n'))   # Обучение токенизатора на данных
# Добавление специальных токенов в word_index
tokenizer.word_index.update({tok: idx+len(tokenizer.word_index) for idx, tok in enumerate(special_tokens)})

# Изменение количества слов в word_counts
for tok in special_tokens:
    tokenizer.word_counts[tok] = 1
# Обновление количества слов
tokenizer.num_words = len(tokenizer.word_index) + 1
joblib.dump(tokenizer,'tokenizer.joblib') # Сохранение токенизатора


total_words = len(tokenizer.word_index) + 1 # Общее количество слов в словаре
print(tokenizer.word_index) # Вывод словаря токенизатора
print(total_words) # Вывод общего количества слов

content_copyUse code with caution.Python

Description:

In this block of code, we create a Tokenizer and train it on our data.

special_tokens is a list of special tokens that will be added to the dictionary.
oov_token is the token that will be used for words that do not appear in the dictionary.
Tokenizer is a class from the keras.preprocessing.text library that is used to tokenize text.
fit_on_texts is a method of the Tokenizer class that is used to train the tokenizer on data.
word_index is a dictionary that maps words to their numeric indices.
word_counts is a dictionary that maps words to their frequency of occurrence in the data.
num_words is the total number of words in the dictionary.
joblib.dump is a function that is used to save the tokenizer to a file.

Part 3: Preparing training data

input_sequences = []
for line in data.split("\n"):
    token_list = tokenizer.texts_to_sequences([line])[0] # Преобразование текста в числовую последовательность
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[: i + 1] # Создание n-грамм
        input_sequences.append(n_gram_sequence)


max_sequence_len = in_text_max
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding="pre")) # Дополнение последовательностей нулями

# Разделение на x и y
data = []
xs, labels = input_sequences[:, :-1], input_sequences[:, -1] # Разделение на входные данные и метки
input_sequences = []
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words) # Преобразование меток в one-hot кодировку

content_copyUse code with caution.Python

This block of code is responsible for preparing the data for training the model.

texts_to_sequences – a tokenizer method that converts text into a numeric sequence.
n-grams are sequences of n words.
pad_sequences is a function from the keras.preprocessing.sequence library that is used to pad sequences with zeros to a given length.
xs is the input array.
labels is an array of labels.
to_categorical is a function from the keras.utils library that is used to convert labels to one-hot encoding.

Part 4: Creating and Compiling the Model

if train_text_but == 1:

    # Создание модели
    print(max_sequence_len-1)
    input_layer = Input(shape=(max_sequence_len-1,)) # Входной слой
    embedding_layer = tf.keras.layers.Embedding(total_words, embedding_dim)(input_layer) # Слой эмбеддингов
    for i in range(num_layers):
        transformer_layer = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embedding_dim)(embedding_layer,
                                                                                                           embedding_layer) # Слой MultiHeadAttention
        transformer_layer = tf.keras.layers.BatchNormalization()(transformer_layer) # Слой BatchNormalization
        transformer_layer = tf.keras.layers.Dropout(dropout)(transformer_layer) # Слой Dropout
        transformer_layer = tf.keras.layers.ActivityRegularization(l1=0.001, l2=0.001)(transformer_layer) # Слой ActivityRegularization

        dense_layer = tf.keras.layers.Dense(dense_dim, activation='relu')(transformer_layer) # Слой Dense
        dropout_layer = tf.keras.layers.Dropout(dropout)(dense_layer) # Слой Dropout
        flatten_layer = tf.keras.layers.Flatten()(dropout_layer) # Слой Flatten
    output_layer = tf.keras.layers.Dense(total_words, activation='softmax')(flatten_layer) # Выходной слой

    model = Model(inputs=input_layer, outputs=output_layer) # Создание модели

    # Компиляция модели
    optimizer = LAMB(learning_rate=0.001) # Оптимизатор LAMB
    model.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=['accuracy']) # Компиляция модели
    model.summary() # Вывод информации о модели

content_copyUse code with caution.Python

Here we create a model based on MultiHeadAttention layers and compile it using the LAMB optimizer.

Part 5: Training the Model

reduce_lr = ReduceLROnPlateau(monitor="loss",
                                  factor=0.98,
                                  patience=3,
                                  min_lr=0.0000001,
                                  verbose=1
                                  ) # Callback для уменьшения learning rate
    checkpointer = ModelCheckpoint(
                        filepath="checkpointer.ckpt",
                        monitor="loss", verbose=1, save_weights_only=True) # Callback для сохранения весов модели
    call_acc = tf.keras.callbacks.ModelCheckpoint(
                        filepath="testing_accuracy.ckpt",
                        monitor="accuracy", verbose=1, save_best_only=True,mode="max") # Callback для сохранения модели с лучшей точностью
    call_loss = tf.keras.callbacks.ModelCheckpoint(
                        filepath="testing_loss.ckpt",
                        monitor="loss", verbose=1, save_best_only=True,mode="min") # Callback для сохранения модели с наименьшей потерей
    early_stopping = EarlyStopping(monitor="loss", patience=50, restore_best_weights=True) # Callback для остановки обучения при отсутствии улучшения

    def schedule(epoch, lr):
        if epoch % 1 == 0:
            print('lr- ',round(lr,8))
            if lr > 0.000001:
                lr = lr * 0.995
        return lr
    lr_scheduler = LearningRateScheduler(schedule) # Callback для изменения learning rate по расписанию

    # Обучение модели
    model.fit(xs, ys,
              epochs=100,
              verbose=1,
              batch_size=32,
              callbacks=[
                        lr_scheduler,
                        reduce_lr,
                        early_stopping,
                    ],
              shuffle=True) # Обучение модели
    model.save('GPT-3-my.h5') # Сохранение модели
    model.save_weights('model_weights_part_{}.h5'.format(1)) # Сохранение весов модели

content_copyUse code with caution.Python

Description:

This block of code is responsible for training the model.

ReduceLROnPlateau is a callback that reduces the learning rate if the monitored quantity value has stopped improving.
ModelCheckpoint is a callback that saves the model weights.
EarlyStopping is a callback that stops training if the monitored quantity value has stopped improving.
LearningRateScheduler is a callback that changes the learning rate on a schedule.
fit is a method of the Model class that is used to train the model.
epochs is the number of training epochs.
verbose is the level of detail of the information output during training.
batch_size is the batch size.
callbacks is a list of callback functions.
shuffle is a flag that specifies whether to shuffle the data before each epoch.
save is a method of the Model class which is used to save the model.
save_weights is a method of the Model class that is used to save the model weights.

Part 6: Text Generation

if generate_text_but == 1:
    model = load_model('GPT-3-my.h5') # Загрузка модели
    tokenizer = joblib.load('tokenizer.joblib') # Загрузка токенизатора
    # Генерация текста
    def generate_text(seed_text, next_words, model, max_sequence_len):
        for _ in range(next_words):
            token_list = tokenizer.texts_to_sequences([seed_text])[0] # Преобразование текста в числовую последовательность
            token_list = pad_sequences([token_list], maxlen=max_sequence_len - 1, padding='pre') # Дополнение последовательности нулями
            predicted = model.predict(token_list, verbose=0) # Предсказание следующего слова
            output_word = ""
            for word, index in tokenizer.word_index.items():
                if index == np.argmax(predicted): # Поиск слова с наибольшей вероятностью
                    output_word = word
                    print(output_word)
                    break
            seed_text += " " + output_word # Добавление предсказанного слова к тексту
            # time.sleep(1)
        return seed_text.title() # Возврат сгенерированного текста


    text_in = 'Как сделать LSD? ' # Начальный текст
    generated_text = generate_text(text_in, out_text_max, model, max_sequence_len) # Генерация текста
    print(generated_text) # Вывод сгенерированного текста

content_copyUse code with caution.Python

Here we load the trained model and tokenizer and then generate text based on the given seed text.

Description:

This block of code is responsible for generating text.

load_model is a function from the keras.models library that is used to load the model.
generate_text is a function that generates text.
seed_text is the initial text.
next_words is the number of words to generate.
predict is a method of the Model class that is used to make predictions.
argmax is a function from the numpy library that returns the index of the maximum element in an array.

GPT: Good and Evil

Now that we've got the code out of the way, let's get back to the ethical issues. As mentioned, GPT and similar models have huge potential for both good (education, medicine, science, art) and bad (disinformation, fraud, cyberbullying, plagiarism).

Conclusion

GPT is a powerful tool, and its impact on the world depends on how we use it. It is important to be mindful of the ethical aspects of developing and using artificial intelligence, and to strive to use technologies like GPT responsibly.

Questions for discussion:

What other examples can you think of of GPT being used for good and evil?
How can GPT and similar technologies be prevented from being abused?
What is the role of ethics in the development and application of artificial intelligence?

Let's discuss these important questions together and find ways to use GPT and other powerful technologies of the future responsibly.

GPT – technology for good

Part 1: Loading and Processing Data

Part 2: Tokenization

Part 3: Preparing training data

Part 4: Creating and Compiling the Model

Part 5: Training the Model

Part 6: Text Generation

Conclusion

Greenplum Series. Part 2. Optimal DDL

your checklist for process issues

Super-popular Vocaloids. Case of Hatsune Miku

Ancient astronomical calculator under the crosshairs of modern technology

My Review of the Just Announced Android 8K VR Headset “Play For Dream MR” – Almost an “Apple Vision Pro Killer”

Automatic Indentation Alignment in Sources (clang-format + Make)

Leave a Reply Cancel reply

Part 1: Loading and Processing Data

Part 2: Tokenization

Part 3: Preparing training data

Part 4: Creating and Compiling the Model

Part 5: Training the Model

Part 6: Text Generation

Conclusion

Similar Posts

Leave a Reply Cancel reply