Text Generator Update

Hi! I decided to return to creating a model based on recurrent neural networks. It seemed very interesting to me, because I personally like not to use the neural network as intended, but to watch how it learns with certain indicators.

I decided to greatly simplify the use of the code by creating a lot of settings and indicating which ones can be changed and tweaked to suit real practical needs, if any. I also made a nice progress bar, because the program only writes a lot when generating text

To begin with, I decided to update the layers themselves. I realized that creating extra layers with special properties is generally not very necessary, because our task is to create a neural network model that will work on a regular computer, and for example, an attention layer greatly increases the training time, even if it makes the model more accurate.

This is what importing all the components and creating layers looks like:

# Функция для создания модели
def create_model(total_words, max_sequence_len):
    model = Sequential()
    model.add(Embedding(total_words, 100, input_length=max_sequence_len-1)) # Число нейронов можно менять
    model.add(LSTM(100)) # Число нейронов можно менять
    for _ in range(2): # Число нейронов можно менять
        model.add(Dense(100)) # Число нейронов можно менять
    model.add(Dense(total_words, activation='softmax'))
    model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=['accuracy'])
    return model

The training of the model has not changed much, except that I changed the part where the dataset with text is specified. From the previous article, I understood that users may not understand what text is needed. I will say right away, Train your model on any text! Just enter a query in Google, for example, “What does it mean to procrastinate” and write the answer in a text file. The main thing is that the text is more than 200 characters, its maximum size is theoretically unlimited. Here is the code for the learning function:

# Функция обучения нейросети
def train_model(TextData, max_sequence_len):
    tokenizer = Tokenizer(char_level=True)
    tokenizer.fit_on_texts(TextData)
    total_chars = len(tokenizer.word_index) + 1

    input_sequences = []
    for i in range(0, len(TextData) - max_sequence_len, 1):
        sequence = TextData[i:i + max_sequence_len]
        input_sequences.append(sequence)

    input_sequences = tokenizer.texts_to_sequences(input_sequences)
    input_sequences = np.array(input_sequences)
    xs, labels = input_sequences[:, :-1], input_sequences[:, -1]
    ys = tf.keras.utils.to_categorical(labels, num_classes=total_chars)

    model = create_model(total_chars, max_sequence_len)
    epochs = 0
    while True:
        history = model.fit(xs, ys, epochs=1, verbose=1)
        accuracy = history.history['accuracy'][0]
        if accuracy > 0.7: # Настраиваемый параметр
            break
        epochs += 1

    model.save('Путь_к_текстовому_файлу')
    return model, tokenizer

Note that the model is trained until the accuracy score is greater than 0.7. Of course, you can change this score if you need to.

Text generation has received a major update. It has gained the use of temperature, which means adjusting the text variety. Here is the code:

# Функция генерации текста 
def generate_text(seed_text, next_chars, model, max_sequence_len, tokenizer, temperature=0.7): # Настройте параметр температуры.
    generated_text = seed_text
    for _ in tqdm(range(next_chars), desc="Generating text"):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')

        predicted_probs = model.predict(token_list, verbose=0)[0]
        predicted_probs = np.log(predicted_probs) / temperature
        exp_preds = np.exp(predicted_probs)
        predicted_probs = exp_preds / np.sum(exp_preds)
        predicted = np.random.choice(len(predicted_probs), p=predicted_probs)
        output_char = tokenizer.index_word.get(predicted, "")
        seed_text += output_char
        generated_text += output_char

    return generated_text

The usage of the model hasn't changed much. Maybe some nuances have been adjusted, but these are minor things. Here's the code:

# Хочет ли пользователь обучить новую модель?
train_new_model = input("Хотите обучить новую модель? (да/нет)): ")
if train_new_model.lower() == "да":
    # Загружаем датасет
    with open('Путь_к_датасету', 'r') as file:
        TextData = file.read().replace('\n', ' ')
    max_sequence_len = 100 # Настройте (он должен быть таким же, как снизу)
    model, tokenizer = train_model(TextData, max_sequence_len)
else:
    # Обучнная модель
    model = load_model('Путь_к_файлу_модели/text_generation_model.h5')
    tokenizer = Tokenizer(char_level=True)
    with open('Путь_к_датасету', 'r') as file:
        TextData = file.read().replace('\n', ' ')
    tokenizer.fit_on_texts(TextData)
    max_sequence_len = 100 # Настройте (он должен быть таким же, как сверху)

# Генерация и вывод текста
while True:
    seed_text = input("Вы: ")
    next_chars = 200 # Настройте длину генерируемого текста
    generated_text = generate_text(seed_text, next_chars, model, max_sequence_len, tokenizer)
    print("ИИ: ", generated_text)

That's it! The model is created. Now you have a problem for training and using text generators. Just in case, I'm sending you the full code, like the one I have in my Google collab:

import tensorflow as tf
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
from tqdm import tqdm

# Создаём модель
def create_model(total_words, max_sequence_len):
    model = Sequential()
    model.add(Embedding(total_words, 700, input_length=max_sequence_len-1))
    model.add(LSTM(700))
    for _ in range(2):
        model.add(Dense(700))
    model.add(Dense(total_words, activation='softmax'))
    model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=['accuracy'])
    return model

# Обучаем модель 
def train_model(TextData, max_sequence_len):
    tokenizer = Tokenizer(char_level=True)
    tokenizer.fit_on_texts(TextData)
    total_chars = len(tokenizer.word_index) + 1

    input_sequences = []
    for i in range(0, len(TextData) - max_sequence_len, 1):
        sequence = TextData[i:i + max_sequence_len]
        input_sequences.append(sequence)

    input_sequences = tokenizer.texts_to_sequences(input_sequences)
    input_sequences = np.array(input_sequences)
    xs, labels = input_sequences[:, :-1], input_sequences[:, -1]
    ys = tf.keras.utils.to_categorical(labels, num_classes=total_chars)

    model = create_model(total_chars, max_sequence_len)
    epochs = 0
    while True:
        history = model.fit(xs, ys, epochs=1, verbose=1)
        accuracy = history.history['accuracy'][0]
        if accuracy > 0.7:
            break
        epochs += 1

    model.save('/content/drive/MyDrive/Colab Notebooks/text_generation_model.h5')
    return model, tokenizer

# Генерация текста
def generate_text(seed_text, next_chars, model, max_sequence_len, tokenizer, temperature=0.7):
    generated_text = seed_text
    for _ in tqdm(range(next_chars), desc="Generating text"):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')

        predicted_probs = model.predict(token_list, verbose=0)[0]
        predicted_probs = np.log(predicted_probs) / temperature
        exp_preds = np.exp(predicted_probs)
        predicted_probs = exp_preds / np.sum(exp_preds)
        predicted = np.random.choice(len(predicted_probs), p=predicted_probs)
        output_char = tokenizer.index_word.get(predicted, "")
        seed_text += output_char
        generated_text += output_char

    return generated_text

# Хочет ли пользователь обучить новую модель?
train_new_model = input("Хотите обучить новую модель? (да/нет)): ")
if train_new_model.lower() == "да":
    # загружаем датасет
    with open('/content/drive/MyDrive/Colab Notebooks/TextData.txt', 'r') as file:
        TextData = file.read().replace('\n', ' ')
    max_sequence_len = 200
    model, tokenizer = train_model(TextData, max_sequence_len)
else:
    # загружаем модель
    model = load_model('/content/drive/MyDrive/Colab Notebooks/text_generation_model.h5')
    tokenizer = Tokenizer(char_level=True)
    with open('/content/drive/MyDrive/Colab Notebooks/TextData.txt', 'r') as file:
        TextData = file.read().replace('\n', ' ')
    tokenizer.fit_on_texts(TextData)
    max_sequence_len = 200

# Используем модель
while True:
    seed_text = input("Вы: ")
    next_chars = 200
    generated_text = generate_text(seed_text, next_chars, model, max_sequence_len, tokenizer)
    print("ИИ: ", generated_text)

Example of use:

Хотите обучить новую модель? (да/нет): да
1730/1730 ━━━━━━━━━━━━━━━━━━━━ 13s 6ms/step - accuracy: 0.2180 - loss: 2.8713
1730/1730 ━━━━━━━━━━━━━━━━━━━━ 11s 7ms/step - accuracy: 0.3566 - loss: 2.2326
1730/1730 ━━━━━━━━━━━━━━━━━━━━ 12s 7ms/step - accuracy: 0.4425 - loss: 1.9001
1730/1730 ━━━━━━━━━━━━━━━━━━━━ 11s 7ms/step - accuracy: 0.5110 - loss: 1.6469
1730/1730 ━━━━━━━━━━━━━━━━━━━━ 12s 7ms/step - accuracy: 0.5526 - loss: 1.4746
1730/1730 ━━━━━━━━━━━━━━━━━━━━ 12s 7ms/step - accuracy: 0.6003 - loss: 1.3076
1730/1730 ━━━━━━━━━━━━━━━━━━━━ 12s 7ms/step - accuracy: 0.6386 - loss: 1.1710
1730/1730 ━━━━━━━━━━━━━━━━━━━━ 11s 7ms/step - accuracy: 0.6808 - loss: 1.0347
1730/1730 ━━━━━━━━━━━━━━━━━━━━ 11s 6ms/step - accuracy: 0.7113 - loss: 0.9224
1730/1730 ━━━━━━━━━━━━━━━━━━━━ 11s 7ms/step - accuracy: 0.7511 - loss: 0.7948
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`. 
Generating text: 100%|██████████| 200/200 [00:11<00:00, 16.95it/s]
ИИ:  Привет! синий красты.  как решений оставило ключевы и рекомендации) и объединённая сильные эмоции овищение ответом ответить, что высокиз атаки из ключевы говорм. понельно заранием произойдет с их ключевых на
Вы: ...
...

Хотите обучить новую модель? (да/нет): нет
WARNING:absl:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.
Вы: Привет!
Generating text: 100%|██████████| 200/200 [00:15<00:00, 13.10it/s]
ИИ:  Привет! как работы системы болезной информативной обмене.  по многие напочтить атаки на предсказывает аборты нейросетей обмена имеют длину волны разрабатывает в этом ких моменту.  притворяются под количество
Вы: Привет!
Generating text: 100%|██████████| 200/200 [00:11<00:00, 17.00it/s]
ИИ:  Привет! рителюный центр: «группа chatgpt файлы — пользователей, скрытые лицы и развитие до движения, филишибы, используемые он обучение нейросеть том, чтобы изображения нейросеть могут матьсить и позволило с
...

I recommend training the program in Google Collab. It happens quickly there and the result can be better than mine.

I followed the TensorFlow instructions and ChatGPT helped me fix the errors so I could build this model.

I am grateful to all people, even those who just came here to copy the code. We hope you like this program and other people will find practical use for it, or upgrade it to a full-fledged chatbot.

Good luck to everyone and see you in new articles with code!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *