Simple explanation with examples on SpaCy

What is this NER of yours?

Named Entity Recognition (NER) – This NLP task (Natural Language Processing)aimed at identifying fragments of text that belong to classes such as people's names, organization names, dates, locations, amounts of money, and any other classes that a model can be trained to detect

How NER Works

Now let's look at the stepswhich are performed by the named entity recognition system (NER):

  1. Text tokenization: The text is broken down into individual words or tokens.

  2. Highlighting features: Each token is assigned features that describe its environment and context, such as:

  3. Application of the model: The model analyzes the features of each token and determines whether it is a named entity.

  4. Combining results: The results of token analysis are combined to form named entities, which are assigned appropriate class labels, such as:

  5. Post-processing: Additional processing is performed to refine the results and correct errors.

Application

NER is applied if the required model element cannot be determined through synonyms, regex, etc., and the creation of search rules is more complex than logic programming.

NER is an important component of many NLP applicationssuch as information extraction, sentiment analysis, question answering systems and many others. (voice assistants, SMS, phone calls).

Python libraries for NER

Of all the variety of existing libraries, I highlight these:

  • spaCy — It has NER, POS tagging, dependency parsing, word vectors, and more. It is noteworthy that pre-trained models There are models that support Russian language

  • StanfordCoreNLP — It provides a simple API for text processing tasks such as tokenization, part-of-speech tagging, named entity reconfiguration, parsing, dependency parsing, and more.

SpaCy

We will talk about this library due to the banal simplicity of its useas well as models in Russian from stock

Installing SpaCy

GPU version (recommended for better performance)

Here is presented example for Nvidia cards. Before installing the libraries, you will need to install CUDA, direct link

PyPi:

pip install 'spacy[cuda12x]'
pip install cupy-cuda12x

Anaconda

conda create -n NER python=3.10.9 spacy spacy-transformers cupy -c conda-forge

CPU version (only if there is no video card)

PyPi:

pip install spacy

Anaconda

conda create -n NER python=3.10.9 spacy -c conda-forge

Code example

In the code example we will load a pre-trained Russian model

Model installation:

python -m spacy download ru_core_news_sm

Test code:

import spacy

nlp = spacy.load("ru_core_news_sm")
doc = nlp("Apple рассматривает возможность покупки британского стартапа за 1 миллиард долларов")
for token in doc:
    print(token.text, token.pos_, token.dep_)

Conclusion:

Apple PROPN nsub

рассматривает VERB ROOT

возможность NOUN obj

покупки NOUN nmod

британского ADJ amod

стартапа NOUN nmod

за ADP case

1 NUM nummod

миллиард NOUN nummod:gov

долларов NOUN obl

Preparation for training

We've gotten acquainted with the pre-trained model, it's time learn to teach them.

The concept of learning

Education – is an iterative process in which model predictions are compared with benchmark values to estimate the loss. Then the loss used to calculate the gradient of weights using the backpropagation method. The gradients show how to adjust weightsto improve the accuracy of the model's predictions.

It is noteworthy that SpaCy protects models from overfittingOnce retraining begins, model training will be interrupted.

Config for training

File config.cfg to train the model SpaCy contains various parameters and settingsthat allow adapt the model to a specific task. Let's take a closer look, what this file may include:

1. Data:

This section contains information about the datasets that will be used to train, validate, and test the model. For example:

  • Paths to training and test dataset files.

  • Data format (JSON, CSV, etc.)

2. Model components:

This section describes the NLP components that are used in the model, such as:

  • Tokenizer (Tokenizer)

  • Lemmatizer (Lemmatizer)

  • Part of speech (Tagger)

  • Named Entity Recognition (NER)

  • Dependency Analysis (Parser)

3. Optimizers:

Optimization methods and training parameters are described here, such as:

  • Optimizer type (Adam, SGD)

  • Optimizer parameters (learning coefficient, regularization coefficients, etc.)

4. Hyperparameters:

This section includes parameters that control the learning process:

5. Vector representations:

This section defines vector representations of words. (vectors Word2Vec, GloVe, etc.)which will be used to initialize the embeddings.

6. Environment settings:

This is where runtime settings are located, such as:

7. Logging settings:

This section may include parameters for logging the learning process:

  • Logging level (DEBUG, INFO, WARNING)

  • Log format and paths to them

8. Metrics:

This describes the metrics that will be used to evaluate the model:

Example of part of config.cfg file:

[paths]
train = "path/to/train_data"
dev = "path/to/dev_data"
vectors = "path/to/vectors"

[nlp]
lang = "ru"
pipeline = ["tok2vec","ner"]
batch_size = 1000

[components.tok2vec]
factory = "tok2vec"

[training.optimizer]
@optimizers = "Adam.v1"

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001

Creating a training configuration (CPU/GPU)

In general, in the official documentation of SpaCy there is a configuration generatorbut if you take its config, then In rare cases you will get an accuracy higher than 64%

So I'll give you your configs for CPU and GPU, which showed average accuracy 99% models trained on them

GPU

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "ru"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = true
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = true
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = true

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

CPU

[paths]
train = null
dev = null
vectors = null
[system]
gpu_allocator = null

[nlp]
lang = "ru"
pipeline = ["tok2vec","ner"]
batch_size = 1000

[components]

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"]
rows = [5000, 1000, 2500, 2500]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3

[components.ner]
factory = "ner"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}

[corpora]

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"

[training.optimizer]
@optimizers = "Adam.v1"

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001

[initialize]
vectors = ${paths.vectors}

Differences from stock configuration

There are differences only for the GPU version, due to the fact that there included gold preprocessing. If your datasets carefully annotated and you are confident in their quality, inclusion gold preprocessing recommended to improve the accuracy and reliability of the model.

Creating datasets

To train the model, you need mark up data And create datasetswhich is what we'll do now. First, we should decide on classeswhich our model will determine, as well as find datawhich will be marked up into datasets. The number of rows in each class should be not less than 1000otherwise possible problems with recognition accuracy due to insufficient data.

Search for data on markup

I highlight three main places where we can find data for markup:

  • HuggingFace — a huge amount of data for any occasion and even though there are very few datasets for NER, but taking a conditional csv you can extract data from it and mark it up

  • Kaggle – similar to the first point

  • Google Dorking — search for information indexed by the search engine. You can extract files with the necessary information, for example, I typed company names for markup in this way

Data Markup

Now let's proceed directly to data markup.

Before marking, previously collected data should be collected into a txt file

  1. Let's go to website

  2. Open the file and set up the classes

  3. We mark up the data to the end, one row can have one or more classesit all depends on the data format and its content

  4. Load the marked data via Annotations -> Export

After completing the above steps the file will be loaded annotations.json

Data conversion

As we can see. the downloaded file has JSON formatwhich we does not fit. Necessary transform it into a supported SpaCy format for subsequent creating datasets. The following Python code can help us with this:

import json

# Исходный JSON
data = open(input('Введите путь до JSON файла: ')).read()

# Загружаем данные из JSON
json_data = json.loads(data)

# Конвертируем данные в необходимый формат
converted_data = []
for item in json_data["annotations"]:
    text, annotation = item
    entities = annotation["entities"]
    converted_data.append((text, entities))

# Сохраняем результат в training_data.txt
with open("training_data.txt", "w", encoding="utf-8") as f:
    for entry in converted_data:
        f.write(f"{entry}\n")

Creating datasets

Here we come to the final stage data preparation for training the model. For conversion training_data.txt in datasets, we will need to be run one more code, here it is:

import spacy
from spacy.tokens import DocBin
from collections import defaultdict
import math
import random

# Загружаем модель SpaCy
nlp = spacy.blank("ru")

# Функция для конвертации данных в формат spacy
def convert_to_spacy_format(data, nlp):
    doc_bin = DocBin()
    for item in data:
        text, annotations = item
        doc = nlp.make_doc(text)
        ents = []
        for start, end, label in annotations:
            span = doc.char_span(start, end, label=label)
            if span:
                ents.append(span)
        doc.ents = ents
        doc_bin.add(doc)
    return doc_bin

# Чтение данных на конвертацию из файла
data = [
    eval(line.strip().rstrip(","))
    for line in open("training_data.txt", encoding="utf-8")
]

# Группируем данные по классам
class_data = defaultdict(list)
for item in data:
    if isinstance(item, tuple) and len(item) == 2:
        _, annotations = item
        for _, _, label in annotations:
            class_data[label].append(item)
            break  # Предполагаем, что каждый текст относится к одному классу

# Устанавливаем соотношение для train, dev и test наборов
train_ratio = 0.7
dev_ratio = 0.2
test_ratio = 0.1

train_data = []
dev_data = []
test_data = []

# Вычисляем количество элементов для каждого класса
for label, items in class_data.items():
    n_items = len(items)
    n_train = math.ceil(train_ratio * n_items)
    n_dev = math.ceil(dev_ratio * n_items)
    n_test = n_items - n_train - n_dev  # Оставшееся количество идёт в тестовый набор

    # Добавляем данные в наборы
    train_data.extend(items[:n_train])
    dev_data.extend(items[n_train:n_train + n_dev])
    test_data.extend(items[n_train + n_dev:])

# Перемешиваем данные внутри каждого набора для случайного распределения
random.shuffle(train_data)
random.shuffle(dev_data)
random.shuffle(test_data)

# Конвертируем данные в формат spacy и сохраняем
train_doc_bin = convert_to_spacy_format(train_data, nlp)
train_doc_bin.to_disk("train.spacy")

dev_doc_bin = convert_to_spacy_format(dev_data, nlp)
dev_doc_bin.to_disk("dev.spacy")

test_doc_bin = convert_to_spacy_format(test_data, nlp)
test_doc_bin.to_disk("test.spacy")

At the end of the work we will have 3 files: train.spacy (data on which the model will be trained, 70% of all data)dev.spacy (data that will be used to test the model's predictions, 20% of all data)test.spacy (data for benchmark, 10% of all data)

Model training

So, finally we have three files with datasets and configyou can start training

GPU:

python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy -g 0

CPU:

python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy
Model training process

Model training process

Great, training has started. Now let's figure it out the value of each column:

  1. # – This iteration ordinal number in the process of training the model.

  2. LOSS TOK2VECShows losses (errors) during component training tok2vec. This component is used for vectorization of tokens (words) in the model.

  3. LOSS NERShows losses (errors) during training of the NER component (Named Entity Recognition)This component is responsible for named entity recognition in the text.

  4. ENTS_F – This is the F1-score metric for named entity recognition. It represents harmonic mean between precision and completeness.

  5. ENTS_PShows accuracy (precision) recognition of named entities. It reflects the proportion of correctly classified named entities among all predicted ones.

  6. ENTS_RShows completeness (recall) named entity recognition. This is ratio of correctly classified named entities to total number real named entities.

  7. SCORE – This overall rating or a measure of the success of the model at a given training iteration. This measure can take into account various metrics and the parameters of the model to assess its quality. Multiplying this value by 100 will give model success rate.

At the end of the process, the best model will be saved along the way. ./output/model-best

Using a ready-made model

Using the model is quite simple. To do this, create a file for_model.txt and enter the values ​​you want into it line by line process the model. Next, run the following code:

import spacy
import re
import openpyxl

# Загрузите модель для русского языка, например, 'ru_core_news_sm'
nlp_ru = spacy.load("./output/model-best")

f = open("for_model.txt", encoding="utf-8")
output_xlsx_file = "output_data.xlsx"


# Функция для извлечения сущностей заданного типа из документа
def extract_entities(doc, entity_type):
    return [ent.text for ent in doc.ents if ent.label_ == entity_type]


# Получаем список уникальных типов сущностей из модели
entity_types = nlp_ru.pipe_labels["ner"]

# Создаем workbook и worksheet для xlsx файла
wb = openpyxl.Workbook()
ws = wb.active
ws.title = "NER Output"

# Записываем заголовок колонок
ws.append(["Input Text"] + entity_types)

for input in f:
    line = input.strip()
    doc_ru = nlp_ru(line)
    row_data = [line]

    # Извлекаем сущности для каждого типа и удаляем недопустимые компании
    for entity_type in entity_types:
        entities = extract_entities(doc_ru, entity_type)

        # Добавляем сущности в соответствующую колонку по их классу
        row_data.append(", ".join(list(set(entities))))

    # Записываем строку в xlsx файл
    ws.append(row_data)

# Закрываем вводной файл
f.close()

# Сохраняем и закрываем рабочую книгу
wb.save(output_xlsx_file)
wb.close()

print(f"Данные успешно сохранены в файл {output_xlsx_file}")

When the code finishes running, a file will be created output_data.xlsx.

That's all!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *