Transfer Learning Guide for Beginners

Transfer learning, or Transfer Learning (TL) — is a technique in machine learning in which a model trained for one task is reused for another, related task.

Let's say a person knows how to play guitar and wants to learn the ukulele. Their skills will help them do it faster because the techniques and methods of playing are similar. Transfer learning works in the same way. Instead of training a model from scratch, this method helps to use existing knowledge in the form of saved pre-trained models.

Together with an expert Maria Zharova We explain how TL works, where it is used, and analyze specific cases.

Maria Zharova

Data Scientist, Alfa-Bank

How transfer learning differs from regular learning

In general, Transfer Learning is simply retraining ML algorithms to solve other problems. There are several differences between ML and TL:

Classical Machine Learning (ML)

Transfer Learning (TL)

Education

From scratch

Based on a pre-trained model

Computational costs

Tall

Low

Required data volume

Big

Small

Use of knowledge

Each model is trained independently

Uses knowledge from a pre-trained model

Time to Optimum Performance

Long lasting

Fast

Efficiency

Less effective with limited resources and data

More efficient with limited resources and data

Transfer learning helps:

  • Save resources. Since the model does not need to be trained from scratch, this reduces labor costs and reduces the requirements for computing power, which a company and, especially, a specific specialist may not have when carrying out pet projects.

  • Improve the quality of results. On a limited data set for a specific task, TL models can produce better results than conventional learning.

  • Speed ​​up learning. Since the pre-trained model already knows the general features, it will be much faster to retrain it on a small data set.

How Transfer Learning Developed

In 1976, S. Bozinovski and A. Fulgosi published article about the transfer of knowledge between neural networks. This is one of the early examples of research in the field that would later become known as Transfer Learning. However, the term itself was not formulated and popularized until much later.

The modern understanding of TL was formed when scientists began to actively explore and use knowledge transfer between different tasks and models. Thus, in 2012 AlexNeta neural network architecture developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton won the ImageNet Large-Scale Visual Recognition Challenge. This neural network, pre-trained on a large data set, could then be further trained on a new set.

In 2014, Google introduced the Inception architecture (GoogLeNet), which also used transfer learning ideas to improve performance in computer vision (CV) tasks.

IN In 2018, Google introduced the model Bidirectional Encoder Representations from Transformers (BERT), which was pre-trained on a huge volume of texts and then further trained to perform specific tasks in natural language processing.

Principles and mechanism of transfer learning

Before using transfer learning, you need to conduct an analysis and find out:

  • Does the model need additional training? – will the use of TL be effective and will it worsen the results?

  • What part of knowledge can be transferred from the TL model to the final one so that it can perform the task Not all of the knowledge in the original model may be useful. For example, in computer vision tasks, the model may carry over common features, such as shapes and textures, that are needed to recognize different objects. You will have to experiment to determine which parts of the knowledge can and should be carried over.

  • How to transfer knowledge from the original model — different tasks and subject areas may require different transfer methods. Therefore, before applying TL, you need to choose the appropriate one so that it provides the best results.

The training itself is based on two principles:

  1. Model pre-training: First, the model is trained on a large data set that is not necessarily related to a specific task, but provides a lot of information for training. For example, in the case of image processing, the model might be trained on a large set of photographs of various objects.

  2. Adaptation to a new task (fine-tuning): After pre-training, the model is further trained on a more specialized data set. For example, if it is necessary to recognize plants from the Red Book, the model will be further trained on images of these plants.

How it works inside the model

ML models consist of layers. This is a set of neurons that perform certain calculations on the input data. Each layer extracts features of different levels of complexity from the data:

  • Initial layers: process the source data and extract basic features.

  • Deep layers: process the data received from the initial layers and reveal more complex and abstract patterns, such as the shapes of objects and their relationships.

  • Last layer: is responsible for classifying source classes – categories and labels of data objects (for example, cat or dog).

In ML there is also a concept domain:

  • Subject area domain — a field of knowledge in which a machine learning model is applied (e.g. medicine, finance, natural language processing).

  • Data domain — is the space or set of values ​​that the data used to train the model can take (for example, age from 0 to 80 years).

In the context of transfer learning, a domain is the area of ​​knowledge or dataset on which a model is trained. (original domain) and in which it will be applied after additional training (target domain).

Examples of images from different domains. Source

Examples of images from different domains. Source

These domains may be from different subject areas and contain data with different characteristics. Transfer learning involves transferring knowledge from a source domain to a target domain. This process involves adaptation of model layers (“freezing”, replacement and retraining) to a new data domain and, possibly, to a new subject domain, taking into account the differences between them.

For example, if a model was trained on human X-rays (source domain), it can be adapted to analyze animal X-rays (target domain). Despite the difference in domains (medical vs. veterinary), in both cases the data is medical scan images, and the underlying principles of image processing may be similar.

Transfer learning using CV as an example

Let's look at the learning process using computer vision as an example. It can be divided into seven main stages:

  1. Getting a pre-trained model

    The first step is to obtain a model pre-trained on a large dataset such as ImageNet. In other words, they take a model that has passed pre-trainingShe can already recognize general visual features such as shapes, textures and objects, and can also distinguish more specific parameters.

Ready-made datasets for training on ImageNet. Source

Ready-made datasets for training on ImageNet. Source
  1. Definition of the basic model.

    The model is prepared for further training: its architecture is analyzed and it is determined what changes will need to be made – which layers can be replaced (if necessary) or further trained for the target domain.

    They also set up basic hyperparameters – those that are set before training begins, so that they can then control the training process. These include data packet size (batch size) and learning speed (learning rate). Then programs and equipment are prepared.

The following stages are fine-tuning (fine tuning, or additional training):

  1. “Freezing” layers

    First, the selected initial and deep layers of the model, which have already been trained to recognize certain features, are “frozen.” They will remain unchanged during further training.

    “Freezing” allows you to save the acquired knowledge of the existing layers and focus on further training of the last layers of the model, which will be adapted to the specifics of the new task.

How do you know which layer to freeze? Formally, the choice of layers to freeze depends on the task. There is a simple rule: the more data for a new task, the more layers can be unfrozen. If there is little new data, it is better to leave more initial layers frozen.

Maria Zharova, Data Scientist at Alfa-Bank

  1. Modification of the last layers

    Once the required layers have been “frozen”, they need to be replaced or added. last layersso that the model can solve a new problem. Quantity classes objects may be different, so the old layer is replaced with a new one.

    For example, you can add a new layer with 10 neurons that will be responsible for classifying the target domain into 10 classes.

  2. Continuing network training

    At this stage, the model continues training with a new set of data to adapt to the specific task.

  3. Assessing the accuracy of the model

    The test dataset is used to evaluate the accuracy of the model — how well it copes with the task. Usually, there are several parallel models that are different from each other.

  4. Selecting the best model and integration

    After comparing several models, the one that is better trained is selected and integrated into the system to solve real problems.

Transfer learning approaches and technologies

Basic approaches to TL

Homogeneous Transfer Learning – It is used when the source and target domains have the same or very similar data structure and solve similar problems.

Example: A model trained to recognize objects in photographs can be adapted to recognize objects in videos. Despite the difference in data format, the recognition task remains similar because videos consist of sequences of images.

Instance Transfer is a transfer learning method that selects and reuses individual examples (instances) from a source dataset. Even if the data from the source and target domains is slightly different, similar examples can help the model adapt faster and perform better on the new task.

Example: The model is trained to recognize photos of products from an online store, such as vegetables, fruits, food packaging, etc. Now we need to adapt it to recognize dishes in photos from restaurants. Although the images differ in style and context, individual products, such as vegetables or meat, can still be recognizable to the model.

Feature Representation Transfer — is a method in which a model trained on one data set uses its ability to extract important features (characteristics) to work with new data.

Example: If a model has learned to be good at recognizing common textures and shapes in images of cars (source domain), it can use those skills to get better at recognizing textures and shapes in new images, such as furniture or clothing (target domain).

Popular Transfer Learning Models

  • ResNet (Residual Neural Network) is a type of neural network architecture designed to efficiently train deep networks. The key feature is residual connectionswhich allow information to be transmitted without going through one or more intermediate layers.

  • Transformers — a neural network architecture that has become a standard for natural language processing (NLP) and other sequential data tasks. At the core of transformers is an attention mechanism that allows the model to focus on important parts of the input data while ignoring less important ones.

    Examples of transformers – BERT (Bidirectional Encoder Representations from Transformers and GPT (Generative Pre-trained Transformer) Both models use feature representation transfer (FRT) to extract and use useful features from text.

Transfer learning tools

  • TensorFlow — an open-source library for machine learning and deep learning developed by Google. It supports many transfer learning methods and provides convenient tools for retraining pre-trained models.

    TensorFlow Hub — a platform for searching for pre-trained models.

    TensorFlow Model Garden — a collection of models for various tasks, including models for transfer learning.

  • PyTorch — a deep learning framework developed by Facebook. Provides tools for transfer learning and working with pre-trained models.

    torchvision.models — a package of models for computer vision tasks with architectures such as AlexNet, ResNet, DenseNet, etc.

    PyTorch Hub — a repository of pre-trained models for various tasks.

  • Hugging Face — a library that initially specialized in models for natural language processing. Now it provides tools for transfer learning of models from a wide range of areas — CV, NLP, audio.

  • Transformers library — support for pre-trained models for various tasks.

  • Tokenizers — tokenizers for text preprocessing. These components transform text into data that the model will then process.

  • Keras — a high-level API for deep learning that is part of TensorFlow. It offers a simple and convenient syntax for creating and retraining models.

    tf.keras.applications — a set of pre-trained models such as VGG, ResNet, Inception for computer vision tasks.

  • Preprocessing and Data Augmentation (Class ImageDataGenerator) — tools for data preparation and augmentation (generation of new data based on existing data).

To work with CVs, they also use RoboFlow — an open repository of pre-trained models. The easiest way to try transfer learning from scratch is with Hugging Face or RoboFlowThese databases contain quite a lot of pre-trained models for standard tasks, and there are also interesting tutorials.

If you want to delve deeper into the topic and try to solve more complex problems, the first thing you need to do is thoroughly study the principles of neural networks and at least one of the libraries for working with them – PyTorch or TensorFlow. They also have many pre-trained weights, and you can flexibly select layers for “freezing” and additional training, and build your own architectures.

It is also important to know methods for preprocessing data before feeding it to the model input. Useful libraries: OpenCV, Albumentations (CV), NLTK, spaCY (NLP), Librosa (audio).

Maria Zharova, Data Scientist at Alfa-Bank

Where Transfer Learning is Used

Natural Language Processing (NLP)

Transformer models – BERT And GPT are first trained on large volumes of text to understand contexts and grammar. After that, they are further trained on specific tasks to achieve a deeper understanding of the text.

Examples:

  • Sentiment analysis of text. BERT models, for example, can be further trained to analyze sentiment in product reviews, so they can distinguish between positive and negative opinions.

  • Classification of texts. GPT models can be adapted to classify texts into categories such as news, scientific publications, or poetry. This helps improve the automatic sorting and analysis of large volumes of text.

Computer Vision (CV)

To solve any classification problems in CV, models such as Inception v4pre-trained on large datasets, such as ImageNet And CIFAR. They can be easily retrained for more specific tasks. This is especially useful when we have a lot of data for one task, but little for another (as in medicine, for example).

This is what importing a pre-trained Inception v4 model looks like. Source

This is what importing a pre-trained Inception v4 model looks like. Source

Examples:

  • Medical images. For example, models ResNet or VGG, can be further trained to detect diseases on X-ray images.

  • Object recognitionThe models can also be adapted to recognize specific objects, such as vehicles in surveillance footage.

Speech recognition

Speech recognition models such as DeepSpeech or Wav2Vecare trained on large volumes of audio data and can process different accents and pronunciations. After that, they can also be further trained for specific tasks.

Application examples:

  • Medicine. Speech models can be trained to answer questions from medical practice. In the future, special digital assistants for doctors and patients.

  • Legal consultationsModels can be further trained to work with legal terminology, which will help create accurate transcriptions of court hearings and other legal documents.

As for tabular data, classical ML algorithms are more effective here, and transfer learning is not used. This can be explained by the fact that such data is individual and well structured. Therefore, the approach of “determine general contours in the picture” or “create a vector representation of the text” is not needed here.

Maria Zharova, Data Scientist at Alfa-Bank

Application of TL to specific cases

Classification of images of cats and dogs

  1. Selecting and loading a pre-trained model

    Tool: TensorFlow, PyTorch.

    Load a pre-trained model such as ResNet or VGG, which is already trained on the ImageNet dataset.

  2. Replacing the last layer

    Replace the last layer of the model with a new one that has two outputs (one for cats and one for dogs). For example, in Keras it might look like this:

from tensorflow.keras.applications import ResNet50
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D


base_model = ResNet50(weights="imagenet", include_top=False, input_shape=(224, 224, 3))
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation='relu')(x)
predictions = Dense(2, activation='softmax')(x)
model = Model(inputs=base_model.input, outputs=predictions)
  1. Training a model on a new task

    Retrain the model on a dataset of images of cats and dogs. For example, using Keras:

model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=10, validation_data=(val_images, val_labels))

Sentiment Analysis of Reviews

  1. Selecting and loading a pre-trained model

    Tool: Hugging Face Transformers.

    Load a pre-trained model such as BERTwhich has already been trained on a large corpus of texts:

from transformers import BertTokenizer, BertForSequenceClassification


tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
  1. Preparing data for training

    Tool: Hugging Face Datasets, TensorFlow, PyTorch.

    Prepare a dataset with reviews, convert text data into a format for working with the model BERT:

def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)


train_dataset = train_dataset.map(tokenize_function, batched=True)
  1. Retraining the model on a new task

    Tool: Hugging Face Transformers, TensorFlow, PyTorch.

    Train a model on review data so it can classify reviews by sentiment:

from transformers import Trainer, TrainingArguments


training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)


trainer.train()

Useful resources for study

  • Documentation TensorFlow And PyTorch – lots of examples and training tutorials.

  • “Deep Learning”, Goodfellow J., Bengio I., Courville A.book to get acquainted with the basics of deep learning, including transfer learning. Great for building a foundation in this area.

  • Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, Aurélien Géron bookwhich provides the fundamentals of machine learning and explains how transfer learning is applied in practice in the context of real-world problems.

  • Hugging Face — mini-courses on working with NLP, audio And CV.

Let's sum it up

  • TL allows you to use knowledge gained from one task to improve performance on another, similar task.

  • Training models from scratch requires a lot of data, while TL uses already trained models and adapts them to new tasks, which saves time and resources.

  • TL is widely used in CV, NLP and speech recognition.

  • The TL process consists of obtaining a pre-trained model, creating a base model, “freezing” the layers, replacing the last layer, and then training to the final result.

  • For TL, tools such as TensorFlow, Keras, PyTorch, and Hugging Face Transformers are used, among others.


Tomsk State University, together with Skillfactory, has developed Master's program for those who want to deeply study neural networks. Students will master current technologies and advanced tools of Data Science. They will also immerse themselves in the process of creating intelligent systems: from design and training to implementation in real business processes.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *