FP32, FP16, BF16 and FP8 – understanding the main types of floating point numbers

FP64: Maximum precision for scientific calculations

FP64, or 64-bit floating point numbers, are used where the slightest error can lead to incorrect results. In areas such as the space industry, modeling satellite trajectories or fluid dynamics calculations, even a small deviation can have very serious consequences.

Task: Let's consider the problem of numerical integration to calculate the trajectory of a space object moving under the influence of Earth's gravity. For such problems, it is critical to use maximum precision to avoid calculation biases that could lead to incorrect orbit predictions.

Example code for numerical integration using FP64:

import torch

# Положение и скорость спутника
initial_position = torch.tensor([7.0e6, 0.0, 0.0], dtype=torch.float64)
initial_velocity = torch.tensor([0.0, 7.12e3, 0.0], dtype=torch.float64)
mass_earth = 5.972e24
g = 6.67430e-11

# Функция для вычисления ускорения
def compute_acceleration(position, mass, g=6.67430e-11):
   distance = torch.norm(position)
   return -g * mass / distance**3 * position

# Метод Рунге-Кутты для численного интегрирования
def runge_kutta_step(position, velocity, mass, time_step):
   k1_v = compute_acceleration(position, mass) * time_step
   k1_x = velocity * time_step
   k2_v = compute_acceleration(position + k1_x / 2, mass) * time_step
   k2_x = (velocity + k1_v / 2) * time_step
   return position + (k1_x + 2 * k2_x) / 3, velocity + (k1_v + 2 * k2_v) / 3

# Применение метода
time_step = 1.0  # шаг времени в секундах
position, velocity = runge_kutta_step(initial_position, initial_velocity, mass_earth, time_step)

print(f"Положение: {position}, Скорость: {velocity}")

Why is FP64 needed here? In this task, it is important that even minimal errors do not lead to significant deviations in the calculations, since the calculation of the orbit requires maximum accuracy. FP64 minimizes rounding errors and provides high accuracy when performing numerical methods such as the Runge-Kutta method.

FP32: balance between accuracy and speed

FP32 is a standard 32-bit format that is used in most everyday tasks such as graphics rendering, image processing and neural network training. It provides sufficient accuracy with high performance, making it the optimal choice for applications where speed is more important than accuracy.

Task: Matrix multiplication for real-time image processing. Such tasks require fast calculations, and the accuracy of FP32 is sufficient to obtain high-quality results.

Example code for matrix multiplication using FP32:

import torch


# Инициализация матриц в формате float32
matrix_a = torch.tensor([[1.0, 2.0], [3.0, 4.0]], dtype=torch.float32)
matrix_b = torch.tensor([[5.0, 6.0], [7.0, 8.0]], dtype=torch.float32)


# Выполнение умножения матриц
result = torch.matmul(matrix_a, matrix_b)


print(f"Результат умножения матриц с FP32: \n{result}")

Why is FP32 needed here? In a number of tasks related to rendering or graphics processing, it is critical to perform calculations in real time, for example, when creating graphic effects or video processing. FP32 provides a good balance between accuracy and performance. Of course, it is extremely unlikely that anyone will use Python for tasks requiring execution in real time, but for uniformity it was chosen here as for the other examples, although C/C++/Rust/Zig would be better suited.

FP16: Accelerated Data Processing

FP16 is a 16-bit format that allows you to significantly speed up calculations by reducing accuracy, but without significantly compromising the quality of the result. This format is actively used in machine learning and neural network tasks, where high speed processing of large volumes of data is important.

Task: Training a neural network for image classification. FP16 allows you to speed up model training without significant loss of quality.

Example code for working with FP16:

import torch
import torch.nn as nn
import torch.optim as optim

# Простая нейросеть для классификации
class SimpleNet(nn.Module):
   def __init__(self):
       super(SimpleNet, self).__init__()
       self.fc1 = nn.Linear(784, 128)
       self.fc2 = nn.Linear(128, 10)

   def forward(self, x):
       x = torch.relu(self.fc1(x))
       return self.fc2(x)

# Инициализация данных и модели
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleNet().to(device, dtype=torch.float16)
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Фиктивные данные для обучения
data = torch.randn(64, 784, dtype=torch.float16).to(device)
target = torch.randint(0, 10, (64,), dtype=torch.long).to(device)

# Прогон обучения
optimizer.zero_grad()
output = model(data)
loss = nn.CrossEntropyLoss()(output, target)
loss.backward()
optimizer.step()

print(f"Потери: {loss.item()}")

Why is FP16 needed here? When training large neural networks on huge datasets, processing time plays a crucial role. FP16 allows you to speed up the training process on the GPU several times, reducing the amount of memory occupied and reducing resource consumption. This is especially useful when training deep models with large amounts of data.

BFLOAT16: optimization for inference

BFLOAT16 is a format that is most often used for inference, that is, for executing already trained models. It allows you to significantly speed up data processing without significant losses in accuracy, which is especially useful in tasks related to real-time data analysis.

Task: Model inference for image processing using BFLOAT16 to speed up calculations.

Example code for inference using BFLOAT16:

import torch

# Инициализация данных с использованием BFLOAT16
input_data = torch.randn(64, 784, dtype=torch.bfloat16).cuda()

# Фиктивная модель для инференса
class SimpleInferenceNet(nn.Module):
   def __init__(self):
       super(SimpleInferenceNet, self).__init__()
       self.fc = nn.Linear(784, 10)

   def forward(self, x):
       return self.fc(x)

# Выполнение инференса
model = SimpleInferenceNet().cuda().bfloat16()
output = model(input_data)

print(f"Результат инференса: {output}")

Why is BFLOAT16 needed here? This format is used to speed up inference, which is especially important in real-time tasks such as processing data streams or artificial intelligence systems in autonomous cars. BFLOAT16 preserves the precision of the exponential part of a number, which allows you to work with large ranges of values, while saving memory and resources.

FP8: Maximum performance for inference

Representation of the number 0.3952 by different types of floating point numbers.

FP8 is a new format that is used to perform operations at maximum speed with minimal resources. This format is well suited for inference applications where accuracy is not as important as speed, such as in tasks involving computer vision or real-time object recognition.

Task: Performing inference using FP8 to recognize objects in an image.

Example code:

import torch
import torch.nn as nn # Import the torch.nn module

# Пример симуляции работы с FP8
def to_fp8(tensor):
   return torch.clamp(tensor, min=-128, max=127).to(torch.float32)

# Инициализация данных
input_data = torch.randn(64, 784).cuda()
input_data_fp8 = to_fp8(input_data)

# Простейшая модель для инференса
class SimpleFP8Net(nn.Module):
   def __init__(self):
       super(SimpleFP8Net, self).__init__()
       self.fc = nn.Linear(784, 10)

   def forward(self, x):
       return self.fc(x)

# Выполнение инференса
model = SimpleFP8Net().cuda()
output = model(input_data_fp8)


print(f"Результат с использованием FP8: {output}")

Why is FP8 needed here? FP8 is ideal for applications where accuracy may be compromised in favor of maximum processing speed. This is relevant in inference for artificial intelligence systems, where it is necessary to instantly process huge amounts of data with minimal resources.

Conclusion

Each float type – be it FP64, FP32, FP16, BFLOAT16 or FP8 – has its own application and should be selected depending on the task. FP64 is for scientific calculations, FP32 is for the balance between performance and accuracy, FP16 is for training neural networks, and BFLOAT16 and FP8 are for inference. Modern accelerators Nvidia Tesla, Radeon Instinct or Intel GPU Max support all these formats, allowing you to make the most of the power of the GPU for each specific task.

What type of float do you use in your projects and why? Share in the comments, it will be interesting to discuss!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *