Setting up Linux to train models with GPU

Well, the hardware assembly is complete! My stand with GPU stands and waits for a command to act. But, of course, just building a PC is just the beginning. Now we need to teach the system to work with this beast by installing Linuxdrivers, CUDA and other joys. And this, as we know, may turn out to be another quest: if everything does not work perfectly right away, then a “show of unpredictable problems” will definitely begin. I’m not a big fan of setting up and reconfiguring and I’m not a big expert in Linux, but from time to time I have to resort to settings, so I decided that I would immediately write everything in the form of scripts to simplify the process and the ability to roll back. The result is in the form of scripts that “they will do everything for you” and you can see the description for them Here! If you're lucky, they won't even break the system (just kidding, they'll definitely break it).

Three steps to success

I do not touch on the Linux installation, it is well described, I will only say that I chose the Ubuntu 24.04 Desktop version as a basis (sometimes a desktop environment is required). And then I configured the system to suit my needs.

For ease of setup, I divided the installation into three parts, each of which solves specific problems, making the process more flexible and convenient:

  1. Setting up remote access – Enables SSH and security to connect to the machine.

  2. Installing drivers and CUDA is the key to harnessing the power of the GPU, without which your hardware is simply useless.

  3. Development Tools — Docker, Jupyter and other nice little things to make writing and testing code comfortable and safe.

For each step I wrote scripts that install and remove or manage installed components. Settings for each step in config.env files. More detailed readme.

First step: remote access

I use my PC as a home server, but sometimes I use its desktop environment, otherwise I could install the server version of Linux. In general, the PC sits in the dark without a monitor and everything that runs on it should be accessible remotely. Therefore, in the first step we will set up remote access. For this purpose the following are provided:

  • SSH — for a secure connection to the server.

  • UFW (Uncomplicated Firewall) – to protect the network.

  • RDP – for remote desktop.

  • VNC – also for graphical access.

  • Samba – for sharing files on the network.

Detailed readme for the first stage.

Second step: NVIDIA and CUDA drivers

Now let's look at the moment for which everything was started. After all, I needed a GPU, and if so, I couldn’t do without NVIDIA drivers.

So, what do we install:

  • NVIDIA drivers – so that the video card finally understands what they want from it.

  • CUDA — the magic of parallel computing cannot be achieved without CUDA when training grids.

  • cuDNN cuDNN library for deep learning tasks.

  • Python —for development, in my case, the Ubuntu distribution already contained Python 3.12, but I needed to install the second version before 3.11.

We correct the config and run the scripts, if you’re lucky, you won’t get a sudden reboot with a black screen (which, by the way, also looks quite minimalistic and stylish). If, nevertheless, this happens, then maybe you are just Malevich?

We move on with those whose installation was successful. Checking the installation of nvidia software:

$ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

If the output of the next command shows exactly your GPU, it means that karma is clear and everything is only ahead; if not, it’s time to reconsider your life priorities. I'm lucky.

$ nvidia-smi

Fri Sep 27 17:01:20 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090 Ti     Off | 00000000:01:00.0 Off |                  Off |
|  0%   41C    P8              15W / 450W |   4552MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2441      C   python                                     4546MiB |
+---------------------------------------------------------------------------------------+

Well, the icing on the cake is to check whether your GPU is really ready to work for the benefit of science. Use the following code (don't forget to install Python first):

import torch
print("CUDA доступна: ", torch.cuda.is_available())
print("Количество доступных GPU: ", torch.cuda.device_count())

The result should be:

python test_gpu.py
CUDA доступна:  True
Количество доступных GPU:  1

If the output confirms that CUDA is available, then the setup was successful and you are ready to dive into the world of deep learning at GPU speed. Well, or at least start to figure out what else went wrong.

Detailed readme for the second stage.

Third step: development tools

After the first two stages, we have configured remote access, installed drivers, and CUDA is working. What's next? And then you need an environment to work in, so that you can train your models, run them for testing, and generally fully load all these cpu/gpu cores and memory that are available in the hardware. Scripts will help here, which in my case install the minimum components I need, namely install:

  • Git: Version control system.

  • Docker: Containerization platform.

  • Jupyter — Isn’t it the dream of every developer to see his mistakes right away in the browser?

  • Ray — a platform for those who decided that one GPU is boring and it’s time to scale up.

Detailed readme for the third stage.

Conclusion

Surely you can do it better, cooler, and so on, but I hope that my scripts will help someone save time on preparing a PC for training models, and will cause a healthy or unhealthy reaction in someone. I will rejoice for the first, thank the second and pity the third. Next time I plan to talk about installing LLM models.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *