External video card for a backend developer, or how to get your best friend to stop being stupid and start helping (part 1)

Hi all! My name is Nikolai Pervukhin. I am a passionate GoLang developer, working at Ozon Bank in the KYC services development group.

Most articles about external video cards are devoted to chasing FPS in games. Here I want to concentrate on how to make it benefit the developer.

Like many remote workers, I work on a company-provided laptop. You could choose Win/Mac/Linux – and, of course, I chose Linux. I got a Lenovo T14 gen2, with an Intel i7 processor and an integrated Iris Xe Graphics graphics chip.

“Apart from Ubuntu installed on a laptop, a terminal and Vim, a true backend guru doesn’t need anything else,” some will say.

I have not yet mastered this level of asceticism: I want to work comfortably in the IDE so that spin cruds and proto files create great code. And since I am a human, and therefore the top of the software-digital chain, I must generate ideas, and the main work should be done for me by a machine (in my dreams).

First attempts to add humanity to a laptop

I catch myself thinking that the one with whom I go through all the thorns of debugging, who shares with me the pain of code review (aka laptop) is not an android robot, but rather an advanced typewriter. Considering the global trend of passion for generative AI, we will try to “get smarter” together with our laptop.

Subjectively, I am not a supporter of cloud AI: sending your code somewhere, hoping for confidentiality… It doesn’t sound very safe – well, remember that the Security Service is always on guard. Therefore, only local solutions are suitable for animating a digital friend.

A long time ago, at the institute 30 years ago, I studied AI, but my knowledge has long been rotten and is limited to the theoretical foundations of the perceptron. In short, they are no good. Therefore, I wanted to find the fastest and simplest solution possible. After going through several options, I settled on the Ollama project, which can be used via Docker.

How surprised I was when everything actually installed and worked without problems (well, almost).

This is how you can start the Ollama service with one command:

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

I knocked on port 11434 (needed to work with models via api) and the service announced normal operation: “Ollama is running.”

It must be clarified that without an installed model, Ollama is as pure as the mind of a child. In my case, I needed not just a basic language model that would help with a recipe for pork wings, but a special, programming one – which was trained on scripts, codes and other information that only you and I could understand.

A little about the models

I’ll make a reservation, I didn’t dive deep, I walked along the top. And here are the models I managed to find: Code LLama, StarCoder, DeepSeek Coder and, of course, Codestral. There are many similar models, and it takes a lot of time to evaluate the pros and cons of each of them. Therefore, I will describe my experience with those that I personally tried.

Codestral from Mistral AI

This model was trained in 80 programming languages, and according to the creators, it is significantly better in terms of relevance indicators compared to the basic models that existed before it.

The model is similar to a beautiful Rolls-Royce: leisurely, but very high quality. I was disappointed by the low speed of the model, but it can be speeded up a little – I’ll tell you about that later.

Starcoder2

These are 3 models at once, trained on different amounts of data (3B, 7B and 15B). They were trained in 600+ programming languages. And also in natural languages ​​based on Wikipedia, Arxiv and GitHub. Even the largest model 15B works many times faster than Codestral, but is highly likely to issue complete game Absolutely irrelevant result.

DeepSeek-Coder

This is probably the same compromise between speed and adequacy. DeepSeek is also a galaxy of several models. There are 4 of them with different sizes: 1.3B, 5.7B, 6.7B and 33B. This model was trained on volumes of data from both natural language (18% of the data, mostly English) and code (82% of the data). The result is fast and quite relevant.

For now I'm using Codestral, which is powerful and at the same time simple for my tasks, and DeepSeek-Coder 6.7b, a little smaller but faster.

To download and install the model you need to run:

docker exec -it ollama ollama run codestral:22b
docker exec -it ollama ollama run deepseek-coder:6.7b

The download will be about 16Gb – make sure you have free space in advance. And right away I would recommend installing the nomic-embed-text module, it will be required for the plugin in GoLand:

docker exec -it ollama ollama run nomic-embed-text

You can see which models are installed with the command:

docker exec -it ollama ollama list

The result looks something like this:

NAME                   	ID          	SIZE  	MODIFIED
deepseek-coder:6.7b    	ce298d984115	3.8 GB	10 minutes ago	
starcoder2:15b         	20cdb0f709c2	9.1 GB	42 hours ago  	
starcoder2:latest      	f67ae0f64584	1.7 GB	2 days ago    	
nomic-embed-text:latest	0a109f422b47	274 MB	5 days ago    	
codestral:22b          	fcc0019dcee9	12 GB 	5 days ago

Now you're ready to start using the model directly from GoLand. To communicate with the service, you need to install the continue plugin (continue.dev). There is also an option for VSCode if for some reason we find it harder to use GoLand. At the time of writing, the plugin works stably for GoLand 2024.3 EAP; in earlier versions there were problems that were solved by changing the runtime for GoLand to an older one.

After installation, immediately open the plugin settings (gear at the bottom of the plugin window) and correct the config to local.

Here is an example of my config for working with the local Ollama service and the Codestral and DeepDeek Coder models:

{
 "models": [
   {
     "title": "Codestral",
     "apiBase": "http://localhost:11434/",
     "provider": "ollama",
     "model": "codestral:22b",
     "contextLength": 2048
   },
   {
     "title": "Deepseek-Coder",
     "apiBase": "http://localhost:11434/",
     "provider": "ollama",
     "model": "deepseek-coder:6.7b"
   }
 ],
 "tabAutocompleteModel": {
   "title": "Deepseek-Coder",
   "provider": "ollama",
   "model": "deepseek-coder:6.7b",
   "apiBase": "http://localhost:11434/"
 },
   "contextProviders": [
   {
     "name": "diff",
     "params": {}
   },
   {
     "name": "folder",
     "params": {}
   },
   {
     "name": "codebase",
     "params": {}
   }
 ],
 "embeddingsProvider": {
   "title": "embeding",
   "provider": "ollama",
   "model": "nomic-embed-text:latest",
   "apiBase": "http://localhost:11434/"
 }
}

After restarting GoLand, the model can be used – but only if you have a lot of free time:

  1. It took 10-15 minutes to initialize the model (first request) on my computer. Often the model is loaded into RAM and feels good there, but uses almost all of it…

  2. The result output speed is approximately 1 word per 5 seconds. Usually, to speed up the development process, a strict boss is required, but in this case, an accelerator will do. In our case, an external graphics accelerator.

Testing AI together with a video card

There will be a separate part of the article about the technical part of connecting a video card. We assume that I have already physically connected the video card, installed the drivers, Cuda and the Docker add-on for forwarding the video card inside the container. Therefore, we continue to work with Ollama.

We launch the Ollama project using Docker and a video card (a parameter is added to forward all video cards inside the container):

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

In the container logs there will be an entry about the use of the video card:

.. library=cuda compute=8.6 driver=12.4 name="NVIDIA GeForce RTX 3060" total="11.8 GiB" available="10.6 GiB"

And now a real miracle has come! Our model works quite quickly. We make requests directly from GoLand:

  1. Initialization (first request) takes about 20 seconds

  2. Request processing is much faster (depending on the model, comparable to the speed of cloud AI)

The loading of the video card at the time of the request can be monitored with the nvtop command:

As can be seen from the graph, there is a large utilization of video memory and an average utilization of gpu.

For even greater acceleration, you can adjust the size of the context: the smaller, the faster the model works. But keep in mind that then the complexity of the algorithm is reduced, and the answer will be less relevant. This is especially true for large models like Codestral. For my video card so far I have chosen almost the minimum size – 2048.

"models": [
 {
   "title": "Codestral",
   "apiBase": "http://localhost:11434/",
   "provider": "ollama",
   "model": "codestral:22b",
   "contextLength": 2048
 }
] ...

Example of work

I was pleasantly surprised that an untrained person like me could immediately start working with the model quite natively.

For example, you can select a fragment in the code, add it to the context and ask directly in Russian: “What does this code do?”

You can also select a function and write a test for it. This is what I especially liked, despite all my love for tests:

For something more complex, you can try to generate something new, for example, like this: “Create a module for storing data in the Calls table with an identifier and name in the Go language using the go-jet framework in Russian.” All these clarifications are important: if you don’t specify the language, Python will be automatically selected, if you don’t mention jet, it will be GORM, etc. But even with this task the model copes well:

From a useful point, I tried to add a large piece of code to the context and asked to do a code-review. I liked that the language of the answers was quite lively (what can she do, this soulless machine?!)

This is what the response from the Starcoder model looks like – quite “clear”, as if you were communicating with a person not in your area:

Several video examples of working with Codestral and DeepSeek Coder models. I used the same query on different models:

Codestral gives a very neat, but rather slow answer:

DeepSeek Coder is fast and quite relevant:

Conclusions on the AI ​​part:

Well, although Ollama does not yet take the issue itself in Jira and commit it for me (I don’t even know if this is good or bad), the usefulness of the model is already obvious.

Even a beginner can use the model. For example, for such cases:

  • generate simple tests,

  • look for examples of using frameworks,

  • refactor code, summarize what is written in a particular commit, etc.

Context makes sense. The model operates on a request and what you give it as input – be it a piece of code, a file or several files. The continue plugin automatically indexes the project when opening it, so you can use the keywords @codebase, @folder, @file in queries to clarify the context. You can ask how the files are connected – with Go this is especially true when there are 100,500 interfaces to one implementation. But the more precisely you formulate your request, the more relevant the results become. It is worth considering that the context size is limited and affects performance.

You need to learn how to formulate queries and experiment with models. Just as at Google we intuitively learned to formulate a search query correctly, so it is here. The skill of writing correct queries can only be honed with practice: it is important to clearly formulate what needs to be done (sometimes this is so lacking in tasks from analysts, right?)

You can generate a large project using various technologies and languages. Let me remind you that the model knows 80 programming languages, and you?

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *