and why not?

My name is Anton Grashchenkov, and I am developing Java at Alfa-Bank. I have been interested in programming since school: I wrote in many different languages ​​- from Pascal to TypeScript, I just like it. In this article, I will show examples of the tasks for which I use local models. Yes, there are many tools available in the cloud – the same ChatGPT, Copilot or YandexGPT. However, you can run such a model locally, and it is extremely easy to do.

After all, if you want to, then why not?

Introduction

The development of AI, and generative networks in particular, is about as much of a paradigm shift as the invention of spreadsheets. If you remember the story, spreadsheets made it possible to perform complex calculations in minutes that previously required weeks of painstaking work. And neural networks offer us the same efficiency boost in this regard.

So just as it is now difficult to find a person in IT who does not know how to use a spreadsheet, so soon there will be no one who does not know how to effectively use AI.

And if in general everything is clear with AI, then why local models?

The benefits of working locally.

  • Offline. All local models work offline and you don’t need the Internet to use them. On the road, on a plane, on a train, and in the country – everywhere, in any place where you can bring your laptop, you can use AI.

  • Privacy. If you use a local model, none of your data is sent to any servers. For some, this is important.

  • Company security policy. This point follows directly from the previous one. To work with code, this very code often needs to be sent in the context of a request, and your company may directly prohibit this.

  • It's free. When using the local model, you don't have to pay for tokens when using the API or sign up for paid subscriptions. We only pay for the laptop and electricity, of course.

Disadvantages of local work.

Naturally, local models have a number of very significant limitations.

  • Resources. The smarter the model, the more resources it consumes. Let me emphasize – they “eat” A LOT. And the smarter the model, the more it consumes, respectively. So you may need to upgrade your computer “for work” (and the video card too).

  • Speed. It's not very big. If you ask GPT something, you expect it to answer you instantly. You'll have to wait a while for a local model to answer. Especially on a large context and weak hardware.

  • Accuracy. Not only will you have to wait, but the accuracy of the answer may not be high, so you should always carefully check what it tells us. Local models are not very accurate and can fantasize quite a bit, which happens less and less with large networks.

  • Language. Available models are usually trained on English-language data, so it is better to communicate with them in English. You can also use Russian for queries, but the accuracy may suffer. It is probably better to learn English.

Now to the tools.

We will use Ollama. It allows us to download different models and run them locally. The list of models themselves is on their website, there are many of them, and they are constantly updated. Follow the news and try something new.

If you want some nice graphical interface and additional features, for example, loading documents into context or using vector databases, you can install it Anything.

And of course, as developers, it is important for us that our chat works in the IDE we use. I really like it Continue.dev: it is available for IntelliJ IDEA and VS Code.

Let's see how to set everything up and start using it.

Installation and configuration

The first thing we need to do is install Ollama. On Mac, it's pretty simple:

brew install ollama

For other operating systems, everything is also quite Justfor example on Linux it is enough to execute:

curl -fsSL https://ollama.com/install.sh | sh

After that, you will need to download the models. There are many models, they are available on the official website — it’s enough to know their names.

You can start, for example, with these four:

  • llama3 is a general purpose model from Meta;

  • mistral is a general purpose model from Mistral AI;

  • codellama — a model for working with code from Meta;

  • codegemma is a model for working with code from Google.

Models differ in the number of parameters. The more parameters, the smarter the model, but for local launch, models with 7-8 billion parameters are optimal, i.e. 7-8 b.

The models are installed very simply – with the command:

ollama pull llama3

Here llama3 is the name of the model from the site.

We install several models at once, because different models have their own strengths. For example, Mistral and llama3 work well with just text (summarization, text generation according to a given template), and codellama or codegemma specialize in working with code.

In order for Ollama to start working with our local clients, we need to run it in server mode. This can be done with the command:

ollama serv 

Or you can run it in service mode, which is even better. On a Mac, the server is launched in service mode using the command:

brew services start ollama

This way, Ollama will always be up and running.

The last of the settings is the continue.dev setting, which is a plugin for our IDE. They are stored in ~/.continue/config.json It is enough to just change the models section like this:

"models": [
  {
    "title": "Ollama",
    "provider": "ollama",
    "model": "AUTODETECT"
  }
]

We use the ollama provider and the model name. You can list all the models you want to use, or leave autodetect, and then you can choose from all that you have downloaded and installed.

That's all for the settings – you can start using it.

How to use

There are several ways to use Ollama.

  1. You can run the model using the command ollama run llama3. In this case, we will switch to the interactive chat mode: we write our question and receive an answer.

  2. You can ask questions in non-interactive mode. To do this, pass the question as a parameter ollama run llama3 'what is java?'.

  3. You can also write your own agents by communicating with Ollama using the REST API.

  4. And of course, you can use the plugin in your favorite IDE. It provides a chat window where you can type queries.

Now let's see in what scenarios we can use Ollama?

№1. Working with utilities, the request format of which I don’t remember well

For example, I don't remember the jq query format very well, and when I have to work with it, I always read the documentation.

Let's say I received an dump with some topics in a JSON file and I want to filter out topics that have more likes than dislikes.

[
  {
    "name": "Как установить LLM локально",
    "like": 2,
    "dislike": 1
  },
  {
    "name": "Пишем сортировку пузырьком 3 способами",
    "like": 1,
    "dislike": 5
  },
  {
    "name": "Как делегировать всю работу ИИ и зарабатывать миллион в час",
    "like": 9,
    "dislike": 2
  },
  {
    "name": "Можно ли программировать прямо в баше?",
    "like": 4,
    "dislike": 4
  },
  {
    "name": "Подключаем ChatGPT к vim",
    "like": 7,
    "dislike": 3
  },
  {
    "name": "Автоматичесоке преобразование JavaScript в Java",
    "like": 9,
    "dislike": 0
  }
]

Actually, let's launch the model:

ollama run llama3

Let's formulate a request:

I have an array of elements like { “name”: “My Name”, “like”: 2, “dislike”: 1}. Write a jq query to filter elements which have likes more than dislikes, show only element’s name.

Different models respond differently to the same prompts. Sometimes you can get something weird, sometimes something good. But llama3, in principle, copes well. Working with language models is not the same as programming: for the same “code” the result can be different.

In the end, I received this answer.

You can use the following `jq` command:

```bash
jq '.[] | select(.like > .dislike) | .name' your_array.json
```

Here's how it works:

- `.[]` iterates over each object in the array.
- `select(.like > .dislike)` filters out objects where likes are not greater than
dislikes. The `.` refers to the current object being processed.
- `.name` extracts and prints only the "name" value from the filtered objects.

Replace `your_array.json` with the actual path to your JSON file or input string.

And this is the correct command, you can see for yourself!

#2. Working with utilities for which I don't remember the format of the configuration files well

These could be, for example, Dockerfile, Makefile or docker-compose.yaml. And instead of googling the syntax every time, you can use Ollama.

ollama run codegemma 'Write a docker-compose file. It should contain mongodb and minio. Map their ports to local machine. Add volumes to store data. Do not provide an explanation, just code.' > docker-compose.yaml

There are a couple of interesting points here.

  • First. I launched Ollama not in interactive mode, but simply as a command. This gives us the ability to use Ollama to automate routine. For example, you can generate test data or some class skeletons.

  • Second. This is the directive Do not provide an explanation, just code. I ask the model not to generate an explanation for what it did. This way I can redirect the response to a file and get the finished result. However, sometimes it still adds quotes to format the response as a code block in Markdown.

And this is what I got as a result.

version: "3.8"

services:
  mongodb:
    image: mongo:latest
    restart: unless-stopped
    ports:
      - 27017:27017
    volumes:
      - mongodb_data:/data/db

  minio:
    image: minio/minio:latest
    restart: unless-stopped
    ports:
      - 9000:9000
      - 9001:9001
    volumes:
      - minio_data:/data

volumes:
  mongodb_data:
  minio_data:

Now let's look at something closer to code.

#3. Reading the code

Of course, I write the backend, but every now and then I have to look at what's going on in the frontend. If you don't know JavaScript/TypeScript very well, it can be quite difficult to read. And our assistant can help with this problem.

This time we'll use Visual Studio Code. Select all the code, press Cmd-L and ask to explain how the code works:

explain this code

The code itself will be added to the request context, and after thinking a little, the model will provide its explanations.

The model describes quite accurately that this is code for access control based on whitelists and that MongoDB with the Mongoose library is used to store users.

It describes quite a lot of details that are not in the code. That is, they are there, but veiled. We, programmers, know that correctly named variables are half the meaning in our program. And the model understands this meaning. Of course, not as well as we do, but approximately.

#4. Own AI agents

And of course, the most interesting thing is to use the API to develop your own AI agents that can be assigned to perform routine actions. From a programming point of view, everything is extremely simple: Ollama provides a REST API (by default on port 11434), in which a couple of endpoints are enough for us. Let's see how this can be used.

You've probably come across modern terminal emulators like Warp that have a built-in AI assistant. It converts natural language queries into commands that can be executed in the terminal. Let's do the same, but for any terminal and using a local LLM.

First, let's look at the API. There's a detailed one on GitHub documentationso it won't be hard to figure it out. We need the api/generate method, since we are not going to communicate with the model in dialog mode, but want to get a single answer.

You can generate the main part of the code using Ollama. It will be faster, and at the same time it will demonstrate how you can use the model to write programs in a language that you don’t know very well. We will choose Go as a language, and since I don’t write in it very often, Ollama will have to help. The prompt is quite large and detailed, but it will allow you to get the code as close as possible to what you want.

Write a program in go language. Generate only code, no explanations. Write a program to send http post request to local ollama server on port 11434. Endpoint is /api/generate. Payload is in json format and has fields model, prompt, system and stream. Get prompt from command line arguments, system and model from const. Set stream to false. Response is in json format and looks like

{
  "model": "llama3",
  "created_at": "2023-08-04T19:22:45.499127Z",
  "response": "",
  "done": true,
  "context": [1, 2, 3],
  "total_duration": 10706818083,
  "load_duration": 6338219291,
  "prompt_eval_count": 26,
  "prompt_eval_duration": 130079000,
  "eval_count": 259,
  "eval_duration": 4232710000
}.

Create and use struct types for the payload and response. Unmarshall the response and print "response" field.

As a result, we get a template for our agent. It is, of course, not ideal, but it compiles, is functional, solves the problem of a “clean sheet”, and can be modified as we wish.

Here's what I got:
package main

import (
	"bytes"
	"encoding/json"
	"fmt"
	"io"
	"log"
	"net/http"
	"os"
	"time"
)

type Payload struct {
	Model  string `json:"model"`
	Prompt string `json:"prompt"`
	System string `json:"system"`
	Stream bool   `json:"stream"`
}

type Response struct {
	Model              string    `json:"model"`
	CreatedAt          time.Time `json:"created_at"`
	Response           string    `json:"response"`
	Done               bool      `json:"done"`
	Context            []int     `json:"context"`
	TotalDuration      int64     `json:"total_duration"`
	LoadDuration       int64     `json:"load_duration"`
	PromptEvalCount    int       `json:"prompt_eval_count"`
	PromptEvalDuration int64     `json:"prompt_eval_duration"`
	EvalCount          int       `json:"eval_count"`
	EvalDuration       int64     `json:"eval_duration"`
}

func main() {
	// Get prompt from command line arguments
	prompt := os.Args[1]

	// Set system and model from const
	system := "llama"
	model := "llama3"

	// Set stream to false
	stream := false

	// Create payload
	payload := Payload{
		Model:  model,
		Prompt: prompt,
		System: system,
		Stream: stream,
	}

	// Marshal payload to JSON
	jsonPayload, err := json.Marshal(payload)
	if err != nil {
		log.Fatal(err)
	}

	// Send HTTP POST request to local ollama server on port 11434
	resp, err := http.Post("http://localhost:11434/api/generate", "application/json", bytes.NewBuffer(jsonPayload))
	if err != nil {
		log.Fatal(err)
	}

	// Read response body
	body, err := io.ReadAll(resp.Body)
	println(string(body))
	if err != nil {
		log.Fatal(err)
	}

	// Unmarshal JSON response
	var response Response
	err = json.Unmarshal(body, &response)
	if err != nil {
		log.Fatal(err)
	}

	// Print "response" field
	fmt.Println(response.Response)
}

Let's get down to the final modifications.

The first thing to do is to change the system, that is, the model's instructions on how it should behave. Our model must know what we want to get from it and in what form. In our case, the role of the model is the system administrator who writes commands for Bash that do what I need.

And, of course, only commands are needed, without explanations. It turns out something like this.

You are a system administrator. Write bash commands that do what I ask. Generate the commands and only the commands, no explanations or formatting. I want to copy your response directly to a terminal emulator and execute it.

Despite the request not to format the response, the model sometimes returns responses surrounded by quotes, formatting them in Markdown. So let's add a little processing that removes these quotes from the response.

And of course, we'll do a little refactoring, more for our own pleasure. Didn't I mention that I love writing code?

In the end I got this:
package main

import (
	"bufio"
	"bytes"
	"encoding/json"
	"fmt"
	"io"
	"net/http"
	"os"
	"os/exec"
	"strings"
)

type Config struct {
	model  string
	system string
	url    string
}

func config() Config {
	return Config{
		model:  "codellama",
		system: "You are a system administrator. Write bash commands that do what I ask. Generate the commands and only the commands, no explanations or formatting. I want to copy your response direct to a terminal emulator and execute it.",
		url:    "http://localhost:11434/api/generate",
	}
}

type Response struct {
	Model              string `json:"model"`
	CreatedAt          string `json:"created_at"`
	Response           string `json:"response"`
	Done               bool   `json:"done"`
	Context            []int  `json:"context"`
	TotalDuration      int64  `json:"total_duration"`
	LoadDuration       int64  `json:"load_duration"`
	PromptEvalCount    int    `json:"prompt_eval_count"`
	PromptEvalDuration int64  `json:"prompt_eval_duration"`
	EvalCount          int    `json:"eval_count"`
	EvalDuration       int64  `json:"eval_duration"`
}

type Payload struct {
	Model  string `json:"model"`
	Prompt string `json:"prompt"`
	System string `json:"system"`
	Stream bool   `json:"stream"`
}

func main() {
	// Get the prompt from command line arguments
	prompt := os.Args[1]

	resp, err := sendAiRequest(prompt)
	if err != nil {
		fmt.Println("Can not generate prompt: ", err)
	}
	cmd := getCommand(resp.Response)
	fmt.Print(cmd)
}

func sendAiRequest(prompt string) (*Response, error) {
	cfg := config()

	payload := Payload{
		Model:  cfg.model,
		Prompt: prompt,
		System: cfg.system,
		Stream: false,
	}

	// Marshal the payload to JSON
	jsonPayload, err := json.Marshal(payload)
	if err != nil {
		fmt.Println("Error marshaling payload: ", err)
		return nil, err
	}

	// Send HTTP POST request to local ollama server on port 11434
	client := &http.Client{}
	req, err := http.NewRequest("POST", cfg.url, bytes.NewBuffer(jsonPayload))
	if err != nil {
		fmt.Println("Error creating request: ", err)
		return nil, err
	}
	req.Header.Add("Content-Type", "application/json")

	res, err := client.Do(req)
	if err != nil {
		fmt.Println("Error sending request: ", err)
		return nil, err
	}
	defer res.Body.Close()

	// Read response body
	body, err := io.ReadAll(res.Body)
	if err != nil {
		fmt.Println("Error reading response: ", err)
		return nil, err
	}

	// Unmarshal JSON response
	var response Response
	err = json.Unmarshal(body, &response)
	if err != nil {
		fmt.Println("Error unmarshaling response: ", err)
		return nil, err
	}

	return &response, nil
}

func getCommand(raw string) string {
	cmd := ""
	scanner := bufio.NewScanner(strings.NewReader(raw))
	for scanner.Scan() {
		line := scanner.Text()
		if strings.HasPrefix(line, "```") {
			continue // skip lines starting with ```
		}
		cmd += line + " "
	}
	cmd = strings.TrimSpace(cmd)
	if strings.HasPrefix(cmd, "`") && strings.HasSuffix(cmd, "`") {
		cmd = strings.TrimPrefix(cmd, "`")
		cmd = strings.TrimSuffix(cmd, "`")
	}
	return cmd
}

Let's compile the resulting output using go build and let's check if our agent (I called it) is working cli-ai).

Works!

All that's left is to configure the shell. I'm using zsh, so let's add a function to call our agent and bind it to Ctrl-a, i adding in ~/.zshrc the following code.

function _ai() {
  prompt="'$BUFFER'"
  zle -R "Thinking about $prompt..."
  cmd=$(/Users/bidzo/develop/bidzo/go/ai-cli/cli-ai $prompt)
  BUFFER="$cmd"
  zle end-of-line
}
zle -N _ai
bindkey "^ai" _ai

We restart zsh and watch how artificial intelligence wakes up in the shell.

As you can see, writing your own simple agent is not difficult, and in any language you like. This approach can be used to automate any routine, the main thing is to clearly understand what result you want to get.

Total

As you can see, at the current level, local networks are still too far from being able to write something decent on their own, without the constant help of a developer. The resulting code often not only does not work, but does not even compile.

However, they can help quickly create a framework for an application and the necessary classes, and often give interesting ideas that can be developed. And this is extremely useful, especially if you do not have access to the Internet ;). It is not for nothing that copilot is called copilot, and not autopilot. This thing helps us write code, but does not write for us yet.

All the same, the programmer needs:

  • Formulate the task extremely precisely and clearly. Without this, there will definitely be no result. AI is not a human, instead of clarifying an unclear point, it will think it out, and most likely not in the way we would like.

  • Check the result.

  • Refine the resulting code with a file so that it runs. Because, surprise, the grid doesn't understand anything about programming – it knows how to compose words so that it looks like a program. So without a programmer, it's still impossible.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *