Distributed inference of llama.cpp via RPC

Greetings, Habrites!

The idea of ​​creating this publication has been spinning around in my head for a long time, the thing is that one of my hobbies is connected with distributed computing, and another hobby is connected with neural networks, and I have long been haunted by the idea of ​​running LLM inference on several computers, but so that they all work on the same model in parallel.

After googling for a while I found out that the project LocalAI has supported this feature for a relatively long time, without thinking twice I rolled out this project on several computers, after which I completed all the necessary settings having linked all the instances into a single system and, to put it mildly, was disappointed, this solution turned out to be too “fatally insufficient”, the Docker image was not assembled optimally, it was huge in weight and only under amd64a non-disconnectable web interface came with the project, a meager selection of models, some of the available LLMs did not work in RPC mode, all embedding models also refused to run in this mode, and so on and so forth.

After fiddling around a bit more, I looked into the source code and found a mention of the project llama.cppthen found the binary call rpc-server. And here I am on the page llama.cpp/examples/rpc and everything will wrap up…

Brief(?) overview

Let's ask first GigaChat about what the RPC protocol is:

The RPC (Remote Procedure Call) protocol allows programs to call functions or procedures in another address space, on remote hosts, or on independent systems on the same host. It includes a network protocol for client-server data exchange and an object serialization language for encoding data as it is transmitted over a network.
There are various implementations of RPC, including SOA, CORBA, and DCOM. TCP and UDP are often used for the transport layer, but HTTP-based implementations also exist. Examples of RPC implementations include XML-RPC, which uses XML to encode messages and HTTP as a transport mechanism, and gRPC, which uses HTTP/2 and Protocol Buffers to describe interfaces. RPC is widely used in various network services, including NFS.

In the llama.cpp project, this protocol is implemented in a client-server format, with utilities like llama-server, llama-cli, llama-embedding and so on, and specialized binaries act as RPC servers rpc-server.

If I describe very briefly how it all works, I get the following:

  1. Some RPC client, let's say llama-serverat startup it receives a list of RPC servers and a model via command line arguments;

  2. The RPC client reads the model and then “slices” its layers so that they are evenly distributed among all RPC servers;

  3. Next, the RPC client distributes the layers across the servers and starts the inference.

In general terms, this whole scheme will look like this:

RPC system diagram

RPC system diagram

At the same time rpc-server can be assembled for different backends, these can be different processor architectures, with support for certain functions, let's say you can assemble one RPC server for x86_64 with CUDA support, and the second – for x86_64 without CUDA, and the third – let's say for ARM64 to run on RepkaPi 3 and… The RPC client will be able to work with all of them perfectly and perform inference.

Building binaries

Having carefully studied the instructions for assembling both the server and the clients, I came to the conclusion that to solve the problem, at least four binary files will be needed:

  • llama-cli – a command line utility that allows you to run LLM inference;

  • llama-embedding – a command line utility that allows you to run inference of embedding models;

  • llama-server – this is a very simple API server that can work both in LLM inference mode and in embedding model inference mode;

  • rpc-server – a binary that will run on remote machines and perform all the inference work.

Well, here's the assembly very briefly llama.cpp can be done in three easy steps.

  1. We install the packages necessary for assembly:

apt install -fyq bash wget git make g++
  1. We clone the repository to our host and go to the directory with the sources:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
  1. We start the compilation (the instructions provide an example via cmakebut I like it better make):

GGML_RPC=ON make llama-server llama-cli llama-embedding rpc-server libggml.so libllama.so

Important before before make set environment variable GGML_RPC=ON (it is possible via export, but I find it more convenient in inline format) this variable allows you to include code blocks in the build instructions that add RPC support.

Once the compilation is complete, the following will appear in the directory: make executable binary files.

Building Docker images

The ability to compile binaries for different architectures is certainly a useful thing, but what if we have, say, a dozen computers and virtual machines or, say, a cluster in Kubernetes, we won’t run compilation on each node? Of course not! Instead, we will use DevOps practices and compile binaries into Docker images.

The library image was chosen as the base image for the purpose of unification. Ubuntu 22.04 LTS, since it is also used in base containers nvidia/cuda.

To implement the project I decided to use multi-stage assembly divided into two stages.

At the first stage, let it perform the download of everything necessary for compilation and the compilation itself:

FROM ubuntu:22.04 AS builder
WORKDIR /app

ARG LLAMACPP_REPO="https://github.com/ggerganov/llama.cpp.git"
ARG LLAMACPP_VERSION="master"

# Install dependencies
RUN apt update -q \
 && apt install -fyq bash wget git make g++ \
 && apt clean

# Clone repo
RUN git clone --branch "$LLAMACPP_VERSION" --depth 1 "$LLAMACPP_REPO"

# Build binaries
WORKDIR /app/llama.cpp
RUN GGML_RPC=ON make -j$(nproc) llama-server llama-cli llama-embedding rpc-server libggml.so libllama.so

And in the second stage, let the collected binaries be copied to a clean base image:

FROM ubuntu:22.04
WORKDIR /app

# Install basic dependencies
RUN apt update -q \
 && apt install -fyq libgomp1 \
 && apt clean

# Create folders
RUN mkdir -pv /app/models

# Copy compiled tools  
COPY --from=builder /app/llama.cpp/libllama.so /usr/lib/x86_64-linux-gnu
COPY --from=builder /app/llama.cpp/libggml.so /usr/lib/x86_64-linux-gnu
COPY --from=builder /app/llama.cpp/rpc-server .
COPY --from=builder /app/llama.cpp/llama-cli .
COPY --from=builder /app/llama.cpp/llama-embedding .
COPY --from=builder /app/llama.cpp/llama-server .

# Init entrypoint  
ADD entrypoint.sh .  
ENTRYPOINT ["/app/entrypoint.sh"]

Full code Dockerfile in the GitHub repository.

Building Docker images with CUDA support

There are no fundamental differences from the Dockerfile based on the Ubuntu library, except that a container is used at the first stage of the build. nvidia/cuda:devel and at the second stage nvidia/cuda:runtime.

# Stage 1
FROM nvidia/cuda:12.5.1-devel-ubuntu22.04 AS builder
# Stage 2
FROM nvidia/cuda:12.5.1-runtime-ubuntu22.04

Full code Dockerfile.cuda in the GitHub repository.

About entrypoint.sh

Since I wanted to assemble a universal container that could be used in various modes, it was necessary to implement a special one entrypoint.sh a script that will be executed every time the container starts.

According to the plan, the container will operate in the following modes:

backend
The mode in which it starts rpc-serverthe command to start the server looks like this:

rpc-server --host "0.0.0.0" --port "50052" --mem "1024"

Here you can see that there is some strange option. --mem it allows you to specify how much RAM (in megabytes) this RPC server can use, if the rpc server is built for CUDA, then this parameter is responsible for the amount of VRAM (video memory), if without CUDA support, then for the amount of RAM (system RAM).

server
The mode in which it starts llama-serverwhich is a simple API server that provides the ability to interact with large (and small) language and embedding models, the launch command looks like this:

llama-server --host "0.0.0.0" --port "8080" --model "/app/models/TinyLlama-1.1B-q4_0.gguf" --gpu-layers 99 --rpc backend01:50052,backend02:50052

It is important to pay attention to the option here --gpu-layers under normal circumstances it indicates the maximum number of layers that can be unloaded into the video card memory, however, if the option is specified --rpcits behavior changes and it indicates how many layers can be uploaded to RPC servers.

With option --rpc In it, we list, separated by commas, the hosts and ports of the RPC servers to which the RPC client will connect.

none
A special mode that runs the command sleep infso that you can connect to the container and manually start it llama-cli or let's say llama-embedding.

If you collect all this within one script, you will get a universal one. entrypoint.sh.

Cross-platform Docker image building

One of the interesting features of the Ubuntu library image is that it supports multiple processors. architecturebut what was most important to me was amd64, arm64 And arm/v7the first one is clear why, but I need the last two to be able to run an RPC server on microcomputers, and here is the container nvidia/cuda supplied only for architectures amd64 And arm64.

The assembly itself will be carried out using docker buildx a special plugin that extends the basic functionality of Docker, but in our case we are only interested in the possibility of cross-compiling containers, since the assembly for ARM64 is planned to be performed on an x86_64 processor.

So, first let's create a collector buildxlet's call it let's say my_builder.

docker buildx create --name my_builder --driver=docker-container

Next, let's assume that the file Dockerfile And entrypoint.sh are located in a directory called llama.cpp:

docker buildx build --builder=my_builder --platform=linux/amd64,linux/arm64,linux/arm/v7 --build-arg LLAMACPP_VERSION=master ./llama.cpp/

Here we see that the assembly is carried out for three architectures, and the version used is HEAD from master repository branches llama.cpp.

Adding options --tag=${owner}/${repo}:${tag} And --push we will be able to tag images and upload them to registries.

Full example Build publishing containers using GitHub Actions.

Launch via Docker Compose

So, let's say we've built a few containers, pushed them to Docker Hub, and now we want to run all this stuff on our hardware. Let's say we have two servers, on one we can use a video card, but only 1GB of VRAM, and on the second we don't have a video card and can only use 2GB of RAM. We plan to run the TinyLlama 1.1B model on them so that the user interacts with the API server.

In general, such a scheme will look like this:

Scheme of two RPC servers and one RPC client

Scheme of two RPC servers and one RPC client

As a result, we will get the following docker-compose.yml

version: "3.9"

services:

  main:
    image: evilfreelancer/llama.cpp-rpc:latest
    restart: unless-stopped
    volumes:
      - ./models:/app/models
    environment:
      APP_MODE: server
      APP_MODEL: /app/models/TinyLlama-1.1B-q4_0.gguf
      APP_RPC_BACKENDS: backend-cuda:50052,backend-cpu:50052
    ports:
      - "127.0.0.1:8080:8080"

  backend-cpu:
    image: evilfreelancer/llama.cpp-rpc:latest
    restart: unless-stopped
    environment:
      APP_MODE: backend
      APP_MEM: 2048

  backend-cuda:
    image: evilfreelancer/llama.cpp-rpc:latest-cuda
    restart: "unless-stopped"
    environment:
      APP_MODE: backend
      APP_MEM: 1024
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [ gpu ]

Next you will need next to docker-compose.yml create directory models and download the file into it TinyLlama-1.1B-q4_0.gguf.

We launch the composition with the command:

docker compose up -d

Next, we wait for some time and after the composition starts, we can try to perform inference via curl:

curl \
    --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Building a website can be done in 10 simple steps:"}'

The response will be something like this:

Server response llama.cpp

Server response llama.cpp

What's next?

In principle, the project can already be used, it has everything you need, and what is missing can be added in the future without much effort.

What I would like to draw your attention to is this: this a little PR for the ollama project (which at the time of publication of this article was still hanging in the unmerged) and here it is This discussion, all there, in the ollama project tickets. In short, the developers want to add the ability to perform distributed inference on RPC backends like what was demonstrated in this publication. So in the future I plan to try to make ollama friends with my Docker containers.

I also plan to use these containers in Kubernetes, so I will most likely prepare a k8s operator or just a deployment in the Helm chart format in the near future to simplify the procedure for deploying servers across nodes.

I also have quite a few microcomputers on my mezzanine, as well as two special motherboards called TuringPi v1 for clustering RaspberryPi CM3, I also plan to conduct experiments on them in the future and that is why among all the listed container architectures there is none arm/v7.

In general, it's a smooth progress, if only there was time…

With this I take my leave, thank you for reading the article to the end, if you are interested in what will happen to this project in the future, I invite you to my channel @evilfreelancer in Telegram.

Links

Other:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *