Distributed inference of llama.cpp via RPC
Greetings, Habrites!
The idea of creating this publication has been spinning around in my head for a long time, the thing is that one of my hobbies is connected with distributed computing, and another hobby is connected with neural networks, and I have long been haunted by the idea of running LLM inference on several computers, but so that they all work on the same model in parallel.
After googling for a while I found out that the project LocalAI has supported this feature for a relatively long time, without thinking twice I rolled out this project on several computers, after which I completed all the necessary settings having linked all the instances into a single system and, to put it mildly, was disappointed, this solution turned out to be too “fatally insufficient”, the Docker image was not assembled optimally, it was huge in weight and only under amd64
a non-disconnectable web interface came with the project, a meager selection of models, some of the available LLMs did not work in RPC mode, all embedding models also refused to run in this mode, and so on and so forth.
After fiddling around a bit more, I looked into the source code and found a mention of the project llama.cppthen found the binary call rpc-server
. And here I am on the page llama.cpp/examples/rpc and everything will wrap up…
Brief(?) overview
Let's ask first GigaChat about what the RPC protocol is:
The RPC (Remote Procedure Call) protocol allows programs to call functions or procedures in another address space, on remote hosts, or on independent systems on the same host. It includes a network protocol for client-server data exchange and an object serialization language for encoding data as it is transmitted over a network.
There are various implementations of RPC, including SOA, CORBA, and DCOM. TCP and UDP are often used for the transport layer, but HTTP-based implementations also exist. Examples of RPC implementations include XML-RPC, which uses XML to encode messages and HTTP as a transport mechanism, and gRPC, which uses HTTP/2 and Protocol Buffers to describe interfaces. RPC is widely used in various network services, including NFS.
In the llama.cpp project, this protocol is implemented in a client-server format, with utilities like llama-server
, llama-cli
, llama-embedding
and so on, and specialized binaries act as RPC servers rpc-server
.
If I describe very briefly how it all works, I get the following:
Some RPC client, let's say
llama-server
at startup it receives a list of RPC servers and a model via command line arguments;The RPC client reads the model and then “slices” its layers so that they are evenly distributed among all RPC servers;
Next, the RPC client distributes the layers across the servers and starts the inference.
In general terms, this whole scheme will look like this:
At the same time rpc-server
can be assembled for different backends, these can be different processor architectures, with support for certain functions, let's say you can assemble one RPC server for x86_64 with CUDA support, and the second – for x86_64 without CUDA, and the third – let's say for ARM64 to run on RepkaPi 3 and… The RPC client will be able to work with all of them perfectly and perform inference.
Building binaries
Having carefully studied the instructions for assembling both the server and the clients, I came to the conclusion that to solve the problem, at least four binary files will be needed:
llama-cli
– a command line utility that allows you to run LLM inference;llama-embedding
– a command line utility that allows you to run inference of embedding models;llama-server
– this is a very simple API server that can work both in LLM inference mode and in embedding model inference mode;rpc-server
– a binary that will run on remote machines and perform all the inference work.
Well, here's the assembly very briefly llama.cpp
can be done in three easy steps.
We install the packages necessary for assembly:
apt install -fyq bash wget git make g++
We clone the repository to our host and go to the directory with the sources:
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
We start the compilation (the instructions provide an example via
cmake
but I like it bettermake
):
GGML_RPC=ON make llama-server llama-cli llama-embedding rpc-server libggml.so libllama.so
Important before before make
set environment variable GGML_RPC=ON
(it is possible via export, but I find it more convenient in inline format) this variable allows you to include code blocks in the build instructions that add RPC support.
Once the compilation is complete, the following will appear in the directory: make
executable binary files.
Building Docker images
The ability to compile binaries for different architectures is certainly a useful thing, but what if we have, say, a dozen computers and virtual machines or, say, a cluster in Kubernetes, we won’t run compilation on each node? Of course not! Instead, we will use DevOps practices and compile binaries into Docker images.
The library image was chosen as the base image for the purpose of unification. Ubuntu 22.04 LTS, since it is also used in base containers nvidia/cuda.
To implement the project I decided to use multi-stage assembly divided into two stages.
At the first stage, let it perform the download of everything necessary for compilation and the compilation itself:
FROM ubuntu:22.04 AS builder
WORKDIR /app
ARG LLAMACPP_REPO="https://github.com/ggerganov/llama.cpp.git"
ARG LLAMACPP_VERSION="master"
# Install dependencies
RUN apt update -q \
&& apt install -fyq bash wget git make g++ \
&& apt clean
# Clone repo
RUN git clone --branch "$LLAMACPP_VERSION" --depth 1 "$LLAMACPP_REPO"
# Build binaries
WORKDIR /app/llama.cpp
RUN GGML_RPC=ON make -j$(nproc) llama-server llama-cli llama-embedding rpc-server libggml.so libllama.so
And in the second stage, let the collected binaries be copied to a clean base image:
FROM ubuntu:22.04
WORKDIR /app
# Install basic dependencies
RUN apt update -q \
&& apt install -fyq libgomp1 \
&& apt clean
# Create folders
RUN mkdir -pv /app/models
# Copy compiled tools
COPY --from=builder /app/llama.cpp/libllama.so /usr/lib/x86_64-linux-gnu
COPY --from=builder /app/llama.cpp/libggml.so /usr/lib/x86_64-linux-gnu
COPY --from=builder /app/llama.cpp/rpc-server .
COPY --from=builder /app/llama.cpp/llama-cli .
COPY --from=builder /app/llama.cpp/llama-embedding .
COPY --from=builder /app/llama.cpp/llama-server .
# Init entrypoint
ADD entrypoint.sh .
ENTRYPOINT ["/app/entrypoint.sh"]
Full code Dockerfile in the GitHub repository.
Building Docker images with CUDA support
There are no fundamental differences from the Dockerfile based on the Ubuntu library, except that a container is used at the first stage of the build. nvidia/cuda:devel
and at the second stage nvidia/cuda:runtime
.
# Stage 1
FROM nvidia/cuda:12.5.1-devel-ubuntu22.04 AS builder
# Stage 2
FROM nvidia/cuda:12.5.1-runtime-ubuntu22.04
Full code Dockerfile.cuda in the GitHub repository.
About entrypoint.sh
Since I wanted to assemble a universal container that could be used in various modes, it was necessary to implement a special one entrypoint.sh
a script that will be executed every time the container starts.
According to the plan, the container will operate in the following modes:
backend
The mode in which it starts rpc-server
the command to start the server looks like this:
rpc-server --host "0.0.0.0" --port "50052" --mem "1024"
Here you can see that there is some strange option. --mem
it allows you to specify how much RAM (in megabytes) this RPC server can use, if the rpc server is built for CUDA, then this parameter is responsible for the amount of VRAM (video memory), if without CUDA support, then for the amount of RAM (system RAM).
server
The mode in which it starts llama-server
which is a simple API server that provides the ability to interact with large (and small) language and embedding models, the launch command looks like this:
llama-server --host "0.0.0.0" --port "8080" --model "/app/models/TinyLlama-1.1B-q4_0.gguf" --gpu-layers 99 --rpc backend01:50052,backend02:50052
It is important to pay attention to the option here --gpu-layers
under normal circumstances it indicates the maximum number of layers that can be unloaded into the video card memory, however, if the option is specified --rpc
its behavior changes and it indicates how many layers can be uploaded to RPC servers.
With option --rpc
In it, we list, separated by commas, the hosts and ports of the RPC servers to which the RPC client will connect.
none
A special mode that runs the command sleep inf
so that you can connect to the container and manually start it llama-cli
or let's say llama-embedding
.
If you collect all this within one script, you will get a universal one. entrypoint.sh.
Cross-platform Docker image building
One of the interesting features of the Ubuntu library image is that it supports multiple processors. architecturebut what was most important to me was amd64
, arm64
And arm/v7
the first one is clear why, but I need the last two to be able to run an RPC server on microcomputers, and here is the container nvidia/cuda
supplied only for architectures amd64
And arm64
.
The assembly itself will be carried out using docker buildx
a special plugin that extends the basic functionality of Docker, but in our case we are only interested in the possibility of cross-compiling containers, since the assembly for ARM64 is planned to be performed on an x86_64 processor.
So, first let's create a collector buildx
let's call it let's say my_builder
.
docker buildx create --name my_builder --driver=docker-container
Next, let's assume that the file Dockerfile
And entrypoint.sh
are located in a directory called llama.cpp:
docker buildx build --builder=my_builder --platform=linux/amd64,linux/arm64,linux/arm/v7 --build-arg LLAMACPP_VERSION=master ./llama.cpp/
Here we see that the assembly is carried out for three architectures, and the version used is HEAD
from master
repository branches llama.cpp
.
Adding options --tag=${owner}/${repo}:${tag}
And --push
we will be able to tag images and upload them to registries.
Full example Build publishing containers using GitHub Actions.
Launch via Docker Compose
So, let's say we've built a few containers, pushed them to Docker Hub, and now we want to run all this stuff on our hardware. Let's say we have two servers, on one we can use a video card, but only 1GB of VRAM, and on the second we don't have a video card and can only use 2GB of RAM. We plan to run the TinyLlama 1.1B model on them so that the user interacts with the API server.
In general, such a scheme will look like this:
As a result, we will get the following docker-compose.yml
version: "3.9"
services:
main:
image: evilfreelancer/llama.cpp-rpc:latest
restart: unless-stopped
volumes:
- ./models:/app/models
environment:
APP_MODE: server
APP_MODEL: /app/models/TinyLlama-1.1B-q4_0.gguf
APP_RPC_BACKENDS: backend-cuda:50052,backend-cpu:50052
ports:
- "127.0.0.1:8080:8080"
backend-cpu:
image: evilfreelancer/llama.cpp-rpc:latest
restart: unless-stopped
environment:
APP_MODE: backend
APP_MEM: 2048
backend-cuda:
image: evilfreelancer/llama.cpp-rpc:latest-cuda
restart: "unless-stopped"
environment:
APP_MODE: backend
APP_MEM: 1024
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [ gpu ]
Next you will need next to docker-compose.yml
create directory models
and download the file into it TinyLlama-1.1B-q4_0.gguf.
We launch the composition with the command:
docker compose up -d
Next, we wait for some time and after the composition starts, we can try to perform inference via curl:
curl \
--request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "Building a website can be done in 10 simple steps:"}'
The response will be something like this:
What's next?
In principle, the project can already be used, it has everything you need, and what is missing can be added in the future without much effort.
What I would like to draw your attention to is this: this a little PR for the ollama project (which at the time of publication of this article was still hanging in the unmerged) and here it is This discussion, all there, in the ollama project tickets. In short, the developers want to add the ability to perform distributed inference on RPC backends like what was demonstrated in this publication. So in the future I plan to try to make ollama friends with my Docker containers.
I also plan to use these containers in Kubernetes, so I will most likely prepare a k8s operator or just a deployment in the Helm chart format in the near future to simplify the procedure for deploying servers across nodes.
I also have quite a few microcomputers on my mezzanine, as well as two special motherboards called TuringPi v1 for clustering RaspberryPi CM3, I also plan to conduct experiments on them in the future and that is why among all the listed container architectures there is none arm/v7
.
In general, it's a smooth progress, if only there was time…
With this I take my leave, thank you for reading the article to the end, if you are interested in what will happen to this project in the future, I invite you to my channel @evilfreelancer in Telegram.
Links
Other: