How to cope with the load on Black Friday? Autoscaling inference in Kubernetes
Use navigation if you don't want to read the full text:
→ Load in ML Production
→ How Node Autoscaling Works in K8s
→ Autoscaling chatGPT2 on VLLM
→ Conclusion
Want to win merch? Try solving the IT crossword! More than 256 questions, 7 crosswords on various topics from the world of IT – daily from September 23 to 29. Just register using the link.
Load in ML Production
The best way to talk about the load of ML or inference services is with ChatGPT. To support its infrastructure, OpenAI uses 3,617 HGX A100 serversThis allows us to provide MAU from 100 to 500 million active users per month.
If you look at the statistics of the last 90 days of ChatGPT operation, you can see that even such an IT mastodon cannot always cope with incoming traffic – pay attention to the red lines of service availability.
OpenAI service availability status panel.
The inference itself is almost no different from a regular web service. The user sends a request to the endpoint, then the model makes a guess based on our request and returns a response in the same format – for example, in JSON. To cope with a large load, it is necessary to deploy more replicas. Large replicas require more free resources. The cloud is perfect for such systems, since it has free resources for additional incoming loads (but of course not in all cases).
And now we want to implement inference production, and even using GPU in the cloud, for example, based on Selectel Managed Kubernetes (MKS). Let's figure out what we'll have to face.
How Node Autoscaling Works in K8s
The initial state of our system is a deployed cluster. Managed Kubernetes with one node and GPU. The node runs an inference service to which HTTP requests can be sent, for example, like the gpt2 model.
The GPU operator is responsible for supporting services for working with video cards. Read more about it in my previous article.
Initial state of the system during autoscaling.
Next, we send traffic to our replica and notice that clients began to receive responses from the inference with a delay of more than one second. What happens next with our system? Let's take a closer look.
Horizontal Pod Autoscaler
HPA (Horizontal Pod Autoscaler) comes into play. We have previously built into it the requirement that the request latency should not exceed a second. As soon as this happens, the system deploys another replica of our service.
A new replica has been added.
When a new replica is raised, it requires a resource nvidia.com/gpu=1which is responsible for the presence of a GPU on the node. In this case, we do not have an available node with this resource.
K8s autoscaler
In the Selectel cloud, we use a fork to implement node autoscaling
. Autoscaler checks for the availability of resources – CPU, RAM, etc. – and monitors the resource
which is missing for a new replica.
A new node has appeared.
The autoscaler will raise a node from the base image in the group in which the new replica is deployed. The time it takes to deploy a new node depends on the size of the selected flavor, usually up to five minutes. Then the installation of K8s services will begin.
Managed Kubernetes Services
At this stage, the necessary K8s services are installed on the new node in the form of systemd units: containerd, kubelet, mk-node-adm, mk-node-health. This takes up to a minute.
GPU operator
Since we are working with GPU, we need to prepare the node. The GPU operator installs the necessary drivers and toolkits, configures the plugin. The latter is what issues the resource
for our new replica.
Drivers are installed on the node.
After all drivers are installed, the node is ready – now you can allocate a replica to it. This takes about three minutes.
Image pulling
The image is being pulled to the new node. The time depends on the image size, the channel bandwidth, and the computing power available to extract the image to the node.
Inference is allocated to a new node.
For a 20GB image – which is quite common in ML – the pooling time will take about six minutes (with a 1Gbps channel).
This is quite a long stage for a regular pull, isn't it? on his Telegram channel I described possible options for optimizing the image time, so drop in for a chat. I also think I'll discuss the optimization options in more detail in the next article.
Now let's move on to practice and try to build our own high-load inference service.
Autoscaling chatGPT2 on VLLM
Let's look at the example I showed. at the webinar. All code is located in the repositoryso feel free to reuse the developments.
What components do we need?
Infrastructure
In the webinar I deployed a cluster
Managed Kubernetes
using Terraform. If you are familiar with this tool, it will be easy to use the code from the repository and deploy the cluster.
We will see how to deploy Managed Kubernetes in the cloud with the autoscaling option via the control panel. In general, this is no different from the usual flow, so I will only show the features.
1. Create a cluster and specify default settings:
2. Specify the region, K8s version and cluster fault tolerance. When deploying, select a node group and specify autoscaling:
When the option is enabled Autoscaling You can select from 2 to 20 nodes in one group. At the same time, you can increase quotas individually through support.
3. In the node configuration, select a flavor with a GPU, for example Tesla T4:
Also, when selecting the node configuration, we specify the option Install node without GPU driversto install the GPU operator yourself.
So, our cluster is ready! Now let's install the necessary services.
System services
gpu operator
This Helm chart was discussed in detail in the previous article.
article
. Now it is needed for installing drivers, toolkits and labeling resources on our nodes.
1. Use the following values for the Helm chart:
driver: # поставит драйвер на ноду
enabled: true
version: "550.54.15" # версия устанавливаемого драйвера
toolkit: # перезапишит containerd config
enabled: true
devicePlugin: # разметит наши GPU ресурсы в лейбл nvidia.com/gpu
enabled: true
dcgmExporter: # нужен для экспорта метрик GPU в prometheus
enabled: true
2. Ставим gpu-operator с помощью следующей команды:
<source lang="bash">
helm upgrade --install gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator -f gpu-operator/values.yaml
prometheus stack
1. A stack of services is needed for Prometheus and Grafana to track our traffic on dashboards. We set up a chart with the following values:
prometheus:
prometheusSpec: # эти настройки нужны для автоматического подтягивания ServiceMonitor
podMonitorSelectorNilUsesHelmValues: false
probeSelectorNilUsesHelmValues: false
ruleSelectorNilUsesHelmValues: false
serviceMonitorSelectorNilUsesHelmValues: false
grafana: # дефолтные настройки графаны
grafana.ini:
analytics:
check_for_updates: true
grafana_net:
url: https://grafana.net
log:
mode: console
paths:
data: /var/lib/grafana/
logs: /var/log/grafana
plugins: /var/lib/grafana/plugins
provisioning: /etc/grafana/provisioning
helm upgrade --install prometheus-stack prometheus-community/kube-prometheus-stack -f prometheus-stack/values.yaml
2. Next, open Grafana via port forward and go to the web interface:
kubectl port-forward <service/grafana> 3000:3000 --namespace=<grafana-namespace>
prometheus adapter
Needed to convert Prometheus metrics to custom K8s metrics. We'll talk about it in more detail later.
Manifestos of our inference
To demonstrate how inference works, we will use
. Deploying models is quite simple: just specify the name
from the list of available models
for example Hugging Face. In our case, gpt2, so as not to waste a lot of time loading weights. Also, vLLM is good because it has inference metrics and Swagger for testing right under the hood.
We send the manifests to one folder vllm/ha. You can deploy them using the command:
kubectl apply -f vllm/ha
Now let's look at each manifesto.
vLLM deployment
Our ChatGPT 2 deployment manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: vllm-app
name: vllm
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: vllm-app
strategy:
type: Recreate
template:
metadata:
labels:
app: vllm-app
spec:
containers:
- command:
- python3
- -m
- vllm.entrypoints.openai.api_server
- --model
- gpt2
image: vllm/vllm-openai:latest
name: vllm-openai
ports:
- containerPort: 8000
protocol: TCP
resources:
limits:
nvidia.com/gpu: "1"
volumeMounts:
- mountPath: /root/.cache/huggingface
name: cache-volume
readinessProbe:
failureThreshold: 5
httpGet:
path: /health
port: 8000
scheme: HTTP
initialDelaySeconds: 40
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
livenessProbe:
failureThreshold: 5
httpGet:
path: /health
port: 8000
scheme: HTTP
initialDelaySeconds: 40
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
volumes:
- emptyDir: {}
name: cache-volume
Service load balancer
To access the inference from the Internet we will use
load balancer
Selectel. It is enough to deploy the following manifest:
apiVersion: v1
kind: Service
metadata:
labels:
app: vllm-app
name: vllm-openai-svc
namespace: default
spec:
ports:
- port: 8000
protocol: TCP
targetPort: 8000
selector:
app: vllm-app
type: LoadBalancer
Service monitor
Needed to collect Prometheus metrics from our inference. After deploying the monitor to Prometheus, the system will automatically add a new target and start collecting data.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
serviceMonitorSelector: vllm-prometheus
name: vllm-prometheus
spec:
endpoints:
- interval: 10s
targetPort: 8000
path: /metrics
selector:
matchLabels:
app: "vllm-app"
HorizontalPodAutoscaler
Needed to set up autoscaling of our replicas. We specify the target as a custom metric and threshold:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm
minReplicas: 1
maxReplicas: 3
metrics:
- type: Pods
pods:
metric:
name: vllm_request_latency_seconds
target:
type: AverageValue
averageValue: 200m # 200ms задержки инференса
I think many may wonder what this custom K8s metric is. Let's look at how the Prometheus adapter works.
Making Custom Metrics with Prometheus Adapter
Why is this necessary?
Scaling is done by K8s metrics. The adapter makes custom “cube” metrics from Prometheus metrics using API declaration. I have repeated this myself before
in the article about GPU sharing
. Prometheus adapter allows you to automate the process via Helm chart.
Implementation
1. We use the following values:
namespaceOverride: default
prometheus:
url: http://prometheus-stack-kube-prom-prometheus
port: 9090
rules:
custom:
- seriesQuery: 'vllm:e2e_request_latency_seconds_sum{namespace!="",pod!="",model_name="gpt2"}'
resources:
overrides:
namespace:
resource: "namespace"
pod:
resource: "pod"
name:
matches: "vllm:e2e_request_latency_seconds_sum"
as: "vllm_request_latency_seconds"
metricsQuery: 'rate(vllm:e2e_request_latency_seconds_sum{<<.LabelMatchers>>}[1m])/rate(vllm:e2e_request_latency_seconds_count{<<.LabelMatchers>>}[1m])'
2. Deploy the Helm chart using the following command:
helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter -f vllm/prometheus-adapter.yaml
To create a custom metric, a special formula metricsQuery is used. The principle of its creation is similar to selecting metrics in Prometheus using a promQL query. The only thing is that you need to additionally specify the <<.LabelMatchers>> attribute, by which metrics are filtered by pods and namespaces. With this formula, we will create a custom metric vllm_request_latency_seconds, by which HPA will perform scaling.
Checking the inference
Once the manifests are deployed, we can go to Swagger and query the model. Swagger will be available at the load balancer IP address on port 8000.
After all the manipulations we will get approximately the following answer:
As a result, we have code 200 and a response from gpt2 – not the most conscious, as is the model, but the inference works.
We apply the load
To track traffic we use a dashboard
from the official vLLM repository
.
We need the E2E Request Latency graph, we will use it to track the average request delay.
We will apply the load using a tool. gen ai perf client from NVIDIA. It was developed based on perf client specifically for LLM testing.
Specify
docker run --net host -it -v /tmp:/workspace nvcr.io/nvidia/tritonserver:24.05-py3-sdk
genai-perf -m gpt2 --service-kind openai --endpoint v1/completions --concurrency 50 --url <loadbalancer_ip>:8000 --endpoint-type completions --num-prompts 100 --random-seed 123 --synthetic-input-tokens-mean 20 --synthetic-input-tokens-stddev 0 --tokenizer hf-internal-testing/llama-tokenizer --measurement-interval 1000 -p 100000
GenAI itself generates queries for gpt2 and stores them in the file artifacts/gpt2-openai-completions-concurrency50/llm_inputs.json.
Over time, we may see HPA raise a new replica that will require nvidia.com/gpu. Then the autoscaling magic begins, which was described above.
As soon as the new node is up, the drivers are installed and the replica is allocated to the node, we see how the traffic changes in Grafana.
Here is an example of a graph in which after the appearance of a new replica, token generation in the old one decreased almost by half:
Conclusion
In this article, we looked at how to implement inference autoscaling in practice, what stages it consists of and what components are needed. But that's not all. I received a list of questions at the webinar, I'll try to answer them here.
What to do if you don't have a rich GPU park? Autoscaling can be implemented with a single GPU. Read my articles about GPU sharing, MIG, Timeslicing and MPS.
Why use K8s for ML production if you can deploy large VMs? K8s is a production for any services, including inferences. It frees you from orchestration issues, provides deployment without downtimes, resource management and service isolation.
How to provide for AB testing of inferences? We use canary deployment of our inference services. First, we test a new model on a certain percentage of traffic, then we send full traffic to it. We do this using Istio. A full-fledged AB test cannot be implemented this way, since there is no control over a specific user group, but it is possible to load test a new version of inference.
Is it possible to use two or more video cards in one pod? You can select a flavor for a node in our cloud that uses more than two video cards. NVIDIA device plugin will mark the presence of more than one resource nvidia.com/gpu on the node. It is worth remembering that a pod can only use two video cards on the node in which they are allocated. You cannot use two video cards from different nodes.