How to cope with the load on Black Friday? Autoscaling inference in Kubernetes

DevOps Engineer in the Data/ML Products Team at Selectel. In this article, I will tell you why you need autoscaling of GPU resources, how to configure scaling of replicas in Kubernetes by traffic, and how to make your own high-load ChatGPT.

Use navigation if you don't want to read the full text:

Load in ML Production
How Node Autoscaling Works in K8s
Autoscaling chatGPT2 on VLLM
Conclusion

Want to win merch? Try solving the IT crossword! More than 256 questions, 7 crosswords on various topics from the world of IT – daily from September 23 to 29. Just register using the link.

Load in ML Production


The best way to talk about the load of ML or inference services is with ChatGPT. To support its infrastructure, OpenAI uses 3,617 HGX A100 serversThis allows us to provide MAU from 100 to 500 million active users per month.

If you look at the statistics of the last 90 days of ChatGPT operation, you can see that even such an IT mastodon cannot always cope with incoming traffic – pay attention to the red lines of service availability.

OpenAI service availability status panel.

The inference itself is almost no different from a regular web service. The user sends a request to the endpoint, then the model makes a guess based on our request and returns a response in the same format – for example, in JSON. To cope with a large load, it is necessary to deploy more replicas. Large replicas require more free resources. The cloud is perfect for such systems, since it has free resources for additional incoming loads (but of course not in all cases).

And now we want to implement inference production, and even using GPU in the cloud, for example, based on Selectel Managed Kubernetes (MKS). Let's figure out what we'll have to face.

How Node Autoscaling Works in K8s


The initial state of our system is a deployed cluster.
Managed Kubernetes with one node and GPU. The node runs an inference service to which HTTP requests can be sent, for example, like the gpt2 model.

The GPU operator is responsible for supporting services for working with video cards. Read more about it in my previous article.

Initial state of the system during autoscaling.

Next, we send traffic to our replica and notice that clients began to receive responses from the inference with a delay of more than one second. What happens next with our system? Let's take a closer look.

Horizontal Pod Autoscaler

HPA (Horizontal Pod Autoscaler) comes into play. We have previously built into it the requirement that the request latency should not exceed a second. As soon as this happens, the system deploys another replica of our service.

A new replica has been added.

When a new replica is raised, it requires a resource nvidia.com/gpu=1which is responsible for the presence of a GPU on the node. In this case, we do not have an available node with this resource.

K8s autoscaler

In the Selectel cloud, we use a fork to implement node autoscaling

of this repository

. Autoscaler checks for the availability of resources – CPU, RAM, etc. – and monitors the resource

nvidia.com/gpu

which is missing for a new replica.

A new node has appeared.

The autoscaler will raise a node from the base image in the group in which the new replica is deployed. The time it takes to deploy a new node depends on the size of the selected flavor, usually up to five minutes. Then the installation of K8s services will begin.

Managed Kubernetes Services

At this stage, the necessary K8s services are installed on the new node in the form of systemd units: containerd, kubelet, mk-node-adm, mk-node-health. This takes up to a minute.

GPU operator

Since we are working with GPU, we need to prepare the node. The GPU operator installs the necessary drivers and toolkits, configures the plugin. The latter is what issues the resource

nvidia.com/gpu

for our new replica.

Drivers are installed on the node.

After all drivers are installed, the node is ready – now you can allocate a replica to it. This takes about three minutes.

Image pulling

The image is being pulled to the new node. The time depends on the image size, the channel bandwidth, and the computing power available to extract the image to the node.

Inference is allocated to a new node.

For a 20GB image – which is quite common in ML – the pooling time will take about six minutes (with a 1Gbps channel).

This is quite a long stage for a regular pull, isn't it? on his Telegram channel I described possible options for optimizing the image time, so drop in for a chat. I also think I'll discuss the optimization options in more detail in the next article.

Now let's move on to practice and try to build our own high-load inference service.

Autoscaling chatGPT2 on VLLM


Let's look at the example I showed.
at the webinar. All code is located in the repositoryso feel free to reuse the developments.

What components do we need?

Infrastructure

In the webinar I deployed a cluster

Managed Kubernetes

using Terraform. If you are familiar with this tool, it will be easy to use the code from the repository and deploy the cluster.

We will see how to deploy Managed Kubernetes in the cloud with the autoscaling option via the control panel. In general, this is no different from the usual flow, so I will only show the features.

1. Create a cluster and specify default settings:

2. Specify the region, K8s version and cluster fault tolerance. When deploying, select a node group and specify autoscaling:

When the option is enabled Autoscaling You can select from 2 to 20 nodes in one group. At the same time, you can increase quotas individually through support.

3. In the node configuration, select a flavor with a GPU, for example Tesla T4:

Also, when selecting the node configuration, we specify the option Install node without GPU driversto install the GPU operator yourself.

So, our cluster is ready! Now let's install the necessary services.

System services

gpu operator

This Helm chart was discussed in detail in the previous article.

article

. Now it is needed for installing drivers, toolkits and labeling resources on our nodes.

1. Use the following values ​​for the Helm chart:

driver: # поставит драйвер на ноду
  enabled: true
  version: "550.54.15" # версия устанавливаемого драйвера
toolkit: # перезапишит containerd config
  enabled: true
devicePlugin: # разметит наши GPU ресурсы в лейбл nvidia.com/gpu
  enabled: true
dcgmExporter: # нужен для экспорта метрик GPU в prometheus
  enabled: true
2. Ставим gpu-operator с помощью следующей команды:
<source lang="bash">
helm upgrade --install gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator -f gpu-operator/values.yaml

prometheus stack

1. A stack of services is needed for Prometheus and Grafana to track our traffic on dashboards. We set up a chart with the following values:

prometheus:
  prometheusSpec: # эти настройки нужны для автоматического подтягивания ServiceMonitor
    podMonitorSelectorNilUsesHelmValues: false
    probeSelectorNilUsesHelmValues: false
    ruleSelectorNilUsesHelmValues: false
    serviceMonitorSelectorNilUsesHelmValues: false
grafana: # дефолтные настройки графаны
  grafana.ini:
    analytics:
      check_for_updates: true
    grafana_net:
      url: https://grafana.net
    log:
      mode: console
    paths:
      data: /var/lib/grafana/
      logs: /var/log/grafana
      plugins: /var/lib/grafana/plugins
      provisioning: /etc/grafana/provisioning
helm upgrade --install prometheus-stack prometheus-community/kube-prometheus-stack -f prometheus-stack/values.yaml

2. Next, open Grafana via port forward and go to the web interface:

kubectl port-forward <service/grafana> 3000:3000 --namespace=<grafana-namespace>

prometheus adapter

Needed to convert Prometheus metrics to custom K8s metrics. We'll talk about it in more detail later.

Manifestos of our inference

To demonstrate how inference works, we will use

vLLM framework

. Deploying models is quite simple: just specify the name

from the list of available models

for example Hugging Face. In our case, gpt2, so as not to waste a lot of time loading weights. Also, vLLM is good because it has inference metrics and Swagger for testing right under the hood.

We send the manifests to one folder vllm/ha. You can deploy them using the command:

kubectl apply -f vllm/ha

Now let's look at each manifesto.

vLLM deployment

Our ChatGPT 2 deployment manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: vllm-app
  name: vllm
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-app
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: vllm-app
    spec:
      containers:
      - command:
        - python3
        - -m
        - vllm.entrypoints.openai.api_server
        - --model
        - gpt2
        image: vllm/vllm-openai:latest
        name: vllm-openai
        ports:
        - containerPort: 8000
          protocol: TCP
        resources:
          limits:
            nvidia.com/gpu: "1"
        volumeMounts:
        - mountPath: /root/.cache/huggingface
          name: cache-volume
        readinessProbe:
          failureThreshold: 5
          httpGet:
            path: /health
            port: 8000
            scheme: HTTP
          initialDelaySeconds: 40
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /health
            port: 8000
            scheme: HTTP
          initialDelaySeconds: 40
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
      volumes:
      - emptyDir: {}
        name: cache-volume

Service load balancer

To access the inference from the Internet we will use

load balancer

Selectel. It is enough to deploy the following manifest:

apiVersion: v1
kind: Service
metadata:
  labels:
    app: vllm-app
  name: vllm-openai-svc
  namespace: default
spec:
  ports:
  - port: 8000
    protocol: TCP
    targetPort: 8000
  selector:
    app: vllm-app
  type: LoadBalancer

Service monitor

Needed to collect Prometheus metrics from our inference. After deploying the monitor to Prometheus, the system will automatically add a new target and start collecting data.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    serviceMonitorSelector: vllm-prometheus
  name: vllm-prometheus
spec:
  endpoints:
  - interval: 10s
    targetPort: 8000
    path: /metrics
  selector:
    matchLabels:
      app: "vllm-app"

HorizontalPodAutoscaler

Needed to set up autoscaling of our replicas. We specify the target as a custom metric and threshold:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
 name: vllm-hpa
spec:
 scaleTargetRef:
   apiVersion: apps/v1
   kind: Deployment
   name: vllm
 minReplicas: 1
 maxReplicas: 3
 metrics:
 - type: Pods
   pods:
     metric:
       name: vllm_request_latency_seconds
     target:
       type: AverageValue
       averageValue: 200m # 200ms задержки инференса

I think many may wonder what this custom K8s metric is. Let's look at how the Prometheus adapter works.

Making Custom Metrics with Prometheus Adapter

Why is this necessary?

Scaling is done by K8s metrics. The adapter makes custom “cube” metrics from Prometheus metrics using API declaration. I have repeated this myself before

in the article about GPU sharing

. Prometheus adapter allows you to automate the process via Helm chart.

Implementation

1. We use the following values:

namespaceOverride: default
prometheus:
  url: http://prometheus-stack-kube-prom-prometheus
  port: 9090
rules:
  custom:
    - seriesQuery: 'vllm:e2e_request_latency_seconds_sum{namespace!="",pod!="",model_name="gpt2"}'
      resources:
        overrides:
          namespace:
            resource: "namespace"
          pod:
            resource: "pod"
      name:
        matches: "vllm:e2e_request_latency_seconds_sum"
        as: "vllm_request_latency_seconds"
      metricsQuery: 'rate(vllm:e2e_request_latency_seconds_sum{<<.LabelMatchers>>}[1m])/rate(vllm:e2e_request_latency_seconds_count{<<.LabelMatchers>>}[1m])'

2. Deploy the Helm chart using the following command:

helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter -f vllm/prometheus-adapter.yaml

To create a custom metric, a special formula metricsQuery is used. The principle of its creation is similar to selecting metrics in Prometheus using a promQL query. The only thing is that you need to additionally specify the <<.LabelMatchers>> attribute, by which metrics are filtered by pods and namespaces. With this formula, we will create a custom metric vllm_request_latency_seconds, by which HPA will perform scaling.

Checking the inference

Once the manifests are deployed, we can go to Swagger and query the model. Swagger will be available at the load balancer IP address on port 8000.

After all the manipulations we will get approximately the following answer:

As a result, we have code 200 and a response from gpt2 – not the most conscious, as is the model, but the inference works.

We apply the load

To track traffic we use a dashboard

from the official vLLM repository

.

We need the E2E Request Latency graph, we will use it to track the average request delay.

We will apply the load using a tool. gen ai perf client from NVIDIA. It was developed based on perf client specifically for LLM testing.

Specify and the number of concurrent users –concurrency. If you change –concurrency from 50 to 100, the average delay will vary from 200 to 400 ms.

docker run --net host -it -v /tmp:/workspace nvcr.io/nvidia/tritonserver:24.05-py3-sdk
genai-perf   -m gpt2   --service-kind openai   --endpoint v1/completions   --concurrency 50 --url <loadbalancer_ip>:8000 --endpoint-type completions --num-prompts 100 --random-seed 123 --synthetic-input-tokens-mean 20 --synthetic-input-tokens-stddev 0 --tokenizer hf-internal-testing/llama-tokenizer --measurement-interval 1000 -p 100000

GenAI itself generates queries for gpt2 and stores them in the file artifacts/gpt2-openai-completions-concurrency50/llm_inputs.json.

Over time, we may see HPA raise a new replica that will require nvidia.com/gpu. Then the autoscaling magic begins, which was described above.

As soon as the new node is up, the drivers are installed and the replica is allocated to the node, we see how the traffic changes in Grafana.

Here is an example of a graph in which after the appearance of a new replica, token generation in the old one decreased almost by half:

Conclusion

In this article, we looked at how to implement inference autoscaling in practice, what stages it consists of and what components are needed. But that's not all. I received a list of questions at the webinar, I'll try to answer them here.

What to do if you don't have a rich GPU park? Autoscaling can be implemented with a single GPU. Read my articles about GPU sharing, MIG, Timeslicing and MPS.

Why use K8s for ML production if you can deploy large VMs? K8s is a production for any services, including inferences. It frees you from orchestration issues, provides deployment without downtimes, resource management and service isolation.

How to provide for AB testing of inferences? We use canary deployment of our inference services. First, we test a new model on a certain percentage of traffic, then we send full traffic to it. We do this using Istio. A full-fledged AB test cannot be implemented this way, since there is no control over a specific user group, but it is possible to load test a new version of inference.

Is it possible to use two or more video cards in one pod? You can select a flavor for a node in our cloud that uses more than two video cards. NVIDIA device plugin will mark the presence of more than one resource nvidia.com/gpu on the node. It is worth remembering that a pod can only use two video cards on the node in which they are allocated. You cannot use two video cards from different nodes.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *