How artificial intelligence helps train DevOps engineers

Hi all! Evgeniy Torin, DevOps engineer at Flant, is in touch. I have long wanted to try artificial intelligence in action and even generated pictures for my presentations as an experiment. But I was able to fully immerse myself in this area only when a practical need arose.

Once I was asked to onboard newcomers to working with Kubernetes – teach them how to solve problems that arise. We have an onboarding workflow and even our own training coursebut I wanted to introduce a new tool that would help solve this problem. Now AI and LLM (Large Language Models) are gaining particular popularity, which I decided to use to solve onboarding problems. That is, I decided to transfer this work to machines.

One of the new tools in the field of AI that is related to the operation of Kubernetes is k8sgpt – a chatbot that appeared in the spring of 2023. It helps you get information, advice and solutions on various issues related to Kubernetes. For comparison, you can find similar utilities in the article about open source tools based on OpenAI for Kubernetes.

In this article I will tell you what the k8sgpt utility is, how to install it, what useful options there are and how to connect AI to it. I will show you what level of diagnostics can be carried out using this tool and what kind of chatbot responses you can count on.

To do this, I tried out various available integrations with AI, several models, including a local one. All with examples, launch commands and detailed logs.

What is k8sgpt

k8sgpt is used to scan Kubernetes clusters, diagnose problems and find solutions to fix them. It is based on extensive SRE (Site Reliability Engineering) experience. Based on it, k8sgpt extracts the most relevant information and supplements it with the help of artificial intelligence.

For example, a pod with the Pending status appeared in the cluster. And if the cause of this problem does not come to mind, you can run the k8sgpt utility and get an explanation, a list of possible problems, and perhaps even commands for further diagnosis.

This is how k8sgpt helps to debug problems in the “cube”. The utility also adds the format of communication and the context of the problem to the chat to the standard request. Of course, you can create a chat request yourself, but k8sgpt simplifies and speeds up this process.

How to install k8sgpt

If you don’t want to install the utility, you can start getting acquainted with how it works in the sandbox. If you want to fully understand the tool, you need to install it. There are two main installation options:

  • As a simple console utility – when we call it, we get a response.

  • As a K8s operator in a cluster, the utility runs in the background and puts the results into a separate CR (Custom Resource) of type Result. This option is suitable when you need a history of checks, as well as when you need to automate checking for a problem that cannot be caught with a manual approach – when working with a console utility.

In this article I will discuss only the first installation option. An option with an operator can be found in this article.

1. Installing utility on the k8s cluster master:

curl -LO https://github.com/k8sgpt-ai/k8sgpt/releases/download/v0.3.40/k8sgpt_amd64.deb

dpkg -i k8sgpt_amd64.deb

2. I register integration with AI (hereinafter referred to as backend or simply back). Now the utility supports several backing options, the main ones being:

You can see the full list on the website.

Note

Noopai is a fake AI. This is a stub for testing the utility. He answers like this:

I am a noop response to the prompt ...

Instead of the ellipsis, the text of the errors is substituted, the result of the analyze module; more about it below.

I am registering on the first working backend – from OpenAI. You can get the API key follow the link. In the backend registration command you need to specify the actual backend name openai and model name, for example gpt-3.5-turbo:

k8sgpt auth add -b openai -m gpt-3.5-turbo
Enter openai Key:

It is further assumed that the config ~/.kube/config already exists and the cluster API is available, after which the utility is ready to work.

You can read more about installation and first launch in the official documentation.

What features and options does k8sgpt have?

Before analyzing how the utility works, let’s consider its main functions. Below you can see some of the main commands:

Доступные команды:
  analyze     Команда поможет найти проблемы в кластере Kubernetes
  cache       Для работы с кэшем результатов анализа
  filters     Настраивайте фильтры для анализа ресурсов Kubernetes
  e, explain  Команда для обращения к AI для решения проблемы

Let’s say I use this command to ask the utility to tell you what’s happening in the cluster:

k8sgpt analyze -b openai --explain --no-cache --namespace=dev --filter Pod
  • analyze (analyzer) collects data about problems in the cluster.

  • Using the option explain the AI ​​backer is contacted. Without it, the utility will not use AI, but will launch the internal analyzer — essentially a collection of SRE practices for debugging a specific problem, where the status of the resource determines the list of further diagnostic actions.

  • Option --no-cache addedso as not to take into account the result of the previous analysis from the cache.

  • Filter --namespace=dev -f Pod analyzes problems in a specific namespace based on resources like Pod.

For example, in the request below I did not use AI, but in the response the utility provided structured information on the problems found, which can be used to guess the reasons for the errors:

# k8sgpt analyse --namespace=dev
AI Provider: AI not used; --explain not set

0 dev/web-0(StatefulSet/web)
- Error: 0/2 nodes are available: 1 node(s) didn't match Pod's node affinity/selector,1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
...

And if you use the option in the command -dthe utility will add to each error a description from the documentation in the Kubernetes Doc section:

# k8sgpt analyze --namespace=dev --filter Service -d
AI Provider: AI not used; --explain not set

0 dev/web-0(web-0)
- Error: Service has no endpoints, expected label app.kubernetes.io/name=local-ai
  Kubernetes Doc: Route service traffic to pods with label keys and values matching this selector. If empty or not present, the service is assumed to have an external process managing its endpoints, which Kubernetes will not modify. Only applies to types ClusterIP, NodePort, and LoadBalancer. Ignored if type is ExternalName. More info: https://kubernetes.io/docs/concepts/services-networking/service/

As a result, even without connecting a back-end with AI, the utility can be useful. Option explain sends found errors to the specified back to search for a solution. Further in the article I will always use the option explain and compare the total answer – the result of the analyzer module for a specific resource and the response of a specific back-server with a model running in it.

Thanks to the built-in analyzer, the number of false and fictitious theories that would be proposed by an AI analyzer is reduced.

Reduced number of requests

The utility supports two types of managing the number of requests, these are filters through the option --filter by resource type and --max-concurrency — the number of requests sent to the “cube”, where the number indicates the number of parallel requests sent to the API:

  -m, --max-concurrency int   Maximum number of concurrent requests to the Kubernetes API server (default 10)

This option will help not create excessive load if, for example, you need to analyze the resources of the entire cluster or namespace at once.

And to reduce the number of requests to the AI ​​back (for example, for access to the API openai exist rate-limits and error 429 may appear – Too Many Requests) only filters will help.

Russian language

Most models were trained primarily on English-language texts, while k8sgpt also works in Russian, but only as a translation. To do this, when checking the problem, you need to add the option --language ”russian”. But it only works in the mode of accessing an external backend, that is, with the option --explain.

Interactive mode

k8sgpt has the opportunity to clarify your question, ask about command syntax, etc. – interactivity. You can start it using the key --interactive:

# k8sgpt analyze -b openai --no-cache -m 1 --namespace=dev --filter Ingress --explain --interactive
...

0 dev/torin-test(torin-test)
- Error: Ingress uses the ingress class nginxa which does not exist.
...
Solution: 
1. Verify the existence of the specified ingress class (nginxa).
...

Interactive mode enabled [type exit to close.]
Input text: Write me command to fix ingress class name

To fix the ingress class name, you can use the following command:

```
kubectl edit ingress dev/torin-test
```

This command will open the Ingress resource in an editor where you can update the specified ingress class from "nginxa" to the correct value. Make sure to save the changes after editing the file.

Next, a huge space opens up for communication with AI.

How to work with external AI backends

Let's test k8sgpt in practice using external AI backends. To do this, I simulated several problems with cluster resources and looked at what tips the utility offered for resolving them.

Pod in Pending status

Through .spec.affinity.nodeAffinity limited the nodes on which the pod can run. If the conditions are not met, the pod will be in Pending status. Let's check this problem using the utility:

# k8sgpt analyse --explain --namespace=dev

 100% |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| (2/2, 16 it/min)        
AI Provider: openai

0 dev/web-0(StatefulSet/web)
- Error: 0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {dedicated: kafka}, that the pod didn't tolerate, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

Solution: 
1. Update the pod's node affinity/selector to match a node that is available.
2. Remove the taint {dedicated: kafka} from the node or add toleration in the pod's spec to tolerate the taint.
3. Remove the taint {node-role.kubernetes.io/master: } from the node or add toleration in the pod's spec to tolerate the taint.

In the answer, k8sgpt first of all suggests checking the correctness of the specified selectors (nodeselector) And affinitywhich is a correct assumption. Further removal recommendations taint can also be useful, since mistakes are also often made in these places.

Pod in ErrImagePull status

The next error is related to the inability to download the image. In this example, I changed the name of the current image to a non-existent one, and this is what the utility and the base model responded to this gpt-3.5-turbo from openai:

# k8sgpt analyze --explain -m 1 --namespace=dev 
 100% |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| (2/2, 13 it/min)        
AI Provider: openai
0 dev/web-0(StatefulSet/web)
- Error: Back-off pulling image "registry.k8s.io/alpine:3.88"
Error: The Kubernetes container is experiencing a back-off while trying to pull the image "registry.k8s.io/alpine:3.88".
Solution: 
1. Check the image repository registry for any issues or downtime.
2. Verify the image name and tag are correct.
3. Ensure there is enough disk space on the Kubernetes node.
4. Restart the Kubernetes cluster or pod to retry the image pull.
5. If the issue persists, check network connectivity to the image registry.
6. Consider using a different image or version if the problem persists.

In its response, the model suggests using the utility to check whether the registry is working at all. And then it offers to check the tag name. If this does not help, you can also check the storage location for the image, reboot the cluster, check accessibility between the cluster and the image storage, and in the end it is recommended to try another image.

The model’s train of thought is correct, but the proposal to reboot the cluster will lead to the inaccessibility of everything running in it. Therefore, you still need to follow the recommendations with an eye to the scale of the possible consequences.

If you look documentationthe utility parses the output of several sources, including kubectl get events. This output looks like this:

# kubectl -n dev get events --sort-by='.metadata.creationTimestamp'
LAST SEEN   TYPE      REASON                   OBJECT            MESSAGE
20s         Normal    Pulling                  pod/web-0         Pulling image "registry.k8s.io/alpine:3.88"
20s         Warning   Failed                   pod/web-0         Error: ErrImagePull
20s         Warning   Failed                   pod/web-0         Failed to pull image "registry.k8s.io/alpine:3.88": rpc error: code = NotFound desc = failed to pull and unpack image "registry.k8s.io/alpine:3.88": failed to resolve reference "registry.k8s.io/alpine:3.88": registry.k8s.io/alpine:3.88: not found
5s          Normal    BackOff                  pod/web-0         Back-off pulling image "registry.k8s.io/alpine:3.88"
5s          Warning   Failed                   pod/web-0         Error: ImagePullBackOff

This confirms that the utility relies on cluster data. This means that she sees the problem and offers a solution specifically for the situation that has arisen.

Now let’s test the utility on the same problem only using a third-party AI server – Cohere. Model description can be found via this link. You can add a new model not only through k8sgpt auth add -b cohere -m command-nightlybut also through editing the configuration file /root/.config/k8sgpt/k8sgpt.yaml. The model is easy to register, no phone number is required for confirmation, and even in free mode you can comfortably send requests.

The Cohere model dealt with the bad tag name problem as follows:

# k8sgpt analyze --explain -m 1 --no-cache --namespace=dev
 100% |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| (1/1, 17 it/min)        
AI Provider: cohere

0 dev/test-6645f4b985-rbm7z(Deployment/test)
- Error: Back-off pulling image "alpine:3.88"
 Error: Kubernetes is having trouble pulling the image "alpine:3.88". 
This could be due to a few reasons, such as a network issue, 
a problem with the image repository, or a slow download.
Solution: Check the Kubernetes logs for more details on the error. 
You can also try to pull the image manually to see if there are any issues 
with the repository or download.
If the issue persists, try restarting the Kubernetes node or cluster.

In this version, the prompt seems to have changed. For example, there is no numbered list of recommendations. Also, the approach to the solution is slightly different: it is proposed to check network problems, problems with registry or low download speed at the top level. It is also recommended to look at the K8s logs and try to download the image manually. And if nothing helps at all, you need to restart the cluster. But we talked earlier about the problem this could lead to.

Solution to the same problem from OpenAI's flagship model gpt-4o essentially no different from the answer gpt-3.5-turbo:

0 dev/test-6645f4b985-rbm7z(Deployment/test)
- Error: Back-off pulling image "alpine:3.88"
Error: Kubernetes is unable to pull the Docker image "alpine:3.88".
Solution:
1. Verify the image name and tag are correct.
2. Check if the image exists in the repository.
3. Ensure your Kubernetes nodes have internet access.
4. Try pulling the image manually using `docker pull alpine:3.88`.
5. Update the image tag if necessary.

Here, comparing two backs and three models, two of which with equal insistence recommend restarting the cluster, it is logical to choose the last one – 4oand I like the visually and technically structured answer from OpenAI better. On the other hand, Cohere is easier to register and works without a VPN.

Pod in ErrImagePull + wrong password status

Now I throw problems at the utility within the framework of the unavailability of the image. But this time I will change the login and password to access gitlab.registry — local registry into which the image was pushed alpineso the tag has changed. Testing k8sgpt on a model gpt-3.5-turbo from openai. I get this response:

# k8sgpt analyze --explain -m 1 --no-cache --namespace=dev 
 100% |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| (2/2, 8 it/min)         
AI Provider: openai

1 dev/web-0(StatefulSet/web)
- Error: Back-off pulling image "registry.k8s.io/alpine:7c36eb2e27a23465498b483c1fc02c94142b31919e51bdede2146e6a-1690898126442"
Error: The Kubernetes is experiencing a back-off while pulling the image "registry.k8s.io/alpine:7c36eb2e27a23465498b483c1fc02c94142b31919e51bdede2146e6a-1690898126442".

Solution: 
1. Check the network connectivity to the image registry.
2. Verify the image name and tag are correct.
3. If using custom image repository, ensure proper authentication is configured.

The following tips: check the network, the tag name and the correctness of the credentials for connecting to the private storage. The last option is correct, but there is a caveat – after some time the list of recommendations was updated:

1 dev/web-0(StatefulSet/web)
- Error: Back-off pulling image "registry.k8s.io/alpine:7c36eb2e27a23465498b483c1fc02c94142b31919e51bdede2146e6a-1690898126442"
Error: There is an issue with pulling the specified image from the registry.

Solution: 
1. Check if the image name and tag are correct.
2. Verify if the registry URL is accessible.
3. Ensure that you have the necessary permissions to pull the image.
4. Make sure there is an active internet connection.
5. Restart the Kubernetes pod or deployment.

Updating the list of recommendations may seem strange, since when accessing the model, the high temperature — a parameter that is responsible for the randomness of the generated text. This topic was even open appealand starting from the utility version v0.3.18 you can specify this parameter, the default is 0.7, and this makes the model more creative in its responses. The temperature setting itself is included in the configuration file:

cat /root/.config/k8sgpt/k8sgpt.yaml

ai:
    providers:
        - name: openai
          model: gpt-3.5-turbo
          password: ...
          temperature: 0.7
          topp: 0.5
          maxtokens: 2048
...

But the parameter option is just guesswork. Also, updating the text of recommendations can be attributed to new events in the cluster and the use of the option --no-cachewhich makes each new request without taking into account the data of the previous one.

The utility's new answer recommends checking the tag name, the correctness of the link to the registry, and the availability of the necessary rights for downloading – this can be important in some cases. It is also recommended to check the Internet connection and, according to the established tradition, if all else fails, restart the cluster.

The updated list of recommendations turned out to be wider, and most of the proposed solutions have a right to exist. At the same time, the option with the correctness of the login and password, which I had guessed, disappeared, although all this time it was in the description of the pod (kubectl -n dev describe pod web-0) you can see an obvious access error
failed to authorize: failed to fetch anonymous token: unexpected status: 403 Forbidden.

Unfortunately, the model does not always produce a 100% satisfactory answer. But this is an artificial example in which I know I broke it. In a real situation, the model will need options, after trying which you can ask again. But in fairness it should be noted that this time 4o-the model coped with the task better than its predecessor:

Solution:
1. Verify image registry URL and credentials.
2. Ensure your Kubernetes secret for the registry is correctly configured.
3. Check OAuth token endpoint and permissions.
4. Retry pulling the image.

Pod in CrashLoopBackOff status

I simulate the following problem: I make any mistake in the pod launch command, for example, I call bash, which does not exist in this image. As a result, I see the following answer from k8sgpt:

# kubectl -n dev get pod
NAME                    READY   STATUS             RESTARTS      AGE
test-6dbdf798f9-skb8t   0/1     CrashLoopBackOff   4 (22s ago)   2m2s

# k8sgpt analyze --explain --no-cache --namespace=dev
 100% |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| (2/2, 7 it/min)         
AI Provider: openai

1 dev/test-6dbdf798f9-skb8t(Deployment/test)
- Error: back-off 1m20s restarting failed container=alpine pod=test-6dbdf798f9-skb8t_dev(9e727950-e020-4422-b4d5-36e725933a26)
Error: The container "alpine" in the "test-6dbdf798f9-skb8t_dev" pod is failing and restarting with a back-off time of 1 minute and 20 seconds.

Solution: 
1. Check the logs of the "alpine" container to identify the cause of the failure.
2. Make necessary changes to fix the issue.
3. Restart the pod to apply the changes and validate if the "alpine" container starts successfully.

The utility recommends looking at the container logs, fixing the problem and restarting without any specifics.

Watching events:

19s         Warning   Failed                   pod/test-6dbdf798f9-skb8t         Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "bash": executable file not found in $PATH: unknown

Supposed problem – exec: "bash": executable file not found in $PATH: unknownbut there is no mention of it in the decision.

The same error – and a new attempt at analysis shows that the answer is still not static – there is some variability. But even changing the temperature to 0.7 will not radically change the quality of the answer:

# k8sgpt analyze --explain --namespace=dev
 100% |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| (2/2, 15 it/min)        
AI Provider: openai

1 dev/test-6dbdf798f9-skb8t(Deployment/test)
- Error: back-off 2m40s restarting failed container=alpine pod=test-6dbdf798f9-skb8t_dev(9e727950-e020-4422-b4d5-36e725933a26)
Error: The container "alpine" in the pod "test-6dbdf798f9-skb8t_dev" failed to restart after a back-off period.

Solution:
1. Check the logs of the pod to identify the specific error causing the container to fail.
2. Update the container configuration or troubleshoot the issue causing the failure.
3. Apply the necessary changes to resolve the error.
4. Restart the pod to check if the issue has been resolved.

Model 4o suggested:

Solution:
1. Check container logs: `kubectl logs test-6dbdf798f9-skb8t -c alpine`
2. Describe the pod for more details: `kubectl describe pod test-6dbdf798f9-skb8t`
3. Fix any issues found in the logs or description.
4. Restart the pod: `kubectl delete pod test-6dbdf798f9-skb8t` (it will be recreated).

As expected, the new recommendations differ in form, but are identical in essence: check the logs, update the container configuration, find and fix the problem, reboot the pod. The new model only offered specific commands for debugging.

Pod in OOMKilled status

Another common mistake is emulating the launch of a pod with the following command:

command:
            - bash
            - "-c"
            - "for i in {1..100};do echo \" $i s waiting...\" && sleep 1;done\necho \"some strange command\"\ntail /dev/zero\nEOF \n"

The events show a cyclic reboot of the pod:

52s         Normal    Started                  pod/test-858c7f9f7d-tp69t         Started container alpine
52s         Normal    Pulled                   pod/test-858c7f9f7d-tp69t         Container image "alpine:3.6" already present on machine
52s         Normal    Created                  pod/test-858c7f9f7d-tp69t         Created container alpine
4s          Warning   BackOff                  pod/test-858c7f9f7d-tp69t         Back-off restarting failed container

It is impossible to clearly judge the problem based on events. OOM is not explicitly mentioned here, but this can be discovered by doing describe:

# kubectl -n dev describe pod test-858c7f9f7d-tp69t
...      
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled

I also briefly caught this information in pod status:

# kubectl -n dev get pod -w
NAME                    READY   STATUS    RESTARTS      AGE
test-858c7f9f7d-tp69t   1/1     Running   2 (25s ago)   49s
test-858c7f9f7d-tp69t   0/1     OOMKilled   2 (26s ago)   50s
test-858c7f9f7d-tp69t   0/1     CrashLoopBackOff   2 (13s ago)   63s
test-858c7f9f7d-tp69t   1/1     Running            3 (29s ago)   79s
test-858c7f9f7d-tp69t   0/1     OOMKilled          3 (40s ago)   90s
test-858c7f9f7d-tp69t   0/1     CrashLoopBackOff   3 (13s ago)   102s

The chat gives the following recommendations:

# k8sgpt analyze --explain --no-cache --namespace=dev -f Pod
   0% |                                                                                                                                          | (0/1, 0 it/hr) [0s:0s]^ 100% |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| (1/1, 7 it/min)        
AI Provider: openai

0 dev/test-858c7f9f7d-tp69t(Deployment/test)
- Error: back-off 5m0s restarting failed container=alpine pod=test-858c7f9f7d-tp69t_dev(2dcd7352-a56c-4c87-a74f-88f964b60d02)
Error: The alpine container in pod test-858c7f9f7d-tp69t_dev failed to start and is continuously restarting with a back-off period of 5 minutes.
Solution: 
1. Check the logs of the pod using `kubectl logs pod/test-858c7f9f7d-tp69t_dev`
2. Identify the cause of the failure in the logs.
3. Fix the issue with the alpine container.
4. Update the pod using `kubectl apply -f <pod_yaml_file>` to apply the changes.
5. Monitor the pod using `kubectl get pods -w` to ensure it is running without restarts

You need to find the reason, make changes, restart the computer and continue monitoring. Repeated launches with different delays and attempts to catch the error were unsuccessful. It turns out that k8sgpt may not notice floating errors. Perhaps an operator version of the utility would be suitable here.

Errors with Ingress

Next there will be a fairly simple typo error. I’ll show you which one below, but for now I run nginx under the c image and connect it to kind: Ingress. There are no errors for events:

# kubectl -n dev get events
LAST SEEN   TYPE     REASON    OBJECT                      MESSAGE
6m          Normal   Pulled    pod/test-548bf544cd-nttq5   Container image "alpine:3.6" already present on machine
6m          Normal   Created   pod/test-548bf544cd-nttq5   Created container alpine
9m24s       Normal   Sync      ingress/torin-test          Scheduled for sync

The describe is also empty:

# kubectl -n dev describe ingress
Name:             torin-test
Labels:           <none>
Namespace:        dev
Address:          95.217.82.184
Default backend:  default-http-backend:80 (<error: endpoints "default-http-backend" not found>)
TLS:
  torin-test-tls terminates torin-test.sandbox.echo-shared.hf.flant.com
Rules:
  Host                                         Path  Backends
  ----                                         ----  --------
  torin-test.sandbox.echo-shared.hf.flant.com  
                                               /   torin-test:app (<error: endpoints "torin-test" not found>)
Annotations:                                   kubernetes.io/ingress.class: nginx
                                               nginx.ingress.kubernetes.io/force-ssl-redirect: true
                                               nginx.ingress.kubernetes.io/ssl-redirect: true
Events:
  Type    Reason  Age                From                      Message
  ----    ------  ----               ----                      -------
  Normal  Sync    10m (x2 over 10m)  nginx-ingress-controller  Scheduled for sync

And if you have no idea where to look after this, k8sgpt can suggest a possible problem:

# k8sgpt analyze --explain -m 1  --namespace=dev 
 100% |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| (2/2, 20 it/min)        
AI Provider: openai

0 dev/torin-test(torin-test)
- Error: Ingress uses the ingress class nginxa which does not exist.
- Error: Ingress uses the service dev/torin-test which does not exist.
- Error: Ingress uses the secret dev/torin-test-tls as a TLS certificate which does not exist.
Error: Ingress uses the ingress class nginxa which does not exist. Ingress uses the service dev/torin-test which does not exist. Ingress uses the secret dev/torin-test-tls as a TLS certificate which does not exist.

Solution: 
1. Check the spelling and syntax of the ingress class "nginxa" to ensure it matches a valid ingress class name.
2. Verify if the service "dev/torin-test" exists. If not, create the service with the required specifications.
3. Confirm if the secret "dev/torin-test-tls" exists. If not, create the secret with the correct TLS certificate information.

And with the very first point, the utility offers the right solution: a typo in the name ingress classand class nginxa does not exist in the cluster:

# kubectl get ingressclasses.networking.k8s.io --no-headers -o custom-columns=":metadata.name"
nginx

It should be noted here that the error Error: Ingress uses the ingress class nginxa which does not exist was found in the section analyze – that is, the k8sgpt analyzer based on data from the cluster and before using AI. Therefore, it is not surprising that in the section Solution AI also emphasizes the importance of this problem and offers the right solution to it.

Regarding the typo in the resource name, an unexpectedly detailed answer and step-by-step recommendations with commands were given by back Cohere with the command-nightly model (a full list of current models can be obtained from the documentation) even without the option -d:

# k8sgpt analyze -b cohere --explain -m 1  --namespace=dev
 100% |██████████████████████████████████████████████████████████████████████████████████████████████| (2/2, 5 it/min)         
AI Provider: cohere

1 dev/torin-test(torin-test)
- Error: Ingress uses the ingress class nginxa which does not exist.
- Error: Ingress uses the service dev/torin-test which does not exist.
- Error: Ingress uses the secret dev/torin-test-tls as a TLS certificate 
which does not exist.
 Error: Ingress uses the ingress class nginxa which does not exist. 
Ingress uses the service dev/torin-test which does not exist. 
Ingress uses the secret dev/torin-test-tls as a TLS certificate 
which does not exist.

Solution: To resolve this error, you can follow these steps:

1. Verify the ingress class: Make sure that the ingress class `nginx` 
exists in the Kubernetes cluster. You can use the following command to check:

```
kubectl get ingressclass nginxa
```

2. Verify the service: Make sure that the service `dev/torin-test` 
exists in the Kubernetes cluster. You can use the following command to check:

```
kubectl get service dev/torin-test
```

3. Verify the secret: Make sure that the secret `dev/torin-test-tls` 
exists in the Kubernetes cluster. You can use the following command to check:

```
kubectl get secret dev/torin-test-tls
```

If any of these resources do not exist, you can create them with the 
following commands:

```
kubectl apply -f ingress.yaml
kubectl apply -f service.yaml
kubectl apply -f secret.yaml
```

4. Update the ingress: Once you have verified that the required 
resources exist, you can update the ingress to use the correct resources. 
You can use the following command to patch the ingress:

```
kubectl patch ingress torin-ingress --type="json" 
--patch="{"spec": {"ingressClassName": "nginx", 
"defaultBackend": {"serviceName": "torin-test", "servicePort": 80}}"
```

5. Verify the ingress: After updating the ingress, you can verify that it is 
using the correct resources by running the following command:

```
kubectl get ingress torin-ingress
```

Output:
```
NAME           CLASS    HOSTNAME   ADDRESS        PORTS   AGE
torin-ingress  nginxa                 127.0.0.1      80      1d3h
```

This output shows that the ingress `torin-ingress` is using the correct 
ingress class (`nginx`), service (`torin-test`), and secret (`torin-test-tls`).

What back Cohere offers:

  1. Check for the presence of an ingress class with the same name.

  2. Check the availability of the service and secret.

  3. Change the ingress settings taking into account the found resource names.

  4. Check that the ingress settings are correct.

And as I already said, all this is with specific teams.

How to work with local backs

Using external backers may seem unsafe, since requests to the backing explicitly include the names of namespaces, pods, and other cluster resources. And the question may arise whether data, such as secrets, is being sent. In this case, the utility offers the ability to anonymize the sent data using the option --anonymize. For example, this is what a request looks like without --anonymize:

# k8sgpt analyze -b localai  --no-cache -m 1 --namespace=dev --filter Ingress --explain

...
--- Ingress uses the ingress class nginxa which does not exist. Ingress uses the service dev/torin-test which does not exist. Ingress uses the secret dev/torin-test-tls as a TLS certificate which does not exist. ---.
...

Here, after the commands, the request logs from the utility from the back side are presented. Logs are taken from the backup localaian example of which will be launched below.

And this is what the same request looks like from the back end, but with the option --anonymize:

# k8sgpt analyze -b localai --no-cache -m 1 --namespace=dev --filter Ingress --explain --anonymize
...
--- Ingress uses the ingress class ZWIyfD5S which does not exist. Ingress uses the service Rkwv/c0gnUmJZNjhPIw== which does not exist. Ingress uses the secret ZUUx/LVdhSF9rbF0xNEIjLVA= as a TLS certificate which does not exist. ---.
...

It can be seen that the resource names are hidden in this case. From the side of the utility’s response to the user, the difference in names is not visible, so back-up logs were required to demonstrate the operation of the option.

If you need an even more secure way to use the k8sgpt utility, you should consider using your own backing (-b localai).

A LocalAI solution may be suitable for a cluster – an application that allows you to run artificial intelligence models without the need for an Internet connection. This is where confusion may arise between the LocalAI project and the back utility k8sgpt (-b localai). The latter is a common key for connecting any back with a model, and LocalAI is one of the possible implementations of this back, since the utility for connecting the localai back only needs a link and an access key. The provider for the model can be any other application.

Running a local model using LocalAI

To install LocalAI you will need the following:

To install LocalAI I used werfnot pure Helm: edited Helm charts, created werf.yaml And .gitlab-ci.yml for rollout to the K8s cluster in a convenient way. You can use either pure Helm or any other analogues.

values.yaml looks like this:

deployment:
  env:
    ...
    preload_models: '[{ "url": "github:go-skynet/model-gallery/gpt4all-j.yaml", "overrides": { "parameters": { "model": "ggml-gpt4all-j" }}, "files": [ { "uri": "https://gpt4all.io/models/ggml-gpt4all-j.bin", "sha256": "acd54f6da1cad7c04c48b785178d686c720dcbe549903032a0945f97b1a43d20", "filename": "ggml-gpt4all-j" }]}]'

Here it is worth paying attention to the model name and resources. The default model is ggml-gpt4all-j (below there will be an example with the launch of another model). I started running experiments on a K8s cluster with a worker node available at that time with 4cpu 8gb characteristics and gave the pod almost all available resources (I didn’t have a node with a GPU at hand, but in the next part I’ll try to start with them):

resources:
  requests:
    cpu: 3500m
    memory: 7Gi

Disk to have enough space for the model:

persistence:
    pvc:
      enabled: true
      size: 20Gi

Looking ahead, according to the graphs you can see the utilization of resources during request processing and the duration of response preparation:

7.45 GB of memory and 100% CPU – almost all resources that were available on the node are utilized, since the pod did not specify limits.

Also in variables values.yaml you can customize the request template:

promptTemplates:
   ggml-gpt4all-j.tmpl: |
     The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
     ### Prompt:
     {{.Input}}
     ### Response:

And it can also be seen in the pod logs during the execution of the request:

Prompt (after templating): The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
### Prompt:
Simplify the following Kubernetes error message delimited by triple dashes written in --- english --- language; --- Ingress uses the ingress class ZWIyfD5S which does not exist. Ingress uses the service Rkwv/c0gnUmJZNjhPIw== which does not exist. Ingress uses the secret ZUUx/LVdhSF9rbF0xNEIjLVA= as a TLS certificate which does not exist. ---.
	Provide the most possible solution in a step by step style in no more than 280 characters. Write the output in the following format:
	Error: {Explain error here}
	Solution: {Step by step solution here}
### Response:

Here you can clearly see part of the request brought by the LocalAI backend, and part from the k8sgpt utility – {{.Input}}. Other parameters values.yaml left default.

If everything is fine with the resources for the pod, there is enough space on the runner to download the image and imagePullSecrets are in place, then everything will start the first time:

gitlab-runner:~$ docker image ls
REPOSITORY                   TAG       IMAGE ID       CREATED      SIZE
quay.io/go-skynet/local-ai   latest    ba9a2b1bff11   2 days ago   14GB

In this case, you can see in the log:

# kubectl -n dev logs local-ai-dev-6b67766d44-59cxh -f
...
12:42PM DBG File "ggml-gpt4all-j" downloaded and verified
12:42PM DBG Prompt template "gpt4all-completion" written
12:42PM DBG Prompt template "gpt4all-chat" written
12:42PM DBG Written config file /models/gpt4all-j.yaml


 ┌───────────────────────────────────────────────────┐ 
 │                   Fiber v2.48.0                   │ 
 │               http://127.0.0.1:8080               │ 
 │       (bound on host 0.0.0.0 and port 8080)       │ 
 │                                                   │ 
 │ Handlers ............ 55  Processes ........... 1 │ 
 │ Prefork ....... Disabled  PID ................ 14 │ 
 └───────────────────────────────────────────────────┘ 

After successfully deploying the image with LocalAI and running the model inside this pod, you need to register a new back in k8sgpt:

k8sgpt auth add -b localai -u local-ai-dev.dev.svc.cluster.local:8080/v1 -m ggml-gpt4all-j
k8sgpt auth default -p localai

Request k8sgpt to the local back with a filter by pod:

k8sgpt analyze -b localai --explain --no-cache --namespace=dev -f Pod

And, as I showed above, when the request is wrapped in our local pod with the model, you can see exactly how k8sgpt works its magic. That is, what questions does it wrap around the initial fault report obtained as a result of running the utility with the option --analyze:

...
### Prompt:
Simplify the following Kubernetes error message delimited by triple dashes written in --- english --- language; --- Back-off pulling image "alpine:3.88" ---.
	Provide the most possible solution in a step by step style in no more than 280 characters. Write the output in the following format:
	Error: {Explain error here}
	Solution: {Step by step solution here}

...

This is what the back replied LocalAI + ggml-gpt4all-j in case of a problem with a missing tag:

# k8sgpt analyze -b localai --explain --no-cache --namespace=dev -f Pod
 100% |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| (1/1, 2 it/min)         
AI Provider: localai

0 dev/test-6645f4b985-t2ntj(Deployment/test)
- Error: Back-off pulling image "alpine:3.88"

  I'm sorry, as an AI language model, it is not within my programming to provide step-by-step solutions or explain errors. However, I can provide the most common response that could occur. In this case, it seems like an error message about pulling the image "alpine:3.88" is being displayed, but there may be no solution to it.

The answer is unhelpful, but honest: this model cannot provide a step-by-step solution, but suggests paying attention to the error downloading the image.

The following answer was obtained on the same model, but with a temperature of 0.9:

# k8sgpt analyze -b localai --explain --no-cache --namespace=dev -f Pod
 100% |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| (1/1, 1 it/min)         
AI Provider: localai

0 dev/test-6645f4b985-t2ntj(Deployment/test)
- Error: Back-off pulling image "alpine:3.88"
The error message is "back-off pulling image alpine:3.88". The possible cause of the error is a network issue or an incorrect configuration. To resolve the error, you need to check your network connection and make sure that the image has been successfully pushed to your server. Once the image is uploaded, you can specify it as a dependency in your project's package.json file and run "npm install" to include it in your project. Alternatively, you can also use a CI/CD tool such as Jenkins or Travis CI to automate the build process and ensure that all dependencies are up-to-date.

Here the answer turned out to be more detailed. For example, the utility offers to check the network connection and make sure that the mentioned image is available on the server. At the same time, the service misled me: advice about package.json clearly not relevant to the problem. On the plus side, the model does not offer to restart the cluster.

As you can see from the examples above, the local model approach is applicable. In this case, a node with larger amounts of computing power is required, since it is not possible to wait for recommendations for several hours when interacting interactively with the utility. Perhaps this option is suitable, for example, in the mode of running the utility as an operator. And when using the LocalAI application in the cluster, it was not possible to find an opportunity to use one of the key features of K8s: launch several Pods with a model for parallel computing. Most likely, this option for launching the model is not supported.

Local model repository

Then I wanted to try to launch another model. For this, the community has prepared several ready-made options in the gallery. For example, here modelwhich was used above. Or you can try such a well-known model as llama_7bfor this in values.yaml need to change or add a parameter preload_models:

preload_models: '[{"url": "github:go-skynet/model-gallery/openllama_7b.yaml", "name": "gpt-3.5-turbo"}]'

But let’s say I wanted to try a model that is not in the gallery, that is, not among the available templates for LocalAI. Or I have my own trained model. In this case, a local template gallery can help. To do this, you can do the following: download the model from a public repository or train your own, create a template and make the template file available (for example, through a separate ingress of the same cluster). After this, you need to point the LocalAI application to your prepared yaml file with the downloaded model.

Here is an example of such a template file for a model llama_2_7b_chat_gglm.yaml:

name: "llama-2-7b-chat-gglm"

description: |
  ...

license: ...
urls:
- https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

config_file: |
    backend: llama
    parameters:
      model: llama-2-7b-chat.ggmlv3.q6_K.bin
      top_k: 80
      temperature: 0.2
      top_p: 0.7
    context_size: 1024
    template:
      completion: openllama-completion
      chat: openllama-chat
files:
    - filename: "llama-2-7b-chat.ggmlv3.q6_K.bin"
      sha256: "e32c8f063b357001a1da0431778b40a78aa71dd664561ff14c51f18556381818"
      uri: "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/blob/main/llama-2-7b-chat.ggmlv3.q6_K.bin"

prompt_templates:
- name: "openllama-completion"
  content: |
    Q: Complete the following text: {{.Input}}\nA: 
- name: "openllama-chat"
  content: |
    Q: {{.Input}}\nA:

The remaining parameters can be adjusted. In my case, we can assume that they remained by default.

Adding to values.yaml LocalAI application link to the local template and restart the deployment:

preload_models: '[{"url": "https://gallery.sandbox.echo-shared.hf.flant.com/llama_2_7b_chat_gglm.yaml", "name": "llama_2_7b_chat_gglm"}]'

Since I naturally didn’t have enough dedicated testing capacity to run the model, I won’t be able to show the model’s answers here. Recommended capacities are usually indicated in the model description. For example, in the huggingface repository for llama_2_7b required resources are indicated Here.

From the above we can conclude that for the k8sgpt utility to generate a useful recommendation, models trained on a small amount of data are often insufficient, or quantized models, for example ggml-gpt4all-j. Quantization reduces the amount of memory required to store the model and speeds up its execution by reducing the computational load. But this also leads to a loss of accuracy and deterioration in the quality of the model results. And to achieve better results, you need more power, but I will try to talk about this in the following articles.

Conclusions

The most important conclusion that I made for myself: the k8sgpt utility can be used as a good help for onboarding. Even without an AI backend, the utility has built-in useful cluster resource status analyzers, which in some cases are quite informative. In addition, you can add information from the K8s documentation without leaving the console.

However, I note that usually a beginner can use some of the most basic diagnostic commands, such as kubectl explain, describe, get events, logs, and your favorite search engine will be able to find a solution to the problems presented in the article, but k8sgpt’s answers help you go in the right direction faster by focusing on possible solutions rather than on error codes, as well as by structured output and example commands.

The usefulness of k8sgpt's troubleshooting tips depends on the backend used – the model that generates the response. After analyzing several backs, it seemed subjectively best at the time of comparison OpenAI with the model 4oAlthough gpt-3.5-turbo often gives comparable answers and costs less. From the pros of the answers 4o — in addition to recommendations, she also issued diagnostic commands. Local models also provide useful answers, but it all depends on the available capacity.

In this experiment, I did not compare the output of the utility with the chat dialog. It seems that this comparison would be incorrect. Chat is still a more flexible tool, where you can repeatedly update the context, make comments and clarifications, for example, copy the output kubectl get events and pod logs on the web of the selected chat and receive a similar set of recommendations. But nevertheless, the boundary became more blurred after adding an interactive mode to the k8sgpt utility.

As a plus of the utility, I would like to note its convenience, since recommendations for debugging can be obtained without switching from the cluster terminal to the browser. k8sgpt highlights typos or missing required resources if you forgot to define them. And the analysis module additionally checks errors of some resources, sending to the back, in addition to the error text, additional debugging information suitable for this type of resource. Thus, using the utility during onboarding and receiving a list of step-by-step actions, you can begin to engage in cluster debugging, using the utility to first eliminate the simplest and most common errors.

Among the obvious disadvantages, it is worth noting that the analysis is carried out on a certain number of cluster resources; it is not yet possible to try to debug custom resources, but the project is open and actively developing. Continuing with the minuses, floating errors were not caught, and unexpectedly, the same ErrImagePull error can be caught with the k8sgpt utility only on the main container; for initContainer, the error went unnoticed.

In addition, in the future I would like to be able to change the request template (prompt), save the context and enrich it with new events. An analysis of the pod logs would also be useful. However, the project is developing quickly, and some features that I was missing at the beginning of the review appeared towards the end, for example, an interactive mode.

PS

The article is a compilation of personal experience and other reviews. Useful links on the topic:

Read also in our blog:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *