How we look for degradation on nodes in Kubernetes clusters

engineer of the IaaS unit of the Infrastructure development department in Avito. This article is dedicated to detecting degradations on nodes in Kubernetes clusters. I'll tell you about the tool we use, and also show you a dashboard where we monitor the state of all our nodes.

Causes of degradation on nodes

Avito infrastructure is thousands bare metal servers, most of which are combined into dozens of Kubernetes clusters. It is clear that on such a scale, failure of individual cubonodes is a regular event. The reasons may be different: from a broken memory stick to problems with container runtime.

Well, if the node fails completely, then Kubernetes will handle the failure itself, and the workload will suffer minimally. It's worse when the degradation is partial. In this case, the node may be in a bad state for a long time, causing all the services that are running on it to “suffer.”

Typically, events in this case develop as follows: one of the developers notices the degradation of their microservice, localizes it to the node and comes to us. We diagnose and repair the node. However, some problems occurred on different nodes over and over again. And every time they demanded our intervention.

An illustrative example of this type of problem is synchronization disorder kubelet with docker. If the kubelet fails to execute SyncLoop() within three minutes, he writes to the log Pleg is not healthy. In practice, this is a symptom of partial degradation.

According to our observations, the appearance of several messages of this type in the logs means that soon people will come to complain about the node. Unfortunately, this problem does not have a permanent solution with the versions of cube components that we use. But there is a fairly good temporary solution – you need to do systemctl restart docker. This operation does not affect the workload and after its execution the problem on the node is not reproduced at all, or is reproduced very slowly.

By the way, exactly the same solution was automatically applied in clusters Google Kubernetes Engine. We had to do this manually.

And we, of course, also thought about automation. Ideally, we wanted a system where all known problems were detected and fixed automatically.

For the last six months I have been implementing mechanics Auto Healing. Now they work in all Avito prod and staging clusters. This article is about the first stage Auto Healing — automatic detection of degradations.

Tool selection

Before starting work on the project, a self-written service was used to detect problems on the nodes k8s-node-acceptance. This was Python code that ran accessibility checks at specified intervals docker registrystatus of infrastructure services on the node, and so on. The results were recorded in node conditions next to default air conditioners Ready, DiskPressure etc. It worked well, but had several drawbacks:

  • did not provide metrics;

  • consumed quite a lot of resources;

  • heavily loaded kube-apiserverbecause node conditions updated after each scan was run.

After a little research, I suggested replacing it with Node Problem Detectorpart of the project Kubernetes. It is used by default in clusters Google Kubernetes Engine And Amazon Elastic Kubernetes Service. The service is written in Go, provides metrics, can parse logs, contains many checks out of the box, and is easily expandable. Experience has shown that this was a good choice.

Node Problem Detector

Before implementing NPD, several fundamental decisions had to be made. The first is to run it as a pod or as a systemd unit. Each approach has its pros and cons.

Running in a pod is more convenient for updating purposes and monitoring the status of the NPD itself. This is a much more flexible approach than baking it into an image. In addition, this way we can write checks regarding pod network. However, the probability that a pod with NPD will not start is higher than if it were a demon. In addition, this makes it impossible to use healthchecker – NPD utility that checks health systemd kubelet units And docker (or at least requires login to high privileges). After weighing all the pros and cons, I settled on the first option.

We use native NPD with custom checks. The code is in the internal repository and compiled by CI/CD into a minimalistic docker image. The image is then delivered to all clusters using Application Set For ArgoCD.

The principle of operation of NPD is very simple: it receives the paths to check manifests, which describe their type, launch interval and path to the logs (understands kmsg, journal and logs in a free format) or to a script with custom checking.

Example verification manifest:

{
  "plugin": "custom",
  "pluginConfig": {
      "invoke_interval": "1m",
      "timeout": "1m",
      "max_output_length": 80,
      "concurrency": 1,
      "skip_initial_status": true
  },
  "source": "journalctl-custom-monitor",
  "metricsReporting": true,
  "conditions": [
      {
      "type": "PLEGisUnhealthy",
      "reason": "PLEGisHealthy",
      "message": "PLEG is functioning properly"
      }
  ],
  "rules": [
      {
      "type": "permanent",
      "condition": "PLEGisUnhealthy",
      "reason": "PLEGisUnhealthy",
      // Here we use a custom alternative to logcounter that comes with NPD
      // as our version runs much faster on large logs
      "path": "/home/kubernetes/bin/journalcounter",
      "args": [
          "--identifier=kubelet",
          "--lookback=10m",
          "--count=3",
          "--pattern=PLEG is not healthy: pleg was last seen active"
      ],
      "timeout": "1m"
      }
  ]
}

Scripts can be written in any language, the only requirement is a zero return code if the check is successful and a non-zero return code if problems are detected. There was a temptation to simply move some of the checks from k8s-node-acceptanceformatting them as separate Python scripts. This turned out to not be a very good idea.

NPD runs checks independently of one another, which means that each of them loads its own Python interpreter. They worked with difficulty, and I rewrote the checks in Go, adding a few new ones of my own. For comparison: with Python tests, NPD did not always fit into the 500m cpu limit, and with Go tests it rarely consumes more than 30m.

Below are two examples of scripts we use.

A simple check to determine whether TCP connections can be established to the selected hosts:

package main

import (
    "fmt"
    "net"
    "os"
    "strings"
    "time"
)

const TIMEOUT = 2 * time.Second

func checkTCPConnect(endpoints []string) (bool, string) {
    errors := 0

    for _, endpoint := range endpoints {
        parts := strings.Split(endpoint, ":")
        if len(parts) != 2 {
            return false, fmt.Sprintf("INVALID ENDPOINT FORMAT: %s", endpoint)
        }

        conn, err := net.DialTimeout("tcp", endpoint, TIMEOUT)
        if err != nil {
            errors++
            continue
        }
        defer conn.Close()
    }

    endpointString := strings.Join(endpoints, ", ")
    if errors == len(endpoints) {
        // We use uppercase writing to make errors more noticeable among node conditions
        return false, fmt.Sprintf("TIMEOUT TO ENDPOINTS: %s", strings.ToUpper(endpointString))
    }
    return true, fmt.Sprintf("connected to at least one endpoint: %s", endpointString)
}

func main() {
    if len(os.Args) < 2 {
        fmt.Println("Usage: tcp-connect address1:port1 address2:port2 ...")
        os.Exit(1)
    }
    endpoints := os.Args[1:]
    result, msg := checkTCPConnect(endpoints)
    fmt.Println(msg)
    if !result {
        os.Exit(1)
    }
}

A slightly more complex script that accesses the bird socket and checks that calico has at least one active BGP peer. Most of the code is borrowed from calicoctlnative utility calico (essentially we do in the script the same thing as when executing calicoctl node status):

package main

import (
    "bufio"
    "errors"
    "fmt"
    "net"
    "os"
    "reflect"
    "regexp"
    "strings"
    "time"

    log "github.com/sirupsen/logrus"
)

// Timeout for querying BIRD.
var birdTimeOut = 4 * time.Second

// Expected BIRD protocol table columns
var birdExpectedHeadings = []string{"name", "proto", "table", "state", "since", "info"}

// bgpPeer is a structure containing details about a BGP peer
type bgpPeer struct {
    PeerIP   string
    PeerType string
    State    string
    Since    string
    BGPState string
    Info     string
}

// Check for Word_<IP> where every octate is separated by "_", regardless of IP protocols
// Example match: "Mesh_192_168_56_101" or "Mesh_fd80_24e2_f998_72d7__2"
var bgpPeerRegex = regexp.MustCompile(`^(Global|Node|Mesh)_(.+)$`)

// Mapping the BIRD/GoBGP type extracted from the peer name to the display type
var bgpTypeMap = map[string]string{
    "Global": "global",
    "Mesh":   "node-to-node mesh",
    "Node":   "node specific",
}

func checkBGPPeers() (bool, string) {
    // Show debug messages
    // log.SetLevel(log.DebugLevel)

    // Try connecting to the bird socket in `/var/run/calico/` first to get the data
    c, err := net.Dial("unix", "/var/run/calico/bird.ctl")
    if err != nil {
        // If that fails, try connecting to bird socket in `/var/run/bird` (which is the
        // default socket location for bird install) for non-containerized installs
        c, err = net.Dial("unix", "/var/run/bird/bird.ctl")
        if err != nil {
            return false, "ERROR: UNABLE TO OPEN BIRD SOCKET"
        }
    }
    defer c.Close()

    // To query the current state of the BGP peers, we connect to the BIRD
    // socket and send a "show protocols" message. BIRD responds with
    // peer data in a table format
    //
    // Send the request
    _, err = c.Write([]byte("show protocols\n"))
    if err != nil {
        return false, "UNABLE TO WRITE TO BIRD SOCKET"
    }

    // Scan the output and collect parsed BGP peers
    peers, err := scanBIRDPeers(c)

    if err != nil {
        // If "read unix @->/var/run/calico/bird.ctl: i/o timeout" then skip check
        // This error usually means that it is very high LA on node
        if netErr, ok := err.(net.Error); ok && netErr.Timeout() {
            return true, fmt.Sprintf("Skipping because of: %v", err)
        } else {
            return false, fmt.Sprintf("ERROR: %v", err)
        }
    }

    // If no peers were returned then just print a message
    if len(peers) == 0 {
        return false, "CALICO HAS NO BGP PEERS"
    }

    for _, peer := range peers {
        log.Debugf(peer.PeerIP, peer.BGPState)
        if peer.BGPState == "Established" {
            return true, "calico bird have at least one peer with established connection"
        }
    }

    return false, "NO CONNECTION TO BGP PEERS"

}

func scanBIRDPeers(conn net.Conn) ([]bgpPeer, error) {
    ipSep := "."

    // The following is sample output from BIRD
    //
    //  0001 BIRD 1.5.0 ready.
    //  2002-name     proto    table    state  since       info
    //  1002-kernel1  Kernel   master   up     2016-11-21
    //       device1  Device   master   up     2016-11-21
    //       direct1  Direct   master   up     2016-11-21
    //       Mesh_172_17_8_102 BGP      master   up     2016-11-21  Established
    //  0000
    scanner := bufio.NewScanner(conn)
    peers := []bgpPeer{}

    // Set a time-out for reading from the socket connection
    err := conn.SetReadDeadline(time.Now().Add(birdTimeOut))
    if err != nil {
        return nil, errors.New("failed to set time-out")
    }

    for scanner.Scan() {
        // Process the next line that has been read by the scanner
        str := scanner.Text()

        log.Debug(str)

        if strings.HasPrefix(str, "0000") {
            // "0000" means end of data
            break
        } else if strings.HasPrefix(str, "0001") {
            // "0001" code means BIRD is ready
        } else if strings.HasPrefix(str, "2002") {
            // "2002" code means start of headings
            f := strings.Fields(str[5:])
            if !reflect.DeepEqual(f, birdExpectedHeadings) {
                return nil, errors.New("unknown BIRD table output format")
            }
        } else if strings.HasPrefix(str, "1002") {
            // "1002" code means first row of data
            peer := bgpPeer{}
            if peer.unmarshalBIRD(str[5:], ipSep) {
                peers = append(peers, peer)
            }
        } else if strings.HasPrefix(str, " ") {
            // Row starting with a " " is another row of data
            peer := bgpPeer{}
            if peer.unmarshalBIRD(str[1:], ipSep) {
                peers = append(peers, peer)
            }
        } else {
            // Format of row is unexpected
            return nil, errors.New("unexpected output line from BIRD")
        }

        // Before reading the next line, adjust the time-out for
        // reading from the socket connection
        err = conn.SetReadDeadline(time.Now().Add(birdTimeOut))
        if err != nil {
            return nil, errors.New("failed to adjust time-out")
        }
    }

    return peers, scanner.Err()
}

// Unmarshal a peer from a line in the BIRD protocol output. Returns true if
// successful, false otherwise
func (b *bgpPeer) unmarshalBIRD(line, ipSep string) bool {
    columns := strings.Fields(line)
    if len(columns) < 6 {
        log.Debug("Not a valid line: fewer than 6 columns")
        return false
    }
    if columns[1] != "BGP" {
        log.Debug("Not a valid line: protocol is not BGP")
        return false
    }

    // Check the name of the peer is of the correct format.  This regex
    // returns two components:
    // -  A type (Global|Node|Mesh) which we can map to a display type
    // -  An IP address (with _ separating the octets)
    sm := bgpPeerRegex.FindStringSubmatch(columns[0])
    if len(sm) != 3 {
        log.Debugf("Not a valid line: peer name '%s' is not correct format", columns[0])
        return false
    }
    var ok bool
    b.PeerIP = strings.Replace(sm[2], "_", ipSep, -1)
    if b.PeerType, ok = bgpTypeMap[sm[1]]; !ok {
        log.Debugf("Not a valid line: peer type '%s' is not recognized", sm[1])
        return false
    }

    // Store remaining columns (piecing back together the info string)
    b.State = columns[3]
    b.Since = columns[4]
    b.BGPState = columns[5]
    if len(columns) > 6 {
        b.Info = strings.Join(columns[6:], " ")
    }

    return true
}

func main() {
    var message string
    var result bool

    result, message = checkBGPPeers()

    fmt.Println(message)

    if !result {
        os.Exit(1)
    }
}

In addition, we use two sets of checks that come with NPD: docker-monitor And kernel-monitor. They parse accordingly journal And kmsg regarding remounting the file system to read-onlymemory errors and so on.

Types of NPD checks

NPD can conduct two types of checks: permanent and temporary. The results of passing the first ones are reflected in node conditions and metrics problem_gaugesecondly – only in the metric problem_counter. Almost all of our checks are of the first type because:

  • mechanics Auto Healing react exactly to conditions;

  • if there is some degradation on the node, we want to see it by doing kubectl describe node.

When designing checks, I proceeded from the fact that they must meet one of the following requirements: either unambiguously detect a problem known to us, or provide a clear idea of ​​​​a certain condition.

An example of a test that clearly detects a known problem is PLEGisUnhealthy. It is written to only trigger if there is a problem that requires intervention. That is why we are looking for a pattern in the logs PLEG is not healthy: pleg was last seen active: pleg was last seen activenot just PLEG is not healthy. The last pattern would also work with the following entry in the logs:

skipping pod synchronization - [container runtime status check may not have completed yet., PLEG is not healthy: pleg has yet to be successful.],

And it, as a rule, does not indicate a problem. Additionally, this check only fails if we find at least three occurrences of the pattern in the last 10 minutes.

An example of a test that provides a clear picture of the status would be the test RegistryIsNotAvailable. It checks availability from the node docker registry. If the test fails, all we can say for sure is that registry not accessible from the node. If other network checks fail along with it, then the problem is most likely in the network connectivity of the node.

If RegistryIsNotAvailable fails simultaneously on many nodes, then we can assume that the problem is in the registry or network connectivity before it. The purpose of such checks is to give us an accurate answer to a simple question, for example, “is the registry?. Sometimes they are a great help in diagnosis, but they should not be used for Auto Healingsince failure of such checks can occur for various reasons.

Improved observability

Before implementing NPD, we did not have a dashboard that would allow us to monitor problems on nodes in all clusters at once. Metrics could be used node conditionsbut it didn't work too well. Information came through kube-apiserver with some lag, and promql queries were not very responsive. With metrics from NPD, everything works very quickly, and if NPD has detected any problem, then we will know about it with minimal delay.

I thought that if we enriched these metrics with other useful information, we could create a dashboard that gives a fairly complete picture of what is happening with all our cubonodes. This is how a tool appeared that today serves as the main tool for our team observability in relation to cubonodes.

The dashboard consists of several sections. The first gives an idea of ​​the current state of the nodes.

It displays:

  • the total number of nodes on which problems were detected;

  • the total number of different types of problems detected;

  • a table with the types of problems and the names of the clusters in which they were found;

  • table with problem nodes and the following information about them:

    • failed NPD checks and kubelet,

    • node status (Ready/NotReady/Unknown),

    • Is the node flopping (has it transitioned from the state Ready V NotReady and back in the last 20 minutes),

    • whether the node is bordered.

The top three panels allow you to quickly understand when a problem is becoming widespread. In these rare cases, the first counter, located on the left, usually goes into the red zone, but the second one does not, since the same problem is detected on all nodes. The table on the right shows which check failed and on how many nodes. If there are a large number of nodes with the same problem, the counter on the right side of the table turns red.

Below is the second section. It opens with a large panel with a history of states. A node appears on this panel if it went into NotReady or hit the NPD radar.

By hovering the mouse over any of the blocks, you can see the name of the checks that failed at that moment. But you can get a general idea without this, based on the color of the rectangles:

  • red means critical problems like Kernel Deadlock or node transition to NotReady;

  • blue and its shades are different types of network problems;

  • purple – expiring certificates and the like;

  • yellow – too high load average.

If several checks fail at once, the color will be determined based on the most critical one. By the way, the panel was made using a plugin Statusmapwhich was written by the guys from Flant. Unfortunately, they haven't supported it for a long time. The plugin has a couple of unpleasant bugs, but overall it works fine.

Below is another panel that displays the number of problems of all types in each of the clusters. This is what it usually looks like:

And this is what the incident looks like:

The graph shows a network failure that caused several checks to fail on all nodes in several clusters. In this case, the dashboard did not make it possible to determine the cause, but made it possible to immediately localize the problem to specific clusters.

Under this panel there are several more sections with various kinds of useful information:

  • attempts made Auto Healing;

  • list of bordered nodes;

  • NPD status on nodes (is it running, is it updated to the latest version) and so on.

Benefit and development

NPD has already played a major role in incidents a couple of times. We have an alert configured that is triggered if the check fails at the same time more than 20% of nodes. Thanks to him, we learned about incidents some time before receiving requests from other teams.

The alert contains a link to the dashboard that I demonstrated above. In two cases, the dashboard allowed us to immediately understand which infrastructure component was failing and immediately begin fixing the problem. Then we used it to monitor the cessation of degradation.

The dashboard also turned out to be useful in less dramatic situations. Now, if we want to check the status of a particular node, we first look to see if it appeared on the NPD radar.

Of course, we will improve all the described mechanics. The infrastructure is evolving rapidly: problems that regularly occur today will no longer occur tomorrow, but others will appear and they will require new checks for NPD. However, the labor costs for such support are small.

The dashboard is also gradually evolving. For example, now on the main panel with a list of problem nodes you can see when mechanics are applied to any of them Auto Healing. In the future, problems that can be uniquely detected by metrics may be added there, which will make it possible to display observability nodes in the cluster to an even higher level.

Results

In this article we discussed the mechanics of detecting degradations through the use of a tool Node Problem Detectorand also considered the possibility of expanding the standard set of checks. In addition, we assessed the usefulness of the tool in scenarios where observability k8s clusters.

That's all about detecting degradation. If you are interested in discussing implementation details, please ask questions in the comments. Thank you for your time for this article!

Subscribe to AvitoTech channel in Telegramthere we talk more about the professional experience of our engineers, projects and work in Avito, and also announce meetups and articles.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *