The thorny path to eBPF, or how we implemented Cilium in Deckhouse

Not so long ago, we decided to add to our Kubernetes platform Deckhouse cilium support. However, during the development cni-cilium module unexpectedly encountered difficulties, to overcome which we even had to turn to the authors of the project. Now that the module has been successfully brought to a working state, you can take a breath and share your feelings from the experience gained and the use of this product in general.

Foreword

cilium is an Open Source project that provides networking, security, and availability for cloud environments such as Kubernetes and other container orchestration platforms. Cilium is based on a Linux kernel technology called eBPFwhich allows you to dynamically implement powerful security, visibility and network management logic into the core of this OS.

The idea of ​​Deckhouse is to offer the user a ready-to-use Kubernetes with minimal effort and maximum automation. Previously, we used two modules to implement a network in a cluster: flannel and simple-bridge. However, they have significant limitations, such as work due to iptables (slow) and the inability to configure policies between cluster nodes (only available between Pods and services). Support for Cilium was intended to get rid of these limitations. (Support itself appeared in the release of Deckhouse v1.33, which we recently told.)

Cilium provides three main advantages: speed (due to eBPF), observability (due to Hubble) and security (Network Policies). In each of these technologies, we are faced with tasks that require additional efforts to achieve the desired result. Let’s take a closer look at them, but first, a small digression about what eBPF is and how it works.

Excursion to eBPF and its connection with Cilium

eBPF is a technology based on the Linux kernel that allows you to run isolated applications in the kernel of the operating system. It is used to safely and efficiently extend the capabilities of the kernel without having to modify its source code or load modules. Application developers can use eBPF programs to enable additional features in the OS at runtime. At the same time, the operating system guarantees the safety and efficiency of their work, as if it was originally compiled using a JIT compiler and a verification mechanism.

Cilium is actively using this technology to develop new ways to work with the network. Familiar tools like iptables/netfilter have a long 25 year history of development behind them, leading them to some finished form that no longer seems to need any tweaks or improvements. They are used absolutely everywhere, from home legendary D-Link DIR-300 routers to millions of Kubernetes clusters with kube-proxy (iptables-backend). At the same time, in some cases, we can safely say that they are ineffective, and it could be done … differently.

By using Cilium, you’re committing to something that isn’t 25 years old. The project is constantly evolving and progressing not only in the development of its eBPF programs and their LLVM compiler, but also in the networking and eBPF parts of the Linux kernel. This is one of the arguments in favor of trying to figure out this technology and implement it in your cluster.

For a deeper dive into the topic, I advise you to read this article. It reveals the subtleties of the life of a network packet in Cilium, the definition of the Pod-to-Service traffic path and the BPF processing logic.

And now – to the adventures that we had to face.

The adventure begins: forgetful conntrack

Conntrack is a state tracking mechanism that is an important part of any network filter. In the Cilium agent, it is similar to the one in any stateful firewall – for example, on conntrack in netfilter. It allows you to set the belonging of the packets to the same flow, which makes it possible to apply additional CPU-intensive processing (policies, NAT decisions) only to the first packet.

We encountered strange behavior: the conntrack table in the eBPF map was reset every time the Cilium agent was restarted. Since after that conntrack knew nothing more about the packets that the client and server were already exchanging, in accordance with the whitelist policy, it discarded all packets sent by the server back to the client.

An example of parsing a single package (appears in cilium monitor -D -vv after enabling the option debug-verbose: datapath):

Ethernet        {Contents=[..14..] Payload=[..118..] SrcMAC=00:00:5e:00:01:00 DstMAC=d0:0d:ba:34:90:bc EthernetType=IPv4 Length=0}
IPv4    {Contents=[..20..] Payload=[..98..] Version=4 IHL=5 TOS=0 Length=2519 Id=54017 Flags=DF FragOffset=0 TTL=63 Protocol=TCP Checksum=18846 SrcIP=10.160.128.31 DstIP=10.160.128.34 Options=[] Padding=[]}
TCP     {Contents=[..32..] Payload=[..66..] SrcPort=2379(etcd-client) DstPort=24824 Seq=2205186449 Ack=1078895699 DataOffset=8 FIN=false SYN=false RST=false PSH=true ACK=true URG=false ECE=false CWR=false NS=false Window=368 Checksum=8011
 Urgent=0 Options=[TCPOption(NOP:), TCPOption(NOP:), TCPOption(Timestamps:388221802/215800007 0x1723cb6a0cdcd8c7)] Padding=[]}
  Packet has been truncated
  Failed to decode layer: No decoder for layer type Payload
CPU 01: MARK 0x0 FROM 137 from-network: 2533 bytes (128 captured), state new, interface eth0, orig-ip 0.0.0.0
CPU 01: MARK 0x0 FROM 137 DEBUG: Successfully mapped addr=10.160.128.31 to identity=7
CPU 01: MARK 0x0 FROM 137 DEBUG: Successfully mapped addr=10.160.128.34 to identity=1
CPU 01: MARK 0x0 FROM 137 DEBUG: Conntrack lookup 1/2: src=10.160.128.31:2379 dst=10.160.128.34:24824
CPU 01: MARK 0x0 FROM 137 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=0
CPU 01: MARK 0x0 FROM 137 DEBUG: CT verdict: New, revnat=0
CPU 01: MARK 0x0 FROM 137 DEBUG: Successfully mapped addr=10.160.128.31 to identity=7
CPU 01: MARK 0x0 FROM 137 DEBUG: Policy evaluation would deny packet from 7 to 1

It can be seen from the log that etcd sends TCP ACK towards the ephemeral port kube-apiserver. The conntrack table is empty, so the eBPF program tries to create a new entry in it. The policy module sees new side connection etcd, which is not allowed by the politicians. And cannot be resolved because the port from the apiserver side is ephemeral (kube-apiserver initiates a TCP connection to etcd).

We initiated the analysis of this problem in issue on GitHub, which eventually led to the necessary fix. Interaction with developers directly in Slack The Cilium project greatly accelerated this process and left a very positive experience with the community:

But our challenges in implementing Cilium did not end there…

Network politicians that are hard to beat

Cilium provides quite flexible security policies for the network – for example:

  • the ability to search for matches on L3 / L4 (even L7), as well as on entity ID (Cilium allows you to uniquely identify the subjects of traffic exchange);

  • policies not only for traffic between Pods, but for all traffic that is present on the host, be it traffic to Pods hostNetwork or system daemons launched via systemd.

Despite this, they sometimes create problems when they conflict with specific implementations of Kubernetes objects.

Here we are faced with a similar problem: Cilium does not allow traffic to healthCheckNodePortswhich is generated exclusively for services type: LoadBalancer. This forced us to provide these ports manually and also add them to CiliumClusterwideNetworkPolicies.

DSR or client IP address retention

DSR (Direct Server Return) is a cool thing to use externalTrafficPolicy: Cluster, since it allows you to save the client IP address, as well as prevent an extra hop on the return path of the packet. It is enabled via the parameter --set loadBalancer.mode=dsr.

The figures below show the schemes for working with and without DSR:

Despite the advantages provided by it, we also encountered several disadvantages:

  1. DSR stops working when using nodePort-traffic towards Pods. The problem is already known and described in the corresponding PR.

  2. DSR doesn’t work at all for hostNetwork-Pods. This problem is also known.

We fixed the first case by including the specified PR in Deckhouse’s supply of Сilium, and the second one does not have a simple solution, because requires significant improvement of the eBPF program bpf_host.c.

Therefore, at the moment hostNetwork-Pods in Deckhouse will only work with DSR when used externalTrafficPolicy: Local.

Hubble Performance

Hubble is a tool that provides observability for your services. It allows you to completely transparently monitor their interaction and behavior, as well as monitor the network infrastructure. Hubble provides visibility at the node level, cluster level, or even across clusters in a multi-cluster mesh scenario. You can read more about its capabilities and interaction with Cilium in the official documentation.

Apart from Hubble Relaywhich allows aggregation, and Hubble UI, which allows you to beautifully display packages and policy decisions on them, Hubble is also contained in each cilium-agent. It is he who exports information about all packages separately for further processing and visualization.

In one cluster where Istio was running and the number of packets and concurrent TCP connections was going through the roof, we encountered a very high CPU consumption of cilium-agent. (Although the documentation promises overhead costs in the amount of not more than 1–15%.)

The graph shows the impact of disabling Hubble in cilium-agent in this cluster of ~30 nodes:

We did not actively deal with this problem, considering at this stage it is sufficient to give a setting that allows you to disable Hubble in cilium-agent.

Junk monitoring

During the initial placement, several key metrics were selected, by which it was supposed to judge the state of the Cilium agent and do alerting. At one point, the metric caught fire on several clusters cilium_controllers_failinga cilium status --verbose showed a problem with updating eBPF-maps cilium_throttle. There were no logs. Understand something using CLI ciliumalso failed.

Upon detailed analysis, it turned out that the error that occurred when deleting the bandwidth controller’s eBPF-map elements was not processed in any way, and the message was not written to the log. At the same time, the metric was generated and created confusion.

We offered correction in Cilium, and it was accepted with the introduction to the upstream.

conclusions

Beginners should be warned against mindlessly using Cilium at the start without sufficient experience with Kubernetes, Linux and networking. Unfortunately, skills in working with Linux network subsystems will help you deal only with symptoms potential problems, but will not help to solve them. If earlier it was possible to correct the rule in iptables, now you have to debug eBPF programs attached to various hooks in the kernel (XDP, tc, cgroup), and analyze whether it puts (or removes) the correct values ​​in eBPF-maps cilium- agent written in Go. Also, do not forget about possible bugs in the kernel and LLVM, which we did not talk about here, so as not to pump too much. Therefore, it is better for beginners to stop at a simpler way to implement the network (kube-router).

Although Cilium has a rather high entry threshold, it is a powerful and uniquely useful tool that adds many useful features to cluster management. Under the hood, eBPF provides new security policies, detailed monitoring of network interactions between cluster services, and faster ways to route packets. All this makes Cilium a very interesting project that is worth paying attention to.

At the same time, both Cilium and eBPF have nuances that require deep diagnostics and subsequent improvements. Despite the possible inconveniences when implementing these technologies, we at Flant believe in their future, and the appearance of the Cilium module in Deckhouse is a logical step towards this future.

PS

Read also on our blog:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *