How I broke and fixed a Kubernetes cluster running on a Raspberry Pi

I played out with the updates and this led to a catastrophe: all the nodes suddenly stopped seeing the network interfaces, and, no matter how hard I fought, I could not revive the cluster.

My home cluster grew into a mature cluster of six nodes (all thanks to my wife, who knew what to give me for my birthday – of course, a Raspberry Pi, and more than one!), And I was faced with a choice – either once again follow my own instructions from articles on installing a Kubernetes cluster on a Raspberry Pi, or, using systems engineering (DevOps and SRE), fully automate the processes of reworking the cluster and creating a cluster management system. This article can be considered a direct addition to my first article How I Build a Raspberry Pi-based Kubernetes Home Cluster.


Time and labor considerations

At that time, I had two options, I thought about them at night looking. Both were time consuming.

Option 1

Get up early in the morning and repeat all the operations manually according to the instructions from my own article and maybe improve a few things along the way.

But this method, as I have repeatedly convinced, is fraught with errors and blunders, arising mainly due to incorrectly entered data or missing one or two steps, while spending a lot of time figuring out what the hell could have happened. that this thing doesn’t work again.

Then everything is cleared to zero and starts over. I’ve done this more than once. But in the end, no one will be hurt if they try again.

Option 2

Get up early in the morning and start coding from scratch, warming myself with the thought that the resulting solution will be useful to me not only this time, but many, many times in the future. So, we are faced with the task of reworking and re-assembling the entire cluster so that the running processes can be reproduced as easily as possible.

Of course, this will take much longer. But long-term advantages also shine: I no longer need to take care of the cluster itself, I will know that this solution is stable and I can transfer all my home infrastructure to it.

Chronicle of automation. Start

First of all, I worked through the basics and identified a set of principles that I followed throughout the entire process. The environment I ended up with is quite unusual, but with a good logical separation of the code between its parts, so it will be somewhat easier for you to change any parts of the code or comment out entire parts of files in order to disable certain functions in the files.

Principle 1. Configuring a Kubernetes cluster on a Raspberry Pi is performed in three stages: configuring a memory card, configuring nodes at the system level, and deploying Kubernetes resources.

Principle 2. My old Intel NUC is running an NFS server connected to the DROBO storage. It would be tempting to use it as a permanent shared storage for all nodes.

Principle 3. The Raspberry Pi cluster is running on my home VLAN, so I don’t really care about security. All services and nodes should be easily accessible without any trickery with names and passwords.

So with this in mind, I started programming my little Frankenstein. To repeat the results (that is, in order for the system to work) you will need:

  • Mac (to format the card). If I take the time to install the Linux VM, I will try to update the platform detection script.

  • Ansible (I used version 2.10.6).

  • Terraform (I used version 0.13.4, but 0.14.8 will work too).

  • Make, a useful tool not to conjure over parameters.

RPi cluster. First step

The Raspberry Pi uses a memory card as a hard drive. It may not be the optimal or fastest read / write solution, but it should be sufficient for games and hobby projects.

What happens in the first step?

  • The memory card is being formatted.

  • The memory card is divided into two sections: 1 GB plus what remains.

  • The Alpine Linux image is copied to the memory stick.

  • A system overlay is created.

The system overlay sets eth0 to promisc mode, which is required for MetalLB to work, and allows SSH connections to Raspberry Pi nodes without a password.

Important: check the source 001-prepare-card.sh and make sure / dev / disk5 is the inserted memory card, otherwise data may be lost.

Result: preparing six memory cards will take about one minute.

RPi cluster. Second step

The fun begins. So, you inserted the memory cards into the Raspberry Pi, connected all the cables (network and power) and booted the system. Now you need to get the IP addresses of the devices. This can be done either by connecting a screen to each of them and running the command ifconfig eth0, or by going into the router and checking the information on it. Enter the appropriate values ​​in the pi-hosts.txt file.

[masters]
pi0 ansible_host=192.168.50.132 # Pi0

[workers]
pi1 ansible_host=192.168.50.135 # Pi1
pi3 ansible_host=192.168.50.60  # Pi3
pi4 ansible_host=192.168.50.36  # Pi4
pi2 ansible_host=192.168.50.85  # Pi2
pi5 ansible_host=192.168.50.230 # Pi5

Important: some programs may require the hostname pi0 to run.

Add the following line to your ~ / .ssh / config file, it will give root access to all nodes named pi *.

Host pi?
  User root
  Hostname %h.local

Now our micro-calculators (you see, how much I’m old!) are ready, and we need to get them ready to run Ansible. This can be easily done using the 001-prepare-ansible.sh script, which connects via ssh to each server specified in the pi-hosts file, configures chrony for NTP on each server and installs a Python interpreter.

Important: you may need to open the rpi.yaml file and change the vars section to suit your preference. I did just that.

After this step, run the ansible-playbook rpi.yaml -f 10 command on Ansible, which will perform the following actions:

GENERAL:

  • Installs the required packages.

  • Partition and format the RPI memory card.

  • Configures the parameters of the “large” partition as a system drive.

  • Add entries to the fstab file.

  • Confirms the changes.

  • This will restart the Pi to boot from the “permanent” partition.

KUBEMASTER:

  • Configure a master node using kubeadm.

  • Will save tokens locally (in static / token_file).

  • Define a root user on the Pi with access to kubectl.

  • Saves Kubernetes settings locally (in the static / kubectl.conf file).

KUBEWORKER:

  • Copies tokens to worker nodes.

  • After that, through the token file, the worker nodes will be connected to the master node.

  • Copy kubectl.conf to root users of worker nodes.

BASIC:

  • Unchecks the master node to allow it to accept workloads.

  • Installs py3-pip, PyYaml and Helm on the nodes.

If you’ve made it to this point, congratulations! You just created a basic Kubernetes cluster, which doesn’t know how to do anything yet, but is ready to learn if you give it a little attention. Everything turned out to be quite elementary – you just need to run several scripts and wait for them to complete. I think this is definitely a better solution than fiddling with your hands.

Important: scripts can be run as many times as you like. You do not need to reformat the memory cards after each time.

Result: provisioning six nodes with a basic Kubernetes installation takes a couple of minutes, depending on your internet connection speed.

RPi cluster. Third step

After successfully completing the previous two steps, the Pi cluster is ready for the first deployments. The setup is done in a few steps, and of course these steps can also be automated, Terraform will take care of that.

Let’s take a look at the configuration first.

# Variables used for barebone kubernetes setup
network_subnet    = "192.168.50"

net_hosts = {
  adguard = "240"
  adguard_catchall = "249"
  traefik = "234"
  torrent_rpc = "245"
}

nfs_storage = {
  general = "/media/nfs"
  torrent = "/mnt/drobo-storage/docker-volumes/torrent"
  adguard = "/mnt/drobo-storage/docker-volumes/adguard"
}

# ENV variable: TRAEFIK_API_KEY sets traefik_api_key
# ENV variable: GH_USER, GH_PAT for authentication with private containers

The cluster starts on the network at 192.168.50.0/24, but by default MetalLB will use the “end” of the network address pool with addresses 200-250. Since I have a home torrent server and DNS from Adguard, I need to set specific addresses for them. I also need a Traefik load balancer serving dashboards and other tools.

Important notes:

The nfs _ * _ path values ​​must be compatible with the settings specified in the second step.

Make sure to add data from the static / kubernetes.conf file to your Kubernetes configuration file ~ / .kube / config is. I use home-k8s as the context name.

What does Terraform do?

Installs flannelas well as a patch of configuration parameters for host-gw; sets metalLB and sets the network parameters var.network_subnet 200-250.

Installs the Traefik proxy and makes it available on the home network through the metalLB load balancer. The Traefik dashboard itself is accessed via traefik.local.

The Traefik dashboard runs on a Pi cluster.
The Traefik dashboard runs on a Pi cluster.

Installs Adguard DNS service with persistent volumes claims using NFS; opens access to the dashboard (adguard.local) through Traefik and to the service itself through the IP address allocated on the home network.

Adguard Home runs on a Pi cluster.
Adguard Home runs on a Pi cluster.

Installs and deploys the Prometheus and Grafana monitoring stack on all nodes. Introduces changes to the Prometheus DaemonSet, eliminating the need to mount volumes. Also through Traefik it defines Grafana as grafana.local. The default Grafana username and password is admin: admin. Grafana already has a pre-installed plugin devopsprodigy-kubegraf-app… I find it the best for monitoring clusters.

The Grafana dashboard is launched on the Pi cluster.
The Grafana dashboard is launched on the Pi cluster.

Installs the Kubernetes dashboard and defines it as k8s.local via Traefik.

The Kubernetes dashboard runs on a Pi cluster.
The Kubernetes dashboard runs on a Pi cluster.

Installs and deploys a torrent server (rTorrent) with the Flood web interface. The dashboard is displayed as torrent.local. The dashboard uses many mount points to store data (including configuration data). The reason why the replication value should be set to 1 is simply explained. RTorrent has problems with lock files, and since this program uses a shared directory, it simply won’t start if a lock file is found. I have rTorrent configured to listen on port 23340.

Since the Raspberry Pi is started from a memory card, this card can fail over time due to constant read-write operations. So I decided to make regular backups etcd on NFS. The backup program runs once a day with parameters set by Terraform. Each backup “weighs” about 32 megabytes.

Launching Terraform

To make things a little easier, I’ve created a Makefile that you might find useful for customization. You may need to set the following environment variables:

TRAEFIK_API_KEY // Traefik API key
GH_USER // Github user
GH_PAT // Github Personal Access Token

Important: Github credentials now not are involved, but I plan to add an authentication feature to fetch private images from GHCR in the near future.

ADDITIONAL_ARGS=-var 'traefik_api_key=$(TRAEFIK_API_KEY)' -var "github_user=$(GH_USER)" -var "github_pat=$(GH_TOKEN)"

apply:
	cd infrastructure; terraform apply $(ADDITIONAL_ARGS) -auto-approve -var-file ../variables.tfvars

plan:
	cd infrastructure; terraform plan $(ADDITIONAL_ARGS) -var-file ../variables.tfvars

destroy:
	cd infrastructure; terraform destroy $(ADDITIONAL_ARGS) -var-file ../variables.tfvars

destroy-target:
	cd infrastructure; terraform destroy $(ADDITIONAL_ARGS) -var-file ../variables.tfvars -target $(TARGET)

refresh:
	cd infrastructure; terraform refresh $(ADDITIONAL_ARGS) -var-file ../variables.tfvars

init:
	cd infrastructure; rm -fr .terraform; terraform init

import:
	cd infrastructure; terraform import $(ADDITIONAL_ARGS) -var-file ../variables.tfvars $(ARGS)

lint:
	terraform fmt -recursive infrastructure/

Concluding remarks

The complete code can be taken at Github… It can be freely used and changed as you like (as usual, your comments and remarks are warmly welcomed). I also posted reworked Docker images (rTorrent and Flood) with a multiarch architecture that supports ARM64 processors. I quite often clean up the entire cluster and do a build from scratch using the mentioned repository, and as new features become available, I will make appropriate changes to it.

Automation saves time and hassle and – I must admit! – does his job well. That’s why DevOps engineers so valued in the market (salaries with five zeros per month are already commonplace). If you want to do the same, come to study, in addition to modern curricula, we have mentors from leading Runet companies who will share their practical experience with you.

find outhow to level up in other specialties or master them from scratch:

Other professions and courses

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *