Protecting containers with Seccomp filters

Many companies use containers as a fundamental technology for managing and running their applications. If you’re already experienced with containers, you’ll understand their motivation: containers provide whole new levels of portability and scalability. However, the use of containers, like any other technology, also means new ways to use application exploits.

With certain container configurations, an application exploit can eventually compromise the host running the container. There are other issues to consider as well, such as secrets stored in containers as environment variables and what containers have access to. If you want to learn more about Docker container security best practices, you can use useful cheat sheet.

The established software development life cycle already includes security processes such as vulnerability scanning and software composition analysis, but more is required. Most existing application security technologies prevent application vulnerabilities, but few can prevent the damage caused by a successful application exploit. I learned a new way to secure applications in containers after the exploits. In this post, I will tell you what it is and how to seamlessly integrate it into existing software development processes. As an additional protection, I used Seccomp-BPF, so before going into details, I need to talk a little about it.

▍ Introduction

Programs running on computers actively use the functions of the operating system. In modern programming languages, tasks such as opening files and creating new processes are abstracted away, but inside the code they are performed using requests to the kernel called

system calls

(syscall). How important are syscalls to the operation of a program? There are about four hundred syscalls in the Linux kernel, and even a simple “Hello, World!” program written in C uses two of them: write and exit.

Code executing in so-called “user space” cannot do anything without asking the kernel to do so. The smart developers of the Linux kernel decided to take advantage of this and create a powerful security feature. In July 2012, Linux 3.5 was released, adding support for Seccomp-BPF.

Seccomp-BPF is a Linux kernel feature that allows by creating a special filter, limit the list of system calls that a process can perform.

Theoretically, you can create a Seccomp-BPF filter that allows a process to execute only those syscalls that are necessary for its operation, and nothing else. This is useful in case the application is vulnerable to exploits in such a way that an attacker can create additional processes. If Seccomp does not allow a process to execute new syscalls, there is a good chance that this will thwart an attacker.

Seccomp is a very cool thing, it is even integrated into the container runtime and management tools like Docker and Kubernetes. The question arises: “Why is Seccomp not used everywhere?” I think the reason is that there aren’t enough resources to bridge the gap between a low-level kernel function like Seccomp and modern software development processes. Not every organization has a low-level code developer with extensive knowledge of syscall. In addition, you need to spend additional resources on figuring out what system calls your program needs and supplementing it with each new feature implemented in the code.

I thought about how to solve this problem, and came to this thought: “What if we write down those syscalls that the program executes in the course of its work?” I pitched my idea to one of my colleagues, and the next day he sent me a link to the tool on GitHub. It turned out that the Red Hat developers had already created a tool called oci-seccomp-bpf-hookthat does just that!

▍ Creating a Seccomp-BPF Filter

Tool

oci-seccomp-bpf-hook

was created to work with Linux containers. OCI stands for “Open Container Initiative” and is a set of standards for container runtimes that define what kinds of interfaces they should be able to provide. OCI-compliant runtimes (such as Docker) have a mechanism called “hooks” that allows you to run code before a container is started and after it has terminated. Instead of explaining how the Red Hat tool uses these hooks, it’s better to show it with an example.

Red Hat developed oci-seccomp-bpf-hook for use with its podman container runtime. Podman is backward compatible with Docker in many ways, so if you’ve worked with Docker, the syntax in my examples will sound familiar to you. Also, the OCI hook is currently only available in Red Hat’s associated DNF repositories, or it can be installed from source. To keep the demo simple, I’ll just use a Fedora server (if you don’t have a Fedora environment, I recommend running a Fedora virtual machine on something like Virtualbox or VMware).

First of all to use oci-seccomp-bpf-hook you need to make sure it is installed along with podman. To do this, you can run the following command:

sudo dnf install podman oci-seccomp-bpf-hook

Now that we have podman and the OCI hook, we can finally start generating the Seccomp-BPF filter. AT

readme

you can find out that the syntax looks like this:

sudo podman run --annotation io.containers.trace-syscall="if:[absolute path to the input file];of:[absolute path to the output file]" IMAGE COMMAND

Let’s run the command

ls

in a simple container and pipe the output to

/dev/null

. In doing so, we will record the system calls made by the command

ls

and save them to a file

/tmp/ls.json

.

sudo podman run --annotation io.containers.trace-syscall=of:/tmp/ls.json fedora:35 ls / > /dev/null

Since we are passing the output of the command

ls

in

/dev/null

, there should be no output in the terminal. However, after executing the command, we can take a look at the file where the system calls were saved. In it, we see that the command worked and syscall were recorded:

cat /tmp/ls.json
{"defaultAction":"SCMP_ACT_ERRNO","architectures":["SCMP_ARCH_X86_64"],"syscalls":[{"names":["access","arch_prctl","brk","capset","chdir","close","close_range","dup2","execve","exit_group","fchdir","fchown","fstatfs","getdents64","getegid","geteuid","getgid","getrandom","getuid","ioctl","lseek","mmap","mount","mprotect","munmap","newfstatat","openat","openat2","pivot_root","prctl","pread64","prlimit64","pselect6","read","rt_sigaction","rt_sigprocmask","seccomp","set_robust_list","set_tid_address","sethostname","setresgid","setresuid","setsid","statfs","statx","umask","umount2","write"],"action":"SCMP_ACT_ALLOW","args":[],"comment":"","includes":{},"excludes":{}}]}

This file is our Seccomp filter, now we can use it with any runtime that supports it. Let’s try to use a filter with the same containerized command

ls

which we just did:

sudo podman run --security-opt seccomp=/tmp/ls.json fedora ls / > /dev/null

There is no output or errors, which means that the command was successfully executed with the Seccomp filter applied. And now the fun begins. We’ll add features to the container that weren’t there when we recorded the system calls to create the Seccomp filter. It will be enough to add to the team

ls

flag

-l

.

sudo podman run --security-opt seccomp=/tmp/ls.json fedora ls -l / > /dev/null
ls: /: Operation not permitted
ls: /proc: Operation not permitted
ls: /root: Operation not permitted
…

As you can see, this time we get a series of errors saying that the operation that our command is trying to perform cannot be performed. Adding a flag

-l

to the team

ls

added several new syscalls to the process that are not in the white list of the Seccomp filter. If we generate a new Seccomp filter with the command

ls -l

we will see that the new filter works, because now it has all the required syscall.

sudo podman run --annotation io.containers.trace-syscall=of:/tmp/lsl.json fedora ls -l / > /dev/null
sudo podman run --security-opt seccomp=/tmp/lsl.json fedora ls -l / > /dev/null

As you can see, using Seccomp filters with containers severely limits their capabilities. In a situation where an attacker can exploit your application, this can prevent him from doing damage or even using exploits.

Thanks to Red Hat’s OCI hook, you no longer need to have deep knowledge of Linux kernel system calls to create a Seccomp filter. You can easily create an application-specific filter that prevents the container from doing anything beyond what it should be doing. This is a major step towards bridging the gap between kernel capabilities and high-level software development.

▍ In conclusion

No matter how beautiful

oci-seccomp-bpf-hook

, by itself, it did not fully fulfill my dream of integrating Seccomp into the established software development process. Still, additional resources are needed to run this tool, and developers will not be eager to spend time manually changing the Seccomp filter for each application update. To finally fill this gap and make it as easy as possible to use Seccomp in enterprise applications, we need to find a way to automate the generation of Seccomp-BPF filters. Fortunately, by looking at the modern software development process, we find the perfect place to implement this automation:

in timeContinuous Integration (CI)

.

CI processes have already become an integral part of the established software development life cycle. If you are unfamiliar with CI, then I will say that it allows you to use features such as automated unit testing and code security scanning with every commit to a git repository. There are many tools for CI, so this is an ideal step to automate the generation of a Seccomp filter for a containerized application.

I’ll write another post shortly demonstrating how to create a CI process that generates a Seccomp filter every time the code is updated. With this, you can finally take advantage of Seccomp’s syscall restrictions and secure your applications!

Telegram channel and cozy chat for clients

Similar Posts

Leave a Reply