Testing the performance of Docker clients for Mac

I recently published an article, OrbStack: Why I forgot about Docker Desktop, which sparked a lively discussion in the comments. The main questions arose around the performance of various Docker-like solutions. My arguments, based primarily on personal experience of use, turned out to be not convincing enough.

To get an objective picture and provide the community with real data, I decided to develop a comprehensive benchmark to compare different solutions. During the test development process, commentators suggested several interesting ideas that helped expand the list of engines being tested. As a result, the following people took part in testing:

  • Docker Desktop

  • Podman Desktop

  • Rancher Desktop

  • OrbStack

  • Colima

Host Configuration:

  • OS: MacOS 15.0.1

  • Hardware: MacBook Pro 16

  • Processor: M1 Max (10 cores (8 performance and 2 efficiency))

  • RAM: 64 GB

We will measure:

  • Docker startup time (conditionally clicked on the application, it starts and becomes available for running containers)

  • Heavy build time – 2 different images

  • CPU and Memory

  • Energy consumption (I described below how I think it is and what it is for)

All testing, which will be described below, was carried out at almost the same time, with the same set of programs running.

> If you find an error in the scripts/logic, please contact me in any convenient way. I will finalize and update the article if possible, or only the scripts if editing is not available.

Preparation

At first I wrote this part in as much detail as possible, but when the text exceeded 30 thousand characters, I realized that no one would read it, even if the entire code was under a spoiler.

Therefore, if you want to study scripts, find errors, suggest improvements/revisions – go to Github. In addition, this is a guarantee that the scripts will be up to date, since they will definitely be improved.

Before we move on to the testing itself, I would like to talk about a few points.

Firstly, I tried to write the scripts in such a way that it was possible to both “Run All” and, for example, test only runtime for Docker Desktop. Each script has a –help command, thanks to which you can find out a set of attributes, everything is as flexible as possible.

Secondly, I would like to explain the calculations for “Energy Consumption” and why it is needed. I wanted to calculate how much energy an application spends, inspired by the OrbStack benchmark. It says the following:

> After waiting for CPU usage to settle, we measured the long-term average power usage over 10 minutes by sampling the process group's estimated energy usage (in nJ) from the macOS kernel before and after the test, and converting it to power usage ( in mW) using the elapsed time.

To put it simply, we waited until the CPU load was stable, measured the average energy consumption for the period, selecting the estimated energy consumption of a group of processes and converted it into energy consumption. I considered simply measuring how much battery was left after startup to be a completely inaccurate metric.

The closest analogue to this is the energy impact column in Mac OS

System monitoring

System monitoring

I had all the scripts written, because in general it’s easy to detect the startup time, calculate the CPU/RAM, measure the assembly time, but there were so many difficulties with power consumption that instead of the estimated 4 hours, I spent several days on this task. After a huge number of iterations and googling I found article.

And with the team:

top -stats pid,command,power -o power -l 0 | grep 'Docker Desktop'

I even started getting results:

21930  Docker Desktop   0.4
43853  Docker Desktop H 0.0
21955  Docker Desktop H 0.0
21953  Docker Desktop H 0.0
43853  Docker Desktop H 1.1
21930  Docker Desktop   0.3
21953  Docker Desktop H 0.1

But these values ​​were very unrealistic, because just with the databases running, half a day of using the computer kills the battery completely, but here it’s just some microfractions. And it was not clear whether I should sum them up or display the average (equal to zero, of course). I suffered some more and decided to return to parsing the command:

powermetrics -i 1000 --poweravg 1 | grep 'Average cumulatively decayed power score' -A 20

I rewrote the script from bash to python to make it easier to parse everything. But, to be honest, I’m still not sure that I believe this data is correct. The first damn thing is lumpy, but at least it works, I got the following numbers:

Начальная мощность: 1035.0mW
Конечная мощность: 1295.0mW
Средняя мощность: 1307.7mW

Then I added processing to calculate exactly how much our process consumes – the total CPU load, from it the CPU of our processes, and based on this I calculate the share of the total load. In parallel monitoring, I saw that the CPU share was from 16 to 20%, so after modification, the result seemed to be realistic:

Начальная мощность: 7.2mW
Конечная мощность: 706.5mW
Средняя мощность: 149.1mW

Then the happy one launched the general script for testing, but was overjoyed early. Despite the fact that I tested each script individually, when I launched the general one, a bunch of errors began to appear, and I had to work long and hard to refine it. But in the end everything worked out.

Then the happy one launched the general script for testing, but was overjoyed early. Despite the fact that I tested each script separately, when I launched the general one, a bunch of errors appeared and I had to modify it. But in the end everything worked out.

Dockerfiles

I would like to carry out the assembly on really heavy images, but without fanaticism – 10-15 minutes for assembly is normal, but I definitely wouldn’t want to wait an hour, especially while debugging the script.

Therefore, first of all, I made a simple service for debugging a bash script.

# test-builds/simple/Dockerfile
FROM node:18-alpine

WORKDIR /app

COPY package*.json ./
RUN echo '{"name":"test","version":"1.0.0","dependencies":{"express":"^4.18.2"}}' > package.json

RUN npm install

RUN mkdir src
COPY . .

RUN npm install -g typescript && \
    echo '{"compilerOptions":{"target":"es6","outDir":"dist"}}' > tsconfig.json && \
    mkdir -p src && \
    echo 'const greeting: string = "Hello World"; console.log(greeting);' > src/main.ts && \
    tsc

CMD ["node", "dist/main.js"]

Coming up with or downloading a heavy image turned out to be more difficult. OrbStack tests on PostHog and Open edX in its benchmarks, but I couldn’t assemble them normally at all. I decided to make my own. The second service is a Java Spring application with many dependencies:

# test-builds/java/Dockerfile
# Multi-stage build for Spring Boot application
FROM maven:3.8.4-openjdk-17 AS builder

WORKDIR /app

RUN mkdir -p src/main/java/com/example/demo

# Copy configuration files
COPY pom.xml .
COPY DemoApplication.java src/main/java/com/example/demo/

# Download dependencies and build
RUN mvn dependency:go-offline
RUN mvn package -DskipTests

# Final stage
FROM openjdk:17-slim
COPY --from=builder /app/target/*.jar app.jar
ENTRYPOINT ["java","-jar","/app.jar"]

You need to create two more files, the first pom.xml:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>3.1.0</version>
        <relativePath/>
    </parent>
    <groupId>com.example</groupId>
    <artifactId>demo</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <name>demo</name>
    <description>Demo Spring Boot Application</description>
    <properties>
        <java.version>17</java.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-security</artifactId>
        </dependency>
        <dependency>
            <groupId>org.postgresql</groupId>
            <artifactId>postgresql</artifactId>
            <scope>runtime</scope>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
            </plugin>
        </plugins>
    </build>
</project>

second file DemoApplication.java:

package com.example.demo;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

@SpringBootApplication
public class DemoApplication {
    public static void main(String[] args) {
        SpringApplication.run(DemoApplication.class, args);
    }
}

The third service is a Python ML application with TensorFlow:

# test-builds/ml/Dockerfile
FROM python:3.9 as builder

WORKDIR /app

# Install system dependencies
RUN apt-get update &amp;&amp; apt-get install -y \
    build-essential \
    curl \
    software-properties-common \
    git \
    &amp;&amp; rm -rf /var/lib/apt/lists/*

# Create requirements.txt
RUN echo 'tensorflow==2.13.0' &gt; requirements.txt &amp;&amp; \
    echo 'torch==2.0.1' &gt;&gt; requirements.txt &amp;&amp; \
    echo 'transformers==4.31.0' &gt;&gt; requirements.txt &amp;&amp; \
    echo 'scipy==1.11.2' &gt;&gt; requirements.txt &amp;&amp; \
    echo 'scikit-learn==1.3.0' &gt;&gt; requirements.txt &amp;&amp; \
    echo 'pandas==2.0.3' &gt;&gt; requirements.txt &amp;&amp; \
    echo 'numpy==1.24.3' &gt;&gt; requirements.txt &amp;&amp; \
    echo 'matplotlib==3.7.2' &gt;&gt; requirements.txt &amp;&amp; \
    echo 'seaborn==0.12.2' &gt;&gt; requirements.txt

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Create sample ML application
COPY . .
RUN echo 'import tensorflow as tf' &gt; app.py &amp;&amp; \
    echo 'import torch' &gt;&gt; app.py &amp;&amp; \
    echo 'import transformers' &gt;&gt; app.py &amp;&amp; \
    echo 'from sklearn.ensemble import RandomForestClassifier' &gt;&gt; app.py &amp;&amp; \
    echo 'import numpy as np' &gt;&gt; app.py &amp;&amp; \
    echo 'import pandas as pd' &gt;&gt; app.py &amp;&amp; \
    echo 'print("TensorFlow version:", tf.__version__)' &gt;&gt; app.py &amp;&amp; \
    echo 'print("PyTorch version:", torch.__version__)' &gt;&gt; app.py &amp;&amp; \
    echo 'print("Transformers version:", transformers.__version__)' &gt;&gt; app.py

FROM python:3.9-slim

WORKDIR /app

# Copy only necessary files from builder
COPY --from=builder /app/app.py .
COPY --from=builder /usr/local/lib/python3.9/site-packages /usr/local/lib/python3.9/site-packages

CMD ["python", "app.py"]

Now nothing prevents you from running testing.

Testing

To start testing, just run the script:

./engine-benchmark.sh -v all #Запуск тестирования всех движков

# Или запустить только нужный
./engine-benchmark.sh -v podman-desktop

The results are added up for each test, and the total result is stored in result. For example docker-desktop_idle_resources.json:

{
    "engine": "docker-desktop",
    "timestamp": "2024-10-28T21:49:35.419319Z",
    "repeat_count": 3,
    "results": {
        "average": 10.817972977956137,
        "min": 3.1761507987976074,
        "max": 25.995931148529053,
        "all_times": [
            25.995931148529053,
            3.281836986541748,
            3.1761507987976074
        ]
    }
}

While the application is running, you are required to periodically enter a password – to install and uninstall, as well as to run the last test, since powermetrics can only work with sudo rights. You also need to react to windows periodically, as permission to run is required from time to time, for example for Podman Desktop. It was not possible to completely automate the process to run it for two hours without intervention, but overall it’s not bad, it can be improved later.

Results

To conveniently work with the results, I wrote a simple script. To see, just go to ./graphics and open index.html

In order to be sure that one engine will not affect another, the first engine is first installed, the entire set of tests is run, removed, and only then does the transition to the next one occur.

All test results and logs are uploaded to github, you can go to the graphics folder and run index.html.

Start time

Checked 4 times. At the same time, at the first launch there are nuances everywhere. For example, with Docker, after checking, you need to launch it manually for the first time, it is not clear why – it launches it, but the application does not start. Other applications also take a long time to open for the first time. Therefore, I decided to calculate the average between the last three launches.

Also on startup I run three commands to check that everything is working correctly:

if docker info &gt;/dev/null 2&gt;&amp;1 &amp;&amp; \  
   docker ps &gt;/dev/null 2&gt;&amp;1 &amp;&amp; \  
   docker run --rm hello-world &gt;/dev/null 2&gt;&amp;1; then  
    return 0  
fi

At first I wanted to delete the line with the container launch, then I realized that since I didn’t count the first launch, there was no effect, so I left it.

Another separate story with Rancher – I couldn’t fix it. When a failure occurs, the first launch must be done manually, I don’t know why, but all subsequent scripts start and stop correctly.

[2024-10-29 22:40:45] Тестирование времени запуска rancher-desktop...
Тестирование rancher-desktop...
Попытка 1 из 4
Запуск rancher-desktop...

At the same time, it allows you to launch it only two times in a row, on the third time it gives 0 each time (I restarted it several times, but could not fix it). But, fortunately, in its interface there is information about how long it takes for the virtual machine to start – this is, as a rule, 30-40 seconds. Therefore, I decided not to add here; in any case, the numbers are realistic.

Launch testing. Combined

Launch testing. Combined

The same thing, but in a different display:

Launch testing. Divided

Launch testing. Divided

Here's something interesting – when testing, launching Podman took 0 seconds. After the first launch there really is so much. The first run was 5 seconds (also the lowest of all), the next 3 – for 0. I tried it separately, ran the test for it, and got generally similar results:

Тестирование podman-desktop...
Попытка 1 из 4
Запуск podman-desktop...
Первый запуск пропущен: 5.0 секунд
Останавливаем podman-desktop...
Попытка 2 из 4
Запуск podman-desktop...
Время запуска: 0 секунд
Останавливаем podman-desktop...
Попытка 3 из 4
Запуск podman-desktop...
Время запуска: 1.0 секунд
Останавливаем podman-desktop...
Попытка 4 из 4
Запуск podman-desktop...
Время запуска: 1.0 секунд
Останавливаем podman-desktop...
Результаты сохранены в results/startup/podman-desktop_startup.json
Тестирование завершено. Результаты в директории results/startup

Assembly

Here the problematic one is Colima. She complained about docker-compose, the lack of buildx (it could work without it, but she complained that it was deprecated, so she added a build through it) and docker. That is, in addition to Colima, you need to install these tools. I decided not to install Buildx, because it would require a lot of additional logic for testing, and it’s clear that it would be faster. After all the modifications, it simply began to refuse to work, it’s unclear why, but I woke up in the morning and everything worked. In general, a very strange story, but then everything began to function normally. And then she started swearing at docker compose again, I had to add conditions for Colima – launch via docker-composenot docker compose. It helped.

Result:

Build results

Build results

Performance

At first, I wanted to run the tests for 10 minutes, but two tests each, the overall script takes about 2 hours. After a large number of runs during debugging, I realized that, most likely, there would not be a huge difference, the order would be preserved. Therefore, here are the results of minute runs, but perhaps in the future I will retest with a longer duration.

CPU

CPU

RAM

RAM

Energy consumption

Energy consumption

I was pleasantly surprised by Docker Desktop in terms of energy consumption. Either I did the test incorrectly, or they have made a lot of progress since last year, when the test was done from OrbStack.

Results

The technical side gave me a lot of problems, almost everything is very delicate, there are always some nuances, although it would seem that everywhere except Podman, backward compatibility for commands is complete. But I definitely enjoyed the process and at the same time got to know all the products very closely.

As a result, there is an implementation, it works, but sometimes you need help – enter a password, open the application when you first launch it. Globally, you can later finish it and make it more convenient and stable (I hope the community will help). In addition, you can expand the set of tests, for example, add I/O.

The most important thing is that the tool provides data, and in my opinion, quite realistic, and this Benchmark will allow you to draw conclusions for yourself. I decided for myself to stay on OrbStack, but at the same time I will be taking a closer look at Podman Desktop. What you choose is entirely up to you.

PS When you reproduce tests, take into account the difference in environments. I expect the numbers will be different, but the order will remain the same.

PPS For those who scrolled through, link to Github

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *