Experience adapting Firecracker for FreeBSD

A lot of fantastically cool open source software comes out of a developer’s urge to try it out. This is exactly what happened in the case of Firecracker. Amazon launched AWS in 2014 Lambda, which was positioned as a “serverless” computing platform. In AWS Lambda, a user can define a function—say, ten lines of Python code—and Lambda responds by building all the required infrastructure to make the chain work: an HTTP request arrives, a function is invoked, the request is processed, and finally a response is generated.

For this service to work safely and efficiently, Amazon needed to develop a mechanism that would allow running virtual machines at minimal cost. This is how the Firecracker was born. It is a virtual machine monitor (VMM) that works in conjunction with Linux KVM. He easily creates “micro-VM” and manages them.

Why can Firecracker be used with FreeBSD?

In June 2022, I started porting FreeBSD to make it friends with Firecracker. I will list the factors that aroused such interest in me.

First, I’ve been working hard trying to speed up the FreeBSD boot process and wanted to test which processes are achievable with a minimal hypervisor.

Second, when porting FreeBSD to new platforms, it’s always a good idea to find bugs – both in FreeBSD and on those platforms.

Third, AWS Lambda is currently only supported on Linux. I’ve always wanted to expand the availability of FreeBSD on AWS (although I can’t influence how actively Lambda is taken up in the community, Firecracker support is a necessary precondition in this case).

But first of all, I wanted to try it simply because Firecracker exists. It’s an interesting platform and I was wondering if I could get it to work.

Starting the FreeBSD Kernel

Despite the fact that Firecracker was designed taking into account one of the needs of Lambda – to run Linux kernels – already in 2020 there were patches available that allowed using the PVH boot mode along with “linuxboot”. FreeBSD supports PVH boot mode when using Xen, so I decided to try that option first.

This is where I ran into the first problem: Firecracker can load the FreeBSD kernel into memory. But it cannot find the address (“kernel entry point”) from which the kernel code should start executing. According to the PVH boot protocol, this value is specified in the “ELF note”. This is a special metadata record that is stored in ELF (Executable and Linker Format) files. It turns out that there are two types of ELF notes: PT_NOTE And SHT_NOTE, and FreeBSD doesn’t provide them in the format that Firecracker needs. It turned out that in order to fix this, it was enough to slightly change the FreeBSD kernel linker script – after which Firecracker began to start normally with the FreeBSD kernel.

Everything went smoothly for about 1 microsecond.

Early debugging

FreeBSD has some great debugging functionality built into it. But, if the kernel crashes before the debugger initialization is completed, or the serial console is not yet ready for operation, this functionality is hardly useful to you. In this case, there is an exit from the Firecracker process – from which I conclude that the FreeBSD guest encountered a triple fault (triple fault) – but that’s all we could find out.

True, it turned out that with this information it is quite possible to get down to business, if you show a little ingenuity. If a program encounters an hlt instruction while executing the FreeBSD kernel, the Firecracker process continues to run, but uses 0% of the host’s CPU (because it virtualizes the CPU in a stalled state). In principle, I could distinguish between the situations “FreeBSD has crashed up to now” and “FreeBSD has crashed since now” by inserting the instruction hlt. In case the Firecracker process crashed, it meant that it crashed before reaching this instruction. Accordingly, the start of the process in this case can be called “binary splitting of the kernel” (kernel bisect). There is a known technique when the list of commits is split using the binary search method in order to find exactly the commit in which the bug originated (as when using the command git bisect). You can use the same binary search to check the kernel startup code to find the piece of code that crashes FreeBSD.

Xen hypercalls

This approach first led me to Xen hypercalls. The PVH boot mode was originally formed as the Xen/PVH boot mode. In fact, FreeBSD’s PVH entry point is designed specifically for booting under Xen. Therefore, from the structure of the code itself, it quite logically follows that the program runs under Xen, which means that Xen hypercalls can be made in it. KVM (which provides the kernel functionality used by Firecracker) is of course not Xen, so it doesn’t have these hypercalls. Trying to use any such call would crash the virtual machine. First, I tried a simple workaround: I simply commented out all the Xen hypercalls. Later, I added code that, before making calls, checked if the CPUID Xen signature: for example, printing debug output in the Xen debug console.

True, among the Xen hypercalls there was one that provided a key element of the functionality, namely, it removed the physical memory map. Of course, inside the hypervisor, “physical” memory is only virtually considered physical. There are turtles down there. What saves us here is that Xen/PVH has retroactively declared version zero of the PVH boot mode. In version 1 and above, the pointer to the memory map is passed through the PVH page start_info (and the pointer to this page is provided in the register at the moment the vCPU starts running). I needed to write code that would use a PVH 1 memory map instead of fetching information about that map via a Xen hypercall, but it was easy enough.

Another, related problem stemmed from how Xen and Firecracker laid out structures in memory. Whereas Xen first loads the core, and then puts the page at the end start_infoFirecracker arranges the page start_info at a fixed low address, and after that it loads the kernel. It should work fine, but this PVH code from FreeBSD – which was written to work with Xen – assumes that there will be free memory located immediately after the page start_info, and this memory can be used as a draft (scratch space). In Firecracker, this practice would quickly lead to overwriting the original kernel stack – the result is far from optimal! This problem was fixed by slightly changing the PVH code from FreeBSD: now it allocates the draft region after the hypervisor has had time to initialize all memory regions.

With ACPI – or without it!

On x86 platforms, a FreeBSD system typically uses ACPI (Advanced Configuration and Power Management Interface) to tell what hardware it is running on (and sometimes control that hardware). In addition to discovering through ACPI things that are commonly considered “devices” – these can be disks, network adapters, etc. – FreeBSD also learns through ACPI about such fundamental components as the CPU and interrupt controllers.

Firecracker is purposefully minimalistic and therefore does not implement ACPI. But FreeBSD is uncomfortable when it cannot determine how many processor cores it has at its disposal, or where the interrupt controllers associated with them are located.

Fortunately, FreeBSD supports the historical Intel MultiProcessor specification, which communicates this critical information through the “MPTable” data structure. This data structure is not included in the GENERIC kernel configuration, but we would still need a stripped-down kernel configuration to work with Firecracker. Therefore, it turned out that it is easy to add mptable for devices and further work with the information provided by Firecracker.

Really… it didn’t work. FreeBSD was still unable to find the required information. It turned out that there are bugs in Linux, due to which it is not always possible to find and correctly parse the MPTable structure. Firecracker, on the other hand, designed to boot Linux, provides MPTable in a Linux-compatible form, which, however, does not comply with the standard. FreeBSD uses a separate implementation here, strictly following the standard. Therefore, it not only failed to find an incorrectly located MPTable, but also to parse the found MPTable if it contained syntax errors.

So now FreeBSD has a new kernel option: you can add to the configuration options MPTABLE_LINUX_BUG_COMPAT, if you need bug-free compatibility with Linux-based MPTable processing. With these adjustments, FreeBSD was able to get a little further on with loading Firecracker.

Serial console

The serial port is one of the few emulated devices provided by Firecracker. It is emulated, not virtualized, like Virtio block and network devices. In fact, in a typical configuration from which Firecracker is launched, the standard input and output of the Firecracker process is turned into serial input and output ports on the virtual machine. It looks like the guest OS is just another process running under your shell (in a sense, it is). At least that’s how it should work.

So, while trying to get FreeBSD up inside Firecracker, by this point I was able to boot the FreeBSD kernel from a disk where I had admin rights, and then compile the kernel image as well. I did not yet have a working virtualized disk driver, but I was already able to read the output received from the kernel in the console. But after all the kernel output went to the console, FreeBSD entered the user side of the boot process; after that, I received 16 more characters in the console output – and then the work stopped.

It’s funny that I experienced exactly the same symptoms more than ten years earlier when I first tried to get FreeBSD to work on EC2 instances. A bug was discovered in QEMU due to which the UART did not send interrupts when the FIFO queue was empty. FreeBSD managed to write 16 bytes to the UART, after which it did not write anything else, but waited for an interrupt that never came. Modern EC2 instances run on Amazon’s Nitro platform, but back in the old days, EC2 used Xen and code from QEMU was used to emulate devices. Somehow, in 10 years, this bug in QEMU was fixed, but exactly the same bug got into the Firecracker implementation. But I was lucky: the crutch that I once inserted into the FreeBSD kernel – hw.broken_txfifo=”1″ – could still be used. Therefore, the problem with console output was solved by adding a custom version of this bootloader (since Firecracker loads the kernel directly, bypassing the bootloader, the value had to be compiled before being written to the kernel and then used as an environment variable).

Then I discovered that console input was also broken: FreeBSD didn’t respond to whatever I wrote there. In fact, as the Firecracker process trace showed, Firecracker itself didn’t even read anything from the console – because it “assumed” that the FIFO queue on the receiver’s side in the emulated UART was full. So I found another bug in Firecracker: on initialization, FreeBSD UART fills the FIFO queue on the receiver side with garbage (to measure the size of this queue), and then resets the FIFO by writing to the FIFO control register. Firecracker does not implement a FIFO control register, so the tool is left with a full FIFO queue. It is logical that this is why he does not even try to read any more characters and add to it. Here I’ve added another workaround to work with FreeBSD. If assert LSR_RXRDY persists even after we tried to reset the FIFO through the corresponding control register (which means that the FIFO is not emptied despite the command) – we continue to read and discard characters one by one, and so on until the FIFO is empty . With this adjustment, Firecracker began to recognize that FreeBSD was ready to accept additional input from the serial port – and we ended up with a two-way serial console.

Virtio Devices

Yes, a system without disks or a network can still be used, but if we want to achieve anything significant with FreeBSD, then we cannot do without these devices. Firecracker supports Virtio block and network devices and exposes them to virtual machines in the form of mmio (memory mapped input/output devices). First, let’s make them work under FreeBSD: add the option to the Firecracker kernel configuration device virtio_mmio.

Next, we need to tell FreeBSD how to find virtualized devices. FreeBSD relies on devices mmio can be found via FDT (Device Flat Tree). This is a mechanism often used in embedded systems. But Firecracker passes device parameters through the kernel command line using directives of the form virtio_mmio.device=4K@0x1001e000:5. The second step to make them work on FreeBSD is to write code to parse such directives and to create virtual device nodes. virtio_mmio. (Once we create a device node, the standard FreeBSD device probing process kicks in and the kernel automatically detects the Virtio device type and then hooks the appropriate driver to it.)

But if we have several devices – say, disk and network – then another problem arises with Firecracker. It passes directives in the Linux way, i.e. as sequences of key/value pairs via the kernel command line. FreeBSD, in turn, parses the contents of the kernel command line as environment variables… that is, if two directives are passed through the command line virtio_mmio.device=, only one of them is preserved. To fix this, I rewrote the old parsing code snippet in the kernel to solve the problem of duplicating variables by attaching a numeric suffix to them. As a result, we have virtio_mmio.device= for one device and virtio_mmio.device_1= for the second.

After doing all this, I finally got FreeBSD to boot successfully and discover all of my devices. But there was another problem with disk devices: if I did not clean up the virtual machine before shutting down, then the next time I booted the file system, the fsck command was applied to the file system, and a panic occurred in the kernel. It turns out that fsck is one of the few things in FreeBSD that causes misaligned disk I/O, and FreeBSD’s Virtio block driver causes a kernel panic when trying to pass such misaligned I/O to Firecracker.

When an I/O operation crosses a boundary between pages of memory—which is exactly what happens with non-page-aligned paging I/O—the physical I/O segments are usually discontinuous. Most devices are adapted to handle these I/O requests, which specify the sequence of segments to be accessed. Firecracker is an exception to this rule, as it is extremely minimalistic. It only accepts one data buffer; that is, if a buffer crosses a page boundary in memory, it cannot simply be split apart in the way other Virtio implementations would. Fortunately, FreeBSD specifically provides a system busdmaallowing to stop such complications with devices.

This aspect of FreeBSD’s interaction with Firecracker was perhaps the most difficult to establish, but after several attempts, I finally brought it to mind: I modified the Virtio block driver for FreeBSD, adapting it to work with busdma. Then unaligned requests start to be “mirrored” (i.e., copied to a temporary buffer) – and this way it is possible to cope with the limitations of the Virtio implementation in Firecracker.

Discovered optimization options

Once I was able to get FreeBSD working with Firecracker, it immediately became clear that some more work needed to be done. Almost immediately, I noticed that, although the virtual machine I was testing had 128 MB of RAM, the system was barely able to work, and processes had to be forcefully terminated every now and then, as the memory in the system completely ran out. Utility top(1) showed that almost half of the system memory is in a wired state, which looks strange. I dug around the system a little more and found that busdma reserves 32 MB of memory for mirrored pages. Of course, this is much more than necessary. Given the limitations of Firecracker and the fact that mirrored pages are usually not allocated in a single chunk, each disk I/O operation should use no more than one full mirrored page, that is, 4 KB. So with the help of a special patch busdma I was able to reduce this memory usage to 512 KB: I left the mirror page reservation only for devices that support a small number of I / O segments.

The random number generator running in the FreeBSD kernel usually gets entropy from physical devices, but when working with virtual machines, this is not a very reliable source. The instruction is used as a backup source of entropy in x86 systems. RDRAND, which allows receiving random values ​​from the CPU. But in this case, on each request, you can get very little entropy, and we only request entropy once every 100 ms. When I changed the entropy collection system so that it was possible at each step to request enough entropy to completely seed the Fortuna random number generator, I was able to reduce the load time by another 2.3 seconds.

  • On initial boot, FreeBSD records the ID of the host it is running on. Usually it is taken from hardware using an environment variable smbios.system.uuid. The bootloader sets this variable based on information received from the BIOS or UEFI. But there is no bootloader under Firecracker – and, accordingly, no ID is provided. I had a fallback system that programmatically generated a random ID on systems that don’t have a valid hardware ID. But in addition, I displayed a warning and waited 2 seconds for the user to read it. Next, I modified this code so that a warning is displayed for two seconds if the hardware provides an invalid ID, but quietly and quickly returns to work if the hardware does not provide an ID at all.

  • The IPv6 protocol dictates that the system must first wait for Duplicate Address Detection (DAD) before using the IPv6 address. IN rc.d/netif after raising the interfaces, we waited for the IPv6 DAD only if IPv6 mode was activated on any of our network interfaces. There was only one problem here: IPv6 mode was activated on our loopback interface Always! I changed the logic to wait for DAD only if IPv6 mode is enabled on any interface other than the loopback. So we managed to speed up the download by another 2 seconds – if another system tried to use the same IPv6 address as we did on our lo0, then there were problems more serious than a simple address collision!

  • When FreeBSD rebooted, it displayed a message (“Rebooting…”) and then waited 1 second “until printf completes and the result is read.” It seemed that there was a minimum of benefit in this, since you usually immediately see that the system is rebooting. Now there is an option kern.reboot_wait_time sysctlwhich is zero by default.

  • When shutting down or rebooting the system, FreeBSD’s BSP (CPU #0) waits for other cores to signal that it has stopped… and then waits another 1 second to make sure it can stop itself. I got rid of that extra second of waiting.

When all the simple optimization options were over, I opened TSLOG and began to examine the flame diagrams of the boot process. Firecracker is great for this kind of work for two reasons: firstly, in a minimalistic environment, a lot of noise is eliminated, so the signal is clearly visible. Secondly, since virtual machines start up extremely quickly under Firecracker, it was also possible to test changes in the FreeBSD kernel. Often, in less than 30 seconds, I managed to build a new kernel, run and generate a new flame diagram.

Such a study with the help of TSLOG led me to a number of possible optimizations:

  • At lapic_init there was a loop of 100,000 iterations, which allowed us to calibrate how long it takes to execute lapic_read_icr_lo. By reducing this loop to 1000 iterations, I was able to gain another 10ms.

  • ns8250_drain called DELAY after reading each character. I changed this step like this: check LSR_RXRDY and delay only if there is nothing to read at the moment. So we managed to speed up the download by another 27 ms.

  • FreeBSD uses the CPUID leaf, which is used by most hypervisors to announce clock speeds for the TSC and the local APIC. Firecracker, unlike VMWare, QEMU and EC2, does not implement this. By adding support for this CPUID list to Firecracker, I cut another 20ms from FreeBSD’s boot time.

  • FreeBSD installs kern.nswbuf (to control the number of buffers allocated for different temporary purposes) to 256, regardless of system size. By setting 32 * mp_ncpus here, I was able to reduce boot time by 5ms when running on a small virtual machine (1 core).

  • FreeBSD Feature mi_startup, which launches machine-independent system initialization procedures, determines the order in which functions are called by the bubble sort method. Yes, it was reasonable in the 90s (when there were relatively few procedures to streamline). But today there are more than 1000 such procedures, and bubble sort is too slow. I changed it to quicksort and saved another 2ms.

  • FreeBSD’s initialization procedure vm_mem prepares structures vm_page for all available physical memory. Even on relatively small virtual machines with 128 MB of RAM, in this case it was necessary to initialize 32768 such structures, which took several ms. I changed this code to lazy initialize structs vm_page, as memory is allocated. So I saved another 2 ms.

  • Firecracker allocates memory for the guest VM using anonymous mmap, but Linux does not set paging structures for the entire address space of the guest VM. As a result, the first time each page was read, an error occurred that took approximately 20,000 CPU cycles to fix (when Linux mapped the page to memory. I added the flag MAP_POPULATE to the call mmap in Firecracker, thus saving another 2 ms.

The state of affairs

FreeBSD is loaded under Firecracker – and very quickly. Given the patches that have not yet been committed (in both FreeBSD and Firecracker), when running in a virtual machine with 1 CPU and 128 MB of RAM, the FreeBSD kernel manages to boot in less than 20 ms. Below is a flame diagram of the boot process.

Rice.  1: Flame diagram of the FreeBSD 14 boot process under Firecracker.

Rice. 1: Flame diagram of the FreeBSD 14 boot process under Firecracker.

Much remains to be done. Not only do we need to fix all of the above patches and add PVH boot mode support to the Firecracker trunk, but we also need to clean up a lot. Given that PVH boot mode historically dates back to Xen, the code used to boot PVH is still mixed in with code to support Xen. If you separate them, then everything will be greatly simplified. Likewise, it is currently not possible to build a FreeBSD arm64 kernel without PCI or ACPI support. If you find and remove redundant dependencies, you get a more compact FreeBSD/Firecracker kernel (and by the way, we will reduce the load by a few more microseconds – it takes 25 ns to check if memory needs to be reserved for the Intel GPU).

A more ambitious idea is to see if Firecracker can be ported to run on FreeBSD. At some point, you realize that a virtual machine is just a virtual machine, and while Firecracker was designed to work with Linux KVM, there is no underlying reason why it couldn’t work with kernel space in FreeBSD’s bhyve hypervisor.

Anyone who wants to experiment with FreeBSD under Firecracker can build the FreeBSD 14.0 kernel in amd64 configuration FIRECRACKER and check the thread feature/pvh branch from the Firecracker project. If it turns out that this branch no longer exists, then it has been pulled into the main part of the Firecracker tree.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *