KVM host in a couple of lines of code

Hello!

Today we publish an article on how to write a KVM host. We saw her on the blog Serge Zaitsev, translated and supplemented with their own Python examples for those who do not work with the C ++ language.

KVM (Kernel-based Virtual Machine) Is a virtualization technology that ships with the Linux kernel. In other words, KVM allows you to run multiple virtual machines (VMs) on a single Linux virtual host. Virtual machines in this case are called guests. If you’ve ever used QEMU or VirtualBox on Linux, you know what KVM is capable of.

But how does it work under the hood?

IOCTL

KVM provides API via a device special file – / dev / kvm… When you start a device, you access the KVM subsystem and then make ioctl system calls to allocate resources and start virtual machines. Several calls to ioctls return file descriptors, which can also be manipulated with ioctls. And so on ad infinitum? Not really. There are only a few API levels in KVM:

  • the / dev / kvm level used to manage the entire KVM subsystem and to create new virtual machines,
  • the VM layer used to manage an individual virtual machine,
  • VCPU level used to control the operation of one virtual processor (one virtual machine can run on several virtual processors – VCPU

In addition, there are APIs for I / O devices.

Let’s see how it looks in practice.

// KVM layer
int kvm_fd = open("/dev/kvm", O_RDWR);
int version = ioctl(kvm_fd, KVM_GET_API_VERSION, 0);
printf("KVM version: %dn", version);

// Create VM
int vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);

// Create VM Memory
#define RAM_SIZE 0x10000
void *mem = mmap(NULL, RAM_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE, -1, 0);
struct kvm_userspace_memory_region mem = {
	.slot = 0,
	.guest_phys_addr = 0,
	.memory_size = RAM_SIZE,
	.userspace_addr = (uintptr_t) mem,
};
ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, &mem);

// Create VCPU
int vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);

Python example:

    with open('/dev/kvm', 'wb+') as kvm_fd:
        # KVM layer
        version = ioctl(kvm_fd, KVM_GET_API_VERSION, 0)
        if version != 12:
            print(f'Unsupported version: {version}')
            sys.exit(1)

        # Create VM
        vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0)

        # Create VM Memory
        mem = mmap(-1, RAM_SIZE, MAP_PRIVATE | MAP_ANONYMOUS, PROT_READ | PROT_WRITE)
        pmem = ctypes.c_uint.from_buffer(mem)
        mem_region = UserspaceMemoryRegion(slot=0, flags=0,
                                           guest_phys_addr=0, memory_size=RAM_SIZE,
                                           userspace_addr=ctypes.addressof(pmem))
        ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, mem_region)

        # Create VCPU
        vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);

In this step, we have created a new virtual machine, allocated memory for it, and assigned one vCPU. For our virtual machine to actually run something, we need to load the virtual machine image and properly configure the processor registers.

Loading the virtual machine

It’s easy enough! Just read the file and copy its contents to the virtual machine memory. Sure, mmap not a bad option too.

int bin_fd = open("guest.bin", O_RDONLY);
if (bin_fd < 0) {
	fprintf(stderr, "can not open binary file: %dn", errno);
	return 1;
}
char *p = (char *)ram_start;
for (;;) {
	int r = read(bin_fd, p, 4096);
	if (r <= 0) {
		break;
	}
	p += r;
}
close(bin_fd);

Python example:

 # Read guest.bin
        guest_bin = load_guestbin('guest.bin')
        mem[:len(guest_bin)] = guest_bin

It is assumed that guest.bin contains valid bytecode for the current CPU architecture, because KVM does not interpret CPU instructions one by one as the old virtual machines did. KVM gives the computation to the real CPU and only intercepts I / O. This is why modern virtual machines operate at high performance, close to bare metal, unless you are doing I / O heavy operations.

Here’s the tiny guest VM kernel we’ll try to run first:

#
# Build it:
#
# as -32 guest.S -o guest.o
# ld -m elf_i386 --oformat binary -N -e _start -Ttext 0x10000 -o guest guest.o
#
.globl _start
.code16
_start:
xorw %ax, %ax
loop:
out %ax, $0x10
inc %ax
jmp loop

If you are not familiar with assembler, the example above is a tiny 16-bit executable file that increments the register in a loop and outputs the value to port 0x10.

We deliberately compiled it as an archaic 16-bit application, because the launched KVM virtual processor can operate in several modes, like a real x86 processor. The simplest mode is “real” mode, which has been used to run 16-bit code since the last century. Real mode differs in memory addressing, it is direct instead of using descriptor tables – it would be easier to initialize our register for real mode:

struct kvm_sregs sregs;
ioctl(vcpu_fd, KVM_GET_SREGS, &sregs);
// Initialize selector and base with zeros
sregs.cs.selector = sregs.cs.base = sregs.ss.selector = sregs.ss.base = sregs.ds.selector = sregs.ds.base = sregs.es.selector = sregs.es.base = sregs.fs.selector = sregs.fs.base = sregs.gs.selector = 0;
// Save special registers
ioctl(vcpu_fd, KVM_SET_SREGS, &sregs);

// Initialize and save normal registers
struct kvm_regs regs;
regs.rflags = 2; // bit 1 must always be set to 1 in EFLAGS and RFLAGS
regs.rip = 0; // our code runs from address 0
ioctl(vcpu_fd, KVM_SET_REGS, &regs);

Python example:

  sregs = Sregs()
        ioctl(vcpu_fd, KVM_GET_SREGS, sregs)
        # Initialize selector and base with zeros
        sregs.cs.selector = sregs.cs.base = sregs.ss.selector = sregs.ss.base = sregs.ds.selector = sregs.ds.base = sregs.es.selector = sregs.es.base = sregs.fs.selector = sregs.fs.base = sregs.gs.selector = 0
        # Save special registers
        ioctl(vcpu_fd, KVM_SET_SREGS, sregs)

        # Initialize and save normal registers
        regs = Regs()
        regs.rflags = 2  # bit 1 must always be set to 1 in EFLAGS and RFLAGS
        regs.rip = 0  # our code runs from address 0
        ioctl(vcpu_fd, KVM_SET_REGS, regs)

Running

The code is loaded, the registers are ready. Let’s get started? To start a virtual machine, we need to get a pointer to the “run state” for each vCPU and then enter a loop in which the virtual machine will run until it is interrupted by I / O or other operations where control will be transferred back to the host.

int runsz = ioctl(kvm_fd, KVM_GET_VCPU_MMAP_SIZE, 0);
struct kvm_run *run = (struct kvm_run *) mmap(NULL, runsz, PROT_READ | PROT_WRITE, MAP_SHARED, vcpu_fd, 0);

for (;;) {
	ioctl(vcpu_fd, KVM_RUN, 0);
	switch (run->exit_reason) {
	case KVM_EXIT_IO:
		printf("IO port: %x, data: %xn", run->io.port, *(int *)((char *)(run) + run->io.data_offset));
		break;
	case KVM_EXIT_SHUTDOWN:
		return;
	}
}

Python example:

 runsz = ioctl(kvm_fd, KVM_GET_VCPU_MMAP_SIZE, 0)
        run_buf = mmap(vcpu_fd, runsz, MAP_SHARED, PROT_READ | PROT_WRITE)
        run = Run.from_buffer(run_buf)

        try:
            while True:
                ret = ioctl(vcpu_fd, KVM_RUN, 0)
                if ret < 0:
                    print('KVM_RUN failed')
                    return
                if run.exit_reason == KVM_EXIT_IO:
                    print(f'IO port: {run.io.port}, data: {run_buf[run.io.data_offset]}')
                elif run.exit_reason == KVM_EXIT_SHUTDOWN:
                    return
                time.sleep(1)
        except KeyboardInterrupt:
            pass

Now if we run the application, we will see:

IO port: 10, data: 0
IO port: 10, data: 1
IO port: 10, data: 2
IO port: 10, data: 3
IO port: 10, data: 4
...

Works! Full source codes are available as follows address (if you spot a mistake, comments are welcome!).

Do you call it the core?

Most likely, none of this is very impressive. How about running the Linux kernel instead?

The beginning will be the same: open / dev / kvm, create a virtual machine, etc. However, we need a few more ioctls at the virtual machine level to add a periodic interval timer, initialize the TSS (required for Intel chips), and add an interrupt controller:

ioctl(vm_fd, KVM_SET_TSS_ADDR, 0xffffd000);
uint64_t map_addr = 0xffffc000;
ioctl(vm_fd, KVM_SET_IDENTITY_MAP_ADDR, &map_addr);
ioctl(vm_fd, KVM_CREATE_IRQCHIP, 0);
struct kvm_pit_config pit = { .flags = 0 };
ioctl(vm_fd, KVM_CREATE_PIT2, &pit);

We will also need to change the way the registers are initialized. The Linux kernel needs protected mode, so we enable it in the register flags and initialize the base, selector, granularity for each special case:

sregs.cs.base = 0;
sregs.cs.limit = ~0;
sregs.cs.g = 1;

sregs.ds.base = 0;
sregs.ds.limit = ~0;
sregs.ds.g = 1;

sregs.fs.base = 0;
sregs.fs.limit = ~0;
sregs.fs.g = 1;

sregs.gs.base = 0;
sregs.gs.limit = ~0;
sregs.gs.g = 1;

sregs.es.base = 0;
sregs.es.limit = ~0;
sregs.es.g = 1;

sregs.ss.base = 0;
sregs.ss.limit = ~0;
sregs.ss.g = 1;

sregs.cs.db = 1;
sregs.ss.db = 1;
sregs.cr0 |= 1; // enable protected mode

regs.rflags = 2;
regs.rip = 0x100000; // This is where our kernel code starts
regs.rsi = 0x10000; // This is where our boot parameters start

What are the boot parameters and why can’t you just boot the kernel at address zero? It’s time to learn more about the bzImage format.

The kernel image follows a special “boot protocol” where there is a fixed header with boot parameters followed by the actual kernel bytecode. Here the format of the boot header is described.

Loading a kernel image

In order to properly load the kernel image into the virtual machine, we need to read the entire bzImage file first. We look at offset 0x1f1 and get the number of sectors of the setup from there. We’ll skip them to see where the kernel code starts. In addition, we will copy the boot parameters from the beginning of bzImage to the memory area for the boot parameters of the virtual machine (0x10000).

But even that won’t be enough. We will need to correct the boot parameters for our VM to force it to VGA mode and initialize the command line pointer.

Our kernel needs to write logs to ttyS0 so that we can intercept I / O and our virtual machine prints it to stdout. For this we need to add “Console = ttyS0” to the kernel command line.

But even after that, we will not get any result. I had to set a fake CPU ID for our kernel (https://www.kernel.org/doc/Documentation/virtual/kvm/cpuid.txt). Most likely the kernel I put together relied on this information to determine if it was running inside a hypervisor or on bare metal.

I used a kernel compiled with a “tiny” configuration and set up a few configuration flags to support terminal and virtio (I / O virtualization framework for Linux).

Full code of modified KVM host and test kernel image available here

If this image does not start, you can use another image available from given link

If we compile it and run it, we get the following output:

Linux version 5.4.39 (serge@melete) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~16.04~ppa1)) #12 Fri May 8 16:04:00 CEST 2020
Command line: console=ttyS0
Intel Spectre v2 broken microcode detected; disabling Speculation Control
Disabled fast string operations
x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
BIOS-provided physical RAM map:
BIOS-88: [mem 0x0000000000000000-0x000000000009efff] usable
BIOS-88: [mem 0x0000000000100000-0x00000000030fffff] usable
NX (Execute Disable) protection: active
tsc: Fast TSC calibration using PIT
tsc: Detected 2594.055 MHz processor
last_pfn = 0x3100 max_arch_pfn = 0x400000000
x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC
Using GB pages for direct mapping
Zone ranges:
  DMA32    [mem 0x0000000000001000-0x00000000030fffff]
  Normal   empty
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x0000000000001000-0x000000000009efff]
  node   0: [mem 0x0000000000100000-0x00000000030fffff]
Zeroed struct page in unavailable ranges: 20322 pages
Initmem setup node 0 [mem 0x0000000000001000-0x00000000030fffff]
[mem 0x03100000-0xffffffff] available for PCI devices
clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
Built 1 zonelists, mobility grouping on.  Total pages: 12253
Kernel command line: console=ttyS0
Dentry cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
Inode-cache hash table entries: 4096 (order: 3, 32768 bytes, linear)
mem auto-init: stack:off, heap alloc:off, heap free:off
Memory: 37216K/49784K available (4097K kernel code, 292K rwdata, 244K rodata, 832K init, 916K bss, 12568K reserved, 0K cma-reserved)
Kernel/User page tables isolation: enabled
NR_IRQS: 4352, nr_irqs: 24, preallocated irqs: 16
Console: colour VGA+ 142x228
printk: console [ttyS0] enabled
APIC: ACPI MADT or MP tables are not detected
APIC: Switch to virtual wire mode setup with no configuration
Not enabling interrupt remapping due to skipped IO-APIC setup
clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x25644bd94a2, max_idle_ns: 440795207645 ns
Calibrating delay loop (skipped), value calculated using timer frequency.. 5188.11 BogoMIPS (lpj=10376220)
pid_max: default: 4096 minimum: 301
Mount-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
Disabled fast string operations
Last level iTLB entries: 4KB 64, 2MB 8, 4MB 8
Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4
CPU: Intel 06/3d (family: 0x6, model: 0x3d, stepping: 0x4)
Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
Spectre V2 : Spectre mitigation: kernel not compiled with retpoline; no mitigation available!
Speculative Store Bypass: Vulnerable
TAA: Mitigation: Clear CPU buffers
MDS: Mitigation: Clear CPU buffers
Performance Events: Broadwell events, 16-deep LBR, Intel PMU driver.
...

Obviously, this is still a rather useless result: no initrd or root partition, no real applications that could run in this kernel, but it still proves that KVM is not such a terrible and quite powerful tool.

Conclusion

To run a full-fledged Linux, the virtual machine host needs to be much more advanced – we need to model several I / O drivers for disks, keyboard, graphics. But the general approach remains the same, for example, we need to configure the command line parameters for initrd in the same way. The disks will need to intercept I / O and respond appropriately.

However, no one is forcing you to use KVM directly. Exist libvirt, a nice friendly library for low-level virtualization technologies like KVM or BHyve.

If you are interested in learning more about KVM, I suggest looking at the source. kvmtool… They are much easier to read than QEMU and the whole project is much smaller and simpler.

Hope you enjoyed the article.

You can follow the news at Github, in Twitter or subscribe through rss

Links to the GitHub Gist with Python examples from the Timeweb expert: (1) and (2)

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *