Execute Docker Containers as QEMU MicroVMs

Summary

This blog post compares the advantages and disadvantages of docker containers with VMs and describes why and how we execute docker images inside a QEMU MicroVM. The described approach makes it possible to combine the security of VMs with the existing ecosystem of docker (e.g. images and tools). Afterwards we take a look at ways how we can map features like bind mounts to solutions supported by QEMU and demonstrate this by running the NGINX docker image in QEMU.

Docker vs VM

Running software in a sandboxed environment is crucial when handling untrusted input, especially when multiple tenants are involved. We faced this challenge during the development of the scanner feature for MergeBoard. Scanners are external tools that analyze the source code and generate a report that is parsed by MergeBoard and displayed as review comments. A simple example would be a spell checker. Since these external tools can get very complex and often include parsers, they provide an attack surface not to be underestimated. A search for CVEs related to parsing already yields more than 100 results for the first half of 2021. It also became clear that some of these tools will not return useful results unless the customer can prepare the environment, e.g. install dependencies. This leads to the question: How can we run untrusted code without compromising the security of our system?

There are two common solutions to this challenge, virtual machines and containers. Both of them provide their own advantages and disadvantages. Virtual machines (VMs) are considered to provide the best isolation but are usually quite heavy in terms of resource usage and slow to start. Containers on the other hand can easily share resources with the host system and are quick to start but their isolation level is not necessarily on par with VMs. A natural choice for our scanner environment would have been docker containers as they are widely used by developers in CIs and there are a lot of images available that customers can built upon. Are docker containers secure enough though? This is difficult to answer, docker’s documentation states Docker containers are, by default, quite secure but other companies don’t seem to fully trust them, Google for example developed gVisor to add another layer of security and Amazon created Firecracker which is used by Fly.io to run docker containers in a VM. The last idea peeked our interest. Can we somehow combine the advantages of the docker ecosystem with VMs? The solution was worth a try, but their software Firecracker just didn’t fit into our architecture. We already had our own task system in place and therefore didn’t need a another daemon and their focus on containers providing network services didn’t fully overlap with our requirements. We therefore decided to create a similar system using the more versatile QEMU hypervisor.

Running docker containers in a QEMU microVM

Before we can create our first proof of concept, it is important to understand what microVMs and docker images are from a technical point of view. This will lead us to the necessary steps to convert docker containers into VMs. Afterwards we will try to boot into an image and measure the boot duration of the kernel to anticipate the speed differences to normal docker containers.

What are microVMs and how do they differ from normal VMs?

A “normal” VM tries to mimic real hardware and the operating system running as guest might not even be aware of the virtualization. The disadvantage of this approach is that the emulation of hardware is not necessarily efficient. On a real hardware system most of your peripheral components (e.g. GPU) can run independent of the CPU. The CPU is only involved if new data / events are available and need to be processed. With VMs the story is a little different, the hypervisor needs to emulate all the peripheral devices as well. This is especially bad if there is a lot of communication involved. Every time the guest OS tries to interact with the fake hardware the control is handed over to the hypervisor which needs to handle the request. To improve performance, paravirtualized devices are created that provide the same functionality as hard drives or network cards without requiring so many context switches as their emulated counterparts. These special devices can provide a significant performance boost, but the guest operating system needs appropriate drivers to support them and is aware of the fact that it is running in a VM.

microVMs don’t try to emulate real hardware and instead use almost only paravirtual devices. The set of provided hardware components is restricted to the absolute necessary. MicroVMs for example don’t have a PCI bus but instead use memory-mapped I/O. This restricts the devices available for emulation quite a bit (e.g. no input devices, GPU), but leads to much better boot times.

What are docker containers and how do they differ from a normal operating system?

Docker containers are basically just tar archives containing the file system of the container plus some additional metadata. This metadata for example describes which process should be started in the docker container. The main difference to a normal operating system is that we do not have a kernel and an init system installed. There is nothing that would setup a network card or the directories /proc and /sys. The environment is already prepared by the docker runtime before anything is executed in the container. If we want to run docker containers as VMs we need to take care of this.

Kernel

Let’s get started! Unlike most normal VMs we skip the BIOS / UEFI initialization and directly boot into the kernel. This is a nice feature of QEMU but raises the question which kernel we should boot? Docker containers usually do not include one. The answer is simple: We build our own kernel that only contains the drivers that we actually need. To make your life easier you can grab my kernel config from here and use the commands below to compile it:

wget https://mergeboard.com/files/blog/qemu-microvm/defconfig
wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.12.10.tar.xz
tar -xf linux-5.12.10.tar.xz
cd linux-5.12.10/
cp ../defconfig .config
make olddefconfig
make -j8

Since we now have a compiled kernel image at arch/x86_64/boot/bzImage we can try the first step of booting the kernel using a QEMU microVM. The system is going to panic since we don’t provide any filesystem, but it provides us with a base line for how long the container needs to start.

$ qemu-system-x86_64 -M microvm,x-option-roms=off,isa-serial=off,rtc=off -no-acpi -enable-kvm -cpu host -nodefaults -no-user-config -nographic -no-reboot -device virtio-serial-device -chardev stdio,id=virtiocon0 -device virtconsole,chardev=virtiocon0 -kernel kernel/bzImage -append "console=hvc0 acpi=off reboot=t panic=-1"

[...]
[    0.043774] Loaded X.509 cert 'Build time autogenerated kernel key: a55de9768f536d965f27c2fe6fc963974d95b367'
[    0.044079] Key type ._fscrypt registered
[    0.044211] Key type .fscrypt registered
[    0.044294] Key type fscrypt-provisioning registered
[    0.044559] Key type encrypted registered
[    0.044888] VFS: Cannot open root device "(null)" or unknown-block(0,0): error -6
[    0.045056] Please append a correct "root=" boot option; here are the available partitions:
[    0.045224] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
[    0.045386] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.12.10 #1
[    0.045479] Call Trace:
[...]

As expected the kernel panics with VFS: Cannot open root device "(null)" or unknown-block(0,0): error -6. The more interesting detail is that the kernel took less than 50ms to initialize! If you add the startup time of QEMU you will get around 180ms. You could try to optimize this further by not compressing the kernel, but this is fast enough for our use case.

Let’s take a closer look at the command line options used to start QEMU as these are quite a lot:

Option	Explanation
-M microvm,x-option-roms=off,isa-serial=off,rtc=off	Switches to microVM mode and disables all unnecessary devices (BIOS option rom, isa serial device and real time clock)
-no-acpi	Disables ACPI support, usually used to control the power states of the host (i.e. standby modes)
-enable-kvm	Use the KVM hypervisor API for hardware acceleration
-cpu host	Provide all CPU features of the host to the VM
-nodefaults	Prevent QEMU from adding any default devices
-no-user-config	Do not load any user config files
-nographic	Disable graphic output
-no-reboot	Exit when the VM tries to reboot
-device virtio-serial-device	Add a paravirtual device for serial ports
-chardev stdio,id=virtiocon0	Add a serial device that is connected to stdin/stdout on the host with the name virtiocon0
-device virtconsole,chardev=virtiocon0	Connect the serial device virtiocon0 to a virtual terminal device on the guest side
-kernel kernel/bzImage	The kernel we want to boot
-append “console=hvc0 acpi=off reboot=t panic=-1”	The kernel command line. Use hvc0 (=virtiocon0) for the console, disable ACPI as it is not available anyways, reboot using a triple CPU fault (normal reboot would require ACPI) and reboot immediately when a panic occurs

Since the kernel seems to work, we can tick it off from our todo list and continue with the init system.

Init System

After mounting the file system specified by the kernel command line, linux will try to execute /sbin/init and panic if it terminates or does not exist at all. We could now try to use the entry point of the docker image as init, but this will most probably fail as our linux environment is not fully initialized yet. We need to inject an init system that takes cares of mounting various directories and setting up other system settings. It would be possible to use something like systemd or BusyBox’s init, but this would be overkill in most cases. For the beginning we use our own self written init system:

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/stat.h>

char *const default_environment[] = {
    "PATH=/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin",
    NULL,
};

void mount_check(
    const char *source, const char *target,
    const char *filesystemtype, unsigned long mountflags,
    const void *data
) {
    struct stat info;

    if (stat(target, &info) == -1 && errno == ENOENT) {
        printf("Creating %s\n", target);
        if (mkdir(target, 0755) < 0) {
            perror("Creating directory failed");
            exit(1);
        }
    }

    printf("Mounting %s\n", target);
    if (mount(source, target, filesystemtype, mountflags, data) < 0) {
        perror("Mount failed");
        exit(1);
    }
}

int main(int argc, char *argv[]) {
    mount_check("none", "/proc", "proc", 0, "");
    mount_check("none", "/dev/pts", "devpts", 0, "");
    mount_check("none", "/dev/mqueue", "mqueue", 0, "");
    mount_check("none", "/dev/shm", "tmpfs", 0, "");
    mount_check("none", "/sys", "sysfs", 0, "");
    mount_check("none", "/sys/fs/cgroup", "cgroup", 0, "");

    sethostname("microvm", sizeof("microvm"));

    execle("/bin/sh", "/bin/sh", NULL, default_environment);
    perror("Exec failed");
    return 1;
}

This small C application does basically the same as docker when starting a container. To make testing easier we hardcoded /bin/sh as entry point. In the long run you want to replace this with some other application or startup script. Please be aware that we replace our init application with the target process using execle, so that the target process becomes the new init process. If the application does not wait for child processes (call waitpid() on SIGCHLD), you might end up with a lot of zombies processes as all orphaned processes are reparented to the init process. If this turns out to be a problem the small C script could be extended or you could run everything through bash, which also takes care of this.

Let’s compile our init script:

gcc -Wall -o init -static init.c

Before we can test our init application we first need a file system in which we can inject it. The filesystem also needs to provide /bin/sh. Let’s figure out how to convert a docker image into a QEMU disk image!

Preparing the docker image

We have now compiled our kernel and created a simple and minimalistic init system. The last missing part is to convert an existing docker image into a virtual machine disk image before we can test our approach for the first time. During this process we will inject our own init system binary so that we don’t need a separate initramfs filesystem. We split our approach into two steps, first we try to generate a tar archive containing the base filesystem and then we convert it into a disk image.

There are various ways how to convert a docker image into a tar archive. The easiest method is to use docker build. It gives us a simple way how to inject our init system and we can also customize the image as well. Another method would be to use docker container create and docker container export to export the base filesystem or to manually combine the layers exported by docker save. For this tutorial, we will use docker build, but you could also fetch the different layers from a docker registry using a script and not use docker at all.

For our first test we will use Alpine Linux and tell docker build to inject our init system and extract the file system as tar archive:

FROM alpine:3.13

RUN rm -f /sbin/init
COPY init /sbin/init

DOCKER_BUILDKIT=1 docker build -f alpine.docker --output "type=tar,dest=alpine.tar" .

We are almost done and just need to convert the tar archive into a qcow2 image:

virt-make-fs --format=qcow2 --size=+200M alpine.tar alpine-large.qcow2
qemu-img convert alpine-large.qcow2 -O qcow2 alpine.qcow2
rm alpine-large.qcow2

The first command converts the tar archive into a qcow2 image. The qcow2 image format is not as efficient as using a raw image, but it has various advantages that we can benefit from. One issue of using VMs, compared to docker containers, is that you can not easily share resources between the guest and the host. VMs cannot simply use all the available disk space but are limited to the size of the disk image. From a security point of view it makes sense to limit the maximum disk space available to a container, but sometimes it is hard to anticipate the required amount of disk space in advance and you might want to allocate a bit more just to be sure (see the size parameter). When using raw images you would end up with huge image files containing almost nothing and you would need to rely on your file system handling the file as sparse if you don’t want to lose a lot of disk space. The image format qcow2 already supports sparse images itself, at least in theory. For some reason virt-make-fs doesn’t create sparse images and the generated alpine-large.qcow2 has a size of 207mb. To fix this issue, we call qemu-img convert to re-encode our image file and it shrinks to 9.9mb. Another advantage of using qcow2 is that it supports differential images as we will see later.

First try!

We have now gathered all our dependencies and can run the VM for the first time. We just need to adjust our qemu command line a bit to attach the disk image.

qemu-system-x86_64 -M microvm,x-option-roms=off,isa-serial=off,rtc=off -no-acpi -enable-kvm -cpu host -nodefaults -no-user-config -nographic -no-reboot -device virtio-serial-device -chardev stdio,id=virtiocon0 -device virtconsole,chardev=virtiocon0 -drive id=root,file=alpine.qcow2,format=qcow2,if=none -device virtio-blk-device,drive=root -kernel kernel/bzImage -append "console=hvc0 root=/dev/vda rw acpi=off reboot=t panic=-1"

Option	Explanation
-drive id=root,file=alpine.qcow2,format=qcow2,if=none	Registers a drive using the alpine2.qcow image file and the id `root`
-device virtio-blk-device,drive=root	Exposes the `root` drive as virtio block device to the guest
-append “console=hvc0 root=/dev/vda rw acpi=off reboot=t panic=-1”	We added `root=/dev/vda rw` to mount the virtio device as writable root file system

Let’s see this in action:

Our proof of concept works! We can execute applications inside the alpine docker container running in QEMU. This was a very simple example though. In order to do something useful with it, we need to extend our approach a bit.

Next steps

We have demonstrated that the basic idea of converting docker images into VMs works, but there are various features missing in order to really make use of this approach. So far we can only communicate with the VM using stdio and the image needs to be recreated each time a new instance is started. Let’s take this approach a bit further and learn how to mimic various features from docker.

Better hardware configuration

If you are familiar with QEMU and closely look at the command line, you will notice that we defined all the required devices but never really specified the amount of memory or the number of CPU cores. By default the VM will have about 100mb of memory and 1 CPU core. This is most probably not enough in practice and you want to extend the command line:

Option	Explanation
-m 512	Amount of assigned memory in MB
-smp 2	Number of virtual CPU cores

Assigning more CPU cores to a VM than necessary is not really an issue as the VM is just run as any other process in your system. If the guest does not fully utilize all its assigned cores, your host CPU can still use the physical cores for other tasks or VMs. You can therefore overprovision the number of virtual CPU cores if you don’t know how many you will actually need.

Another issue you might notice sooner or later is that the Linux kernel takes forever to initialize the random number generator. There is simply no real source of entropy in our microVM and applications trying to access the RNG will freeze. You can work around this by exposing the hosts RNG to the guest using:

Option	Explanation
-device virtio-rng-device	Add a virtual random number generator device forwarding random data from the host

Using differential images

We have to recreate the disk image each time we start a VM if we want to make sure the image was not modified by a previous run or if we want to execute multiple instances in parallel. This slows down the startup process and also consumes more disk space than necessary. One option would be to mark the disk image as read-only, but this would probably break most containers unless we apply extra tricks like a tmpfs overlay. A better approach is to use the differential / incremental disk image feature of QEMU. This allows us to add another layer on top of original image file. All write instructions modify only the top layer while our original image file is treated as read-only. This approach is very similar to the OverlayFS storage driver of docker.

Before we start our VM, we create our differential disk image using:

qemu-img create -f qcow2 -b alpine.qcow2 -F qcow2 alpine-diff.qcow2

Just replace the image filename in the QEMU command line option and you are done. It is now possible to run multiple VMs with the same base image in parallel. After the VM is terminated you can remove the differential image again. You might want to wrap these steps into a script so that you don’t have to execute them manually.

If you want to keep the changes made by one of the VMs, it is possible to merge them back into the base image using:

qemu-img commit alpine-diff.qcow2

Getting files in and out

The proof of concept only used stdio to communicate with the external world and there is no other way of getting data in and out. This is obviously not sufficient if you want to transfer a large amount of files between the host and guest. There are two simple approaches that we can use to transfer files though.

Using the disk image: If you only want to upload/download files before the VM started and after it terminated, you can simply modify the files in the differential disk image. Tools like qemu-nbd or guestmount make it possible to simply mount the image in the host system so that you can read and write files as you like. Such an approach makes sense for (reproducible) builds. The source code is uploaded before the VM starts and the generated packages are extracted after the VM shuts down.

While the VM is running: If you want to share files with the VM while it is running, you can use the 9pfs virtual file system device. This device allows you to mount a directory from the host system, similar to bind mounts in docker. To expose a directory to a client, we need to add two options to the QEMU command line:

Option	Explanation
-fsdev local,path=myfiles,security_model=none,id=files	Register `myfiles` directory with the id `files`. You can add `readonly` as option if desired.
-device virtio-9p-device,fsdev=myfiles,mount_tag=hostfiles	Expose `myfiles` using virtio device to the guest under the name `hostfiles`

In order to access the files within the guest, you need to mount the filesystem using the name specified by mount_tag. If you are using the kernel configuration provided earlier, the 9P network file system support using virtio is already compiled in. You can mount the filesystem using the command line:

mount -t 9p hostfiles /target -otrans=virtio,version=9p2000.L,msize=52428800

or extend the init script:

mount_check("hostfiles", "/target", "9p", 0, "trans=virtio,version=9p2000.L,msize=52428800");

You might need to adjust the msize parameter as described in the QEMU wiki to obtain the optimal performance.

Communication

QEMU provides various virtual devices that can be used to communicate between the guest and the host. The most obvious solution would be to add a network device. This might be necessary in many cases, but it also has the disadvantage that you need to create either a bridge or virtual ethernet device on the host. This step requires special permissions and is also one of the reasons why the docker daemon is running as root by default. If you don’t require actual network access, you might want to try one of the other options.

Serial Device: You can add one or multiple serial devices to the guest and access them from the host in various ways (pipes, sockets, files, etc.). This makes it possible to stream data between the host and guest. The virtio devices are also much faster than the emulated hardware serial devices. I have used this approach in the past to live-stream build logs to the website of a build-system.

If you want to log all traffic sent to a serial port into a file, you could use the following QEMU options:

Option	Explanation
-chardev file,id=serial1,path=log.txt	Register a character device that logs to `log.txt` with the id `fserial`.
-device virtserialport,chardev=serial1,name=program_stdout	Expose `serial1` as virtio serial port with the name `program_stdout` to the guest.

Virtual Socket: QEMU supports virtio-vsock devices that make it possible to establish unix socket like connections between the host and guest. This approach allows multiple parallel connections and opens up some interesting use cases. The only disadvantage is that it uses a special socket type AF_VSOCK and is therefore not a drop-in replacement. You might be able to work around this limitation by using socat though. If you want to give it a try, you might have to switch away from using a microVM. It seems like the only device implementation provided by QEMU is based on a virtual PCI device which is not available in a microVM (no PCI bus).

Network: There are various different network backends supported by QEMU. You can connect your virtual network card to a bridge on your host system, use a veth device, user mode emulation, etc. I don’t want to go into too much detail here, as you can find a lot of tutorials using your favorite search engine. If you just want to expose a port to the host, you might get away with using the user mode emulation:

Option	Explanation
-netdev user,id=mynet0,hostfwd=tcp:127.0.0.1:8080-:80	Add network with the id `mynet0` that forwards request to `127.0.0.1:8080` to the guest port `80`
-device virtio-net-device,netdev=mynet0	Expose `mynet0` as virtio network device to the guest

Please note that user mode network emulation can be quite slow and that you might be better off using a bridge or veth device in high traffic scenarios.

Entry point

The example init executable starts /bin/sh by default, which is great way of trying out things in the VM, but not very useful in practice. You typically want to execute a certain program when you start your “container”, called ENTRYPOINT in docker. There are various ways to achieve this, but the most generic approach is most probably to upload a script into the base image which takes care of the necessary initialization (e.g. mounting any 9p file systems) and then calls your target application. If you want to make it more dynamic you can generate the script on the fly and upload it into the differential image or directly host it via 9pfs.

More complex example

After we demonstrated our proof of concept, let’s try a more complex example: Hosting a website using NGINX. The idea is to serve a directory on our host via a NGINX instance running inside a VM. In practice you would be better off using a docker container here, unless security is really important, because all the I/O needs to be emulated and the only resource that is available with almost zero performance impact, the CPU, isn’t used very much. Since this example displays some of the ideas described above it still makes sense to take a look at it.

We start off with a new docker file:

FROM nginx:1.21.0

RUN apt update
RUN apt install -y isc-dhcp-client

RUN rm -f /sbin/init
COPY init /sbin/init

and turn it into a differential qcow2 image:

DOCKER_BUILDKIT=1 docker build -f nginx.docker --output "type=tar,dest=nginx.tar" .
virt-make-fs --format=qcow2 --size=+200M nginx.tar nginx-large.qcow2
qemu-img convert nginx-large.qcow2 -O qcow2 nginx.qcow2
rm nginx-large.qcow2
qemu-img create -f qcow2 -b nginx.qcow2 -F qcow2 nginx-diff.qcow2

Now we are ready to go:

qemu-system-x86_64 -M microvm,x-option-roms=off,isa-serial=off,rtc=off -m 512 -no-acpi -enable-kvm -cpu host -nodefaults -no-user-config -nographic -no-reboot -device virtio-serial-device -chardev stdio,id=virtiocon0 -device virtconsole,chardev=virtiocon0 -kernel ../kernel/bzImage -append "console=hvc0 root=/dev/vda rw acpi=off reboot=t panic=-1 quiet" -drive id=root,file=nginx-diff.qcow2,format=qcow2,if=none -device virtio-blk-device,drive=root -netdev user,id=mynet0,hostfwd=tcp:127.0.0.1:8080-10.0.2.15:80 -device virtio-net-device,netdev=mynet0 -fsdev local,path=www,security_model=none,id=www,readonly -device virtio-9p-device,fsdev=www,mount_tag=www -device virtio-rng-device

As you can see in the replay, it works as expected. We can update files on the host system and the change is visible when requesting the file again using wget.

Conclusion

We have demonstrated that it is possible to run a docker image in a microVM with minor modifications, mostly injecting an init executable. The additional time required to boot the linux kernel (~200ms) should be negligible. Most of the typical features provided by docker (e.g. bind mounts, forwarding ports) can be translated into QEMU options. Whether it makes sense to run a typical docker container in QEMU depends on your requirements. From a security point of view, QEMU provides a great isolation, but not all the emulated features (e.g. bind mounts) might provide the same performance as docker. For our use case, the performance impact is not really relevant as our containers don’t need a lot of communication with the outside world. We do not intend to run typical docker containers that host services, but are rather interested in running code analyzers in a secure way. Using docker based images gives our customers the possibility to make use of the docker ecosystem (tools, images, etc.) without worrying about the complexity of virtual machines. Nonetheless, it would be possible to write a tool that converts most docker images including their metadata into ready to run qemu images and command line options out of the box.