Introduction ------------ Virtio-vhost-user[1] is a promising virtualisation technology that allows virtual devices that are exposed to VMs to themselves be implemented in VMs. Let's break down its name a bit to understand how it works: * Virtio[2] is a standard driver interface for virtualisation. Interfaces are available for all sorts of types of virtual devices, e.g. virtio-net, virtio-blk, and virtio-scsi. Typically, virtio devices are implemented by a virtual machine monitor (VMM). * Vhost[3] is kernelspace implementation of virtio virtual devices, created for their performance benefit. Instead of implementing the virtual devices itself, the VMM talks to the kernel implementation of them using a special ioctl protocol. * Vhost-user[4] allows another process to implement the vhost protocol, instead of the kernel, by using a UNIX socket instead of ioctls on a special character device. This doesn't provide the raw performance of vhost, but it serves a different purpose -- it allows virtual devices to be implemented by external programs, in a standardised way so they're portable between VMMs. Virtio-vhost-user allows the program implementing the virtual device to run in a VM of its own, by having the VMM for that VM create the vhost-user socket, and transferring messages over it to its guest using virtio. This is exciting for Spectrum, because it would mean that the host system doesn't have to interact with physical hardware directly beyond the PCI level, and can instead pass it through to a VM, which is responsible for implementing the virtual device backed by that physical hardware, which can be exposed to other VMs. Last year I spent a while looking into virtio-vhost-user[5][6][7]. It's a long way from being ready to use, and it seems to be maturing very slowly. It might be useful to us eventually for driver isolation, or something else might come along. My conclusion from my research was that we should decide later, once the ecosystem has had a chance to develop. But I wanted something to come out of the research I did anyway, and so I've prepared a demonstration. [1]: https://wiki.qemu.org/Features/VirtioVhostUser [2]: https://docs.oasis-open.org/virtio/virtio/v1.1/virtio-v1.1.html [3]: https://blog.vmsplice.net/2011/09/qemu-internals-vhost-architecture.html [4]: https://qemu.readthedocs.io/en/latest/interop/vhost-user.html [5]: https://spectrum-os.org/lists/archives/spectrum-devel/87pn8rezqn.fsf@alyssa.is/ [6]: https://spectrum-os.org/lists/archives/spectrum-devel/87blk1pwph.fsf@alyssa.is/ [7]: https://spectrum-os.org/lists/archives/spectrum-devel/87wo2glkg0.fsf@alyssa.is/ What the demo does ------------------ Using any sort of non-host-based virtual device implementation is going to have to start with taking the virtual device implementation out of the VMM running an application VM, and vhost-user the clear solution to this. Vhost-user isn't supported by crosvm -- its focus is on doing all the virtualisation required by Chromium OS, so there's no need for it to allow other programs to provide virtual device implementations. So another part of the research I did was to try to port the vhost-user implementation from cloud-hypervisor to crosvm, which I was able to do successfully[8]. This has implications beyond vhost-user, and even beyond crosvm, because it demonstrates that it's practical to port features between rust-vmm[9] VMMs, which means we don't have to worry about finding one that provides every feature we need (which is just as well, because there isn't one). So here I demonstrate not just a "standard" virtio-vhost-user setup (to the extent that such a thing can exist at this early stage), but also that my patched crosvm with vhost-user support is capable of interoperating with the experimental virtio-vhost-user implementation for QEMU, because both are speaking the standardised vhost-user protocol. The demo sets up two VMs. One is run with my patched crosvm, and expects a virtual ethernet device to be provided by a vhost-user socket. When it boots, it brings up its network interface, tries to run a DHCP client, and then exits. The other VM is run with Nikos Dragazis and Stefan Hajnoczi's experimental virtio-vhost-user implementation in QEMU[10]. It gets a standard virtual ethernet device (backed by a TAP device on the host), and the virtio-vhost-user device hooked up to the socket that crosvm will be connecting to. Inside the VM, a userspace networking stack (DPDK, again modified to support virtio-vhost-user by Nikos Dragazis[11]) implements the device side of virtio-vhost-user, and forwards packets sent by crosvm's guest to the virtual ethernet device backed by the host TAP. +-----------------------------------------------------------------------------+ | | | +------------------------+ +------------------------+ | | | | | | | | | +----------------+ | | +----------------+ | | | | | | | | | | | | | +-----+ | | +--------+ | | | | | | | | | TAP +------+---+---+ DPDK +---+---+------+---+ | | | | +-----+ | | +--------+ | | | | | | | | | | | | | | | | | | | | Linux | | | | Linux | | | | | +----------------+ | | +----------------+ | | | | | | | | | | QEMU | | crosvm | | | +------------------------+ +------------------------+ | | | | Linux | +-----------------------------------------------------------------------------+ A complicating factor is that the virtio-vhost-user implementation for DPDK only supports outgoing traffic[12]. So packets coming from crosvm will be relayed to the TAP, but not the other way around. This means that we can't just use ping inside the crosvm VM to verify that the connection is working. Instead, we have to tcpdump on the host and verify that the packets the DHCP client inside the crosvm VM is sending are arriving on the TAP. For this to be useful for our intended purpose of isolating drivers for physical devices, we'd pass through the device here rather than using a TAP. It would otherwise work exactly the same, but it's more difficult to test it's working correctly. (I have tested it though -- for the first version of this I got working last year, I verified it worked by checking the logs of my local network's DHCP server.) [8]: https://spectrum-os.org/lists/archives/spectrum-devel/20210512170812.192540-1-hi@alyssa.is/ [9]: https://github.com/rust-vmm [10]: https://github.com/ndragazis/qemu/tree/virtio-vhost-user [11]: https://github.com/ndragazis/dpdk-next-virtio/tree/virtio-vhost-user [12]: https://github.com/ndragazis/dpdk-next-virtio/blob/2d60e63/drivers/virtio_vhost_user/trans_virtio_vhost_user.c#L379 Running the demo ---------------- First, create a TAP device for QEMU to use: # ip tuntap add qemutap mode tap # ip link set qemutap up Start tcpdump, so we can see if packets arrive on the TAP: # tcpdump -i qemutap Start the QEMU VM: $ $(nix-build -A qemuVm /path/to/demo.nix) When you see "Press enter to exit", DPDK is ready to receive a virtio-vhost-user connection. Start the crosvm VM: $ $(nix-build -A crosvmVm /path/to/demo.nix) Once that VM boots, you should see some "BOOTP/DHCP" lines in the tcpdump output. This demonstrates that traffic from the crosvm guest has been relayed over virtio-vhost-user to DPDK, and then to the TAP on the host over virtio-net. You'll want to press enter to shut down the QEMU VM now, because DPDK pegs a CPU core (for reasons[*] unrelated to virtio-vhost-user that are out of scope here). Then you can remove the TAP device: # ip link delete qemutap Nix expression for the demo --------------------------- # SPDX-License-Identifier: MIT OR Apache-2.0 # SPDX-FileCopyrightText: 2021 Alyssa Ross let pinned = builtins.fetchTarball { url = "https://github.com/NixOS/nixpkgs/tarball/b14062b75c4e8ef4dd4110282f7105be87f681d7"; sha256 = "1hzs0w6pcwwbzl2gkqyk46yrzizzm03mph4kggws02a6vlwphsib"; }; in { pkgs ? import pinned {} }: with pkgs; rec { linux = pkgs.linux.override { structuredExtraConfig = with lib.kernel; { "9P_FS" = yes; NET_9P = yes; NET_9P_VIRTIO = yes; PACKET = yes; VFIO = yes; VFIO_NOIOMMU = yes; VFIO_PCI = yes; VIRTIO_NET = yes; VIRTIO_PCI = yes; }; }; dpdk = stdenv.mkDerivation { name = "dpdk-virtio-vhost-user"; src = fetchFromGitHub { owner = "ndragazis"; repo = "dpdk-next-virtio"; rev = "0a46582dc1d02c0dc5069347ffff1a64239385f2"; sha256 = "169cxdps9k764jj420q44262x3291h2jcqsbrh7038hqjczjkgif"; }; buildInputs = [ numactl ]; configurePhase = '' runHook preConfigure make $makeFlags defconfig runHook postConfigure ''; enableParallelBuilding = true; RTE_KERNELDIR = "${linux.dev}/lib/modules/${linux.modDirVersion}/build"; NIX_CFLAGS_COMPILE = [ "-Wno-error=implicit-fallthrough" "-Wno-error=incompatible-pointer-types" ]; makeFlags = [ "RTE_OUTPUT=$(out)/lib" "kerneldir=$(out)/lib/modules/${linux.modDirVersion}/build" "prefix=$(out)" ]; inherit (pkgs.dpdk) meta; }; # DPDK is huge! We just need one program from it. testpmd = runCommandNoCC "testpmd" {} '' mkdir -p $out/bin cp ${dpdk}/bin/testpmd $out/bin ''; # qemu has changed build system since the virtio-vhost-user branch # was last updated, so it's simpler to just make a new derivation # and inherit the bits that are the same than to override the # existing one. qemu = stdenv.mkDerivation { name = "qemu-virtio-vhost-user"; src = fetchFromGitHub { owner = "ndragazis"; repo = "qemu"; rev = "f9ab08c0c8cfc58036ed95b895f9780397448071"; sha256 = "0p6v4i7gj70d6x7s28x3i3x9z8vlswcbbqdwfbhlx87bbnxjrn3b"; fetchSubmodules = true; }; enableParallelBuilding = true; nativeBuildInputs = lib.subtractLists [ ninja meson ] qemu_kvm.nativeBuildInputs; postPatch = '' sed -i '/$(INSTALL_DIR) "$(DESTDIR)$(qemu_localstatedir)/d' Makefile # The virtio-vhost-user implementation tries to allocate a huge # PCI bar, that's bigger than some CPUs can support! If you see # a kernel panic in vp_reset(), lower this further. substituteInPlace hw/virtio/virtio-vhost-user-pci.c \ --replace '1ULL << 36' '1ULL << 34' ''; inherit (qemu_kvm) buildInputs configureFlags meta; }; qemuInitramfs = makeInitrd { contents = [ { symlink = "/init"; object = writeScript "init" '' #!${busybox}/bin/sh -eux export PATH=${busybox}/bin mkdir -p /nix/store /run /var mount -t sysfs none /sys mount -t proc none /proc mount -t tmpfs none /run mount -t devtmpfs none /dev mkdir /dev/hugepages mount -t hugetlbfs none /dev/hugepages ln -s /run /var # Unbind the virtio-net (host TAP) and virtio-vhost-user devices # from their default drivers, since we'll be passing them # through to DPDK. echo 0000:00:04.0 > /sys/bus/pci/devices/0000:00:04.0/driver/unbind echo 0000:00:05.0 > /sys/bus/pci/devices/0000:00:05.0/driver/unbind # Tell the vfio-pci driver it can support virtio-net and # virtio-vhost-user devices. Since our devices are not # bound to any driver at the moment, doing this will bind # them to vfio-pci automatically. echo 1af4 1000 > /sys/bus/pci/drivers/vfio-pci/new_id echo 1af4 1017 > /sys/bus/pci/drivers/vfio-pci/new_id echo 256 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages ${testpmd}/bin/testpmd \ -l 0-1 \ -w 0000:00:05.0 \ --vdev net_vhost0,iface=0000:00:05.0,virtio-transport=1 \ -w 0000:00:04.0 poweroff -f ''; } ]; }; qemuVm = writeShellScript "qemu-vm" '' exec ${qemu}/bin/qemu-system-x86_64 -enable-kvm -cpu host -smp 2 -m 1G \ -M q35,kernel-irqchip=split \ -initrd ${qemuInitramfs}/initrd \ -netdev tap,id=net0,ifname=qemutap,script=no,downscript=no \ -device virtio-net-pci,netdev=net0,addr=04 \ -chardev socket,id=chardev0,path="$XDG_RUNTIME_DIR/vhost-user0.sock",server,nowait \ -device virtio-vhost-user-pci,addr=05,chardev=chardev0 \ -kernel ${linux}/${stdenv.hostPlatform.linux-kernel.target} \ -append "console=ttyS0 vfio.enable_unsafe_noiommu_mode=1" \ -nographic ''; # Can't use overrideAttrs because of cargoSha256. crosvm = rustPlatform.buildRustPackage rec { name = "crosvm-virtio-vhost-user"; src = fetchFromGitiles { url = "https://chromium.googlesource.com/chromiumos/platform/crosvm"; rev = "8a7e4e902a4950b060ea23b40c0dfce7bfa1b2cb"; sha256 = "1lm6psp0xakb66nhgmmh94valc4wzbb967chk80msk8bcvsfpdn4"; }; unpackPhase = let origSrc = pkgs.crosvm.passthru.src; in builtins.replaceStrings [ "${origSrc}" origSrc.name ] [ "$src" src.name ] pkgs.crosvm.unpackPhase; cargoPatches = [ (fetchpatch { url = "https://spectrum-os.org/lists/archives/spectrum-devel/20210512170812.192540-2-hi@alyssa.is/raw"; sha256 = "0yzqrpgq35s9wxvbf9s3dgs5cpyxgdc5hr14hsdjr0gd18a6camg"; }) ]; patches = pkgs.crosvm.patches ++ [ (fetchpatch { url = "https://spectrum-os.org/lists/archives/spectrum-devel/20210512170812.192540-3-hi@alyssa.is/raw"; sha256 = "0g2rvqqa4lvq7bjq0s1ynsjx7lmrxql7lsdv8wyzb7d2z9j6mj13"; }) (fetchpatch { url = "https://spectrum-os.org/lists/archives/spectrum-devel/20210512170812.192540-4-hi@alyssa.is/raw"; sha256 = "051sz87i8kzc5sbygk2bpiqp4g32y9fxswg2yax1nd3lg4rxh43r"; }) (fetchpatch { url = "https://spectrum-os.org/lists/archives/spectrum-devel/20210512170812.192540-5-hi@alyssa.is/raw"; sha256 = "1jpas65masn2xg9jxha16vi0y7scarzhl221y9wxh4chi4aa4m3f"; }) ]; cargoSha256 = "07yizbhs64jrb05fq5g7sx812xbz2989bsficacq5l19ziax5164"; passthru = pkgs.crosvm.passthru // { inherit src; }; inherit (pkgs.crosvm) sourceRoot postPatch nativeBuildInputs buildInputs preBuild postInstall CROSVM_CARGO_TEST_KERNEL_BINARY meta; }; crosvmInitramfs = makeInitrd { contents = [ { symlink = "/init"; object = writeScript "init" '' #!${busybox}/bin/sh -eux export PATH=${busybox}/bin mount -t sysfs none /sys mount -t proc none /proc ip link set eth0 up udhcpc -n || : reboot -f ''; } ]; }; crosvmVm = writeShellScript "crosvm-vm" '' # In our patched crosvm, suppling --mac without --host_ip or # --netmask will put it into vhost-user mode. exec ${crosvm}/bin/crosvm run \ --mac 0A:B3:EC:FF:FF:FF \ -i ${crosvmInitramfs}/initrd \ ${linux}/${stdenv.hostPlatform.linux-kernel.target} ''; }