general high-level discussion about spectrum
 help / color / mirror / Atom feed
From: Alyssa Ross <>
Subject: [DEMO] virtio-vhost-user between QEMU and crosvm
Date: Thu, 13 May 2021 12:41:03 +0000
Message-ID: <> (raw)

[-- Attachment #1: Type: text/plain, Size: 15864 bytes --]


Virtio-vhost-user[1] is a promising virtualisation technology that allows
virtual devices that are exposed to VMs to themselves be implemented
in VMs.

Let's break down its name a bit to understand how it works:

 * Virtio[2] is a standard driver interface for virtualisation.
   Interfaces are available for all sorts of types of virtual devices,
   e.g. virtio-net, virtio-blk, and virtio-scsi.  Typically, virtio
   devices are implemented by a virtual machine monitor (VMM).

 * Vhost[3] is kernelspace implementation of virtio virtual devices,
   created for their performance benefit.  Instead of implementing the
   virtual devices itself, the VMM talks to the kernel implementation
   of them using a special ioctl protocol.

 * Vhost-user[4] allows another process to implement the vhost
   protocol, instead of the kernel, by using a UNIX socket instead of
   ioctls on a special character device.  This doesn't provide the raw
   performance of vhost, but it serves a different purpose -- it
   allows virtual devices to be implemented by external programs, in a
   standardised way so they're portable between VMMs.

Virtio-vhost-user allows the program implementing the virtual device
to run in a VM of its own, by having the VMM for that VM create the
vhost-user socket, and transferring messages over it to its guest
using virtio.  This is exciting for Spectrum, because it would mean
that the host system doesn't have to interact with physical hardware
directly beyond the PCI level, and can instead pass it through to a
VM, which is responsible for implementing the virtual device backed by
that physical hardware, which can be exposed to other VMs.

Last year I spent a while looking into virtio-vhost-user[5][6][7].
It's a long way from being ready to use, and it seems to be maturing
very slowly.  It might be useful to us eventually for driver
isolation, or something else might come along.  My conclusion from my
research was that we should decide later, once the ecosystem has had a
chance to develop.  But I wanted something to come out of the research
I did anyway, and so I've prepared a demonstration.


What the demo does

Using any sort of non-host-based virtual device implementation is
going to have to start with taking the virtual device implementation
out of the VMM running an application VM, and vhost-user the clear
solution to this.  Vhost-user isn't supported by crosvm -- its focus
is on doing all the virtualisation required by Chromium OS, so there's
no need for it to allow other programs to provide virtual device
implementations.  So another part of the research I did was to try to
port the vhost-user implementation from cloud-hypervisor to crosvm,
which I was able to do successfully[8].  This has implications beyond
vhost-user, and even beyond crosvm, because it demonstrates that it's
practical to port features between rust-vmm[9] VMMs, which means we
don't have to worry about finding one that provides every feature we
need (which is just as well, because there isn't one).

So here I demonstrate not just a "standard" virtio-vhost-user setup
(to the extent that such a thing can exist at this early stage), but
also that my patched crosvm with vhost-user support is capable of
interoperating with the experimental virtio-vhost-user implementation
for QEMU, because both are speaking the standardised vhost-user

The demo sets up two VMs.  One is run with my patched crosvm, and
expects a virtual ethernet device to be provided by a vhost-user
socket.  When it boots, it brings up its network interface, tries to
run a DHCP client, and then exits.  The other VM is run with Nikos
Dragazis and Stefan Hajnoczi's experimental virtio-vhost-user
implementation in QEMU[10].  It gets a standard virtual ethernet
device (backed by a TAP device on the host), and the virtio-vhost-user
device hooked up to the socket that crosvm will be connecting to.
Inside the VM, a userspace networking stack (DPDK, again modified to
support virtio-vhost-user by Nikos Dragazis[11]) implements the device
side of virtio-vhost-user, and forwards packets sent by crosvm's guest
to the virtual ethernet device backed by the host TAP.

|                                                                             |
|                +------------------------+      +------------------------+   |
|                |                        |      |                        |   |
|                |   +----------------+   |      |   +----------------+   |   |
|                |   |                |   |      |   |                |   |   |
|   +-----+      |   |   +--------+   |   |      |   |                |   |   |
|   | TAP +------+---+---+  DPDK  +---+---+------+---+                |   |   |
|   +-----+      |   |   +--------+   |   |      |   |                |   |   |
|                |   |                |   |      |   |                |   |   |
|                |   |     Linux      |   |      |   |     Linux      |   |   |
|                |   +----------------+   |      |   +----------------+   |   |
|                |                        |      |                        |   |
|                |          QEMU          |      |         crosvm         |   |
|                +------------------------+      +------------------------+   |
|                                                                             |
|                                Linux                                        |

A complicating factor is that the virtio-vhost-user implementation for
DPDK only supports outgoing traffic[12].  So packets coming from
crosvm will be relayed to the TAP, but not the other way around.  This
means that we can't just use ping inside the crosvm VM to verify that
the connection is working.  Instead, we have to tcpdump on the host
and verify that the packets the DHCP client inside the crosvm VM is
sending are arriving on the TAP.

For this to be useful for our intended purpose of isolating drivers
for physical devices, we'd pass through the device here rather than
using a TAP.  It would otherwise work exactly the same, but it's more
difficult to test it's working correctly.  (I have tested it though --
for the first version of this I got working last year, I verified it
worked by checking the logs of my local network's DHCP server.)


Running the demo

First, create a TAP device for QEMU to use:

	# ip tuntap add qemutap mode tap
	# ip link set qemutap up

Start tcpdump, so we can see if packets arrive on the TAP:

	# tcpdump -i qemutap

Start the QEMU VM:

	$ $(nix-build -A qemuVm /path/to/demo.nix)

When you see "Press enter to exit", DPDK is ready to receive a
virtio-vhost-user connection.

Start the crosvm VM:

	$ $(nix-build -A crosvmVm /path/to/demo.nix)

Once that VM boots, you should see some "BOOTP/DHCP" lines in the
tcpdump output.  This demonstrates that traffic from the crosvm guest
has been relayed over virtio-vhost-user to DPDK, and then to the TAP
on the host over virtio-net.

You'll want to press enter to shut down the QEMU VM now, because DPDK
pegs a CPU core (for reasons[*] unrelated to virtio-vhost-user that
are out of scope here).

Then you can remove the TAP device:

     # ip link delete qemutap

Nix expression for the demo

# SPDX-License-Identifier: MIT OR Apache-2.0
# SPDX-FileCopyrightText: 2021 Alyssa Ross <>

  pinned = builtins.fetchTarball {
    url = "";
    sha256 = "1hzs0w6pcwwbzl2gkqyk46yrzizzm03mph4kggws02a6vlwphsib";

{ pkgs ? import pinned {} }: with pkgs;

rec {
  linux = pkgs.linux.override {
    structuredExtraConfig = with lib.kernel; {
      "9P_FS" = yes;
      NET_9P = yes;
      NET_9P_VIRTIO = yes;
      PACKET = yes;
      VFIO = yes;
      VFIO_NOIOMMU = yes;
      VFIO_PCI = yes;
      VIRTIO_NET = yes;
      VIRTIO_PCI = yes;

  dpdk = stdenv.mkDerivation {
    name = "dpdk-virtio-vhost-user";

    src = fetchFromGitHub {
      owner = "ndragazis";
      repo = "dpdk-next-virtio";
      rev = "0a46582dc1d02c0dc5069347ffff1a64239385f2";
      sha256 = "169cxdps9k764jj420q44262x3291h2jcqsbrh7038hqjczjkgif";

    buildInputs = [ numactl ];

    configurePhase = ''
      runHook preConfigure
      make $makeFlags defconfig
      runHook postConfigure

    enableParallelBuilding = true;

    RTE_KERNELDIR = "${}/lib/modules/${linux.modDirVersion}/build";


    makeFlags = [

    inherit (pkgs.dpdk) meta;

  # DPDK is huge!  We just need one program from it.
  testpmd = runCommandNoCC "testpmd" {} ''
    mkdir -p $out/bin
    cp ${dpdk}/bin/testpmd $out/bin

  # qemu has changed build system since the virtio-vhost-user branch
  # was last updated, so it's simpler to just make a new derivation
  # and inherit the bits that are the same than to override the
  # existing one.
  qemu = stdenv.mkDerivation {
    name = "qemu-virtio-vhost-user";

    src = fetchFromGitHub {
      owner = "ndragazis";
      repo = "qemu";
      rev = "f9ab08c0c8cfc58036ed95b895f9780397448071";
      sha256 = "0p6v4i7gj70d6x7s28x3i3x9z8vlswcbbqdwfbhlx87bbnxjrn3b";
      fetchSubmodules = true;

    enableParallelBuilding = true;

    nativeBuildInputs =
      lib.subtractLists [ ninja meson ] qemu_kvm.nativeBuildInputs;

    postPatch = ''
      sed -i '/$(INSTALL_DIR) "$(DESTDIR)$(qemu_localstatedir)/d' Makefile

      # The virtio-vhost-user implementation tries to allocate a huge
      # PCI bar, that's bigger than some CPUs can support!  If you see
      # a kernel panic in vp_reset(), lower this further.
      substituteInPlace hw/virtio/virtio-vhost-user-pci.c \
          --replace '1ULL << 36' '1ULL << 34'

    inherit (qemu_kvm) buildInputs configureFlags meta;

  qemuInitramfs = makeInitrd {
    contents = [
        symlink = "/init";
        object = writeScript "init" ''
          #!${busybox}/bin/sh -eux
          export PATH=${busybox}/bin

          mkdir -p /nix/store /run /var

          mount -t sysfs none /sys
          mount -t proc none /proc
          mount -t tmpfs none /run
          mount -t devtmpfs none /dev

          mkdir /dev/hugepages
          mount -t hugetlbfs none /dev/hugepages

          ln -s /run /var

          # Unbind the virtio-net (host TAP) and virtio-vhost-user devices
          # from their default drivers, since we'll be passing them
          # through to DPDK.
          echo 0000:00:04.0 > /sys/bus/pci/devices/0000:00:04.0/driver/unbind
          echo 0000:00:05.0 > /sys/bus/pci/devices/0000:00:05.0/driver/unbind

          # Tell the vfio-pci driver it can support virtio-net and
          # virtio-vhost-user devices.  Since our devices are not
          # bound to any driver at the moment, doing this will bind
          # them to vfio-pci automatically.
          echo 1af4 1000 > /sys/bus/pci/drivers/vfio-pci/new_id
          echo 1af4 1017 > /sys/bus/pci/drivers/vfio-pci/new_id

          echo 256 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

          ${testpmd}/bin/testpmd \
              -l 0-1 \
              -w 0000:00:05.0 \
              --vdev net_vhost0,iface=0000:00:05.0,virtio-transport=1 \
              -w 0000:00:04.0

          poweroff -f

  qemuVm = writeShellScript "qemu-vm" ''
    exec ${qemu}/bin/qemu-system-x86_64 -enable-kvm -cpu host -smp 2 -m 1G \
        -M q35,kernel-irqchip=split \
        -initrd ${qemuInitramfs}/initrd \
        -netdev tap,id=net0,ifname=qemutap,script=no,downscript=no \
        -device virtio-net-pci,netdev=net0,addr=04 \
        -chardev socket,id=chardev0,path="$XDG_RUNTIME_DIR/vhost-user0.sock",server,nowait \
        -device virtio-vhost-user-pci,addr=05,chardev=chardev0 \
        -kernel ${linux}/${} \
        -append "console=ttyS0 vfio.enable_unsafe_noiommu_mode=1" \

  # Can't use overrideAttrs because of cargoSha256.
  crosvm = rustPlatform.buildRustPackage rec {
    name = "crosvm-virtio-vhost-user";

    src = fetchFromGitiles {
      url = "";
      rev = "8a7e4e902a4950b060ea23b40c0dfce7bfa1b2cb";
      sha256 = "1lm6psp0xakb66nhgmmh94valc4wzbb967chk80msk8bcvsfpdn4";

    unpackPhase =
      let origSrc = pkgs.crosvm.passthru.src; in
      builtins.replaceStrings [ "${origSrc}" ] [ "$src" ]

    cargoPatches = [
      (fetchpatch {
        url = "";
        sha256 = "0yzqrpgq35s9wxvbf9s3dgs5cpyxgdc5hr14hsdjr0gd18a6camg";

    patches = pkgs.crosvm.patches ++ [
      (fetchpatch {
        url = "";
        sha256 = "0g2rvqqa4lvq7bjq0s1ynsjx7lmrxql7lsdv8wyzb7d2z9j6mj13";
      (fetchpatch {
        url = "";
        sha256 = "051sz87i8kzc5sbygk2bpiqp4g32y9fxswg2yax1nd3lg4rxh43r";
      (fetchpatch {
        url = "";
        sha256 = "1jpas65masn2xg9jxha16vi0y7scarzhl221y9wxh4chi4aa4m3f";

    cargoSha256 = "07yizbhs64jrb05fq5g7sx812xbz2989bsficacq5l19ziax5164";

    passthru = pkgs.crosvm.passthru // { inherit src; };

    inherit (pkgs.crosvm) sourceRoot postPatch nativeBuildInputs buildInputs
      preBuild postInstall CROSVM_CARGO_TEST_KERNEL_BINARY meta;

  crosvmInitramfs = makeInitrd {
    contents = [
        symlink = "/init";
        object = writeScript "init" ''
          #!${busybox}/bin/sh -eux
          export PATH=${busybox}/bin

          mount -t sysfs none /sys
          mount -t proc none /proc

          ip link set eth0 up

          udhcpc -n || :

          reboot -f

  crosvmVm = writeShellScript "crosvm-vm" ''
    # In our patched crosvm, suppling --mac without --host_ip or
    # --netmask will put it into vhost-user mode.
    exec ${crosvm}/bin/crosvm run \
        --mac 0A:B3:EC:FF:FF:FF \
        -i ${crosvmInitramfs}/initrd \

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

                 reply	other threads:[~2021-05-13 12:41 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

general high-level discussion about spectrum

This inbox may be cloned and mirrored by anyone:

	git clone --mirror spectrum-discuss/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 spectrum-discuss spectrum-discuss/ \
	public-inbox-index spectrum-discuss

Example config snippet for mirrors.
Newsgroups are available over NNTP:

AGPL code for this site: