This Week in Spectrum, 2020-W30

27 Jul 2020

      Feels like there's a pretty clear path forward.  Nice feeling. :)

QEMU
----

Last week, as I wrote TWiS, I had just discovered virtio-vhost-user,
which looked like a very promising mechanism for getting a VM to take
care of networking for other VMs.  This week, I've been researching it
further, and trying to test and evaluate it.

The first thing I tried to do, naturally, was to build the patched QEMU
tree and boot a VM with a virtio-vhost-user device attached.  This was
not as easy as I'd hoped, because adding the virtio-vhost-user device to
my QEMU command line made the VM kernel panic at boot, with an error
message about an invalid memory access.  I spent most of the week trying
to figure this out -- I wasn't doing anything different to the
example[1] on the QEMU wiki, so it should have worked, and it felt like
if I could just get past whatever was going wrong here, it would be
worth it, because virtio-vhost-user otherwise seems so suited for what
we need here.  I emailed the patch author[2], but he didn't know what
was up either.

An early breakthrough came when I got frustrated with kernel builds
taking hours on my 8-year-old laptop, and so decided to work on a more
powerful computer instead.  Once I got everything set up on that
computer, I started up the VM, and it worked.  Perhaps in setting it up
over here I'd done something different?  I copied over the exact VM
disk/kernel/initrd/command line that I was running on my laptop, and the
other computer booted it just fine.  I had -cpu host in the QEMU command
line, so I thought maybe the different kind of virtual CPU was causing
it.  Tried setting it to a specific value on both machines, and still
the laptop VM panicked and the other didn't.  So it sounded like whether
it worked or not depended on the host hardware.

I put together a Nix derivation that would automatically build the
custom QEMU and output a script that would run a VM, and then asked
people in #spectrum to test it out on various computers.  After getting
some further data, a pattern started to emerge, where Intel processors
Ivy Bridge and older would fail, and Skylake and newer would succeed (I
didn't encounter any AMD processors that failed, nor did I have data at
the time for generations between Ivy Bridge and Skylake).  This theory
had a convenient explanation for why nobody else had seen this problem
-- I doubt people at Red Hat are working on 7-year-old hardware.

This was a good clue, but still didn't put me much closer to having a
working system.  I do have a more recent laptop around, but for reasons
that are out of scope here it would be very inconvenient to decide to
just move over to it.  I could see that the kernel was panicking the
first time it tried to access the PCI BARs of the virtio-vhost-user,
which led me to believe that the problem was probably in how that memory
was being set up.  I found the function that did that[3], and stared at
it for a long time.  I tried to read the rest of the QEMU code, but it
became clear that my domain knowledge here isn't good enough to be able
to keep track of what's meant to be happening.  I added some debug
prints, which were vaguely helpful in making that understanding a little
better.

I was hoping to find the guest address each PCI BAR was mapped to so
that I could check the kernel was trying to write to the right location,
but didn't manage to do that.  While attempting to, though, I did add a
debug print that printed the size of each PCI bar as it was allocated.
I noticed that most were small -- 16 MiB at most, but one was huge, at
64 GiB!  The code that allocated this BAR was part of the function I'd
been staring at.  As far as I could tell, the choice of size was pretty
arbitrary -- this big memory region was used as backing memory for all
sorts of small objects on the fly.  On a whim, I tried changing the BAR
size from 1ULL << 36 to 1ULL << 26, and recompiled QEMU.  The VM booted.

The comment above the bar_size definition that I'd been looking at for
so long said:

/* TODO If the BAR is too large the guest won't have address space to map
 * it!
 */

I don't know if that's exactly what went wrong here, though.  I suspect
it's more like the host architecture doesn't have enough address space?
The affected machines all reported 36 bit physical address size, and 48
bit virtual address size.  So maybe what's happening is that the
processor interprets PCI addresses in the hardware-assisted VM as
physical addresses, and therefore runs out of space because all of it is
taken up by this one PCI bar?  I'm not really sure.  Lowering the bar
size to 2^35 or 2^34 (has to be a power of two) depending on the QEMU
version made the problem go away, and that's good enough for now.

I'm not very enthusiastic about this up-front allocation of a huge
amount of memory that might not even fit in the available address space.
I don't know if there's a better way of doing it in this case, but I
certainly hope so.  In general I think this perhaps demonstrates why
this code is not considered suitable for "production" yet.  The bet I'm
taking here is that by the time Spectrum is further along, things will
have moved on for virtio-vhost-user too.  As I said, at some point we
will want to implement it in crosvm to avoid having QEMU in the TCB, but
it would be a bad idea to do that now while virtio-vhost-user is still
going through the back-and-forth of making its way into the Virtio spec.

[1]: https://wiki.qemu.org/Features/VirtioVhostUser
[2]: https://lore.kernel.org/qemu-devel/87h7u1s5k1.fsf@alyssa.is/T/#u
[3]: https://github.com/ndragazis/qemu/blob/f9ab08c0c8/hw/virtio/virtio-vhost-use...

DPDK
----

Once I was able to boot a VM with the virtio-vhost-user device, I tried
to connect another QEMU VM to it through vhost-user -- I'll want to have
this working first as a reference before I start porting Cloud
Hypervisor's vhost-user implementation to crosvm.  But the "frontend"
(vhost-user) QEMU process hung waiting for a reply on the vhost-user
socket from the backend one.  Not really knowing what to do about this,
I decided that maybe I'd been a bit too ambitious in going straight for
vhost-user <-> virtio-vhost-user when I'd never actually used vhost-user
before, so maybe I should try a more conventional vhost-user setup
first.

As far as I can tell, vhost-user is usually used for connecting a VM to
a userspace networking stack.  And usually, this networking stack is
DPDK, the "Data Plane Development Kit"[4].  DPDK was also used in the
virtio-vhost-user examples, so I figured my next step would be to try
it there as well, and therefore it was worth the time in learning how to
do a very basic setup with it.

Quick start -style documentation for this was pretty lacking, but I did
eventually manage to make this work.  Here's what I did, for my own
future reference as much as anything else:

(1) Make some hugepages available.  1GiB for DPDK and 1GiB for QEMU:

    echo 1024 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

(2) Take my ethernet interface offline so it could be used with DPDK:

    nmcli d disconnect enp0s25

(3) Load the vfio-pci module, which allows PCI devices to be exported to
    userspace rather than managed by the kernel:

    mobprobe vfio-pci

(4) Export the ethernet interface:

    usertools/dpdk-devbind.py -b vfio-pci enp0s25

(5) Run testpmd, a program that comes with DPDK mostly used for
    debugging and tracing it seems, but that with no special arguments
    acts as a simple packet forwarder.  Here I create a vhost-user
    socket, and forward traffic between vhost-user and my ethernet
    interface:

    build/app/dpdk-testpmd -l 0,1 -w 00:19.0 \
        --vdev net_vhost0,iface=/run/vhost-user0.sock

    The -w value is the PCI address of the ethernet interface.  Note how
    "00:19" corresponds to "p0s25".  (19 in hex is 25 is decimal.)

(6) Start a VM.  The relevant QEMU flags appear to be:

    -chardev socket,id=char0,path=/run/vhost-user0.sock \
    -netdev type=vhost-user,id=net0,chardev=char0,vhostforce \
    -device virtio-net-pci,netdev=net0 \
    -object memory-backend-file,id=mem0,size=1024M,mem-path=/dev/hugepages,share=on \
    -numa node,memdev=mem0 \
    -mem-prealloc

I figured this all out mostly from a guide for a DPDK benchmark[5].  I
have not yet experimented with variations on the QEMU flags yet.  I'm
not sure if all the memory flags are required -- -mem-prealloc might
just be there because it was important for a benchmark, for example.

So this is the point I'm at with this exploration.  Next up, I'll be
trying with DPDK inside a VM with a virtio-vhost-user device.  I think
that maybe, despite the virtio-vhost-user device showing up as an
ethernet device inside the VM, it needs some special support which is
available for DPDK as a patchset, but that has not been written for the
kernel yet.  I was a bit worried about this, because unlike the kernel,
DPDK isn't going to have things that Wi-Fi drivers for all sorts of
different hardware, and so using DPDK instead of the kernel network
stack would be a problem.  But then I learned that DPDK has a component
called the Kernel Native Interface (KNI) which allows it to use network
interfaces from the kernel, so a hybrid approach would be possible, and
is what I think we'll end up using for now.  Then, once
virtio-vhost-user is a bit more mature, a kernel driver will probably
show up, and we can use that instead and drop DPDK.

[4]: https://dpdk.org/
[5]: https://doc.dpdk.org/guides/howto/pvp_reference_benchmark.html?highlight=pvp

Website
-------

I was having a conversation about Spectrum yesterday, and I found myself
sending over a bunch of links to articles and papers that I often find
myself referring to when talking to somebody about Spectrum.  This made
me think that maybe there should be some place where we keep all these
relevant articles.  So I mined the IRC logs, the TWiS archive, and my
blog, and added whatever I could pull from my brain, and wrote a
Spectrum bibliography, containing 27 links to interesting articles and
papers that are particularly relevant to Spectrum.

This isn't on the website quite yet, but I did sent this as a patch[6]
to the mailing list, if you want an early look.

I also posted a patch to fix a minor issue where I'd mistakenly used
".." instead of "."  as href values, to no user-visible effect[7].

[6]: https://spectrum-os.org/lists/archives/spectrum-devel/20200726045701.32259-1...
[7]: https://spectrum-os.org/lists/archives/spectrum-devel/20200726055410.20641-1...

Documentation
-------------

On Monday, I had a call with the Free Software Foundation Europe.
They're a part of NGI Zero (where my funding comes from), and they are
promoting their new "REUSE" specification[8] for license information in
free software projects to NGI Zero projects.  It basically covers
standardised per-file license and copyright annotations, and a standard
way of including license texts.

I think this is really cool!  It's something I've been unsure of how to
handle because it's all vague conventions that are different in
different circles, and it's nice to see something formalised about it.
They also have an automated tool[9] for checking compliance and
semi-automatically adding license information, which is great!

So I'm enthusiastically adopting the REUSE specification.  I decided
that our smaller, first-party repositories (the documentation, the
website, etc.) would be a good place to get started, and so I posted a
patch[10] that makes the documentation repository REUSE-compliant.

[8]: https://reuse.software/
[9]: https://git.fsfe.org/reuse/tool
[10]: https://spectrum-os.org/lists/archives/spectrum-devel/20200726105527.27432-1...

mktuntap
--------

I posted a patch[11] to make mktuntap REUSE-compliant.

[11]: https://spectrum-os.org/lists/archives/spectrum-devel/20200726110123.30159-1...

The thing that's most on my mind this week is the extent to which I'm
learning about and working on software like QEMU and DPDK that I don't
see having a place in Spectrum in the long run.  It's counterintuitive,
but this is definitely worth it.  There's no point writing a kernel
driver for virtio-vhost-user (should such a thing be required) right
now, because if I use DPDK for now instead, at some point either
virtio-vhost-user will end up not being the thing that gets adopted by
the ecosystem and we'll have to move to something else, or (more likely)
it gets widely adopted and somebody else writes a kernel driver.
Similarly, using QEMU for network VMs is the smart choice even though
I don't want it to end up in the TCB, because even though I'm probably
going to end up implementing virtio-vhost-user in crosvm later, swapping
out QEMU is going to be so easy later that it would be a very bad idea
to implement that now in case virtio-vhost-user doesn't take off.  But
it still /feels/ weird to be using QEMU for this stuff, you know?

This Week in Spectrum, 2020-W30

Alyssa Ross