Feels like there's a pretty clear path forward. Nice feeling. :)
Last week, as I wrote TWiS, I had just discovered virtio-vhost-user, which looked like a very promising mechanism for getting a VM to take care of networking for other VMs. This week, I've been researching it further, and trying to test and evaluate it.
The first thing I tried to do, naturally, was to build the patched QEMU tree and boot a VM with a virtio-vhost-user device attached. This was not as easy as I'd hoped, because adding the virtio-vhost-user device to my QEMU command line made the VM kernel panic at boot, with an error message about an invalid memory access. I spent most of the week trying to figure this out -- I wasn't doing anything different to the example on the QEMU wiki, so it should have worked, and it felt like if I could just get past whatever was going wrong here, it would be worth it, because virtio-vhost-user otherwise seems so suited for what we need here. I emailed the patch author, but he didn't know what was up either.
An early breakthrough came when I got frustrated with kernel builds taking hours on my 8-year-old laptop, and so decided to work on a more powerful computer instead. Once I got everything set up on that computer, I started up the VM, and it worked. Perhaps in setting it up over here I'd done something different? I copied over the exact VM disk/kernel/initrd/command line that I was running on my laptop, and the other computer booted it just fine. I had -cpu host in the QEMU command line, so I thought maybe the different kind of virtual CPU was causing it. Tried setting it to a specific value on both machines, and still the laptop VM panicked and the other didn't. So it sounded like whether it worked or not depended on the host hardware.
I put together a Nix derivation that would automatically build the custom QEMU and output a script that would run a VM, and then asked people in #spectrum to test it out on various computers. After getting some further data, a pattern started to emerge, where Intel processors Ivy Bridge and older would fail, and Skylake and newer would succeed (I didn't encounter any AMD processors that failed, nor did I have data at the time for generations between Ivy Bridge and Skylake). This theory had a convenient explanation for why nobody else had seen this problem -- I doubt people at Red Hat are working on 7-year-old hardware.
This was a good clue, but still didn't put me much closer to having a working system. I do have a more recent laptop around, but for reasons that are out of scope here it would be very inconvenient to decide to just move over to it. I could see that the kernel was panicking the first time it tried to access the PCI BARs of the virtio-vhost-user, which led me to believe that the problem was probably in how that memory was being set up. I found the function that did that, and stared at it for a long time. I tried to read the rest of the QEMU code, but it became clear that my domain knowledge here isn't good enough to be able to keep track of what's meant to be happening. I added some debug prints, which were vaguely helpful in making that understanding a little better.
I was hoping to find the guest address each PCI BAR was mapped to so that I could check the kernel was trying to write to the right location, but didn't manage to do that. While attempting to, though, I did add a debug print that printed the size of each PCI bar as it was allocated. I noticed that most were small -- 16 MiB at most, but one was huge, at 64 GiB! The code that allocated this BAR was part of the function I'd been staring at. As far as I could tell, the choice of size was pretty arbitrary -- this big memory region was used as backing memory for all sorts of small objects on the fly. On a whim, I tried changing the BAR size from 1ULL << 36 to 1ULL << 26, and recompiled QEMU. The VM booted.
The comment above the bar_size definition that I'd been looking at for so long said:
/* TODO If the BAR is too large the guest won't have address space to map * it! */
I don't know if that's exactly what went wrong here, though. I suspect it's more like the host architecture doesn't have enough address space? The affected machines all reported 36 bit physical address size, and 48 bit virtual address size. So maybe what's happening is that the processor interprets PCI addresses in the hardware-assisted VM as physical addresses, and therefore runs out of space because all of it is taken up by this one PCI bar? I'm not really sure. Lowering the bar size to 2^35 or 2^34 (has to be a power of two) depending on the QEMU version made the problem go away, and that's good enough for now.
I'm not very enthusiastic about this up-front allocation of a huge amount of memory that might not even fit in the available address space. I don't know if there's a better way of doing it in this case, but I certainly hope so. In general I think this perhaps demonstrates why this code is not considered suitable for "production" yet. The bet I'm taking here is that by the time Spectrum is further along, things will have moved on for virtio-vhost-user too. As I said, at some point we will want to implement it in crosvm to avoid having QEMU in the TCB, but it would be a bad idea to do that now while virtio-vhost-user is still going through the back-and-forth of making its way into the Virtio spec.
: https://wiki.qemu.org/Features/VirtioVhostUser : https://firstname.lastname@example.org/T/#u : https://github.com/ndragazis/qemu/blob/f9ab08c0c8/hw/virtio/virtio-vhost-use...
Once I was able to boot a VM with the virtio-vhost-user device, I tried to connect another QEMU VM to it through vhost-user -- I'll want to have this working first as a reference before I start porting Cloud Hypervisor's vhost-user implementation to crosvm. But the "frontend" (vhost-user) QEMU process hung waiting for a reply on the vhost-user socket from the backend one. Not really knowing what to do about this, I decided that maybe I'd been a bit too ambitious in going straight for vhost-user <-> virtio-vhost-user when I'd never actually used vhost-user before, so maybe I should try a more conventional vhost-user setup first.
As far as I can tell, vhost-user is usually used for connecting a VM to a userspace networking stack. And usually, this networking stack is DPDK, the "Data Plane Development Kit". DPDK was also used in the virtio-vhost-user examples, so I figured my next step would be to try it there as well, and therefore it was worth the time in learning how to do a very basic setup with it.
Quick start -style documentation for this was pretty lacking, but I did eventually manage to make this work. Here's what I did, for my own future reference as much as anything else:
(1) Make some hugepages available. 1GiB for DPDK and 1GiB for QEMU:
echo 1024 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
(2) Take my ethernet interface offline so it could be used with DPDK:
nmcli d disconnect enp0s25
(3) Load the vfio-pci module, which allows PCI devices to be exported to userspace rather than managed by the kernel:
(4) Export the ethernet interface:
usertools/dpdk-devbind.py -b vfio-pci enp0s25
(5) Run testpmd, a program that comes with DPDK mostly used for debugging and tracing it seems, but that with no special arguments acts as a simple packet forwarder. Here I create a vhost-user socket, and forward traffic between vhost-user and my ethernet interface:
build/app/dpdk-testpmd -l 0,1 -w 00:19.0 \ --vdev net_vhost0,iface=/run/vhost-user0.sock
The -w value is the PCI address of the ethernet interface. Note how "00:19" corresponds to "p0s25". (19 in hex is 25 is decimal.)
(6) Start a VM. The relevant QEMU flags appear to be:
-chardev socket,id=char0,path=/run/vhost-user0.sock \ -netdev type=vhost-user,id=net0,chardev=char0,vhostforce \ -device virtio-net-pci,netdev=net0 \ -object memory-backend-file,id=mem0,size=1024M,mem-path=/dev/hugepages,share=on \ -numa node,memdev=mem0 \ -mem-prealloc
I figured this all out mostly from a guide for a DPDK benchmark. I have not yet experimented with variations on the QEMU flags yet. I'm not sure if all the memory flags are required -- -mem-prealloc might just be there because it was important for a benchmark, for example.
So this is the point I'm at with this exploration. Next up, I'll be trying with DPDK inside a VM with a virtio-vhost-user device. I think that maybe, despite the virtio-vhost-user device showing up as an ethernet device inside the VM, it needs some special support which is available for DPDK as a patchset, but that has not been written for the kernel yet. I was a bit worried about this, because unlike the kernel, DPDK isn't going to have things that Wi-Fi drivers for all sorts of different hardware, and so using DPDK instead of the kernel network stack would be a problem. But then I learned that DPDK has a component called the Kernel Native Interface (KNI) which allows it to use network interfaces from the kernel, so a hybrid approach would be possible, and is what I think we'll end up using for now. Then, once virtio-vhost-user is a bit more mature, a kernel driver will probably show up, and we can use that instead and drop DPDK.
I was having a conversation about Spectrum yesterday, and I found myself sending over a bunch of links to articles and papers that I often find myself referring to when talking to somebody about Spectrum. This made me think that maybe there should be some place where we keep all these relevant articles. So I mined the IRC logs, the TWiS archive, and my blog, and added whatever I could pull from my brain, and wrote a Spectrum bibliography, containing 27 links to interesting articles and papers that are particularly relevant to Spectrum.
This isn't on the website quite yet, but I did sent this as a patch to the mailing list, if you want an early look.
I also posted a patch to fix a minor issue where I'd mistakenly used ".." instead of "." as href values, to no user-visible effect.
On Monday, I had a call with the Free Software Foundation Europe. They're a part of NGI Zero (where my funding comes from), and they are promoting their new "REUSE" specification for license information in free software projects to NGI Zero projects. It basically covers standardised per-file license and copyright annotations, and a standard way of including license texts.
I think this is really cool! It's something I've been unsure of how to handle because it's all vague conventions that are different in different circles, and it's nice to see something formalised about it. They also have an automated tool for checking compliance and semi-automatically adding license information, which is great!
So I'm enthusiastically adopting the REUSE specification. I decided that our smaller, first-party repositories (the documentation, the website, etc.) would be a good place to get started, and so I posted a patch that makes the documentation repository REUSE-compliant.
I posted a patch to make mktuntap REUSE-compliant.
The thing that's most on my mind this week is the extent to which I'm learning about and working on software like QEMU and DPDK that I don't see having a place in Spectrum in the long run. It's counterintuitive, but this is definitely worth it. There's no point writing a kernel driver for virtio-vhost-user (should such a thing be required) right now, because if I use DPDK for now instead, at some point either virtio-vhost-user will end up not being the thing that gets adopted by the ecosystem and we'll have to move to something else, or (more likely) it gets widely adopted and somebody else writes a kernel driver. Similarly, using QEMU for network VMs is the smart choice even though I don't want it to end up in the TCB, because even though I'm probably going to end up implementing virtio-vhost-user in crosvm later, swapping out QEMU is going to be so easy later that it would be a very bad idea to implement that now in case virtio-vhost-user doesn't take off. But it still /feels/ weird to be using QEMU for this stuff, you know?