Last week, I'd just finished getting the cloud-hypervisor vhost-user-net frontend code to build as part of crosvm, and the next step was testing it.
I wrote some hacky code that replaced the virtio-net device creation in crosvm with an instance of the ported vhost-user-net code. When I booted crosvm, there were some of the expected simple oversights of mine that needed to be addressed, but once those were taken care of, it still didn't quite work. The VM boots, sees a network interface, and even communicates with the vhost-user-net backend! But, it doesn't quite work. The vhost-user-net code never realises/gets told that it has traffic, and so it's never processed. Unsure of what to do about this, I decided to turn to cloud-hypervisor and look at how the code ran there.
I wanted to try running the cloud-hypervisor v-u-n backend I was using for testing (because it's much simpler than DPDK -- it just sends traffic to a TAP device) with QEMU as the frontend, because QEMU is a VMM I'm familiar with (much more so than cloud-hypervisor as the frontend), and I thought it would be useful to have a working frontend/backend combination to compare to.
I had some problems, though, because apparently nobody had ever wanted to use QEMU with the cloud-hypervisor vhost-user-net backend before -- or if they had, they hadn't wanted to enough to make it work. The cloud-hypervisor backend didn't implement the vhost-user spec correctly in a few subtle ways that made it incompatible with QEMU. I won't explain every subtle issue, but I ended up writing a few patches for cloud-hypervisor and the "vhost" crate it depends on (that is in the process of being moved under the rust-vmm umbrella).
One interesting issue I will go into a little detail of was that the wording in the spec was a little unclear, and QEMU interpreted it one way, and cloud-hypervisor the other. I ended up sending an email to the author of the spec asking for clarification. He answered my question, and we discussed how the wording could be improved. He liked my second attempt at improving my working, and asked me to send a patch, but preferably not right now, because QEMU is currently gearing up for a release, scheduled for next week if everything goes well.
Since I wrote these cloud-hypervisor patches, and had to test them, I ended up having to learn how to use cloud-hypervisor anyway to make sure I hadn't broken it in fixing the backend up to work with QEMU. Oh well.
Once this was done, I could use both QEMU and cloud-hypervisor with the backend, but not crosvm. But it was a little more complex than that. When I ported the v-u-n code to crosvm, I ported the first version of it that was added to the cloud-hypervisor tree, rather than the latest version. The theory here was that the earlier version would be closer to crosvm, because cloud-hypervisor would have had less time to diverge. Then, once I had that working, I could add on the later changes gradually. What I didn't account for here is that the initial version of the v-u-n frontend in cloud-hypervisor didn't really work properly, and needed some time to bake before it did. So having now had this experience I think it might be better to try to port the latest version, and accept that porting might be a bit harder, but the end result is more likely to work.
While bisecting cloud-hypervisor to see if I could figure out when the v-u-n frontend started working properly, I encountered a large section of commits that I couldn't build any more, because Cargo couldn't resolve a git dependency. The dependency was locked to a commit that was no longer in the branch it had been in when the cloud-hypervisor commit was from. Despite knowing the exact commit it needed, Cargo fetched the branch the commit used to be on. This is because it is generally not possible to fetch arbitrary commits with git. Some servers, like GitHub, do however allow this, and I wondered why Cargo wouldn't at least fall back to trying that.
As it turns out, it actually couldn't do that, though! Cargo uses libgit2, and libgit2 doesn't support fetching arbitrary commits. So I wrote a quick patch to libgit2 to support this. It's only a partial implementation, though, because I don't find libgit2 to be a particularly easy codebase to work in (although it's better than git!). So I'm hoping somebody who knows more about it than me will help me figure out how to finish it.
Next week, I'm hoping that I'll be able to get to vhost-user-net in crosvm working. I think this will probably mean porting the code again, using the latest version. Which is a bit of a shame, but at least I have an idea of what to do next.
I am, overall, feeling pretty optimistic, though. I'm pretty confident that we can get some sort of decent but imperfect network hardware isolation even though virtio-vhost-user might not be ready yet, which was something I was worried about before. I don't want to really go into detail in that now though because this is already a long email and it's already a day late because I was tired yesterday, but essentially, we could forward the network device to a VM that would run the driver, and forward traffic back to the host over virtio-net. The host could handle this either in kernelspace or userspace with DPDK, but the important thing is that the only network driver it would need to support would be virtio-net. No talking to hundreds of different Wi-Fi cards and hoping that none of the drivers have a vulnerability. So, not perfect compared to proper guest<->guest networking, but a step in the right direction, and one that should be as simple as possible to upgrade to virtio-vhost-user once that becomes possible.