Last week, I'd just finished getting the cloud-hypervisor vhost-user-net
frontend code to build as part of crosvm, and the next step was testing
I wrote some hacky code that replaced the virtio-net device creation in
crosvm with an instance of the ported vhost-user-net code. When I
booted crosvm, there were some of the expected simple oversights of mine
that needed to be addressed, but once those were taken care of, it still
didn't quite work. The VM boots, sees a network interface, and even
communicates with the vhost-user-net backend! But, it doesn't quite
work. The vhost-user-net code never realises/gets told that it has
traffic, and so it's never processed. Unsure of what to do about this,
I decided to turn to cloud-hypervisor and look at how the code ran
I wanted to try running the cloud-hypervisor v-u-n backend I was using
for testing (because it's much simpler than DPDK -- it just sends
traffic to a TAP device) with QEMU as the
frontend, because QEMU is a VMM I'm familiar with (much more so than
cloud-hypervisor as the frontend), and I thought it would be useful to
have a working frontend/backend combination to compare to.
I had some problems, though, because apparently nobody had ever wanted
to use QEMU with the cloud-hypervisor vhost-user-net backend before --
or if they had, they hadn't wanted to enough to make it work. The
cloud-hypervisor backend didn't implement the vhost-user spec correctly
in a few subtle ways that made it incompatible with QEMU. I won't
explain every subtle issue, but I ended up writing a few patches
for cloud-hypervisor and the "vhost" crate it depends on (that is in the
process of being moved under the rust-vmm umbrella).
One interesting issue I will go into a little detail of was that the
wording in the spec was a little unclear, and QEMU interpreted it one
way, and cloud-hypervisor the other. I ended up sending an email to
the author of the spec asking for clarification. He answered my
question, and we discussed how the wording could be improved. He liked
my second attempt at improving my working, and asked me to send a patch,
but preferably not right now, because QEMU is currently gearing up for a
release, scheduled for next week if everything goes well.
Since I wrote these cloud-hypervisor patches, and had to test them, I
ended up having to learn how to use cloud-hypervisor anyway to make sure
I hadn't broken it in fixing the backend up to work with QEMU. Oh well.
Once this was done, I could use both QEMU and cloud-hypervisor with the
backend, but not crosvm. But it was a little more complex than that.
When I ported the v-u-n code to crosvm, I ported the first version of it
that was added to the cloud-hypervisor tree, rather than the latest
version. The theory here was that the earlier version would be closer
to crosvm, because cloud-hypervisor would have had less time to diverge.
Then, once I had that working, I could add on the later changes
gradually. What I didn't account for here is that the initial version
of the v-u-n frontend in cloud-hypervisor didn't really work properly,
and needed some time to bake before it did. So having now had this
experience I think it might be better to try to port the latest version,
and accept that porting might be a bit harder, but the end result is
more likely to work.
While bisecting cloud-hypervisor to see if I could figure out when the
v-u-n frontend started working properly, I encountered a large section
of commits that I couldn't build any more, because Cargo couldn't
resolve a git dependency. The dependency was locked to a commit that
was no longer in the branch it had been in when the cloud-hypervisor
commit was from. Despite knowing the exact commit it needed, Cargo
fetched the branch the commit used to be on. This is because it is
generally not possible to fetch arbitrary commits with git. Some
servers, like GitHub, do however allow this, and I wondered why Cargo
wouldn't at least fall back to trying that.
As it turns out, it actually couldn't do that, though! Cargo uses
libgit2, and libgit2 doesn't support fetching arbitrary commits. So I
wrote a quick patch to libgit2 to support this. It's only a partial
implementation, though, because I don't find libgit2 to be a
particularly easy codebase to work in (although it's better than git!).
So I'm hoping somebody who knows more about it than me will help me
figure out how to finish it.
Next week, I'm hoping that I'll be able to get to vhost-user-net in
crosvm working. I think this will probably mean porting the code again,
using the latest version. Which is a bit of a shame, but at least I
have an idea of what to do next.
I am, overall, feeling pretty optimistic, though. I'm pretty confident
that we can get some sort of decent but imperfect network hardware
isolation even though virtio-vhost-user might not be ready yet, which
was something I was worried about before. I don't want to really go
into detail in that now though because this is already a long email and
it's already a day late because I was tired yesterday, but essentially,
we could forward the network device to a VM that would run the driver,
and forward traffic back to the host over virtio-net. The host could
handle this either in kernelspace or userspace with DPDK, but the
important thing is that the only network driver it would need to support
would be virtio-net. No talking to hundreds of different Wi-Fi cards
and hoping that none of the drivers have a vulnerability. So, not
perfect compared to proper guest<->guest networking, but a step in the
right direction, and one that should be as simple as possible to upgrade
to virtio-vhost-user once that becomes possible.