Last week I wasn't feeling well, so there was no This Week in Spectrum.
crosvm
------
Where we left off, I had been attempting to port vhost-user-net support
from cloud-hypervisor to crosvm. I'd been trying to port the first
incarnation of the code in cloud-hypervisor to the contemporary version
of crosvm from when it was added, thinking that that would be easier
because the two codebases together. But I ran into the problem that
this earliest incarnation of the vhost-user-net code from
cloud-hypervisor didn't actually work (at least with the backend I was
attempting to test it with). I'd been attempting to figure out exactly
which changes were required to make it work, but hadn't been successful
with that yet, and I thought I'd probably need to start the port over,
from the latest cloud-hypervisor and crosvm code.
The next day, I decided to give my previous strategy one more try,
though, and an hour or two later, I found the required cloud-hypervisor
change, applied it to crosvm, and it worked! So I now have a crosvm
tree capable of vhost-user-net[1].
This means that it's looking good for my plans for inter-guest
networking, and network hardware isolation. With that in place, I
decided to start thinking about other kinds of hardware isolation and
inter-VM communication, and that's what I did for most of the last two
weeks. Let's go through them:
Files will be shared between VMs using virtio-fs. This has the
unique feature of (soon) being able to bypass guest page caches, and
have only a single shared cache between VMs. This brings a performance
improvement, but as I understand it, should also reduce memory
consumption because each VM won't have to maintain its own copy of a
disk-backed page. Of course, this feature (DAX) is also a big side
channel, so it won't be appropriate for all use cases. But I think for
some things people want to do with Spectrum, this will be very
important.
The problem with this is that, because it uses the page cache of the
host kernel, the host has to know about the filesystem that's being
shared -- there's no running virtiofsd in a VM if we want DAX. But I'd
really like it if a (non-boot) block device could be used as a
filesystem without the host having to actually talk to the device. I
was stuck here, but edef pointed out to me that we could use the
kernel's 9P support to attach the block device to a VM, and then
mounting the filesystem in the host over 9P, either over a network
connection or (ideally) vsock. It looks like the kernel should be able
to handle 9P over vsock, but I haven't tested yet. We can use existing
virtiofsd and 9P software (there are promising Rust implementations of
each), and harden them against potential vulnerabilites like directory
traversals using kernel features like RESOLVE_BENEATH and
RESOLVE_NO_XDEV. For the boot device, maybe there's no reason not to
just mount it using the host kernel, or maybe there's something to be
gained by just reading a small bootstrap payload into memory from the
start of the disk once, and then making all future communication go via
a VM. I'm not really sure yet. But the important thing is we'll have
mechanisms for all this in place. Maybe we'll decide that non-boot
devices should just go over inter-VM 9P, but in any case, we'll still
need all these pieces.
GPU isolation should be possible by forwarding the GPU to a VM, but
there are a few problems here. The first is that it would mean rendered
surfaces have to be copied via shared memory to the VM with the GPU,
before being sent to the GPU. Additionally, sharing the GPU between VMs
for rendering at all would require significantly more work. The result
of this is that graphics performance using an isolated GPU will probably
be poor, at least for now. The final problem is that passthrough of
integrated GPUs seems to be very difficult to get right. I will
probably need to acquire some hardware that I've sene a report of this
working on, so I can figure out what I've been doing wrong on the two
computers I've tried it on so far. I suspect that I will get GPU
isolation working, but I'm not sure how reliable or performant it will
be.
For generic USB devices, I expect to be able to take an approach similar
to Qubes[2], having a VM to handle interactions with the hardware USB
controller, and exposing individual USB devices over USB/IP to other
VMs. It would be nice if I could use vsock for this too.
[1]: https://spectrum-os.org/git/crosvm/?h=vhost-user-net
[2]: https://www.qubes-os.org/doc/usb-devices/spectrum-os.org
---------------
Philipp registered a Matrix room and bridged it to the #spectrum IRC
channel. I'm told that this should make it easier for Matrix users to
join the room, since some bug in Matrix's IRC bridge prevents people
from joining from Matrix the usual way. Philipp also sent a patch[3] to
improve the instructions for Matrix users joining the channel on the
website. Thanks Philipp!
[3]: https://spectrum-os.org/lists/archives/spectrum-devel/87wo247zu7.fsf@alyssa…
QEMU
----
I sent the previously requested patch[4] to resolve ambiguities in the
vhost-user spec. No response yet, though. I'll probably resend it some
time soon.
[4]: https://lore.kernel.org/qemu-devel/20200813094847.4288-1-hi@alyssa.is/
I'm finding it hard to keep going at the moment. The stuff I'm doing
now is probably the hardest part of implementing Spectrum, and it's
frustrating to realise that not everything I want to do is going to be
possible. So much of the KVM ecosystem assumes that things will be
host<->guest, and there's not always an easy solution. But, whatever we
end up with, it's going to be a lot better than what I'm using today,
and what lots of other people are using today. I think I'm going to be
able to deliver a good experience with a fairly high degree of
protection against malicious hardware. But it's not going to be
perfect.
I'm pushing quite hard to make it over the line with my hardware
isolation funding milestone. I'm so close, and I'm about to need the
money. But once I've hit that, I think I'm going to need a break. This
stuff is gruelling.
Last week, I'd just finished getting the cloud-hypervisor vhost-user-net
frontend code to build as part of crosvm, and the next step was testing
it.
crosvm
------
I wrote some hacky code that replaced the virtio-net device creation in
crosvm with an instance of the ported vhost-user-net code. When I
booted crosvm, there were some of the expected simple oversights of mine
that needed to be addressed, but once those were taken care of, it still
didn't quite work. The VM boots, sees a network interface, and even
communicates with the vhost-user-net backend! But, it doesn't quite
work. The vhost-user-net code never realises/gets told that it has
traffic, and so it's never processed. Unsure of what to do about this,
I decided to turn to cloud-hypervisor and look at how the code ran
there.
cloud-hypervisor
----------------
I wanted to try running the cloud-hypervisor v-u-n backend I was using
for testing (because it's much simpler than DPDK -- it just sends
traffic to a TAP device) with QEMU as the
frontend, because QEMU is a VMM I'm familiar with (much more so than
cloud-hypervisor as the frontend), and I thought it would be useful to
have a working frontend/backend combination to compare to.
I had some problems, though, because apparently nobody had ever wanted
to use QEMU with the cloud-hypervisor vhost-user-net backend before --
or if they had, they hadn't wanted to enough to make it work. The
cloud-hypervisor backend didn't implement the vhost-user spec correctly
in a few subtle ways that made it incompatible with QEMU. I won't
explain every subtle issue, but I ended up writing a few patches[1][2]
for cloud-hypervisor and the "vhost" crate it depends on (that is in the
process of being moved under the rust-vmm umbrella).
One interesting issue I will go into a little detail of was that the
wording in the spec was a little unclear, and QEMU interpreted it one
way, and cloud-hypervisor the other. I ended up sending an email[3] to
the author of the spec asking for clarification. He answered my
question, and we discussed how the wording could be improved. He liked
my second attempt at improving my working, and asked me to send a patch,
but preferably not right now, because QEMU is currently gearing up for a
release, scheduled for next week if everything goes well.
Since I wrote these cloud-hypervisor patches, and had to test them, I
ended up having to learn how to use cloud-hypervisor anyway to make sure
I hadn't broken it in fixing the backend up to work with QEMU. Oh well.
Once this was done, I could use both QEMU and cloud-hypervisor with the
backend, but not crosvm. But it was a little more complex than that.
When I ported the v-u-n code to crosvm, I ported the first version of it
that was added to the cloud-hypervisor tree, rather than the latest
version. The theory here was that the earlier version would be closer
to crosvm, because cloud-hypervisor would have had less time to diverge.
Then, once I had that working, I could add on the later changes
gradually. What I didn't account for here is that the initial version
of the v-u-n frontend in cloud-hypervisor didn't really work properly,
and needed some time to bake before it did. So having now had this
experience I think it might be better to try to port the latest version,
and accept that porting might be a bit harder, but the end result is
more likely to work.
[1]: https://github.com/cloud-hypervisor/vhost/pull/22
[2]: https://github.com/cloud-hypervisor/cloud-hypervisor/pull/1565
[3]: https://lore.kernel.org/qemu-devel/87sgd1ktx9.fsf@alyssa.is/
libgit2
-------
While bisecting cloud-hypervisor to see if I could figure out when the
v-u-n frontend started working properly, I encountered a large section
of commits that I couldn't build any more, because Cargo couldn't
resolve a git dependency. The dependency was locked to a commit that
was no longer in the branch it had been in when the cloud-hypervisor
commit was from. Despite knowing the exact commit it needed, Cargo
fetched the branch the commit used to be on. This is because it is
generally not possible to fetch arbitrary commits with git. Some
servers, like GitHub, do however allow this, and I wondered why Cargo
wouldn't at least fall back to trying that.
As it turns out, it actually couldn't do that, though! Cargo uses
libgit2, and libgit2 doesn't support fetching arbitrary commits. So I
wrote a quick patch to libgit2 to support this[4]. It's only a partial
implementation, though, because I don't find libgit2 to be a
particularly easy codebase to work in (although it's better than git!).
So I'm hoping somebody who knows more about it than me will help me
figure out how to finish it.
[4]: https://github.com/libgit2/libgit2/pull/5603
Next week, I'm hoping that I'll be able to get to vhost-user-net in
crosvm working. I think this will probably mean porting the code again,
using the latest version. Which is a bit of a shame, but at least I
have an idea of what to do next.
I am, overall, feeling pretty optimistic, though. I'm pretty confident
that we can get some sort of decent but imperfect network hardware
isolation even though virtio-vhost-user might not be ready yet, which
was something I was worried about before. I don't want to really go
into detail in that now though because this is already a long email and
it's already a day late because I was tired yesterday, but essentially,
we could forward the network device to a VM that would run the driver,
and forward traffic back to the host over virtio-net. The host could
handle this either in kernelspace or userspace with DPDK, but the
important thing is that the only network driver it would need to support
would be virtio-net. No talking to hundreds of different Wi-Fi cards
and hoping that none of the drivers have a vulnerability. So, not
perfect compared to proper guest<->guest networking, but a step in the
right direction, and one that should be as simple as possible to upgrade
to virtio-vhost-user once that becomes possible.
DPDK
----
Last week, I'd just figured out how to do a normal vhost-user setup with
a QEMU VM connected to DPDK. This week, I wanted to try to move DPDK
into another VM using the experimental virtio-vhost-user driver, taking
the host system out of the networking equation altogether.
In theory this should have been a very simple change, but I couldn't get
it to work. DPDK claimed to be forwarding packets to the ethernet
device I'd attached to the backend VM (the one running DPDK), but
networking in the frontend VM (what you might think of as the
application VM) didn't work at all. It tried and failed to do DHCP, and
so couldn't progress beyond that.
A breakthrough came when I thought to look at the logs of my local DHCP
server. I saw that it was actually receiving requests from the VM, and
assigning it an IP address. Once I realised this, I hypothesised that
outgoing traffic was working, but not incoming.
Finally having something to look for, I had a look through the DPDK
virtio-vhost-user driver[1], and my suspicion was confirmed in an
unexpected way. It looks like incoming traffic (from the perspective of
the virtio-vhost-user frontend) is not actually implemented at all!
But with outbound traffic working, this means that I'm confident enough
I understand virtio-vhost-user enough to be able to leave this here for
now. From Spectrum's side, I can now be pretty sure that everything
should be workable, so we can just wait a bit for virtio-vhost-user to
get a bit further along and then revisit it. And since the frontend
has no idea it's talking to virtio-vhost-user instead of normal
vhost-user, we can use normal (host-based) vhost-user for now, and drop
virtio-vhost-user in down the line.
A couple of outstanding questions I still don't know the answer to about
DPDK are:
- How will routing work if I have multiple frontend VMs with multiple
virtio-vhost-user connections all wanting to use the same network
device? Will I want to use something like Open vSwitch[2] for that?
- DPDK by default uses a busyloop to check for data to process, for
efficiency. This is obviously not appropriate for a
workstation-focused operating system. There is an interrupt-based
mode, though, but I don't know how to use it yet.
Since I consider the concept proven, though, I'm going to punt on these
for now. The longer I leave these questions, the more likely it is that
a kernel driver for virtio-vhost-user will emerge and we can use that
instead. That's not to say I want to leave inter-guest networking
hanging forever, but I have other inter-guest networking bits I can
switch focus to for now, and once those are down I can revisit the
virtio-vhost-user backend situation.
[1]: https://github.com/ndragazis/dpdk-next-virtio/blob/2d60e63/drivers/virtio_v…
[2]: https://www.openvswitch.org/
crosvm
------
I started integrating the vhost-user-net code from Cloud Hypervisor into
crosvm. I'm at the point where I can get all the copied Cloud
Hypervisor code to compile in crosvm, which is pretty good! I have not
yet written the code to actually start one of these devices yet, though,
so I haven't been able to test it yet.
It's been interesting to look at Cloud Hypervisor because it's a
codebase that is heavily based on crosvm (even more so than Firecracker
is), but that has also evolved and diverged from it. It's especially
interesting to see stuff where parallel evolution occurred between the
crosvm and Cloud Hypervisor codebases, or when Cloud Hypervisor changed
how some crosvm code worked, and then later changed it back again.
The codebases were still similar enough that I could have the
cloud-hypervisor device integrated into the crosvm codebase in a day,
although there's lots of code duplication that will have to be dealt
with -- I copied over a bunch of supporting code rather than trying to
integrate it into the crosvm equivalents to get the code running for the
first time in an environment as similar as possible to the one it was
designed for. I expect that when I test the device in crosvm it'll
probably work fairly quickly if not first try. The more complicated
part will be a bit of a change to how crosvm does guest memory that
isn't strictly necessary but is important for security.
crosvm allocates all guest memory in a single memfd. This means that,
to share guest memory with another process, like when using vhost-user,
the only option is to share all of guest memory. This would sort of
defeat the purpose of hardware isolation in Spectrum! But from what I
could tell -- I'm not 100% on this -- the guest memory abstraction in
cloud-hypervisor is more advanced, and I think it might support multiple
memfds backing guest memory for this sort of thing. I'll have to adapt
crosvm to that model to be able to use vhost-user securely.
website
-------
The new "Bibliography" page is up[3]! Lots of links to relevant resources
about concepts important to Spectrum. :)
[3]: https://spectrum-os.org/bibliography.html
It's a bit of a relief to have returned from the uncertain world of DPDK
to the familiar territory of crosvm. I'm confident that the next bit of
work here (vhost-user in crosvm) won't be that much of a big deal.
Hopefully, we'll have at interim networking to a reasonable degree
fairly soon. After that, I plan to look at file sharing, possibly with
vhost-user-fs (virtio-fs over vhost-user), which I noticed
cloud-hypervisor implements today. That should be pretty similar to the
networking stuff, although I don't think any virtio-fs virtio-vhost-user
code exists at the moment.