August 2020 - Spectrum Discuss

This (and Last) Week in Spectrum, 2020-W34 & 2020-W35
by Alyssa Ross 26 Aug '20

26 Aug '20

Last week I wasn't feeling well, so there was no This Week in Spectrum. crosvm ------ Where we left off, I had been attempting to port vhost-user-net support from cloud-hypervisor to crosvm. I'd been trying to port the first incarnation of the code in cloud-hypervisor to the contemporary version of crosvm from when it was added, thinking that that would be easier because the two codebases together. But I ran into the problem that this earliest incarnation of the vhost-user-net code from cloud-hypervisor didn't actually work (at least with the backend I was attempting to test it with). I'd been attempting to figure out exactly which changes were required to make it work, but hadn't been successful with that yet, and I thought I'd probably need to start the port over, from the latest cloud-hypervisor and crosvm code. The next day, I decided to give my previous strategy one more try, though, and an hour or two later, I found the required cloud-hypervisor change, applied it to crosvm, and it worked! So I now have a crosvm tree capable of vhost-user-net[1]. This means that it's looking good for my plans for inter-guest networking, and network hardware isolation. With that in place, I decided to start thinking about other kinds of hardware isolation and inter-VM communication, and that's what I did for most of the last two weeks. Let's go through them: Files will be shared between VMs using virtio-fs. This has the unique feature of (soon) being able to bypass guest page caches, and have only a single shared cache between VMs. This brings a performance improvement, but as I understand it, should also reduce memory consumption because each VM won't have to maintain its own copy of a disk-backed page. Of course, this feature (DAX) is also a big side channel, so it won't be appropriate for all use cases. But I think for some things people want to do with Spectrum, this will be very important. The problem with this is that, because it uses the page cache of the host kernel, the host has to know about the filesystem that's being shared -- there's no running virtiofsd in a VM if we want DAX. But I'd really like it if a (non-boot) block device could be used as a filesystem without the host having to actually talk to the device. I was stuck here, but edef pointed out to me that we could use the kernel's 9P support to attach the block device to a VM, and then mounting the filesystem in the host over 9P, either over a network connection or (ideally) vsock. It looks like the kernel should be able to handle 9P over vsock, but I haven't tested yet. We can use existing virtiofsd and 9P software (there are promising Rust implementations of each), and harden them against potential vulnerabilites like directory traversals using kernel features like RESOLVE_BENEATH and RESOLVE_NO_XDEV. For the boot device, maybe there's no reason not to just mount it using the host kernel, or maybe there's something to be gained by just reading a small bootstrap payload into memory from the start of the disk once, and then making all future communication go via a VM. I'm not really sure yet. But the important thing is we'll have mechanisms for all this in place. Maybe we'll decide that non-boot devices should just go over inter-VM 9P, but in any case, we'll still need all these pieces. GPU isolation should be possible by forwarding the GPU to a VM, but there are a few problems here. The first is that it would mean rendered surfaces have to be copied via shared memory to the VM with the GPU, before being sent to the GPU. Additionally, sharing the GPU between VMs for rendering at all would require significantly more work. The result of this is that graphics performance using an isolated GPU will probably be poor, at least for now. The final problem is that passthrough of integrated GPUs seems to be very difficult to get right. I will probably need to acquire some hardware that I've sene a report of this working on, so I can figure out what I've been doing wrong on the two computers I've tried it on so far. I suspect that I will get GPU isolation working, but I'm not sure how reliable or performant it will be. For generic USB devices, I expect to be able to take an approach similar to Qubes[2], having a VM to handle interactions with the hardware USB controller, and exposing individual USB devices over USB/IP to other VMs. It would be nice if I could use vsock for this too. [1]: https://spectrum-os.org/git/crosvm/?h=vhost-user-net [2]: https://www.qubes-os.org/doc/usb-devices/ spectrum-os.org --------------- Philipp registered a Matrix room and bridged it to the #spectrum IRC channel. I'm told that this should make it easier for Matrix users to join the room, since some bug in Matrix's IRC bridge prevents people from joining from Matrix the usual way. Philipp also sent a patch[3] to improve the instructions for Matrix users joining the channel on the website. Thanks Philipp! [3]: https://spectrum-os.org/lists/archives/spectrum-devel/87wo247zu7.fsf@alyssa… QEMU ---- I sent the previously requested patch[4] to resolve ambiguities in the vhost-user spec. No response yet, though. I'll probably resend it some time soon. [4]: https://lore.kernel.org/qemu-devel/20200813094847.4288-1-hi@alyssa.is/ I'm finding it hard to keep going at the moment. The stuff I'm doing now is probably the hardest part of implementing Spectrum, and it's frustrating to realise that not everything I want to do is going to be possible. So much of the KVM ecosystem assumes that things will be host<->guest, and there's not always an easy solution. But, whatever we end up with, it's going to be a lot better than what I'm using today, and what lots of other people are using today. I think I'm going to be able to deliver a good experience with a fairly high degree of protection against malicious hardware. But it's not going to be perfect. I'm pushing quite hard to make it over the line with my hardware isolation funding milestone. I'm so close, and I'm about to need the money. But once I've hit that, I think I'm going to need a break. This stuff is gruelling.

2 1

This (Last) Week in Spectrum, 2020-W32
by Alyssa Ross 11 Aug '20

11 Aug '20

Last week, I'd just finished getting the cloud-hypervisor vhost-user-net frontend code to build as part of crosvm, and the next step was testing it. crosvm ------ I wrote some hacky code that replaced the virtio-net device creation in crosvm with an instance of the ported vhost-user-net code. When I booted crosvm, there were some of the expected simple oversights of mine that needed to be addressed, but once those were taken care of, it still didn't quite work. The VM boots, sees a network interface, and even communicates with the vhost-user-net backend! But, it doesn't quite work. The vhost-user-net code never realises/gets told that it has traffic, and so it's never processed. Unsure of what to do about this, I decided to turn to cloud-hypervisor and look at how the code ran there. cloud-hypervisor ---------------- I wanted to try running the cloud-hypervisor v-u-n backend I was using for testing (because it's much simpler than DPDK -- it just sends traffic to a TAP device) with QEMU as the frontend, because QEMU is a VMM I'm familiar with (much more so than cloud-hypervisor as the frontend), and I thought it would be useful to have a working frontend/backend combination to compare to. I had some problems, though, because apparently nobody had ever wanted to use QEMU with the cloud-hypervisor vhost-user-net backend before -- or if they had, they hadn't wanted to enough to make it work. The cloud-hypervisor backend didn't implement the vhost-user spec correctly in a few subtle ways that made it incompatible with QEMU. I won't explain every subtle issue, but I ended up writing a few patches[1][2] for cloud-hypervisor and the "vhost" crate it depends on (that is in the process of being moved under the rust-vmm umbrella). One interesting issue I will go into a little detail of was that the wording in the spec was a little unclear, and QEMU interpreted it one way, and cloud-hypervisor the other. I ended up sending an email[3] to the author of the spec asking for clarification. He answered my question, and we discussed how the wording could be improved. He liked my second attempt at improving my working, and asked me to send a patch, but preferably not right now, because QEMU is currently gearing up for a release, scheduled for next week if everything goes well. Since I wrote these cloud-hypervisor patches, and had to test them, I ended up having to learn how to use cloud-hypervisor anyway to make sure I hadn't broken it in fixing the backend up to work with QEMU. Oh well. Once this was done, I could use both QEMU and cloud-hypervisor with the backend, but not crosvm. But it was a little more complex than that. When I ported the v-u-n code to crosvm, I ported the first version of it that was added to the cloud-hypervisor tree, rather than the latest version. The theory here was that the earlier version would be closer to crosvm, because cloud-hypervisor would have had less time to diverge. Then, once I had that working, I could add on the later changes gradually. What I didn't account for here is that the initial version of the v-u-n frontend in cloud-hypervisor didn't really work properly, and needed some time to bake before it did. So having now had this experience I think it might be better to try to port the latest version, and accept that porting might be a bit harder, but the end result is more likely to work. [1]: https://github.com/cloud-hypervisor/vhost/pull/22 [2]: https://github.com/cloud-hypervisor/cloud-hypervisor/pull/1565 [3]: https://lore.kernel.org/qemu-devel/87sgd1ktx9.fsf@alyssa.is/ libgit2 ------- While bisecting cloud-hypervisor to see if I could figure out when the v-u-n frontend started working properly, I encountered a large section of commits that I couldn't build any more, because Cargo couldn't resolve a git dependency. The dependency was locked to a commit that was no longer in the branch it had been in when the cloud-hypervisor commit was from. Despite knowing the exact commit it needed, Cargo fetched the branch the commit used to be on. This is because it is generally not possible to fetch arbitrary commits with git. Some servers, like GitHub, do however allow this, and I wondered why Cargo wouldn't at least fall back to trying that. As it turns out, it actually couldn't do that, though! Cargo uses libgit2, and libgit2 doesn't support fetching arbitrary commits. So I wrote a quick patch to libgit2 to support this[4]. It's only a partial implementation, though, because I don't find libgit2 to be a particularly easy codebase to work in (although it's better than git!). So I'm hoping somebody who knows more about it than me will help me figure out how to finish it. [4]: https://github.com/libgit2/libgit2/pull/5603 Next week, I'm hoping that I'll be able to get to vhost-user-net in crosvm working. I think this will probably mean porting the code again, using the latest version. Which is a bit of a shame, but at least I have an idea of what to do next. I am, overall, feeling pretty optimistic, though. I'm pretty confident that we can get some sort of decent but imperfect network hardware isolation even though virtio-vhost-user might not be ready yet, which was something I was worried about before. I don't want to really go into detail in that now though because this is already a long email and it's already a day late because I was tired yesterday, but essentially, we could forward the network device to a VM that would run the driver, and forward traffic back to the host over virtio-net. The host could handle this either in kernelspace or userspace with DPDK, but the important thing is that the only network driver it would need to support would be virtio-net. No talking to hundreds of different Wi-Fi cards and hoping that none of the drivers have a vulnerability. So, not perfect compared to proper guest<->guest networking, but a step in the right direction, and one that should be as simple as possible to upgrade to virtio-vhost-user once that becomes possible.

1 0

This Week in Spectrum, 2020-W31
by Alyssa Ross 02 Aug '20

02 Aug '20

DPDK ---- Last week, I'd just figured out how to do a normal vhost-user setup with a QEMU VM connected to DPDK. This week, I wanted to try to move DPDK into another VM using the experimental virtio-vhost-user driver, taking the host system out of the networking equation altogether. In theory this should have been a very simple change, but I couldn't get it to work. DPDK claimed to be forwarding packets to the ethernet device I'd attached to the backend VM (the one running DPDK), but networking in the frontend VM (what you might think of as the application VM) didn't work at all. It tried and failed to do DHCP, and so couldn't progress beyond that. A breakthrough came when I thought to look at the logs of my local DHCP server. I saw that it was actually receiving requests from the VM, and assigning it an IP address. Once I realised this, I hypothesised that outgoing traffic was working, but not incoming. Finally having something to look for, I had a look through the DPDK virtio-vhost-user driver[1], and my suspicion was confirmed in an unexpected way. It looks like incoming traffic (from the perspective of the virtio-vhost-user frontend) is not actually implemented at all! But with outbound traffic working, this means that I'm confident enough I understand virtio-vhost-user enough to be able to leave this here for now. From Spectrum's side, I can now be pretty sure that everything should be workable, so we can just wait a bit for virtio-vhost-user to get a bit further along and then revisit it. And since the frontend has no idea it's talking to virtio-vhost-user instead of normal vhost-user, we can use normal (host-based) vhost-user for now, and drop virtio-vhost-user in down the line. A couple of outstanding questions I still don't know the answer to about DPDK are: - How will routing work if I have multiple frontend VMs with multiple virtio-vhost-user connections all wanting to use the same network device? Will I want to use something like Open vSwitch[2] for that? - DPDK by default uses a busyloop to check for data to process, for efficiency. This is obviously not appropriate for a workstation-focused operating system. There is an interrupt-based mode, though, but I don't know how to use it yet. Since I consider the concept proven, though, I'm going to punt on these for now. The longer I leave these questions, the more likely it is that a kernel driver for virtio-vhost-user will emerge and we can use that instead. That's not to say I want to leave inter-guest networking hanging forever, but I have other inter-guest networking bits I can switch focus to for now, and once those are down I can revisit the virtio-vhost-user backend situation. [1]: https://github.com/ndragazis/dpdk-next-virtio/blob/2d60e63/drivers/virtio_v… [2]: https://www.openvswitch.org/ crosvm ------ I started integrating the vhost-user-net code from Cloud Hypervisor into crosvm. I'm at the point where I can get all the copied Cloud Hypervisor code to compile in crosvm, which is pretty good! I have not yet written the code to actually start one of these devices yet, though, so I haven't been able to test it yet. It's been interesting to look at Cloud Hypervisor because it's a codebase that is heavily based on crosvm (even more so than Firecracker is), but that has also evolved and diverged from it. It's especially interesting to see stuff where parallel evolution occurred between the crosvm and Cloud Hypervisor codebases, or when Cloud Hypervisor changed how some crosvm code worked, and then later changed it back again. The codebases were still similar enough that I could have the cloud-hypervisor device integrated into the crosvm codebase in a day, although there's lots of code duplication that will have to be dealt with -- I copied over a bunch of supporting code rather than trying to integrate it into the crosvm equivalents to get the code running for the first time in an environment as similar as possible to the one it was designed for. I expect that when I test the device in crosvm it'll probably work fairly quickly if not first try. The more complicated part will be a bit of a change to how crosvm does guest memory that isn't strictly necessary but is important for security. crosvm allocates all guest memory in a single memfd. This means that, to share guest memory with another process, like when using vhost-user, the only option is to share all of guest memory. This would sort of defeat the purpose of hardware isolation in Spectrum! But from what I could tell -- I'm not 100% on this -- the guest memory abstraction in cloud-hypervisor is more advanced, and I think it might support multiple memfds backing guest memory for this sort of thing. I'll have to adapt crosvm to that model to be able to use vhost-user securely. website ------- The new "Bibliography" page is up[3]! Lots of links to relevant resources about concepts important to Spectrum. :) [3]: https://spectrum-os.org/bibliography.html It's a bit of a relief to have returned from the uncertain world of DPDK to the familiar territory of crosvm. I'm confident that the next bit of work here (vhost-user in crosvm) won't be that much of a big deal. Hopefully, we'll have at interim networking to a reasonable degree fairly soon. After that, I plan to look at file sharing, possibly with vhost-user-fs (virtio-fs over vhost-user), which I noticed cloud-hypervisor implements today. That should be pretty similar to the networking stuff, although I don't think any virtio-fs virtio-vhost-user code exists at the moment.

1 0