Hi Thomas!
Thanks for keeping us updated. It's really great to have all this
written up to read, even though I'm only getting to it a month and a
half later.
On Wed, Jan 27, 2021 at 05:31:08PM +0000, Thomas Leonard wrote:
> I've made a bit of progress this week:
>
> It turns out that weston-terminal crashes sommelier if started when
> the clipboard is empty, due to trying to dereference NULL. I've
> patched it to fix that, and now I can run it directly under sommelier,
> without wayfire. I made a few other changes to sommelier too:
>
> - I switched to the latest version, which provides meson instead of
> common-mk for building. Also, they removed the demos and got rid of
> some bogus dependencies. That simplified the build a lot!
> - They switched to the stable XDG protocols, but then reverted it
> again. I unreverted it to get things going again. Not sure if I did it
> right (they migrated from C to C++ so the patch didn't apply
> directly).
This is great to know -- it sounds like maybe they're trying to make
Sommelier more widely usable? Will probably be a while before I get to
updating but this is very exicting.
> - I added xwayland to the VM and sommelier command, allowing X
> applications to run in the VM.
> - By default sommelier runs the program with an already-open socket,
> which doesn't work if the program (or its children) want to open
> multiple connections.
> I was able to fix that by using `--parent` mode, and getting rid of
> PEER_CMD_PREFIX (which just adds some chromium paths preventing it
> from working).
> - Note: in `--parent` mode it waits for the process to exit before
> processing events on the socket, so if you just run an application
> directly it will hang. I used `bash -c 'firefox &'` as the command as
> a work-around.
> - Some programs (e.g. firefox) refused to start because the protocol
> versions offered by sommelier were too old. I increased the version
> numbers and that's working now. It needs doing properly, though. e.g.
> I implemented the new "sl_host_surface_damage_buffer" by simply
> calling the old damage function, which is obviously not correct but is
> working for me so far!
> - Annoyingly, using `--parent` disables xwayland support. Maybe we
> should run xwayland manually, or use a second sommelier instance?
>
> In general, sommelier seems quite buggy and annoying. I guess it will
> need updating constantly to proxy every new wayland protocol. Yet it
> can't add any useful security because it runs inside the VM, and is
> therefore untrusted.
Yeah...
> Some other changes that I found useful:
>
> - I added the generated kernel modules directory to rootfs, which
> allows using all the normal features of Linux (e.g. ext4) in the VM.
Ah, yes, that would remove a lot of gotchas. I have avoided that so far
because I'm hoping to eventually build custom kernels that don't need
many modules, to reduce code size in each VM. But it would probably
make sense to do for now.
> - I switched from `bash` to `bashInteractive` as the VM shell, which
> gets cursor keys working.
Good catch! I'll make that change in Spectrum as well.
> - I wrote a Nix package to generate one script for each of my old
> qubes. So e.g. I can now run `qvm-start-shopping` to start my crosvm
> shopping instance, with its own /home LVM partition and IP address. It
> passes the network configuration using some new kernel parameters
> (alongside spectrumcmd).
> - I put each VM on its own point-to-point virtual network. These
> networks are set up by /etc/nixos/configuration.nix. That works well
> for my qubes-like VMs, though I guess spectrum will need something
> more dynamic.
> - I enabled the shared filesystem (VIRTIO_FS), which works nicely. I
> use it to provide a (separate) shared directory to each VM that I can
> access from the host.
> One problem is that the crosvm driver runs in a minijail with a
> uidmap that makes every file appear to be owned by root, so only root
> can write things in the VM.
> Possibly a newer kernel would help; later versions of the kernel
> docs say you can include any normal FUSE flags here, so mounting with
> `uid=1000` might work.
I've only looked into virtio-fs a little bit -- I remember having to
make a change to crosvm to make the sandboxing work. Glad to hear it's
working well. I'll find out if a later kernel works when I get to
updating Nixpkgs (or somebody else does -- someone on IRC was actually
offering to try doing this the other day).
> - Finally, I added a `vm-halt` command that just calls `reboot`, as I
> don't want to develop the habit of typing `reboot` without thinking
> ;-)
I don't want to think about how many times I've made this mistake, lol.
> If any of this sounds useful for spectrum let me know. I can try and
> tidy it up; it's all a huge mess at the moment!
I think it might well be -- the stuff you have going on with networking
and filesystems sound great, in particular -- but I'll have to have
gotten a bit more back into the project again to know exactly what.
Right now I'm focusing on slowly bringing myself back up to speed and
remembering what state things are in.
> Once this is working more smoothly, I guess the next issues will be
> setting up some kind of secure window manager on the host (e.g.
> labelling windows with the VM they come from, not allowing
> screenshots, etc). Would also be good to get sound forwarding working
> somehow (Qubes routes pulseaudio to all the VMs and gives you a mixer
> to control the levels for each, but I don't know how that worked). It
> also needs some kind of VM manager to keep track of which VMs are
> running. And some kind of IPC system like qrexec would be useful. Do
> you have thoughts or plans about how to do any of this?
The window manager is a part of this whole thing that makes me very
nervous. A secure window manager is very important for Wayland, and I'm
not sure how much I trust any of the existing ones to get it right. But
with Wayfire I'm hoping it'll at least be easy enough to implement stuff
like tagged/coloured windows for the proof of concept (since the
plugin API and stuff is Wayfire's niche), and I'm hoping at some point
somebody comes up with a security-focused Wayland window manager we can
switch to -- I'd love a Rust one, and there's work going on in that
area[1].
Not sure about IPC yet, but I recently read an article about PipeWire[2],
and that's been making me think a bit about audio. With PipeWire, they
seem to have cared about security from the start:
> To avoid the PulseAudio sandboxing limitations, security was
> baked-in: a per-client permissions bitfield is attached to every
> PipeWire node — where one or more SPA nodes are wrapped. This
> security-aware design allowed easy and safe integration with Flatpak
> portals; the sandboxed-application permissions interface now promoted
> to a freedesktop XDG standard.
And it gets better! In particular, this sounds very promising:
> a native fully asynchronous protocol that was inspired by Wayland —
> without the XML serialization part — was implemented over Unix-domain
> sockets. Taymans wanted a protocol that is simple and hard-realtime
> safe.
It goes on to say they use this for sending file descriptors and stuff.
The similarity to Wayland is very exciting, because it means we might
just be able to run PipeWire over the existing virtio_wl infrastructure
very efficiently.
It'll be a while before I get to look at audio in depth, but this all
sounds very good -- maybe most of the work will have been done for us!
In general I'm feeling very optimistic about a lot of the stuff going on
in the ecosystem to try to make Flatpak and co secure -- I don't trust
Flatpak itself to provide meaningful security, but it means we're
getting standard mechanisms for permissions for standard applications
(xdg-desktop-portal is another that comes to mind), and if this goes
well it means that all we have to do is provide implementations of those
standard interfaces that cross VM boundaries, and applications designed
to work in Flatpak etc. should already understand how to interact with
them.
[1]: https://smithay.github.io/pages/about.html
[2]: https://lwn.net/SubscriberLink/847412/f5595d3e8875ce5d/
I've made a bit of progress this week:
It turns out that weston-terminal crashes sommelier if started when
the clipboard is empty, due to trying to dereference NULL. I've
patched it to fix that, and now I can run it directly under sommelier,
without wayfire. I made a few other changes to sommelier too:
- I switched to the latest version, which provides meson instead of
common-mk for building. Also, they removed the demos and got rid of
some bogus dependencies. That simplified the build a lot!
- They switched to the stable XDG protocols, but then reverted it
again. I unreverted it to get things going again. Not sure if I did it
right (they migrated from C to C++ so the patch didn't apply
directly).
- I added xwayland to the VM and sommelier command, allowing X
applications to run in the VM.
- By default sommelier runs the program with an already-open socket,
which doesn't work if the program (or its children) want to open
multiple connections.
I was able to fix that by using `--parent` mode, and getting rid of
PEER_CMD_PREFIX (which just adds some chromium paths preventing it
from working).
- Note: in `--parent` mode it waits for the process to exit before
processing events on the socket, so if you just run an application
directly it will hang. I used `bash -c 'firefox &'` as the command as
a work-around.
- Some programs (e.g. firefox) refused to start because the protocol
versions offered by sommelier were too old. I increased the version
numbers and that's working now. It needs doing properly, though. e.g.
I implemented the new "sl_host_surface_damage_buffer" by simply
calling the old damage function, which is obviously not correct but is
working for me so far!
- Annoyingly, using `--parent` disables xwayland support. Maybe we
should run xwayland manually, or use a second sommelier instance?
In general, sommelier seems quite buggy and annoying. I guess it will
need updating constantly to proxy every new wayland protocol. Yet it
can't add any useful security because it runs inside the VM, and is
therefore untrusted.
Some other changes that I found useful:
- I added the generated kernel modules directory to rootfs, which
allows using all the normal features of Linux (e.g. ext4) in the VM.
- I switched from `bash` to `bashInteractive` as the VM shell, which
gets cursor keys working.
- I wrote a Nix package to generate one script for each of my old
qubes. So e.g. I can now run `qvm-start-shopping` to start my crosvm
shopping instance, with its own /home LVM partition and IP address. It
passes the network configuration using some new kernel parameters
(alongside spectrumcmd).
- I put each VM on its own point-to-point virtual network. These
networks are set up by /etc/nixos/configuration.nix. That works well
for my qubes-like VMs, though I guess spectrum will need something
more dynamic.
- I enabled the shared filesystem (VIRTIO_FS), which works nicely. I
use it to provide a (separate) shared directory to each VM that I can
access from the host.
One problem is that the crosvm driver runs in a minijail with a
uidmap that makes every file appear to be owned by root, so only root
can write things in the VM.
Possibly a newer kernel would help; later versions of the kernel
docs say you can include any normal FUSE flags here, so mounting with
`uid=1000` might work.
- Finally, I added a `vm-halt` command that just calls `reboot`, as I
don't want to develop the habit of typing `reboot` without thinking
;-)
If any of this sounds useful for spectrum let me know. I can try and
tidy it up; it's all a huge mess at the moment!
Once this is working more smoothly, I guess the next issues will be
setting up some kind of secure window manager on the host (e.g.
labelling windows with the VM they come from, not allowing
screenshots, etc). Would also be good to get sound forwarding working
somehow (Qubes routes pulseaudio to all the VMs and gives you a mixer
to control the levels for each, but I don't know how that worked). It
also needs some kind of VM manager to keep track of which VMs are
running. And some kind of IPC system like qrexec would be useful. Do
you have thoughts or plans about how to do any of this?
On Wed, 20 Jan 2021 at 13:04, Thomas Leonard <talex5(a)gmail.com> wrote:
>
> On Thu, 14 Jan 2021 at 12:51, Alyssa Ross <hi(a)alyssa.is> wrote:
> [...]
> > Oh, whoops, I missed your reply about having worked this out already!
>
> Yeah, disk and networking is OK now.
>
> I also managed to fix the fonts, by using `export FONTCONFIG_FILE
> /etc/fonts/fonts.conf`. By default, it didn't have a monospace font
> available, which was pretty hard to read in the terminal.
>
> I want to get wayland forwarding working next. For now, I'm using `ssh
> -Y` to my VM to forward X. It works, but it's a little slow.
>
> I set `export WAYLAND_DEBUG 1`, and tried weston-terminal again. That produced:
>
> [...]
> [446067.157] -> wl_region(a)21.destroy()
> [446067.481] -> wl_surface@16.set_input_region(wl_region@22)
> [446068.036] -> wl_region(a)22.destroy()
> [446068.412] -> wl_surface@16.attach(wl_buffer@24, 0, 0)
> [446069.190] -> wl_surface(a)16.damage(0, 0, 806, 539)
> [446070.141] -> wl_surface(a)16.commit()
> [446070.531] wl_keyboard(a)20.keymap(1, fd 8, 48869)
> [ 1.796076] sommelier[88]: segfault at 30 ip 00007fa5376062c0 sp
> 00007ffe128592c8 error 4 in
> libwayland-client.so.0.3.0[7fa537604000+6000]
> [ 1.798026] Code: ff ff ff 5d 41 5c c3 0f 1f 00 48 8d b7 d0 00 00
> 00 e9 e4 df ff ff 0f 1f 40 00 48 89 77 30 c3 66 66 2e 0f 1f 84 00 00
> 00 00 00 <48> 8b 47 30 c3 66 66 2e 0f 1f 84 00 00 00 00 00 8b 47 40 c3
> 66 66
--
talex5 (GitHub/Twitter) http://roscidus.com/blog/
GPG: 5DD5 8D70 899C 454A 966D 6A51 7513 3C8F 94F6 E0CC
>to handle 9P over vsock, but I haven't tested yet. We can use existing
>virtiofsd and 9P software (there are promising Rust implementations of
>each), and harden them against potential vulnerabilites like directory
>traversals using kernel features like RESOLVE_BENEATH and
>RESOLVE_NO_XDEV. For the boot device, maybe there's no reason not to
Also, if the server is in a namespace seeing only a bind mount to the
necessary part of the FS, in a VM that only sees that one FS, the cheap
attacks just become moot. You can probably talk it into traversal, but
it doesn't see more than allowed anyway; talking it into attacking the
VM kernel is hopefully harder (and still has limited impact)
>just mount it using the host kernel, or maybe there's something to be
>gained by just reading a small bootstrap payload into memory from the
>start of the disk once, and then making all future communication go via
>a VM. I'm not really sure yet. But the important thing is we'll have
>mechanisms for all this in place. Maybe we'll decide that non-boot
>devices should just go over inter-VM 9P, but in any case, we'll still
>need all these pieces.
Can virtiofs eventually be backed by a VM-wrapped vhost-user?
Although we probably do want host-side page cache, as VM's requests to
host are way more transparent for the scheduler than inter-VM requests.
>computers I've tried it on so far. I suspect that I will get GPU
>isolation working, but I'm not sure how reliable or performant it will
>be.
Hmm. Also a good question what is the timeslice for inter-VM
communication. Does it make sense to have two VMs alternate for slices
of ten milliseconds? This is just what is probably needed to have 25fps
video playback���
>I'm pushing quite hard to make it over the line with my hardware
>isolation funding milestone. I'm so close, and I'm about to need the
>money. But once I've hit that, I think I'm going to need a break. This
>stuff is gruelling.
I wish you strength for this push!
Last week I wasn't feeling well, so there was no This Week in Spectrum.
crosvm
------
Where we left off, I had been attempting to port vhost-user-net support
from cloud-hypervisor to crosvm. I'd been trying to port the first
incarnation of the code in cloud-hypervisor to the contemporary version
of crosvm from when it was added, thinking that that would be easier
because the two codebases together. But I ran into the problem that
this earliest incarnation of the vhost-user-net code from
cloud-hypervisor didn't actually work (at least with the backend I was
attempting to test it with). I'd been attempting to figure out exactly
which changes were required to make it work, but hadn't been successful
with that yet, and I thought I'd probably need to start the port over,
from the latest cloud-hypervisor and crosvm code.
The next day, I decided to give my previous strategy one more try,
though, and an hour or two later, I found the required cloud-hypervisor
change, applied it to crosvm, and it worked! So I now have a crosvm
tree capable of vhost-user-net[1].
This means that it's looking good for my plans for inter-guest
networking, and network hardware isolation. With that in place, I
decided to start thinking about other kinds of hardware isolation and
inter-VM communication, and that's what I did for most of the last two
weeks. Let's go through them:
Files will be shared between VMs using virtio-fs. This has the
unique feature of (soon) being able to bypass guest page caches, and
have only a single shared cache between VMs. This brings a performance
improvement, but as I understand it, should also reduce memory
consumption because each VM won't have to maintain its own copy of a
disk-backed page. Of course, this feature (DAX) is also a big side
channel, so it won't be appropriate for all use cases. But I think for
some things people want to do with Spectrum, this will be very
important.
The problem with this is that, because it uses the page cache of the
host kernel, the host has to know about the filesystem that's being
shared -- there's no running virtiofsd in a VM if we want DAX. But I'd
really like it if a (non-boot) block device could be used as a
filesystem without the host having to actually talk to the device. I
was stuck here, but edef pointed out to me that we could use the
kernel's 9P support to attach the block device to a VM, and then
mounting the filesystem in the host over 9P, either over a network
connection or (ideally) vsock. It looks like the kernel should be able
to handle 9P over vsock, but I haven't tested yet. We can use existing
virtiofsd and 9P software (there are promising Rust implementations of
each), and harden them against potential vulnerabilites like directory
traversals using kernel features like RESOLVE_BENEATH and
RESOLVE_NO_XDEV. For the boot device, maybe there's no reason not to
just mount it using the host kernel, or maybe there's something to be
gained by just reading a small bootstrap payload into memory from the
start of the disk once, and then making all future communication go via
a VM. I'm not really sure yet. But the important thing is we'll have
mechanisms for all this in place. Maybe we'll decide that non-boot
devices should just go over inter-VM 9P, but in any case, we'll still
need all these pieces.
GPU isolation should be possible by forwarding the GPU to a VM, but
there are a few problems here. The first is that it would mean rendered
surfaces have to be copied via shared memory to the VM with the GPU,
before being sent to the GPU. Additionally, sharing the GPU between VMs
for rendering at all would require significantly more work. The result
of this is that graphics performance using an isolated GPU will probably
be poor, at least for now. The final problem is that passthrough of
integrated GPUs seems to be very difficult to get right. I will
probably need to acquire some hardware that I've sene a report of this
working on, so I can figure out what I've been doing wrong on the two
computers I've tried it on so far. I suspect that I will get GPU
isolation working, but I'm not sure how reliable or performant it will
be.
For generic USB devices, I expect to be able to take an approach similar
to Qubes[2], having a VM to handle interactions with the hardware USB
controller, and exposing individual USB devices over USB/IP to other
VMs. It would be nice if I could use vsock for this too.
[1]: https://spectrum-os.org/git/crosvm/?h=vhost-user-net
[2]: https://www.qubes-os.org/doc/usb-devices/spectrum-os.org
---------------
Philipp registered a Matrix room and bridged it to the #spectrum IRC
channel. I'm told that this should make it easier for Matrix users to
join the room, since some bug in Matrix's IRC bridge prevents people
from joining from Matrix the usual way. Philipp also sent a patch[3] to
improve the instructions for Matrix users joining the channel on the
website. Thanks Philipp!
[3]: https://spectrum-os.org/lists/archives/spectrum-devel/87wo247zu7.fsf@alyssa…
QEMU
----
I sent the previously requested patch[4] to resolve ambiguities in the
vhost-user spec. No response yet, though. I'll probably resend it some
time soon.
[4]: https://lore.kernel.org/qemu-devel/20200813094847.4288-1-hi@alyssa.is/
I'm finding it hard to keep going at the moment. The stuff I'm doing
now is probably the hardest part of implementing Spectrum, and it's
frustrating to realise that not everything I want to do is going to be
possible. So much of the KVM ecosystem assumes that things will be
host<->guest, and there's not always an easy solution. But, whatever we
end up with, it's going to be a lot better than what I'm using today,
and what lots of other people are using today. I think I'm going to be
able to deliver a good experience with a fairly high degree of
protection against malicious hardware. But it's not going to be
perfect.
I'm pushing quite hard to make it over the line with my hardware
isolation funding milestone. I'm so close, and I'm about to need the
money. But once I've hit that, I think I'm going to need a break. This
stuff is gruelling.
Last week, I'd just finished getting the cloud-hypervisor vhost-user-net
frontend code to build as part of crosvm, and the next step was testing
it.
crosvm
------
I wrote some hacky code that replaced the virtio-net device creation in
crosvm with an instance of the ported vhost-user-net code. When I
booted crosvm, there were some of the expected simple oversights of mine
that needed to be addressed, but once those were taken care of, it still
didn't quite work. The VM boots, sees a network interface, and even
communicates with the vhost-user-net backend! But, it doesn't quite
work. The vhost-user-net code never realises/gets told that it has
traffic, and so it's never processed. Unsure of what to do about this,
I decided to turn to cloud-hypervisor and look at how the code ran
there.
cloud-hypervisor
----------------
I wanted to try running the cloud-hypervisor v-u-n backend I was using
for testing (because it's much simpler than DPDK -- it just sends
traffic to a TAP device) with QEMU as the
frontend, because QEMU is a VMM I'm familiar with (much more so than
cloud-hypervisor as the frontend), and I thought it would be useful to
have a working frontend/backend combination to compare to.
I had some problems, though, because apparently nobody had ever wanted
to use QEMU with the cloud-hypervisor vhost-user-net backend before --
or if they had, they hadn't wanted to enough to make it work. The
cloud-hypervisor backend didn't implement the vhost-user spec correctly
in a few subtle ways that made it incompatible with QEMU. I won't
explain every subtle issue, but I ended up writing a few patches[1][2]
for cloud-hypervisor and the "vhost" crate it depends on (that is in the
process of being moved under the rust-vmm umbrella).
One interesting issue I will go into a little detail of was that the
wording in the spec was a little unclear, and QEMU interpreted it one
way, and cloud-hypervisor the other. I ended up sending an email[3] to
the author of the spec asking for clarification. He answered my
question, and we discussed how the wording could be improved. He liked
my second attempt at improving my working, and asked me to send a patch,
but preferably not right now, because QEMU is currently gearing up for a
release, scheduled for next week if everything goes well.
Since I wrote these cloud-hypervisor patches, and had to test them, I
ended up having to learn how to use cloud-hypervisor anyway to make sure
I hadn't broken it in fixing the backend up to work with QEMU. Oh well.
Once this was done, I could use both QEMU and cloud-hypervisor with the
backend, but not crosvm. But it was a little more complex than that.
When I ported the v-u-n code to crosvm, I ported the first version of it
that was added to the cloud-hypervisor tree, rather than the latest
version. The theory here was that the earlier version would be closer
to crosvm, because cloud-hypervisor would have had less time to diverge.
Then, once I had that working, I could add on the later changes
gradually. What I didn't account for here is that the initial version
of the v-u-n frontend in cloud-hypervisor didn't really work properly,
and needed some time to bake before it did. So having now had this
experience I think it might be better to try to port the latest version,
and accept that porting might be a bit harder, but the end result is
more likely to work.
[1]: https://github.com/cloud-hypervisor/vhost/pull/22
[2]: https://github.com/cloud-hypervisor/cloud-hypervisor/pull/1565
[3]: https://lore.kernel.org/qemu-devel/87sgd1ktx9.fsf@alyssa.is/
libgit2
-------
While bisecting cloud-hypervisor to see if I could figure out when the
v-u-n frontend started working properly, I encountered a large section
of commits that I couldn't build any more, because Cargo couldn't
resolve a git dependency. The dependency was locked to a commit that
was no longer in the branch it had been in when the cloud-hypervisor
commit was from. Despite knowing the exact commit it needed, Cargo
fetched the branch the commit used to be on. This is because it is
generally not possible to fetch arbitrary commits with git. Some
servers, like GitHub, do however allow this, and I wondered why Cargo
wouldn't at least fall back to trying that.
As it turns out, it actually couldn't do that, though! Cargo uses
libgit2, and libgit2 doesn't support fetching arbitrary commits. So I
wrote a quick patch to libgit2 to support this[4]. It's only a partial
implementation, though, because I don't find libgit2 to be a
particularly easy codebase to work in (although it's better than git!).
So I'm hoping somebody who knows more about it than me will help me
figure out how to finish it.
[4]: https://github.com/libgit2/libgit2/pull/5603
Next week, I'm hoping that I'll be able to get to vhost-user-net in
crosvm working. I think this will probably mean porting the code again,
using the latest version. Which is a bit of a shame, but at least I
have an idea of what to do next.
I am, overall, feeling pretty optimistic, though. I'm pretty confident
that we can get some sort of decent but imperfect network hardware
isolation even though virtio-vhost-user might not be ready yet, which
was something I was worried about before. I don't want to really go
into detail in that now though because this is already a long email and
it's already a day late because I was tired yesterday, but essentially,
we could forward the network device to a VM that would run the driver,
and forward traffic back to the host over virtio-net. The host could
handle this either in kernelspace or userspace with DPDK, but the
important thing is that the only network driver it would need to support
would be virtio-net. No talking to hundreds of different Wi-Fi cards
and hoping that none of the drivers have a vulnerability. So, not
perfect compared to proper guest<->guest networking, but a step in the
right direction, and one that should be as simple as possible to upgrade
to virtio-vhost-user once that becomes possible.
DPDK
----
Last week, I'd just figured out how to do a normal vhost-user setup with
a QEMU VM connected to DPDK. This week, I wanted to try to move DPDK
into another VM using the experimental virtio-vhost-user driver, taking
the host system out of the networking equation altogether.
In theory this should have been a very simple change, but I couldn't get
it to work. DPDK claimed to be forwarding packets to the ethernet
device I'd attached to the backend VM (the one running DPDK), but
networking in the frontend VM (what you might think of as the
application VM) didn't work at all. It tried and failed to do DHCP, and
so couldn't progress beyond that.
A breakthrough came when I thought to look at the logs of my local DHCP
server. I saw that it was actually receiving requests from the VM, and
assigning it an IP address. Once I realised this, I hypothesised that
outgoing traffic was working, but not incoming.
Finally having something to look for, I had a look through the DPDK
virtio-vhost-user driver[1], and my suspicion was confirmed in an
unexpected way. It looks like incoming traffic (from the perspective of
the virtio-vhost-user frontend) is not actually implemented at all!
But with outbound traffic working, this means that I'm confident enough
I understand virtio-vhost-user enough to be able to leave this here for
now. From Spectrum's side, I can now be pretty sure that everything
should be workable, so we can just wait a bit for virtio-vhost-user to
get a bit further along and then revisit it. And since the frontend
has no idea it's talking to virtio-vhost-user instead of normal
vhost-user, we can use normal (host-based) vhost-user for now, and drop
virtio-vhost-user in down the line.
A couple of outstanding questions I still don't know the answer to about
DPDK are:
- How will routing work if I have multiple frontend VMs with multiple
virtio-vhost-user connections all wanting to use the same network
device? Will I want to use something like Open vSwitch[2] for that?
- DPDK by default uses a busyloop to check for data to process, for
efficiency. This is obviously not appropriate for a
workstation-focused operating system. There is an interrupt-based
mode, though, but I don't know how to use it yet.
Since I consider the concept proven, though, I'm going to punt on these
for now. The longer I leave these questions, the more likely it is that
a kernel driver for virtio-vhost-user will emerge and we can use that
instead. That's not to say I want to leave inter-guest networking
hanging forever, but I have other inter-guest networking bits I can
switch focus to for now, and once those are down I can revisit the
virtio-vhost-user backend situation.
[1]: https://github.com/ndragazis/dpdk-next-virtio/blob/2d60e63/drivers/virtio_v…
[2]: https://www.openvswitch.org/
crosvm
------
I started integrating the vhost-user-net code from Cloud Hypervisor into
crosvm. I'm at the point where I can get all the copied Cloud
Hypervisor code to compile in crosvm, which is pretty good! I have not
yet written the code to actually start one of these devices yet, though,
so I haven't been able to test it yet.
It's been interesting to look at Cloud Hypervisor because it's a
codebase that is heavily based on crosvm (even more so than Firecracker
is), but that has also evolved and diverged from it. It's especially
interesting to see stuff where parallel evolution occurred between the
crosvm and Cloud Hypervisor codebases, or when Cloud Hypervisor changed
how some crosvm code worked, and then later changed it back again.
The codebases were still similar enough that I could have the
cloud-hypervisor device integrated into the crosvm codebase in a day,
although there's lots of code duplication that will have to be dealt
with -- I copied over a bunch of supporting code rather than trying to
integrate it into the crosvm equivalents to get the code running for the
first time in an environment as similar as possible to the one it was
designed for. I expect that when I test the device in crosvm it'll
probably work fairly quickly if not first try. The more complicated
part will be a bit of a change to how crosvm does guest memory that
isn't strictly necessary but is important for security.
crosvm allocates all guest memory in a single memfd. This means that,
to share guest memory with another process, like when using vhost-user,
the only option is to share all of guest memory. This would sort of
defeat the purpose of hardware isolation in Spectrum! But from what I
could tell -- I'm not 100% on this -- the guest memory abstraction in
cloud-hypervisor is more advanced, and I think it might support multiple
memfds backing guest memory for this sort of thing. I'll have to adapt
crosvm to that model to be able to use vhost-user securely.
website
-------
The new "Bibliography" page is up[3]! Lots of links to relevant resources
about concepts important to Spectrum. :)
[3]: https://spectrum-os.org/bibliography.html
It's a bit of a relief to have returned from the uncertain world of DPDK
to the familiar territory of crosvm. I'm confident that the next bit of
work here (vhost-user in crosvm) won't be that much of a big deal.
Hopefully, we'll have at interim networking to a reasonable degree
fairly soon. After that, I plan to look at file sharing, possibly with
vhost-user-fs (virtio-fs over vhost-user), which I noticed
cloud-hypervisor implements today. That should be pretty similar to the
networking stuff, although I don't think any virtio-fs virtio-vhost-user
code exists at the moment.
Feels like there's a pretty clear path forward. Nice feeling. :)
QEMU
----
Last week, as I wrote TWiS, I had just discovered virtio-vhost-user,
which looked like a very promising mechanism for getting a VM to take
care of networking for other VMs. This week, I've been researching it
further, and trying to test and evaluate it.
The first thing I tried to do, naturally, was to build the patched QEMU
tree and boot a VM with a virtio-vhost-user device attached. This was
not as easy as I'd hoped, because adding the virtio-vhost-user device to
my QEMU command line made the VM kernel panic at boot, with an error
message about an invalid memory access. I spent most of the week trying
to figure this out -- I wasn't doing anything different to the
example[1] on the QEMU wiki, so it should have worked, and it felt like
if I could just get past whatever was going wrong here, it would be
worth it, because virtio-vhost-user otherwise seems so suited for what
we need here. I emailed the patch author[2], but he didn't know what
was up either.
An early breakthrough came when I got frustrated with kernel builds
taking hours on my 8-year-old laptop, and so decided to work on a more
powerful computer instead. Once I got everything set up on that
computer, I started up the VM, and it worked. Perhaps in setting it up
over here I'd done something different? I copied over the exact VM
disk/kernel/initrd/command line that I was running on my laptop, and the
other computer booted it just fine. I had -cpu host in the QEMU command
line, so I thought maybe the different kind of virtual CPU was causing
it. Tried setting it to a specific value on both machines, and still
the laptop VM panicked and the other didn't. So it sounded like whether
it worked or not depended on the host hardware.
I put together a Nix derivation that would automatically build the
custom QEMU and output a script that would run a VM, and then asked
people in #spectrum to test it out on various computers. After getting
some further data, a pattern started to emerge, where Intel processors
Ivy Bridge and older would fail, and Skylake and newer would succeed (I
didn't encounter any AMD processors that failed, nor did I have data at
the time for generations between Ivy Bridge and Skylake). This theory
had a convenient explanation for why nobody else had seen this problem
-- I doubt people at Red Hat are working on 7-year-old hardware.
This was a good clue, but still didn't put me much closer to having a
working system. I do have a more recent laptop around, but for reasons
that are out of scope here it would be very inconvenient to decide to
just move over to it. I could see that the kernel was panicking the
first time it tried to access the PCI BARs of the virtio-vhost-user,
which led me to believe that the problem was probably in how that memory
was being set up. I found the function that did that[3], and stared at
it for a long time. I tried to read the rest of the QEMU code, but it
became clear that my domain knowledge here isn't good enough to be able
to keep track of what's meant to be happening. I added some debug
prints, which were vaguely helpful in making that understanding a little
better.
I was hoping to find the guest address each PCI BAR was mapped to so
that I could check the kernel was trying to write to the right location,
but didn't manage to do that. While attempting to, though, I did add a
debug print that printed the size of each PCI bar as it was allocated.
I noticed that most were small -- 16 MiB at most, but one was huge, at
64 GiB! The code that allocated this BAR was part of the function I'd
been staring at. As far as I could tell, the choice of size was pretty
arbitrary -- this big memory region was used as backing memory for all
sorts of small objects on the fly. On a whim, I tried changing the BAR
size from 1ULL << 36 to 1ULL << 26, and recompiled QEMU. The VM booted.
The comment above the bar_size definition that I'd been looking at for
so long said:
/* TODO If the BAR is too large the guest won't have address space to map
* it!
*/
I don't know if that's exactly what went wrong here, though. I suspect
it's more like the host architecture doesn't have enough address space?
The affected machines all reported 36 bit physical address size, and 48
bit virtual address size. So maybe what's happening is that the
processor interprets PCI addresses in the hardware-assisted VM as
physical addresses, and therefore runs out of space because all of it is
taken up by this one PCI bar? I'm not really sure. Lowering the bar
size to 2^35 or 2^34 (has to be a power of two) depending on the QEMU
version made the problem go away, and that's good enough for now.
I'm not very enthusiastic about this up-front allocation of a huge
amount of memory that might not even fit in the available address space.
I don't know if there's a better way of doing it in this case, but I
certainly hope so. In general I think this perhaps demonstrates why
this code is not considered suitable for "production" yet. The bet I'm
taking here is that by the time Spectrum is further along, things will
have moved on for virtio-vhost-user too. As I said, at some point we
will want to implement it in crosvm to avoid having QEMU in the TCB, but
it would be a bad idea to do that now while virtio-vhost-user is still
going through the back-and-forth of making its way into the Virtio spec.
[1]: https://wiki.qemu.org/Features/VirtioVhostUser
[2]: https://lore.kernel.org/qemu-devel/87h7u1s5k1.fsf@alyssa.is/T/#u
[3]: https://github.com/ndragazis/qemu/blob/f9ab08c0c8/hw/virtio/virtio-vhost-us…
DPDK
----
Once I was able to boot a VM with the virtio-vhost-user device, I tried
to connect another QEMU VM to it through vhost-user -- I'll want to have
this working first as a reference before I start porting Cloud
Hypervisor's vhost-user implementation to crosvm. But the "frontend"
(vhost-user) QEMU process hung waiting for a reply on the vhost-user
socket from the backend one. Not really knowing what to do about this,
I decided that maybe I'd been a bit too ambitious in going straight for
vhost-user <-> virtio-vhost-user when I'd never actually used vhost-user
before, so maybe I should try a more conventional vhost-user setup
first.
As far as I can tell, vhost-user is usually used for connecting a VM to
a userspace networking stack. And usually, this networking stack is
DPDK, the "Data Plane Development Kit"[4]. DPDK was also used in the
virtio-vhost-user examples, so I figured my next step would be to try
it there as well, and therefore it was worth the time in learning how to
do a very basic setup with it.
Quick start -style documentation for this was pretty lacking, but I did
eventually manage to make this work. Here's what I did, for my own
future reference as much as anything else:
(1) Make some hugepages available. 1GiB for DPDK and 1GiB for QEMU:
echo 1024 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
(2) Take my ethernet interface offline so it could be used with DPDK:
nmcli d disconnect enp0s25
(3) Load the vfio-pci module, which allows PCI devices to be exported to
userspace rather than managed by the kernel:
mobprobe vfio-pci
(4) Export the ethernet interface:
usertools/dpdk-devbind.py -b vfio-pci enp0s25
(5) Run testpmd, a program that comes with DPDK mostly used for
debugging and tracing it seems, but that with no special arguments
acts as a simple packet forwarder. Here I create a vhost-user
socket, and forward traffic between vhost-user and my ethernet
interface:
build/app/dpdk-testpmd -l 0,1 -w 00:19.0 \
--vdev net_vhost0,iface=/run/vhost-user0.sock
The -w value is the PCI address of the ethernet interface. Note how
"00:19" corresponds to "p0s25". (19 in hex is 25 is decimal.)
(6) Start a VM. The relevant QEMU flags appear to be:
-chardev socket,id=char0,path=/run/vhost-user0.sock \
-netdev type=vhost-user,id=net0,chardev=char0,vhostforce \
-device virtio-net-pci,netdev=net0 \
-object memory-backend-file,id=mem0,size=1024M,mem-path=/dev/hugepages,share=on \
-numa node,memdev=mem0 \
-mem-prealloc
I figured this all out mostly from a guide for a DPDK benchmark[5]. I
have not yet experimented with variations on the QEMU flags yet. I'm
not sure if all the memory flags are required -- -mem-prealloc might
just be there because it was important for a benchmark, for example.
So this is the point I'm at with this exploration. Next up, I'll be
trying with DPDK inside a VM with a virtio-vhost-user device. I think
that maybe, despite the virtio-vhost-user device showing up as an
ethernet device inside the VM, it needs some special support which is
available for DPDK as a patchset, but that has not been written for the
kernel yet. I was a bit worried about this, because unlike the kernel,
DPDK isn't going to have things that Wi-Fi drivers for all sorts of
different hardware, and so using DPDK instead of the kernel network
stack would be a problem. But then I learned that DPDK has a component
called the Kernel Native Interface (KNI) which allows it to use network
interfaces from the kernel, so a hybrid approach would be possible, and
is what I think we'll end up using for now. Then, once
virtio-vhost-user is a bit more mature, a kernel driver will probably
show up, and we can use that instead and drop DPDK.
[4]: https://dpdk.org/
[5]: https://doc.dpdk.org/guides/howto/pvp_reference_benchmark.html?highlight=pvp
Website
-------
I was having a conversation about Spectrum yesterday, and I found myself
sending over a bunch of links to articles and papers that I often find
myself referring to when talking to somebody about Spectrum. This made
me think that maybe there should be some place where we keep all these
relevant articles. So I mined the IRC logs, the TWiS archive, and my
blog, and added whatever I could pull from my brain, and wrote a
Spectrum bibliography, containing 27 links to interesting articles and
papers that are particularly relevant to Spectrum.
This isn't on the website quite yet, but I did sent this as a patch[6]
to the mailing list, if you want an early look.
I also posted a patch to fix a minor issue where I'd mistakenly used
".." instead of "." as href values, to no user-visible effect[7].
[6]: https://spectrum-os.org/lists/archives/spectrum-devel/20200726045701.32259-…
[7]: https://spectrum-os.org/lists/archives/spectrum-devel/20200726055410.20641-…
Documentation
-------------
On Monday, I had a call with the Free Software Foundation Europe.
They're a part of NGI Zero (where my funding comes from), and they are
promoting their new "REUSE" specification[8] for license information in
free software projects to NGI Zero projects. It basically covers
standardised per-file license and copyright annotations, and a standard
way of including license texts.
I think this is really cool! It's something I've been unsure of how to
handle because it's all vague conventions that are different in
different circles, and it's nice to see something formalised about it.
They also have an automated tool[9] for checking compliance and
semi-automatically adding license information, which is great!
So I'm enthusiastically adopting the REUSE specification. I decided
that our smaller, first-party repositories (the documentation, the
website, etc.) would be a good place to get started, and so I posted a
patch[10] that makes the documentation repository REUSE-compliant.
[8]: https://reuse.software/
[9]: https://git.fsfe.org/reuse/tool
[10]: https://spectrum-os.org/lists/archives/spectrum-devel/20200726105527.27432-…
mktuntap
--------
I posted a patch[11] to make mktuntap REUSE-compliant.
[11]: https://spectrum-os.org/lists/archives/spectrum-devel/20200726110123.30159-…
The thing that's most on my mind this week is the extent to which I'm
learning about and working on software like QEMU and DPDK that I don't
see having a place in Spectrum in the long run. It's counterintuitive,
but this is definitely worth it. There's no point writing a kernel
driver for virtio-vhost-user (should such a thing be required) right
now, because if I use DPDK for now instead, at some point either
virtio-vhost-user will end up not being the thing that gets adopted by
the ecosystem and we'll have to move to something else, or (more likely)
it gets widely adopted and somebody else writes a kernel driver.
Similarly, using QEMU for network VMs is the smart choice even though
I don't want it to end up in the TCB, because even though I'm probably
going to end up implementing virtio-vhost-user in crosvm later, swapping
out QEMU is going to be so easy later that it would be a very bad idea
to implement that now in case virtio-vhost-user doesn't take off. But
it still /feels/ weird to be using QEMU for this stuff, you know?
This has been a week of thinking I wanted to do one thing, not being
sure how to do it, and finding out that there was a better way. I'll
write it up in the order it happened.
crosvm
------
Last week, I described that I wanted to implement a virtio proxy to be
able to allow a kernel in an application VM to use a virtual device in
another VM. I was wondering how to manage virtio buffers, and thought
that I probably wanted an allocator to be able to manage throwing
buffers of different sizes around.
This turned out to be a case of the XY problem[1]. I couldn't find a
good solution, but it turned out that an allocator wasn't what I wanted
anyway. edef pointed out that I could just make the shared memory I
allocated as big as necessary to hold buffers of the maximum size I
wanted to support. The kernel will only actually allocate pages as they
are written to, and I could use fallocate[2] with FALLOC_FL_PUNCH_HOLE
to tell the kernel it can drop pages when I'm done with them. This
would mean that an unusually large buffer would only take up lots of
memory while it was in use, and as soon as it was done with, the kernel
would be able to take back the memory. So exactly what I wanted from an
allocator, but with no need for an allocator at all!
This made the implementation much simpler, and by Friday I was able to
get the proxy into a state where it could pass unit tests that
transported messages in both directions through it.
And then it was suggested to me that maybe a virtio proxy is not what I
want after all.
The main disadvantage to a virtio proxy is that it requires context
switching to the host to send data between VMs. This is a trade-off I
was aware of, but a virtio proxy is pretty straightforward to write as
inter-VM communication systems go, and I was not aware of anything else
that would be up to the job. As it turns out, there is something.
vhost-user is a mechanism for connecting, say, a virtio device to a
userspace network stack in a performant way. I was aware of this, but
what I was not aware of was virtio-vhost-user[3]. virtio-vhost-user is
a proposed mechanism to allow a VMM to forward a vhost-user backend to a
VM. This means that two VMs could directly share virtqueues, with no
host copy step. This would mean there would be no opportunity for the
host to mediate communication between two guests, but that wasn't really
on the cards anyway -- if it's ever required, a virtio proxy would
probably be the way to go. For all the other cases, virtio-vhost-user
would be a faster, cleaner way of sharing network devices between VMs.
The main problem with virtio-vhost-user is that it's still in its
infancy. There's a patchset[4] implementing it for QEMU that's a couple
of years old, but that has not been accepted upstream. The main blocker
for this seems to be first standardising it in the Virtio spec[5][6]. The
good news here is that the standardisation process seems to be
progressing actively at the moment. It's being discussed on the
virtio-dev mailing list basically right now, with the most recent emails
dated Friday (unfortunately, I don't know of a good web archive with
virtio-dev, but you can find the thread on Gmane if you're interested
but not subscribed to the list).
The good news is that virtio-vhost-user mostly works by composing things
that already exist. There's no kernel work required, because devices
are just exposed by the VMM as regular virtio devices. The frontend VM
(i.e. the one that uses the virtual device, as opposed to the one that
provides it) doesn't need any special virtio-vhost-user support, because
it just needs to speak normal vhost-user. Only the backend VM needs
support for virtio-vhost-user, because its VMM needs to expose the
vhost-user backend from the host to that VM.
This means that provisionally using virtio-vhost-user in Spectrum
actually looks very feasible, with a couple of compromises. For
evaluation purposes, it's not worth writing a virtio-vhost-user device
for crosvm. But, the VMs that need that device are the ones that are
very specialised -- VMs that manage networking or block devices or
similar. So for these VMs, for now, we could use QEMU, with the
virtio-vhost-user patch. I investigated what it would take to port it
to the most recent QEMU version, and the answer appears to be "not much
at all". Obviously having two VMMs in the Trusted Computing Base (TCB)
isn't something we'd want in the long term, but it would be fine for,
say, reaching the next funding milestone. If we decide that
virtio-vhost-user is the way to go after all, support in crosvm can be
added then -- in general, adding a new virtio device to crosvm isn't a
huge undertaking.
Earlier, I said that the application side of the communication doesn't
need anything special, because to that it's just regular vhost-user.
This is true, but I glossed over there that crosvm doesn't actually
implement vhost-user. Implementing vhost-user in crosvm would probably
be a big deal at this stage, and not something I feel would be a good
use of my time. BUT! Remember, crosvm has two children: Amazon's
Firecracker[7], and at so-called "serverless" computing; and Intel's
Cloud Hypervisor[8], which aims at traditional, full system server
virtualisation. And both of these children inherited the crosvm device
model from their parents, and Cloud Hypervisor implements vhost-user[9].
So I _think_ it should be possible to pretty much lift the vhost-user
implementation from Cloud Hypervisor, and use it in crosvm. Pretty
neat!
So, the setup I'd like to evaluate is QEMU with the virtio-vhost-user
patch on one side, and crosvm with Cloud Hypervisor's vhost-user
implementation on the other.
It might well be that there are complications here. If there are, I'll
probably just finish the proxy and move on for now, because I want to
keep up the pace. I do think that virtio-vhost-user is probably the
way to do interguest networking in the long-term, though.
Another thing that I've realised is that I don't need to worry about
pulling bits out of crosvm to run in other VMs. I focused a lot on that
towards the beginning of the year, mostly motivated by Wayland, because
the virtio wayland implementation in crosvm is the only one there is.
Now that that works in a different way, though, there's no need to
continue down this path, because things like networking can be done in
more normal ways through virtio and the device VM kernel.
[1]: https://en.wikipedia.org/wiki/XY_problem
[2]: https://man7.org/linux/man-pages/man2/fallocate.2.html
[3]: https://wiki.qemu.org/Features/VirtioVhostUser
[4]: https://github.com/stefanha/qemu/compare/master...virtio-vhost-user
[5]: https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg03082.html
[6]: https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.…
[7]: https://firecracker-microvm.github.io/
[8]: https://github.com/cloud-hypervisor/cloud-hypervisor
[9]: https://github.com/cloud-hypervisor/cloud-hypervisor/blob/b4d04bdff6a7e2c3d…
Overall, it's been frustrating for me to try things, and discover
they're not going to work, or not going to work as well as some other
thing, and make a call on whether to keep going on something I know is
the worse option or switch to the better thing. I have to keep
reminding myself that Spectrum is a research project, and there are
always going to be false starts like this. Lots of what we're doing is
either very unusual (virtio-vhost-user) or brand new (interguest
Wayland), after all.
After I got an isolated Wayland compositor working last week, I wasn't
really sure what to do next -- this was a big piece of work that I'd
been very focused on for a while. The funding milestone I'm closest to
is to do with implementing hardware isolation, which the Wayland work
was a part of, so I decided to keep going with that, and explore other
types of isolation. More on that in a bit.
Wayland
-------
Posted my patch for virtio_wl display socket support in
libwayland-server[1]. This is what allows it to run in a VM, and
receive connections from clients in other VMs. The patch description is
very extensive, so I recommend reading it for more detail if you're
interested.
It introduces a libvirtio_wl, which should also be useful for porting
other programs that we might want to communicate with across a VM
boundary, if they are written with normal Unix sockets in mind
(including transferring file descriptors). This is the evolution of
code I previously had put in wlroots, moved to Wayland for convenience.
If it ever acquires another user (or maybe even if it doesn't) it might
make sense to make it its own package, since virtio_wl is useful even if
Wayland isn't involved.
[1]: https://spectrum-os.org/lists/archives/spectrum-devel/SJ0PR03MB5581479F3388…
crosvm
------
I pushed all my crosvm changes to get the isolated compositor working to
the work-in-progress "interguest" branch[2]. Remember, I only got it
working last week right before I needed to start writing the TWiS email,
so I hadn't even done that yet! I also posted some patches[3] to the list
to fix a bug in my previous crosvm deadlock fix, and to improve some
related documentation. As usual, these were kindly reviewed by Cole.
Next, I turned my attention to other forms of hardware isolation.
Wayland was a bit special, because despite crosvm including a virtual
"Wayland device", it's not really hardware, and so it required an
approach to isolation that will be quite different to other crosvm
virtual devices. My hope is that other virtual devices should all be
substantially similar to each other.
The basic idea for actual hardware isolation is that rather than having
drivers in the host kernel for USB, network devices, etc. those will be
exposed to dedicated VMs as virtual PCI devices. This should
substantially reduce host kernel attack surface. crosvm virtual devices
will be run in these device VMs, and communicate over virtio with
application VMs as normal. This will require implementing in crosvm a
virtio proxy device, than allows for the crosvm running an application
VM to forward virtio communication to the virtual device running in
userspace in the driver VM.
(The reason devices aren't attached to application VMs directly but run
in seperate device VMs is that hardware is probably not going to be very
happy if multiple kernels are trying to talk to it at the same time.
Additionally, this indirection means that application VMs only have to
use the one virtio driver for that device category, rather than any of
the hundreds of drivers for different hardware in that category. If one
of those drivers had a vulnerability, this should help to contain it to
the device VM.)
So I started writing this virtio proxy. The basic idea is to copy
virtio buffers from application VM guest memory into memory that can be
shared with the userspace virtual device in the device VM. I can't find
any prior art on this (which is not unusual -- not many systems isolate
drivers in this way), so this has required a lot of looking back at the
virtio paper[4] and spec[5] to make sure I understand what to do here.
As I write this, the next problem to solve is integrating some sort of
memory allocator that can manage buffer allocations in the shared memory
that the virtual device looks at. This is a new area for me that I'd
appreciate advice on if anybody can give it -- think of it like, I have a
memfd, mmaped into my process, and I would like to dynamically allocate
and release memory buffers of dynamic sizes in that region. I'm sure
there's a library I'll be able to plug in for this.
[2]: https://spectrum-os.org/git/crosvm/?h=interguest
[3]: https://spectrum-os.org/lists/archives/spectrum-devel/SJ0PR03MB55819DE7E13B…
[4]: https://www.ozlabs.org/~rusty/virtio-spec/virtio-paper.pdf
[5]: https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.…
As usual, big thank you to Cole for reviewing patches, and for finding
room for improvement even in languages/areas he isn't familiar with.
It feels nice to have done some thinking about the project at a slightly
higher level than I have been recently, and to know where I am on the
way to the next milestone. Having taken a lot of time away from the
milestone list this year to work on fundamentals, it's good to feel like
I'm getting back on track.
I really didn't want this to be another week where I posted about how I
was still trying to patch Wayland to do virtio_wl, and I am delighted to
have just discovered it's not going to be!
crosvm
------
I realised that emulating accept(2) for the Wayland compositor socket in
the way I'd planned would require some crosvm rework. I want to have a
host proxy program that accepts the connection, then connects the
connection socket to crosvm. I had made it possible to dynamically add
sockets to the crosvm Wl device through the control socket, but this
turned out not be enough, because crosvm would store virtio_wl sockets
in a BTreeMap<String, PathBuf>, and then use connect(2) to connect to
the socket when asked to by the guest kernel. This works fine for
e.g. connecting to a host Wayland compositor, which is what crosvm was
designed for, but it wouldn't work for opening a connection socket from
accept(2), because you can only connect to a listening socket.
So instead, I modified the `crosvm wl add' command to take a file
descriptor pointing to the connection socket. I made crosvm store
sockets as an enum that looks like this:
enum WaylandSocket {
Listening(PathBuf),
NonListening(UnixStream),
}
This way, when it gets asked by the VM to connect to a socket, it can
either connect to a listening socket at its path using connect(2), or
just use the existing file descriptor if it's a non-listening socket. A
NonListening socket will be consumed by a connection, so when the VM
close(2)s it, it'll go away, and on the host side the connection will
finish as expected. Listening sockets can be connected to repeatedly,
as before.
I also added support to `crosvm add wl' for dynamic socket names. So
it's possible to do `crosvm add wl wl-conn-%d', and connections will be
added with names like `wl-conn-0', `wl-conn-1', etc. So it's easy to
get unique names for connection sockets. The chosen name is printed by
the command, so the caller knows what name to tell the VM to connect to.
I also found and fixed a bug with the previous crosvm deadlock fix[1].
I had assumed that device_sock.recv(&mut []) would drop a message from
the (SOCK_SEQPACKET) socket, without having to read any of it. But
UnixSeqpacket::recv calls libc::read, and read(2) tells us that:
> In the absence of any errors, or if read() does not check for errors,
> a read() with a count of 0 returns zero and has no other effects.
So this was in fact doing nothing at all. I don't know why crosvm's
UnixSeqpacket::recv calls read() instead of recv(), but it's always been
like that and I'm guessing this sort of thing (from recv(2)) might have
something to do with it:
> The only difference between recv() and read(2) is the presence of
> flags. With a zero flags argument, recv() is generally equivalent to
> read(2) (but see NOTES).
So probably read() just looked like a nicer way to recv() when no flags
were needed.
But, unfortunately, zero-byte reads are when the aforementioned NOTES
section becomes relevant:
> If a zero-length datagram is pending, read(2) and recv() with a flags
> argument of zero provide different behavior. In this circumstance,
> read(2) has no effect (the datagram remains pending), while recv()
> consumes the pending datagram.
So, my assumption that UnixSeqpacket::recv(&mut []) would consume a
message turned out to be quite reasonable -- the surprising thing was
that a method called `recv' would call read() rather than recv(). I
think the best fix here will be to just make it call recv() instead,
rather than modifying my code to do UnixSeqpacket::recv(&mut [0]) or
something, to prevent further nasty surprises with this in future.
[1]: https://spectrum-os.org/lists/archives/spectrum-devel/20200614114344.22642-…
Wayland
-------
I created API-compatible implementations of the libc sendmsg(2) and
recvmsg(2) functions for virtio_wl sockets. This was quite an
achievement, because the API (which allows you to send and receive data
and file descriptors, as well as other things I don't intend to support)
is rather arcane (see the example in cmsg(3) if you're not familiar with
them). I wrote unit tests for them, and it took a long time before they
worked reliably. Once I had these, though, I could find the places
where Wayland called sendmsg() and recvmsg() and fall back to the
virtio_wl-based implementations if the standard functions failed with
ENOTSOCK. I stubbed out some stuff that isn't going to work over
virtio_wl, like looking up the pid of the Wayland client through
getsockopt(2).
I also had to resort to a few hacks, like faking support for
MSG_DONTWAIT by using fcntl(2) to set O_NONBLOCK on the socket,
recv()ing from it, and then removing O_NONBLOCK again, or faking
mremap(2) by munmap()-ing and mmap()-ing. We will want to clean these
up later by implementing the required missing functionality in the
virtio_wl kernel module. In the first case, at least, this should be
pretty straightforward, because it supports non-blocking operations if
the socket is O_NONBLOCK -- it just needs to accept a MSG_DONTWAIT
option as well. The VIRTWL_IOCTL_{SEND,RECV} syscalls don't currently
have a flags argument, so that'll need to be added.
I implemented this bit by bit, at every step trying to run Alacritty on
my host system, connected to the virtio_wl Wayland server socket through
the accept() proxy, and using strace and some printf()-debugging to see
where the Wayland compositor in the VM would get stuck, and about an
hour ago, it finally worked! For the first time, a Wayland compositor
running in a VM can display an application running outside of it.
(Obviously we'll want the application to be running in another VM rather
than on the host, but that's similar enough that it probably works
already -- I just haven't tested it yet.) This feels like a huge
achievement. I've been working towards it for so long.
Next week, I'll be cleaning up this code and posting patches for all of
it. Then I'll probably move on to other sorts of device virtualization,
like running a virtual network device in a VM. I'm feeling so much more
positive about the direction of the project than I was before. It's
been difficult to make myself keep going making little progress for the
last couple of weeks, and it's great that I've managed to pick things up
again so much. I hope that the level of detail in this email is enough
to make up for the brevity of last week's! I'm sending late again, too,
but only by a couple of minutes -- I didn't expect this email to take
over an hour to write, but there we go.
Thanks for reading! I hope you're looking forward to seeing where
things go from here as much as I am.