Feels like there's a pretty clear path forward. Nice feeling. :)
QEMU
----
Last week, as I wrote TWiS, I had just discovered virtio-vhost-user,
which looked like a very promising mechanism for getting a VM to take
care of networking for other VMs. This week, I've been researching it
further, and trying to test and evaluate it.
The first thing I tried to do, naturally, was to build the patched QEMU
tree and boot a VM with a virtio-vhost-user device attached. This was
not as easy as I'd hoped, because adding the virtio-vhost-user device to
my QEMU command line made the VM kernel panic at boot, with an error
message about an invalid memory access. I spent most of the week trying
to figure this out -- I wasn't doing anything different to the
example[1] on the QEMU wiki, so it should have worked, and it felt like
if I could just get past whatever was going wrong here, it would be
worth it, because virtio-vhost-user otherwise seems so suited for what
we need here. I emailed the patch author[2], but he didn't know what
was up either.
An early breakthrough came when I got frustrated with kernel builds
taking hours on my 8-year-old laptop, and so decided to work on a more
powerful computer instead. Once I got everything set up on that
computer, I started up the VM, and it worked. Perhaps in setting it up
over here I'd done something different? I copied over the exact VM
disk/kernel/initrd/command line that I was running on my laptop, and the
other computer booted it just fine. I had -cpu host in the QEMU command
line, so I thought maybe the different kind of virtual CPU was causing
it. Tried setting it to a specific value on both machines, and still
the laptop VM panicked and the other didn't. So it sounded like whether
it worked or not depended on the host hardware.
I put together a Nix derivation that would automatically build the
custom QEMU and output a script that would run a VM, and then asked
people in #spectrum to test it out on various computers. After getting
some further data, a pattern started to emerge, where Intel processors
Ivy Bridge and older would fail, and Skylake and newer would succeed (I
didn't encounter any AMD processors that failed, nor did I have data at
the time for generations between Ivy Bridge and Skylake). This theory
had a convenient explanation for why nobody else had seen this problem
-- I doubt people at Red Hat are working on 7-year-old hardware.
This was a good clue, but still didn't put me much closer to having a
working system. I do have a more recent laptop around, but for reasons
that are out of scope here it would be very inconvenient to decide to
just move over to it. I could see that the kernel was panicking the
first time it tried to access the PCI BARs of the virtio-vhost-user,
which led me to believe that the problem was probably in how that memory
was being set up. I found the function that did that[3], and stared at
it for a long time. I tried to read the rest of the QEMU code, but it
became clear that my domain knowledge here isn't good enough to be able
to keep track of what's meant to be happening. I added some debug
prints, which were vaguely helpful in making that understanding a little
better.
I was hoping to find the guest address each PCI BAR was mapped to so
that I could check the kernel was trying to write to the right location,
but didn't manage to do that. While attempting to, though, I did add a
debug print that printed the size of each PCI bar as it was allocated.
I noticed that most were small -- 16 MiB at most, but one was huge, at
64 GiB! The code that allocated this BAR was part of the function I'd
been staring at. As far as I could tell, the choice of size was pretty
arbitrary -- this big memory region was used as backing memory for all
sorts of small objects on the fly. On a whim, I tried changing the BAR
size from 1ULL << 36 to 1ULL << 26, and recompiled QEMU. The VM booted.
The comment above the bar_size definition that I'd been looking at for
so long said:
/* TODO If the BAR is too large the guest won't have address space to map
* it!
*/
I don't know if that's exactly what went wrong here, though. I suspect
it's more like the host architecture doesn't have enough address space?
The affected machines all reported 36 bit physical address size, and 48
bit virtual address size. So maybe what's happening is that the
processor interprets PCI addresses in the hardware-assisted VM as
physical addresses, and therefore runs out of space because all of it is
taken up by this one PCI bar? I'm not really sure. Lowering the bar
size to 2^35 or 2^34 (has to be a power of two) depending on the QEMU
version made the problem go away, and that's good enough for now.
I'm not very enthusiastic about this up-front allocation of a huge
amount of memory that might not even fit in the available address space.
I don't know if there's a better way of doing it in this case, but I
certainly hope so. In general I think this perhaps demonstrates why
this code is not considered suitable for "production" yet. The bet I'm
taking here is that by the time Spectrum is further along, things will
have moved on for virtio-vhost-user too. As I said, at some point we
will want to implement it in crosvm to avoid having QEMU in the TCB, but
it would be a bad idea to do that now while virtio-vhost-user is still
going through the back-and-forth of making its way into the Virtio spec.
[1]: https://wiki.qemu.org/Features/VirtioVhostUser
[2]: https://lore.kernel.org/qemu-devel/87h7u1s5k1.fsf@alyssa.is/T/#u
[3]: https://github.com/ndragazis/qemu/blob/f9ab08c0c8/hw/virtio/virtio-vhost-us…
DPDK
----
Once I was able to boot a VM with the virtio-vhost-user device, I tried
to connect another QEMU VM to it through vhost-user -- I'll want to have
this working first as a reference before I start porting Cloud
Hypervisor's vhost-user implementation to crosvm. But the "frontend"
(vhost-user) QEMU process hung waiting for a reply on the vhost-user
socket from the backend one. Not really knowing what to do about this,
I decided that maybe I'd been a bit too ambitious in going straight for
vhost-user <-> virtio-vhost-user when I'd never actually used vhost-user
before, so maybe I should try a more conventional vhost-user setup
first.
As far as I can tell, vhost-user is usually used for connecting a VM to
a userspace networking stack. And usually, this networking stack is
DPDK, the "Data Plane Development Kit"[4]. DPDK was also used in the
virtio-vhost-user examples, so I figured my next step would be to try
it there as well, and therefore it was worth the time in learning how to
do a very basic setup with it.
Quick start -style documentation for this was pretty lacking, but I did
eventually manage to make this work. Here's what I did, for my own
future reference as much as anything else:
(1) Make some hugepages available. 1GiB for DPDK and 1GiB for QEMU:
echo 1024 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
(2) Take my ethernet interface offline so it could be used with DPDK:
nmcli d disconnect enp0s25
(3) Load the vfio-pci module, which allows PCI devices to be exported to
userspace rather than managed by the kernel:
mobprobe vfio-pci
(4) Export the ethernet interface:
usertools/dpdk-devbind.py -b vfio-pci enp0s25
(5) Run testpmd, a program that comes with DPDK mostly used for
debugging and tracing it seems, but that with no special arguments
acts as a simple packet forwarder. Here I create a vhost-user
socket, and forward traffic between vhost-user and my ethernet
interface:
build/app/dpdk-testpmd -l 0,1 -w 00:19.0 \
--vdev net_vhost0,iface=/run/vhost-user0.sock
The -w value is the PCI address of the ethernet interface. Note how
"00:19" corresponds to "p0s25". (19 in hex is 25 is decimal.)
(6) Start a VM. The relevant QEMU flags appear to be:
-chardev socket,id=char0,path=/run/vhost-user0.sock \
-netdev type=vhost-user,id=net0,chardev=char0,vhostforce \
-device virtio-net-pci,netdev=net0 \
-object memory-backend-file,id=mem0,size=1024M,mem-path=/dev/hugepages,share=on \
-numa node,memdev=mem0 \
-mem-prealloc
I figured this all out mostly from a guide for a DPDK benchmark[5]. I
have not yet experimented with variations on the QEMU flags yet. I'm
not sure if all the memory flags are required -- -mem-prealloc might
just be there because it was important for a benchmark, for example.
So this is the point I'm at with this exploration. Next up, I'll be
trying with DPDK inside a VM with a virtio-vhost-user device. I think
that maybe, despite the virtio-vhost-user device showing up as an
ethernet device inside the VM, it needs some special support which is
available for DPDK as a patchset, but that has not been written for the
kernel yet. I was a bit worried about this, because unlike the kernel,
DPDK isn't going to have things that Wi-Fi drivers for all sorts of
different hardware, and so using DPDK instead of the kernel network
stack would be a problem. But then I learned that DPDK has a component
called the Kernel Native Interface (KNI) which allows it to use network
interfaces from the kernel, so a hybrid approach would be possible, and
is what I think we'll end up using for now. Then, once
virtio-vhost-user is a bit more mature, a kernel driver will probably
show up, and we can use that instead and drop DPDK.
[4]: https://dpdk.org/
[5]: https://doc.dpdk.org/guides/howto/pvp_reference_benchmark.html?highlight=pvp
Website
-------
I was having a conversation about Spectrum yesterday, and I found myself
sending over a bunch of links to articles and papers that I often find
myself referring to when talking to somebody about Spectrum. This made
me think that maybe there should be some place where we keep all these
relevant articles. So I mined the IRC logs, the TWiS archive, and my
blog, and added whatever I could pull from my brain, and wrote a
Spectrum bibliography, containing 27 links to interesting articles and
papers that are particularly relevant to Spectrum.
This isn't on the website quite yet, but I did sent this as a patch[6]
to the mailing list, if you want an early look.
I also posted a patch to fix a minor issue where I'd mistakenly used
".." instead of "." as href values, to no user-visible effect[7].
[6]: https://spectrum-os.org/lists/archives/spectrum-devel/20200726045701.32259-…
[7]: https://spectrum-os.org/lists/archives/spectrum-devel/20200726055410.20641-…
Documentation
-------------
On Monday, I had a call with the Free Software Foundation Europe.
They're a part of NGI Zero (where my funding comes from), and they are
promoting their new "REUSE" specification[8] for license information in
free software projects to NGI Zero projects. It basically covers
standardised per-file license and copyright annotations, and a standard
way of including license texts.
I think this is really cool! It's something I've been unsure of how to
handle because it's all vague conventions that are different in
different circles, and it's nice to see something formalised about it.
They also have an automated tool[9] for checking compliance and
semi-automatically adding license information, which is great!
So I'm enthusiastically adopting the REUSE specification. I decided
that our smaller, first-party repositories (the documentation, the
website, etc.) would be a good place to get started, and so I posted a
patch[10] that makes the documentation repository REUSE-compliant.
[8]: https://reuse.software/
[9]: https://git.fsfe.org/reuse/tool
[10]: https://spectrum-os.org/lists/archives/spectrum-devel/20200726105527.27432-…
mktuntap
--------
I posted a patch[11] to make mktuntap REUSE-compliant.
[11]: https://spectrum-os.org/lists/archives/spectrum-devel/20200726110123.30159-…
The thing that's most on my mind this week is the extent to which I'm
learning about and working on software like QEMU and DPDK that I don't
see having a place in Spectrum in the long run. It's counterintuitive,
but this is definitely worth it. There's no point writing a kernel
driver for virtio-vhost-user (should such a thing be required) right
now, because if I use DPDK for now instead, at some point either
virtio-vhost-user will end up not being the thing that gets adopted by
the ecosystem and we'll have to move to something else, or (more likely)
it gets widely adopted and somebody else writes a kernel driver.
Similarly, using QEMU for network VMs is the smart choice even though
I don't want it to end up in the TCB, because even though I'm probably
going to end up implementing virtio-vhost-user in crosvm later, swapping
out QEMU is going to be so easy later that it would be a very bad idea
to implement that now in case virtio-vhost-user doesn't take off. But
it still /feels/ weird to be using QEMU for this stuff, you know?
This has been a week of thinking I wanted to do one thing, not being
sure how to do it, and finding out that there was a better way. I'll
write it up in the order it happened.
crosvm
------
Last week, I described that I wanted to implement a virtio proxy to be
able to allow a kernel in an application VM to use a virtual device in
another VM. I was wondering how to manage virtio buffers, and thought
that I probably wanted an allocator to be able to manage throwing
buffers of different sizes around.
This turned out to be a case of the XY problem[1]. I couldn't find a
good solution, but it turned out that an allocator wasn't what I wanted
anyway. edef pointed out that I could just make the shared memory I
allocated as big as necessary to hold buffers of the maximum size I
wanted to support. The kernel will only actually allocate pages as they
are written to, and I could use fallocate[2] with FALLOC_FL_PUNCH_HOLE
to tell the kernel it can drop pages when I'm done with them. This
would mean that an unusually large buffer would only take up lots of
memory while it was in use, and as soon as it was done with, the kernel
would be able to take back the memory. So exactly what I wanted from an
allocator, but with no need for an allocator at all!
This made the implementation much simpler, and by Friday I was able to
get the proxy into a state where it could pass unit tests that
transported messages in both directions through it.
And then it was suggested to me that maybe a virtio proxy is not what I
want after all.
The main disadvantage to a virtio proxy is that it requires context
switching to the host to send data between VMs. This is a trade-off I
was aware of, but a virtio proxy is pretty straightforward to write as
inter-VM communication systems go, and I was not aware of anything else
that would be up to the job. As it turns out, there is something.
vhost-user is a mechanism for connecting, say, a virtio device to a
userspace network stack in a performant way. I was aware of this, but
what I was not aware of was virtio-vhost-user[3]. virtio-vhost-user is
a proposed mechanism to allow a VMM to forward a vhost-user backend to a
VM. This means that two VMs could directly share virtqueues, with no
host copy step. This would mean there would be no opportunity for the
host to mediate communication between two guests, but that wasn't really
on the cards anyway -- if it's ever required, a virtio proxy would
probably be the way to go. For all the other cases, virtio-vhost-user
would be a faster, cleaner way of sharing network devices between VMs.
The main problem with virtio-vhost-user is that it's still in its
infancy. There's a patchset[4] implementing it for QEMU that's a couple
of years old, but that has not been accepted upstream. The main blocker
for this seems to be first standardising it in the Virtio spec[5][6]. The
good news here is that the standardisation process seems to be
progressing actively at the moment. It's being discussed on the
virtio-dev mailing list basically right now, with the most recent emails
dated Friday (unfortunately, I don't know of a good web archive with
virtio-dev, but you can find the thread on Gmane if you're interested
but not subscribed to the list).
The good news is that virtio-vhost-user mostly works by composing things
that already exist. There's no kernel work required, because devices
are just exposed by the VMM as regular virtio devices. The frontend VM
(i.e. the one that uses the virtual device, as opposed to the one that
provides it) doesn't need any special virtio-vhost-user support, because
it just needs to speak normal vhost-user. Only the backend VM needs
support for virtio-vhost-user, because its VMM needs to expose the
vhost-user backend from the host to that VM.
This means that provisionally using virtio-vhost-user in Spectrum
actually looks very feasible, with a couple of compromises. For
evaluation purposes, it's not worth writing a virtio-vhost-user device
for crosvm. But, the VMs that need that device are the ones that are
very specialised -- VMs that manage networking or block devices or
similar. So for these VMs, for now, we could use QEMU, with the
virtio-vhost-user patch. I investigated what it would take to port it
to the most recent QEMU version, and the answer appears to be "not much
at all". Obviously having two VMMs in the Trusted Computing Base (TCB)
isn't something we'd want in the long term, but it would be fine for,
say, reaching the next funding milestone. If we decide that
virtio-vhost-user is the way to go after all, support in crosvm can be
added then -- in general, adding a new virtio device to crosvm isn't a
huge undertaking.
Earlier, I said that the application side of the communication doesn't
need anything special, because to that it's just regular vhost-user.
This is true, but I glossed over there that crosvm doesn't actually
implement vhost-user. Implementing vhost-user in crosvm would probably
be a big deal at this stage, and not something I feel would be a good
use of my time. BUT! Remember, crosvm has two children: Amazon's
Firecracker[7], and at so-called "serverless" computing; and Intel's
Cloud Hypervisor[8], which aims at traditional, full system server
virtualisation. And both of these children inherited the crosvm device
model from their parents, and Cloud Hypervisor implements vhost-user[9].
So I _think_ it should be possible to pretty much lift the vhost-user
implementation from Cloud Hypervisor, and use it in crosvm. Pretty
neat!
So, the setup I'd like to evaluate is QEMU with the virtio-vhost-user
patch on one side, and crosvm with Cloud Hypervisor's vhost-user
implementation on the other.
It might well be that there are complications here. If there are, I'll
probably just finish the proxy and move on for now, because I want to
keep up the pace. I do think that virtio-vhost-user is probably the
way to do interguest networking in the long-term, though.
Another thing that I've realised is that I don't need to worry about
pulling bits out of crosvm to run in other VMs. I focused a lot on that
towards the beginning of the year, mostly motivated by Wayland, because
the virtio wayland implementation in crosvm is the only one there is.
Now that that works in a different way, though, there's no need to
continue down this path, because things like networking can be done in
more normal ways through virtio and the device VM kernel.
[1]: https://en.wikipedia.org/wiki/XY_problem
[2]: https://man7.org/linux/man-pages/man2/fallocate.2.html
[3]: https://wiki.qemu.org/Features/VirtioVhostUser
[4]: https://github.com/stefanha/qemu/compare/master...virtio-vhost-user
[5]: https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg03082.html
[6]: https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.…
[7]: https://firecracker-microvm.github.io/
[8]: https://github.com/cloud-hypervisor/cloud-hypervisor
[9]: https://github.com/cloud-hypervisor/cloud-hypervisor/blob/b4d04bdff6a7e2c3d…
Overall, it's been frustrating for me to try things, and discover
they're not going to work, or not going to work as well as some other
thing, and make a call on whether to keep going on something I know is
the worse option or switch to the better thing. I have to keep
reminding myself that Spectrum is a research project, and there are
always going to be false starts like this. Lots of what we're doing is
either very unusual (virtio-vhost-user) or brand new (interguest
Wayland), after all.
After I got an isolated Wayland compositor working last week, I wasn't
really sure what to do next -- this was a big piece of work that I'd
been very focused on for a while. The funding milestone I'm closest to
is to do with implementing hardware isolation, which the Wayland work
was a part of, so I decided to keep going with that, and explore other
types of isolation. More on that in a bit.
Wayland
-------
Posted my patch for virtio_wl display socket support in
libwayland-server[1]. This is what allows it to run in a VM, and
receive connections from clients in other VMs. The patch description is
very extensive, so I recommend reading it for more detail if you're
interested.
It introduces a libvirtio_wl, which should also be useful for porting
other programs that we might want to communicate with across a VM
boundary, if they are written with normal Unix sockets in mind
(including transferring file descriptors). This is the evolution of
code I previously had put in wlroots, moved to Wayland for convenience.
If it ever acquires another user (or maybe even if it doesn't) it might
make sense to make it its own package, since virtio_wl is useful even if
Wayland isn't involved.
[1]: https://spectrum-os.org/lists/archives/spectrum-devel/SJ0PR03MB5581479F3388…
crosvm
------
I pushed all my crosvm changes to get the isolated compositor working to
the work-in-progress "interguest" branch[2]. Remember, I only got it
working last week right before I needed to start writing the TWiS email,
so I hadn't even done that yet! I also posted some patches[3] to the list
to fix a bug in my previous crosvm deadlock fix, and to improve some
related documentation. As usual, these were kindly reviewed by Cole.
Next, I turned my attention to other forms of hardware isolation.
Wayland was a bit special, because despite crosvm including a virtual
"Wayland device", it's not really hardware, and so it required an
approach to isolation that will be quite different to other crosvm
virtual devices. My hope is that other virtual devices should all be
substantially similar to each other.
The basic idea for actual hardware isolation is that rather than having
drivers in the host kernel for USB, network devices, etc. those will be
exposed to dedicated VMs as virtual PCI devices. This should
substantially reduce host kernel attack surface. crosvm virtual devices
will be run in these device VMs, and communicate over virtio with
application VMs as normal. This will require implementing in crosvm a
virtio proxy device, than allows for the crosvm running an application
VM to forward virtio communication to the virtual device running in
userspace in the driver VM.
(The reason devices aren't attached to application VMs directly but run
in seperate device VMs is that hardware is probably not going to be very
happy if multiple kernels are trying to talk to it at the same time.
Additionally, this indirection means that application VMs only have to
use the one virtio driver for that device category, rather than any of
the hundreds of drivers for different hardware in that category. If one
of those drivers had a vulnerability, this should help to contain it to
the device VM.)
So I started writing this virtio proxy. The basic idea is to copy
virtio buffers from application VM guest memory into memory that can be
shared with the userspace virtual device in the device VM. I can't find
any prior art on this (which is not unusual -- not many systems isolate
drivers in this way), so this has required a lot of looking back at the
virtio paper[4] and spec[5] to make sure I understand what to do here.
As I write this, the next problem to solve is integrating some sort of
memory allocator that can manage buffer allocations in the shared memory
that the virtual device looks at. This is a new area for me that I'd
appreciate advice on if anybody can give it -- think of it like, I have a
memfd, mmaped into my process, and I would like to dynamically allocate
and release memory buffers of dynamic sizes in that region. I'm sure
there's a library I'll be able to plug in for this.
[2]: https://spectrum-os.org/git/crosvm/?h=interguest
[3]: https://spectrum-os.org/lists/archives/spectrum-devel/SJ0PR03MB55819DE7E13B…
[4]: https://www.ozlabs.org/~rusty/virtio-spec/virtio-paper.pdf
[5]: https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.…
As usual, big thank you to Cole for reviewing patches, and for finding
room for improvement even in languages/areas he isn't familiar with.
It feels nice to have done some thinking about the project at a slightly
higher level than I have been recently, and to know where I am on the
way to the next milestone. Having taken a lot of time away from the
milestone list this year to work on fundamentals, it's good to feel like
I'm getting back on track.
I really didn't want this to be another week where I posted about how I
was still trying to patch Wayland to do virtio_wl, and I am delighted to
have just discovered it's not going to be!
crosvm
------
I realised that emulating accept(2) for the Wayland compositor socket in
the way I'd planned would require some crosvm rework. I want to have a
host proxy program that accepts the connection, then connects the
connection socket to crosvm. I had made it possible to dynamically add
sockets to the crosvm Wl device through the control socket, but this
turned out not be enough, because crosvm would store virtio_wl sockets
in a BTreeMap<String, PathBuf>, and then use connect(2) to connect to
the socket when asked to by the guest kernel. This works fine for
e.g. connecting to a host Wayland compositor, which is what crosvm was
designed for, but it wouldn't work for opening a connection socket from
accept(2), because you can only connect to a listening socket.
So instead, I modified the `crosvm wl add' command to take a file
descriptor pointing to the connection socket. I made crosvm store
sockets as an enum that looks like this:
enum WaylandSocket {
Listening(PathBuf),
NonListening(UnixStream),
}
This way, when it gets asked by the VM to connect to a socket, it can
either connect to a listening socket at its path using connect(2), or
just use the existing file descriptor if it's a non-listening socket. A
NonListening socket will be consumed by a connection, so when the VM
close(2)s it, it'll go away, and on the host side the connection will
finish as expected. Listening sockets can be connected to repeatedly,
as before.
I also added support to `crosvm add wl' for dynamic socket names. So
it's possible to do `crosvm add wl wl-conn-%d', and connections will be
added with names like `wl-conn-0', `wl-conn-1', etc. So it's easy to
get unique names for connection sockets. The chosen name is printed by
the command, so the caller knows what name to tell the VM to connect to.
I also found and fixed a bug with the previous crosvm deadlock fix[1].
I had assumed that device_sock.recv(&mut []) would drop a message from
the (SOCK_SEQPACKET) socket, without having to read any of it. But
UnixSeqpacket::recv calls libc::read, and read(2) tells us that:
> In the absence of any errors, or if read() does not check for errors,
> a read() with a count of 0 returns zero and has no other effects.
So this was in fact doing nothing at all. I don't know why crosvm's
UnixSeqpacket::recv calls read() instead of recv(), but it's always been
like that and I'm guessing this sort of thing (from recv(2)) might have
something to do with it:
> The only difference between recv() and read(2) is the presence of
> flags. With a zero flags argument, recv() is generally equivalent to
> read(2) (but see NOTES).
So probably read() just looked like a nicer way to recv() when no flags
were needed.
But, unfortunately, zero-byte reads are when the aforementioned NOTES
section becomes relevant:
> If a zero-length datagram is pending, read(2) and recv() with a flags
> argument of zero provide different behavior. In this circumstance,
> read(2) has no effect (the datagram remains pending), while recv()
> consumes the pending datagram.
So, my assumption that UnixSeqpacket::recv(&mut []) would consume a
message turned out to be quite reasonable -- the surprising thing was
that a method called `recv' would call read() rather than recv(). I
think the best fix here will be to just make it call recv() instead,
rather than modifying my code to do UnixSeqpacket::recv(&mut [0]) or
something, to prevent further nasty surprises with this in future.
[1]: https://spectrum-os.org/lists/archives/spectrum-devel/20200614114344.22642-…
Wayland
-------
I created API-compatible implementations of the libc sendmsg(2) and
recvmsg(2) functions for virtio_wl sockets. This was quite an
achievement, because the API (which allows you to send and receive data
and file descriptors, as well as other things I don't intend to support)
is rather arcane (see the example in cmsg(3) if you're not familiar with
them). I wrote unit tests for them, and it took a long time before they
worked reliably. Once I had these, though, I could find the places
where Wayland called sendmsg() and recvmsg() and fall back to the
virtio_wl-based implementations if the standard functions failed with
ENOTSOCK. I stubbed out some stuff that isn't going to work over
virtio_wl, like looking up the pid of the Wayland client through
getsockopt(2).
I also had to resort to a few hacks, like faking support for
MSG_DONTWAIT by using fcntl(2) to set O_NONBLOCK on the socket,
recv()ing from it, and then removing O_NONBLOCK again, or faking
mremap(2) by munmap()-ing and mmap()-ing. We will want to clean these
up later by implementing the required missing functionality in the
virtio_wl kernel module. In the first case, at least, this should be
pretty straightforward, because it supports non-blocking operations if
the socket is O_NONBLOCK -- it just needs to accept a MSG_DONTWAIT
option as well. The VIRTWL_IOCTL_{SEND,RECV} syscalls don't currently
have a flags argument, so that'll need to be added.
I implemented this bit by bit, at every step trying to run Alacritty on
my host system, connected to the virtio_wl Wayland server socket through
the accept() proxy, and using strace and some printf()-debugging to see
where the Wayland compositor in the VM would get stuck, and about an
hour ago, it finally worked! For the first time, a Wayland compositor
running in a VM can display an application running outside of it.
(Obviously we'll want the application to be running in another VM rather
than on the host, but that's similar enough that it probably works
already -- I just haven't tested it yet.) This feels like a huge
achievement. I've been working towards it for so long.
Next week, I'll be cleaning up this code and posting patches for all of
it. Then I'll probably move on to other sorts of device virtualization,
like running a virtual network device in a VM. I'm feeling so much more
positive about the direction of the project than I was before. It's
been difficult to make myself keep going making little progress for the
last couple of weeks, and it's great that I've managed to pick things up
again so much. I hope that the level of detail in this email is enough
to make up for the brevity of last week's! I'm sending late again, too,
but only by a couple of minutes -- I didn't expect this email to take
over an hour to write, but there we go.
Thanks for reading! I hope you're looking forward to seeing where
things go from here as much as I am.