Fun little thing I have been working on: teach systemd to boot directly into a disk image downloaded via HTTP within the initrd.
In v257 systemd learnt the ability to download disk images at boot via systemd-import-generator, both DDIs and tarballs, and place them in /var/lib/machines/, /var/lib/portables/, /var/lib/confexts, /var/lib/extensions/. The goal was to provide a way to provision any of these resources automatically at boot. But now that we have this, we can take it a step further:
download the root disk image itself with this. There were a bunch of missing bits to make this nice though:
First of all, for raw disk images we need to attach them to a loopback block device, to make them mountable. Easy-peasy, systemd-dissect --attach already delivers that.
Then, for tar disk images we need to bind mount the downloaded and unpacked image to /sysroot/ (which is where the rootfs goes before we transition into it).
Then, to make this nicer, it makes sense to allow deriving the URL to download the rootfs image from directly from the UEFI HTTP boot URL. Or in other words: if you point your UEFI to boot a UKI from some URL (i.e. http://example.com/somedir/myimage.efi), then that UKI's initrd is smart enough to derive from that same URL a different URL for the rootfs (by replacing the final component, so that it becomes http://example.com/somedir/myimage.raw.xz).
Net result of this: I can now point my UEFI to a single URL where it will load the UKI from. A few seconds later the initrd will pick up the rootfs from the same source, and boot it up. Magic!
Why all this though?
It's mostly to close my test cycle a bit, for physical devices. So here's what this entails:
1. You build your image with mkosi one your development machine, and ask it to serve your image as HTTP. In other words: `mkosi -f serve`.
2. You boot into the target machine once, and register an EFI variable that enables HTTP boot from your development machine. Simply do `kernel-bootcfg --add-uri=http://192.168.47.11:8081/image.efi --title=testloop --boot-order=0`, using @kraxel's wonderful tool.
3. You simply reboot that target machine. It will now fetch the UKI kernel, which then fetches the root disk image. And everytime you reboot this happens again. The target's machine#s local disks are unnaffected.
4. …
5. Profit!!
Sounds simple? That's because it is.
(Well of course, you wonder where the magic sauce is. It's here: you need to build your UKIs a certain way: i.e. add to the kernel cmdline: `rd.systemd.pull=verify=no,machine,blockdev,bootorigin,raw:root:image.raw rootflags=x-systemd.device-timeout=infinity ip=any`)
So, two take-aways here:
1. Really nice test loop now for testing immutable, modern OSes, with onboard tooling
2. Yeah, you can frickin' boot into a damn tarball now, with just an UKI.
WIP PR for all of this is here:
@highvoltage well, you probably need it once to create that HTTP boot URL BootXXX efi variable so that the target system just goes to your development device asking for the UKI.
(you could of course also use DHCP/pxe stuff instead, but uh, that's pain, you'd have to use a separate network for that, and run your own DHCP server, much more painful)
@highvoltage that said, some fancy bioses allow you to enter the URL also interactively in firmware setup. I think tianocore does, but never tried it that way.
@pid_eins what would be needed for a verification (verify=yes)?
Overall this sounds really cool and a somewhat interesting replacement scenario for PXE in some cases 🤔
@dvzrv right now verify=yes means gpg (specifically: SHA256SUMS signed with some key whose public key is baked into the initrd). We really want to get away from gpg though, hence I hope to add pkcs7 or so eventually, and maybe other stuff.
@pid_eins from a first read this sounds a lot like PXE boot - but probably more focused on quick turnaround from a low privilege machine (no DHCP options, tftp/nfs required? Although I’m not 100% sure that was required for PXE boot either)
@thunfisch for pxe you need to set up a dhcp server, and that probably means headaches and that you cannot just use your regular home LAN. But via the kernel-bootcfg tool we can avoid all that mess, and configure the boot source without any such pain, the machine will directly download from our "mkosi serve".
@dvzrv and of course: just use DDIs, i.e. signed verity enabled disk images. way better security, and you simply don't have to bother about download-time verification, because you have something much better: continous use-time verification.
@pid_eins @highvoltage It's true that DHCP/PXE has always been a pain, which is why I built a provisioner in #mgmtconfig to make it trivially easy: https://purpleidea.com/blog/2024/03/27/a-new-provisioning-tool/
Of course it could be modified to also host this file for this systemd style provisioning too.
oh, and one more comment: this will only work on systems that are relatively high on the systemd adoption scale: you definitely need a systemd-based initrd for this. For deriving the rootfs URL from the UEFI network boot URL you need a systemd-stub based UKI.
@pid_eins cool stuff. This relies on the UEFI firmware downloading the image via HTTP, starting the initrd (kernel and systemd?) and then systemd mounts the same image as an ephemeral rootfs? So no persistence, unless additional work is done. Would be interesting to see how the image could be verified. Probably relies on UEFI support for that to be meaningful verification then?
@pid_eins ah wait no - this is a two step process with a separate UEFI executable and then the image I guess. So you can do the usual secure boot things?
@thunfisch yes, it's two step:
1. firmware downloads UKI, authenticates via SecureBoot, measures as usual.
2. initrd downloads rootfs, authenticates either via gpg signatures, or at runtime via signed verity.
or you just yolo it, and turn off SB. For a test loop that should be OK in many cases.
@highvoltage @pid_eins end of USB business 😎
and even one more comment:
next steps: instead of downloading root fs via http, access it via nvme-over-tcp.
Benefit: better performance (no ahead of time download, but download as needed), and even better: persistency!
@pid_eins I have only come across UEFI impl. that must get hold of the HTTP boot URL from an option (`bootfile-url` code 59) in the DHCP server (which they must use). No way to set this URL in the UEFI menus. I think this was also what the UEFI spec 2.8 (?) said. -- Do you use some hw/UEFI firmware where the HTTP boot URL can be specified in its menus? Which?
@quitelost tianocore has it, i.e. your regular efi vm. Also see comments in other parts of this thread. Apparently there are quite a bunch, and with @kraxel's kernel-bootcfg you can add one on any firmware.
@pid_eins Slowly but surely, systemd is turning into a container engine and I'm here for it!
Out of curiosity, did you ever take a look at boot2container (https://gitlab.freedesktop.org/gfx-ci/boot2container)? It is my podman- and u-root based initrd that boots any container(s) without any installation, based on the kernel cmdline.
That's IMO the next level of flexibility, but I must admit I have not worked on its security at all... but this is mostly meant for CI purposes (DUTs or gateway) so the needs are different.
@mupuf OCI/podman is really not my world, sorry. I didn't drink that cool-aid.
You can now boot the latest Debian daily ISOs via UEFI HTTP boot if your firmware supports that and you don't want to deal with silly USB sticks. The needed pmem modules were missing from the installer initrd but not anymore. @highvoltage @pid_eins
@pid_eins aside from OCI or DDIs, are there any plans for a more practical or efficient image format?
It currently feels somewhat cumbersome to try to generate and distribute raw ddi's + extensions for things like portable services or nspawn. It also feels a bit wasteful when you're basing multiple containers on the same image.
I'd love to see something git- or ostree-like…
@pid_eins partially related to this, theres ongoing work in U-Boot to utilise the "pmem" feature on ARM boards at least.
This will enable the bootloader to download one disk image with everything on it and make it accessible to the initrd via /dev/pmemX just like any other block device. So you can directly HTTPs boot distro images or installers with little/no modifications
@cas uefi ramdisk support works the same way: they insert a fake pmem entry in the memory table and linux can directly consume it then.
but i am not too keen to rely on that tbh. i much prefer to download a smallish UKI as first step, and then the big root from linux userspace. simply because firmware code quality sucks ass, and linux is quite OK...
@risen uh, i am happy with ddis.
To say this politely I am not a believer and the security model ostree folks and OCI folks subscribe to. I subscribe to the idea that we should do W^X also for file systems: i.e. a file system is either writable, or it may contain executable files, but never both, as part of guaranteeing that attackers cannot gain persistency, no matter what.
DDIs fit perfectly into the model, but ostree (regardless with or without composefs glue) does conceptually not…
@risen … come close, and well, OCI is just terrible by any standard.
@pid_eins I agree with your idea of non-writable images, but I'd see using ostree (or something similar) more as a way of efficiently deploying updated images. Just like you would drop (write) a new ddi in /var/lib/machines/example.raw.v/
@pid_eins @risen I love using disk images for my system drive, but I really do not want to reserve space for X images during install.
I used to just drop disk images as a file into a simple file system and had a mount unit mount that before mounting the system image as a loopback file.
The downside is obviously that someone could corrupt the filesystem holding the images and I have no way to detect that:-( But on the upside: As many images as I want (and have space for).
@hunger did you see what android did there? they basically did a poor man's LVM based on dm-linear. It's called "dynamic partitions". see:
https://source.android.com/docs/core/ota/dynamic_partitions/implement
We should be able to do something similar. Maybe something as simple as this: if some special bit is set in the GPT flags of a partition we want to use, look for "extension" partitions whose identifying uuid is hashed from the original in counter mode. Pick up all such extensions partitions, then merge them via dm-linear.
@hunger the android folks did measurements which suggest this basically has no IO perfomance cost.
I think doing this kind of setup at boot would be superduper easy within the systemd framework. Other side of the story would be then to teach repart to optionally fulfill grow requests with such extension partitions if needed, and for sysupdate to know how to write them.
finally, it might make sense to have some separate tool we can call on some mounted fs to make space available.
@hunger an OS upgrade with sysupdate would become a bit more complex: instead of just calling sysupdate we would call the rootfs shrinker, then repart, then sysupdate, then reboot.
@pid_eins so they have a partition and put a GPT into that. Then they manage the embedded GPT dynamically? Should be super easy to support: if the disk GPT has some special UUID, then loop-back mount that partition and continue discovery on the contained GPT...
Sorry, I need to read up on dm-linear :-)
@hunger so they do a 2nd level of gpt partitions, i am not sure that's necessary, we should be able to just use the first level
@hunger i mean gpt by default allows 128 partitions iirc, which should be a lot. it's not that we are going to put bazillions of images there
@hunger no need to read up on dm-linear: it just glues together a bunch of block devices. some people might call that raid0.
@pid_eins I used a custom image based system for years and routinely kept about 10 images around. At that point the EFI partition used to overflow as I had UKIs for each image -- each booting that one image only:-)
I kept the initial install, one per customer, and the images going a few days back.
Especially the per customer images proved useful: Getting back to a customer, I always tried the newest image first, having the last one I know worked before as a fallback.