[pull] main from systemd:main#501
Merged
pull[bot] merged 33 commits intoadamlaska:mainfrom Feb 25, 2026
Merged
Conversation
No functional change, just refactoring.
Before:
```
$ run0 systemd-nspawn -xD / --private-network --bind=/sys/firmware/efi/efivars --bind=/boot -- build/bootctl --variables=yes --no-pager
(snip)
Boot Loader Entry Locations:
ESP: /boot ()
config: /boot//loader/loader.conf
XBOOTLDR: /boot ($BOOT)
token: fedora
(snip)
```
After:
```
run0 systemd-nspawn -xD / --private-network --bind=/sys/firmware/efi/efivars --bind=/boot -- build/bootctl --variables=yes --no-pager
(snip)
Boot Loader Entry Locations:
ESP: /boot
config: /boot//loader/loader.conf
XBOOTLDR: /boot ($BOOT)
token: fedora
(snip)
```
This also moves spurious position of new line in each output.
When running in a container, EFI and XBOOTLDR partition check is relaxed, hence /boot may be recognized as both EFI and XBOOTLDR partition. Before: ``` $ run0 systemd-nspawn -xD / --private-network --bind=/sys/firmware/efi/efivars --bind=/boot -E SYSTEMD_LOG_LEVEL=debug -- build/bootctl --variables=yes --no-pager -x Failed to check file system type of "/efi": No such file or directory Using EFI System Partition at /boot. Using XBOOTLDR partition at /boot as $BOOT. /boot ``` After: ``` $ run0 systemd-nspawn -xD / --private-network --bind=/sys/firmware/efi/efivars --bind=/boot -E SYSTEMD_LOG_LEVEL=debug -- build/bootctl --variables=yes -x Failed to check file system type of "/efi": No such file or directory Using EFI System Partition at /boot. Didn't find an XBOOTLDR partition, using the ESP as $BOOT. /boot ```
The purpose of the userns-restrict BPF-LSM program is to prevent the transient ranges leaking to disk, so let's allow operations outside the transient UID ranges, even if the mount is not allowlisted. This is preparation for the next commits where we'll add support for mapping the current user and the foreign UID range into the user namespaces provisioned by nsresourced. Operations creating files/directories as these UIDs/GIDs should not need the corresponding mount to be allowlisted with nsresourced.
We want to support the scenario where we bind mount the nsresourced varlink socket into a container to allow nested containers where the outer container runs in its own transient range from nsresourced but can still allocate transient ranges for its own nested containers. To support this use case let's add support for delegation. Delegated ranges are allocated when allocating the primary range and are propagated 1:1 to the user namespace. We track delegated ranges in ".delegate" files in the userns registry so that they can't be used for other range allocations. We make one exception for delegated ranges though, if we get a request from a user namespace that is a child of the user namespace that owns the delegated ranges, we allow allocating from the delegated range. The parent userns already has full ownership over the child userns, so it doesn't matter that the parent userns and the child userns share the same range. This allows making use of delegated ranges without having to run another copy of nsresourced inside the parent userns to hand out from the delegated range. To support recursive delegations, we keep track of the previous owners of the delegated range and restore ownership to the last previous owner when the current owner is freed.
mkosi does all of its environment setup in an unprivileged user namespace with an identity mapping. When it invokes nspawn and nspawn tries to get a transient userns from nsresourced, this fails as no transient ranges are mapped into mkosi's unprivileged userns (as doing so would require privileges). To fix this problem, let's allow allocating unprivileged self user namespaces in nsresourced, similar to what the kernel allows, except that we also support delegations for these. This means that mkosi can get its unprivileged userns as before from nsresourced, but it can also request a delegated 64K range inside that userns as well, which nsresourced can then allocate to nspawn later when it asks for one. Similar to the kernel, we disallow setgroups for self mappings. However, instead of doing this via /proc/self/setgroups, which applies to the current user namespace and all its child user namespaces, we use the BPF LSM to deny setgroups instead, so that it can still be allowed for child user namespaces. We need this because as soon as a container launches in a child user namespace using one of the delegated mappings, it has to be able to do setgroups() to be able to function properly. To allow mapping the root user, we need to add the CAP_SETFCAP capability to nsresourced.
Whenever delegating UID ranges to a user namespace, it can also be useful to map the foreign UID range, so that the container running in the user namespace with delegated UID ranges can download container images and unpack them to the foreign UID range. Let's add an option mapForeign to make this possible. Note that this option gives unprivileged users full access to the any foreign UID range owned directory that they can access. Hence it is recommended (and already was recommended) to store foreign UID range owned directories in a 0700 directory owned by the owner of the tree to avoid access and modifications by other users. This is already the case for the main users of the foreign UID range, namely /var/lib/machines, /var/lib/portables and /home/<user> which all use 0700 as their mode. Users will also be able to create foreign UID range owned inodes in any directories their own user can write to (on most systems this means /tmp, /var/tmp and /home/<user>).
- Allow foreign UID range - Allow delegated UID ranges Both of these can now be mapped by nsresourced into user namespaces and hence should be accepted by mountfsd.
ENOENT means /dev/loop-control isn't there which means we're in a container and should go via mountfsd. At the same time, reverse the check for fatal actions as almost all actions can be done via mountfsd, only --attach needs the loop device.
ninja -C build update-hwdb
ninja -C build update-hwdb-autosuspend
Follow-up for eb581ff
ninja -C build update-man-rules
ninja -C build systemd-pot
ninja -C build systemd-update-po
We expose this via --private-users-delegate= which takes the number of ranges to delegate. On top of delegating the ranges, we also mount in the nsresourced socket and the mountfsd socket so that nested containers can use nsresourced to allocate from the delegated ranges and mountfsd to mount images. Finally, we also create /run/systemd/dissect-root with systemd-tmpfiles to make sure it is always available as unpriv users won't be able to create it themselves.
These syscalls are part of a newer kernel API to replace interaction with /proc/self/attr, with the goal of allowing LSM stacking. These are being used now by e.g. libapparmor, so should be more easily available to services using seccomp filtering.
Addresses Zbigniew's comments left on the previous MR after merging: #40400 (review)
TEST-74-AUX-UTILS.sh[3789]: + groupadd haldo TEST-74-AUX-UTILS.sh[3875]: ==3875==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD. Follow-up for 1012c6c
Updated by "Update PO files to match POT (msgmerge)" hook in Weblate. Co-authored-by: Hosted Weblate <hosted@weblate.org> Translate-URL: https://translate.fedoraproject.org/projects/systemd/main/ Translation: systemd/main
…oreign UID range in mkosi (#40415)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )