Skip to content

[pull] main from systemd:main#501

Merged
pull[bot] merged 33 commits intoadamlaska:mainfrom
systemd:main
Feb 25, 2026
Merged

[pull] main from systemd:main#501
pull[bot] merged 33 commits intoadamlaska:mainfrom
systemd:main

Conversation

@pull
Copy link

@pull pull bot commented Feb 25, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

yuwata and others added 30 commits February 13, 2026 16:08
No functional change, just refactoring.
Before:
```
$ run0 systemd-nspawn -xD / --private-network --bind=/sys/firmware/efi/efivars --bind=/boot -- build/bootctl --variables=yes --no-pager
(snip)
Boot Loader Entry Locations:
          ESP: /boot ()
       config: /boot//loader/loader.conf
     XBOOTLDR: /boot ($BOOT)
        token: fedora
(snip)
```

After:
```
run0 systemd-nspawn -xD / --private-network --bind=/sys/firmware/efi/efivars --bind=/boot -- build/bootctl --variables=yes --no-pager
(snip)
Boot Loader Entry Locations:
          ESP: /boot
       config: /boot//loader/loader.conf
     XBOOTLDR: /boot ($BOOT)
        token: fedora
(snip)
```

This also moves spurious position of new line in each output.
When running in a container, EFI and XBOOTLDR partition check is
relaxed, hence /boot may be recognized as both EFI and XBOOTLDR
partition.

Before:
```
$ run0 systemd-nspawn -xD / --private-network --bind=/sys/firmware/efi/efivars --bind=/boot -E SYSTEMD_LOG_LEVEL=debug -- build/bootctl --variables=yes --no-pager -x
Failed to check file system type of "/efi": No such file or directory
Using EFI System Partition at /boot.
Using XBOOTLDR partition at /boot as $BOOT.
/boot
```

After:
```
$ run0 systemd-nspawn -xD / --private-network --bind=/sys/firmware/efi/efivars --bind=/boot -E SYSTEMD_LOG_LEVEL=debug -- build/bootctl --variables=yes -x
Failed to check file system type of "/efi": No such file or directory
Using EFI System Partition at /boot.
Didn't find an XBOOTLDR partition, using the ESP as $BOOT.
/boot
```
The purpose of the userns-restrict BPF-LSM program is to prevent the
transient ranges leaking to disk, so let's allow operations outside the
transient UID ranges, even if the mount is not allowlisted.

This is preparation for the next commits where we'll add support for mapping
the current user and the foreign UID range into the user namespaces provisioned
by nsresourced. Operations creating files/directories as these UIDs/GIDs should
not need the corresponding mount to be allowlisted with nsresourced.
We want to support the scenario where we bind mount the nsresourced
varlink socket into a container to allow nested containers where the
outer container runs in its own transient range from nsresourced but
can still allocate transient ranges for its own nested containers.

To support this use case let's add support for delegation. Delegated
ranges are allocated when allocating the primary range and are propagated
1:1 to the user namespace. We track delegated ranges in ".delegate" files
in the userns registry so that they can't be used for other range allocations.

We make one exception for delegated ranges though, if we get a request from
a user namespace that is a child of the user namespace that owns the delegated
ranges, we allow allocating from the delegated range. The parent userns already
has full ownership over the child userns, so it doesn't matter that the parent
userns and the child userns share the same range. This allows making use of
delegated ranges without having to run another copy of nsresourced inside the
parent userns to hand out from the delegated range.

To support recursive delegations, we keep track of the previous owners of the
delegated range and restore ownership to the last previous owner when the current
owner is freed.
mkosi does all of its environment setup in an unprivileged user
namespace with an identity mapping. When it invokes nspawn and nspawn
tries to get a transient userns from nsresourced, this fails as no
transient ranges are mapped into mkosi's unprivileged userns (as doing
so would require privileges).

To fix this problem, let's allow allocating unprivileged self user
namespaces in nsresourced, similar to what the kernel allows, except that
we also support delegations for these. This means that mkosi can get its
unprivileged userns as before from nsresourced, but it can also request a
delegated 64K range inside that userns as well, which nsresourced can then
allocate to nspawn later when it asks for one.

Similar to the kernel, we disallow setgroups for self mappings. However,
instead of doing this via /proc/self/setgroups, which applies to the current
user namespace and all its child user namespaces, we use the BPF LSM to deny
setgroups instead, so that it can still be allowed for child user namespaces.
We need this because as soon as a container launches in a child user namespace
using one of the delegated mappings, it has to be able to do setgroups() to be
able to function properly.

To allow mapping the root user, we need to add the CAP_SETFCAP capability to
nsresourced.
Whenever delegating UID ranges to a user namespace, it can also be
useful to map the foreign UID range, so that the container running in
the user namespace with delegated UID ranges can download container
images and unpack them to the foreign UID range.

Let's add an option mapForeign to make this possible. Note that this option
gives unprivileged users full access to the any foreign UID range owned directory
that they can access. Hence it is recommended (and already was recommended) to
store foreign UID range owned directories in a 0700 directory owned by the
owner of the tree to avoid access and modifications by other users.

This is already the case for the main users of the foreign UID range,
namely /var/lib/machines, /var/lib/portables and /home/<user> which all
use 0700 as their mode.

Users will also be able to create foreign UID range owned inodes in any
directories their own user can write to (on most systems this means /tmp,
/var/tmp and /home/<user>).
- Allow foreign UID range
- Allow delegated UID ranges

Both of these can now be mapped by nsresourced into user namespaces
and hence should be accepted by mountfsd.
ENOENT means /dev/loop-control isn't there which means we're in a
container and should go via mountfsd.

At the same time, reverse the check for fatal actions as almost all
actions can be done via mountfsd, only --attach needs the loop device.
ninja -C build update-hwdb
ninja -C build update-hwdb-autosuspend
ninja -C build update-man-rules
ninja -C build systemd-pot
ninja -C build systemd-update-po
We expose this via --private-users-delegate= which takes the number of
ranges to delegate. On top of delegating the ranges, we also mount in
the nsresourced socket and the mountfsd socket so that nested containers
can use nsresourced to allocate from the delegated ranges and mountfsd to
mount images.

Finally, we also create /run/systemd/dissect-root with systemd-tmpfiles to
make sure it is always available as unpriv users won't be able to create it
themselves.
These syscalls are part of a newer kernel API to replace interaction
with /proc/self/attr, with the goal of allowing LSM stacking. These are
being used now by e.g. libapparmor, so should be more easily available
to services using seccomp filtering.
Addresses Zbigniew's comments left on the previous MR after merging:
#40400 (review)
TEST-74-AUX-UTILS.sh[3789]: + groupadd haldo
TEST-74-AUX-UTILS.sh[3875]: ==3875==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.

Follow-up for 1012c6c
Updated by "Update PO files to match POT (msgmerge)" hook in Weblate.

Co-authored-by: Hosted Weblate <hosted@weblate.org>
Translate-URL: https://translate.fedoraproject.org/projects/systemd/main/
Translation: systemd/main
@pull pull bot locked and limited conversation to collaborators Feb 25, 2026
@pull pull bot added the ⤵️ pull label Feb 25, 2026
@pull pull bot merged commit 04897fb into adamlaska:main Feb 25, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants