https://github.com/mk-fg/systemd-cgroup-nftables-policy-manager
Tool to add/update nftables cgroupv2 rules for systemd-managed unit cgroups (slices, services, scopes)
https://github.com/mk-fg/systemd-cgroup-nftables-policy-manager
cgroups firewall network nftables nim systemd
Last synced: 8 months ago
JSON representation
Tool to add/update nftables cgroupv2 rules for systemd-managed unit cgroups (slices, services, scopes)
- Host: GitHub
- URL: https://github.com/mk-fg/systemd-cgroup-nftables-policy-manager
- Owner: mk-fg
- License: wtfpl
- Created: 2021-08-23T08:24:57.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2025-01-22T06:11:37.000Z (over 1 year ago)
- Last Synced: 2025-03-23T18:37:28.366Z (about 1 year ago)
- Topics: cgroups, firewall, network, nftables, nim, systemd
- Language: Nim
- Homepage:
- Size: 65.4 KB
- Stars: 13
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.rst
- License: COPYING
Awesome Lists containing this project
- stars - mk-fg/systemd-cgroup-nftables-policy-manager - managed unit cgroups (slices, services, scopes) (HarmonyOS / Windows Manager)
README
systemd cgroup (v2) nftables policy manager
===========================================
.. contents::
:backlinks: none
This repository URLs:
- https://github.com/mk-fg/systemd-cgroup-nftables-policy-manager
- https://codeberg.org/mk-fg/systemd-cgroup-nftables-policy-manager
- https://fraggod.net/code/git/systemd-cgroup-nftables-policy-manager
Description
-----------
Small tool that adds and updates nftables_ cgroupv2 filtering rules for
systemd_-managed per-unit cgroups (slices, services, scopes).
"cgroupv2" is also often referred to as "unified cgroup hierarchy" (considered
stable in linux since 2015), works differently from old cgroup implementation,
and is the only one supported here.
Similar capability have also been added to systemd versions 255+ (2023-12-06 and
later) via NFTSet= option in unit files (see `"man systemd.resource-control"`_),
but its use is limited to system units (can't be used in ``~/.config/systemd/user``
session units).
This tool is somewhat redundant with that functionality, but can still be useful
for user session units, or if NFTSet= doesn't work for some purpose/reason.
.. _nftables: https://nftables.org/
.. _systemd: https://systemd.io/
.. _"man systemd.resource-control":
https://man.archlinux.org/man/systemd.resource-control.5
Problem that it addressess
~~~~~~~~~~~~~~~~~~~~~~~~~~
nftables supports "socket cgroupv2" matching in rules (since linux-5.13+),
similar to iptables' "-m cgroup --path ...", which can be used to add rules
like this::
add rule inet filter output socket cgroupv2 level 5 \
"user.slice/user-1000.slice/user@1000.service/app.slice/myapp.service" accept
(or in iptables: ``iptables -A OUTPUT -m cgroup --path ... -j ACCEPT``)
But when trying to put this into /etc/nftables.conf, it will fail to load on boot
(same as similar iptables rules), as that "myapp.service" cgroup with a long
path does not exist yet.
Both nftables/iptables rules use xt_cgroup kernel module that - when looking at
the packet - actually matches numeric cgroup ID, and not the path string, and
does not update those IDs dynamically when cgroups are created/removed in any way.
This means that:
- Firewall rules can't be added for not-yet-existing cgroups.
Causes "Error: cgroupv2 path fails: No such file or directory" from "nft"
command and "xt_cgroup: invalid path, errno=-2" error in dmesg for iptables.
- If cgroup gets removed and re-created, none of the existing rules will apply to it.
This is because new cgroup gets a new unique ID, which can't be present in any
pre-existing netfilter tables, so none of the rules will match it.
So basically such rules in a system-wide policy-config only work for cgroups
that are created early on boot and never removed after that.
This is not what happens with most systemd services and slices, restarting which
will also re-create cgroups, and which are usually started way after system
firewalls are initialized (and often can't be started on boot - e.g. user units).
Solution:
~~~~~~~~~
Since this tool was written, ``NFTSet=`` directive was added to systemd,
which mostly addresses this for system units already - use that if possible,
and see caveats section below for some of potential shortcomings there.
Monitor cgroup (or systemd unit) creation/removal events and (re-)apply any
relevant rules to these dynamically.
This is `how "socket cgroupv2" matcher in nftables is intended to work`_::
Following the decoupled approach: If the cgroup is gone, the filtering
policy would not match anymore. You only have to subscribe to events
and perform an incremental updates to tear down the side of the
filtering policy that you don't need anymore. If a new cgroup is
created, you load the filtering policy for the new cgroup and then add
processes to that cgroup. You only have to follow the right sequence
to avoid problems.
So that's pretty much what this simple tool does, subscribing to systemd unit
start/stop events via journal (using libsystemd) and updating any relevant rules
on events from there (using libnftables).
.. _how "socket cgroupv2" matcher in nftables is intended to work:
https://patchwork.ozlabs.org/project/netfilter-devel/patch/1479114761-19534-1-git-send-email-pablo@netfilter.org/#1511797
Intended use-case:
~~~~~~~~~~~~~~~~~~
Defining system-wide policy to whitelist connections to/from specific systemd
units (can be services/apps, slices of those, or ad-hoc scopes) in an easy and
relatively foolproof way.
I.e. if a desktop system is connected to some kind of "intranet" VPN, there's
no reason for random complex and leaky apps like web browsers or games to be able
to connect to anything there (think fetch() JS call from any site you visit),
and that is trivial to block with a single firewall rule.
This tool is intended to manage a whitelist of rules for systemd units on top,
that should have access there, and hence are allowed to bypass such rule.
Again, systemd has aforementioned NFTSet= option, as well as network filtering
via eBPFs attached to cgroups (IPAddressAllow/Deny=, BPFProgram=, IPEgressFilterPath=
and such), which can be used as an alternative to this tool.
Build / Install
---------------
This is a small Nim_ command-line app, can be built with any modern
`Nim compiler`_, e.g. using included Makefile::
% make
% ./scnpm --help
Usage: ./scnpm [opts] [nft-configs ...]
...
(or run ``nim c -d:release -d:strip -d:lto_incremental --opt:size scnpm.nim`` without make)
That should produce ~150K binary, linked against libsystemd (for journal access)
and libnftables (to re-apply cgroupv2 nftables rules), which can then be installed
and copied between systems normally.
Nim compiler is only needed to build the tool, not to run it.
scnpm.service_ systemd unit file can be used to auto-start it on boot.
Journal is used as an event source instead of more conventional dbus signals to
be able to monitor state changes of units under all "systemd --user" instances
as well as system ones, which are sent through multiple transient dbus brokers,
so much more difficult to reliably track there.
.. _Nim: https://nim-lang.org/
.. _Nim compiler: https://nim-lang.org/install_unix.html
.. _scnpm.service: scnpm.service
Usage
-----
Tool is designed to parse special commented-out rules for it from the same
nftables.conf as used with the rest of ruleset, for consistency
(though of course they can be stored in any other file(s) as well)::
## Allow connections to smtp over vpn for system postfix.service
# postfix.service :: add rule inet filter vpn.whitelist \
# socket cgroupv2 level 2 "system.slice/postfix.service" tcp dport 25 accept
## Allow connections to intranet mail for a scope unit running under "systemd --user"
## "systemd-run" can be used to easily start apps in custom scopes or slices
# app-mail.scope :: add rule inet filter vpn.whitelist socket cgroupv2 level 5 \
# "user.slice/user-1000.slice/user@1000.service/app.slice/app-mail.scope" \
# ip daddr mail.intranet.local tcp dport {25, 143} accept
## Only allow whitelisted apps to connect over "my-vpn" iface
add rule inet filter output oifname my-vpn jump vpn.whitelist
add rule inet filter output oifname my-vpn drop
Commented-out "add rule" lines would normally make this config fail to apply on
boot, as those service/scope/slice cgroups won't exist yet at that point in time.
Script will parse those " :: " comments, and try to apply
rules from them on start and whenever any kind of state-change happens to a unit
with the name specified there.
For example, when postfix.service is stopped/restarted with the config above,
corresponding vpn.whitelist rule will be removed and re-added, allowing access
to a new cgroup which systemd will create for it after restart.
To start it in verbose mode: ``./scnpm --flush --debug /etc/nftables.conf``
``-f/--flush`` option will purge (flush) all chains mentioned in the rules
that will be monitored/applied on tool start, so that leftover rules from any
previous runs are removed, and can be replaced with more fine-grained manual
removal if these are not dedicated chains used for such dynamic rules only.
Running without ``-d/--debug`` should not normally produce any output, unless
there are some (non-critical) warnings like unexpected mismatch or nftables error,
code bugs or fatal errors.
Starting the tool on boot should be scheduled after nftables.service,
so that ``--flush`` option will be able to find all required chains,
and will exit with an error otherwise.
Multiple nftables rules linked to same systemd unit(s) are allowed.
Changes in parsed config files are not auto-detected, and only applied by
either sending SIGHUP or tool restart, which can be done manually after changes,
configured in nftables.service (e.g. via PropagatesReloadTo= and/or BindsTo=)
or systemd.path unit monitoring state of source configuration file(s);
or - without signal - using ``-u/--reload-with-unit`` or ``-a/--reapply-with-unit``
opts, since this tool monitors systemd unit states anyway, and can spot when
things restart there on its own.
Syntax errors in nftables rules should produce warnings when these are applied on
tool start or changes, so should be hard to miss, but maybe do check "nft list chain"
or debug output when rules are supposed to be enabled after conf updates anyway.
To modify nftables rulesets, CAP_NET_ADMIN capability is required, which can be
passed via AmbientCapabilities= in systemd service (or similar option in capsh)
in addition to SupplementaryGroups=systemd-journal and netlink access to avoid
running this as full root.
Caveats and limitations
-----------------------
- Due to "best-effort" nature of trying to apply rules when unit startup is
detected, and an inherent race condition between systemd creating
service/cgroup and rule being applied, I'd heavily recommend to always use
allow-listing rules with this tool, which fail on the safe side.
- I think "cgroupv2" in nftables rule must be the one where network socket was
created, and not the one where systemd might move the process using it.
So for incoming ssh connections for example, "sshd-session" process might
end up in session-N.scope under user.slice, but nftables will only match it
as belonging to sshd.service cgroup, so some rules might need to have different
cgroup string in the rule than a name that triggers the rule to the left of it.
Not 100% sure that's how it works or supposed to work, but have observed it earlier.
- Use HUP signal, ``-u/--reload-with-unit`` (same as SIGHUP) or ``-a/--reapply-with-unit``
option to restore transient cgroup-specific rules after nftables restart
or other firewall resets that'd remove those.
Links
-----
- `systemd.resource-control(5)`_ manpage that describes implementation of
similar functionality there - lookup ``NFTSet=`` option.
- `Systemd firewall integration suggestions (issue #7327)`_ - more comprehensive
netfilter integration than NFTSet= option above, still at a proposal/suggestion
stage at the moment (2025-04-10), neither accepted nor rejected.
- `helsinki-systems/nft_cgroupv2`_ - alternative third-party implementation of
such matching in nftables.
AFAICT it doesn't rely on cgroup id's and instead resolves these from cgroup
path for every packet, which is probably not great wrt performance, but might
be ok for most use-cases where conntrack filters-out traffic before these rules.
Might conflict with current upstream nftables implementation due to "cgroupv2"
keyword used there as well.
- `Upstreamed "netfilter: nft_socket: add support for cgroupsv2" patch
`_
for "cgroupv2" matching support in nftables (0.99+) on the linux kernel side (linux-5.13+).
- `"netfilter: implement xt_cgroup cgroup2 path match" patch
`_
from linux-4.5.
- Earlier version of this tool was written in OCaml_, and can be last found in `commit
048a8128 `_.
.. _systemd.resource-control(5): https://man.archlinux.org/man/systemd.resource-control.5
.. _Systemd firewall integration suggestions (issue #7327):
https://github.com/systemd/systemd/issues/7327
.. _helsinki-systems/nft_cgroupv2: https://github.com/helsinki-systems/nft_cgroupv2/
.. _OCaml: https://ocaml.org/