[PATCHSET v6] Inherited restrictions and BPF filtering for io

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET v6] Inherited restrictions and BPF filtering for io_uring
@ 2026-01-19 23:54 Jens Axboe
  2026-01-19 23:54 ` [PATCH 1/7] io_uring: add support for BPF filtering for opcode restrictions Jens Axboe
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Jens Axboe @ 2026-01-19 23:54 UTC (permalink / raw)
  To: io-uring; +Cc: brauner, jannh, kees, linux-kernel

Hi,

Followup to v5 here:

https://lore.kernel.org/io-uring/20260118172328.1067592-1-axboe@kernel.dk/

Mostly just addressing a bit of feedback, feature wise this is all the
same as before. For details on the patches, see the v5 posting linked
above. For details on the changes, see the changes section below.

Kernel branch can be found here:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git/log/?h=io_uring-bpf-restrictions.3

and a liburing branch with support helpers, man page, and a fairly
substantial test case can be found here:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/liburing.git/log/?h=bpf-restrictions

Feedback welcome!

Changes since v5:
- Disallow setting or appending filters for no_new_privs, unless the
  user is also CAP_SYS_ADMIN (Aleksa)
- Add support for filtering of IORING_OP_OPENAT/OPENAT2, in terms of
  being able to deny certain resolve or creation flags.
- Change layout of io_uring_bpf_ctx slightly, for easier/faster clearing
  of unused members.
- Expand liburing test cases to cover both the no_new_privs situation,
  and testing the OPENAT/OPENAT2 filters.

 include/linux/io_uring.h                 |  14 +-
 include/linux/io_uring_types.h           |  13 +
 include/linux/sched.h                    |   1 +
 include/uapi/linux/io_uring.h            |  10 +
 include/uapi/linux/io_uring/bpf_filter.h |  62 ++++
 io_uring/Kconfig                         |   5 +
 io_uring/Makefile                        |   1 +
 io_uring/bpf_filter.c                    | 436 +++++++++++++++++++++++
 io_uring/bpf_filter.h                    |  48 +++
 io_uring/io_uring.c                      |  48 +++
 io_uring/io_uring.h                      |   1 +
 io_uring/net.c                           |   9 +
 io_uring/net.h                           |   6 +
 io_uring/openclose.c                     |   9 +
 io_uring/openclose.h                     |   3 +
 io_uring/register.c                      |  91 +++++
 io_uring/tctx.c                          |  42 ++-
 kernel/fork.c                            |   5 +
 18 files changed, 794 insertions(+), 10 deletions(-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/7] io_uring: add support for BPF filtering for opcode restrictions
  2026-01-19 23:54 [PATCHSET v6] Inherited restrictions and BPF filtering for io_uring Jens Axboe
@ 2026-01-19 23:54 ` Jens Axboe
  2026-01-27 10:06   ` Christian Brauner
  2026-01-19 23:54 ` [PATCH 2/7] io_uring/net: allow filtering on IORING_OP_SOCKET data Jens Axboe
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2026-01-19 23:54 UTC (permalink / raw)
  To: io-uring; +Cc: brauner, jannh, kees, linux-kernel, Jens Axboe

Add support for loading classic BPF programs with io_uring to provide
fine-grained filtering of SQE operations. Unlike
IORING_REGISTER_RESTRICTIONS which only allows bitmap-based allow/deny
of opcodes, BPF filters can inspect request attributes and make dynamic
decisions.

The filter is registered via IORING_REGISTER_BPF_FILTER with a struct
io_uring_bpf:

struct io_uring_bpf_filter {
	__u32	opcode;		/* io_uring opcode to filter */
	__u32	flags;
	__u32	filter_len;	/* number of BPF instructions */
	__u32	resv;
	__u64	filter_ptr;	/* pointer to BPF filter */
	__u64	resv2[5];
};

enum {
	IO_URING_BPF_CMD_FILTER	= 1,
};

struct io_uring_bpf {
	__u16	cmd_type;	/* IO_URING_BPF_* values */
	__u16	cmd_flags;	/* none so far */
	__u32	resv;
	union {
		struct io_uring_bpf_filter	filter;
	};
};

and the filters get supplied a struct io_uring_bpf_ctx:

struct io_uring_bpf_ctx {
	__u64	user_data;
	__u8	opcode;
	__u8	sqe_flags;
	__u8	pad[6];
	__u64	resv[6];
};

where it's possible to filter on opcode and sqe_flags, with resv[6]
being set aside for specific finer grained filtering inside an opcode.
An example of that for sockets is in one of the following patches.
Anything the opcode supports can end up in this struct, populated by
the opcode itself, and hence can be filtered for.

Filters have the following semantics:
  - Return 1 to allow the request
  - Return 0 to deny the request with -EACCES
  - Multiple filters can be stacked per opcode. All filters must
    return 1 for the opcode to be allowed.
  - Filters are evaluated in registration order (most recent first)

The implementation uses classic BPF (cBPF) rather than eBPF for as
that's required for containers, and since they can be used by any
user in the system.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h           |   9 +
 include/uapi/linux/io_uring.h            |   3 +
 include/uapi/linux/io_uring/bpf_filter.h |  50 ++++
 io_uring/Kconfig                         |   5 +
 io_uring/Makefile                        |   1 +
 io_uring/bpf_filter.c                    | 329 +++++++++++++++++++++++
 io_uring/bpf_filter.h                    |  42 +++
 io_uring/io_uring.c                      |   8 +
 io_uring/register.c                      |   8 +
 9 files changed, 455 insertions(+)
 create mode 100644 include/uapi/linux/io_uring/bpf_filter.h
 create mode 100644 io_uring/bpf_filter.c
 create mode 100644 io_uring/bpf_filter.h

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 211686ad89fd..37f0a5f7b2f4 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -219,9 +219,18 @@ struct io_rings {
 	struct io_uring_cqe	cqes[] ____cacheline_aligned_in_smp;
 };
 
+struct io_bpf_filter;
+struct io_bpf_filters {
+	refcount_t refs;	/* ref for ->bpf_filters */
+	spinlock_t lock;	/* protects ->bpf_filters modifications */
+	struct io_bpf_filter __rcu **filters;
+	struct rcu_head rcu_head;
+};
+
 struct io_restriction {
 	DECLARE_BITMAP(register_op, IORING_REGISTER_LAST);
 	DECLARE_BITMAP(sqe_op, IORING_OP_LAST);
+	struct io_bpf_filters *bpf_filters;
 	u8 sqe_flags_allowed;
 	u8 sqe_flags_required;
 	/* IORING_OP_* restrictions exist */
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index b5b23c0d5283..94669b77fee8 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -700,6 +700,9 @@ enum io_uring_register_op {
 	/* auxiliary zcrx configuration, see enum zcrx_ctrl_op */
 	IORING_REGISTER_ZCRX_CTRL		= 36,
 
+	/* register bpf filtering programs */
+	IORING_REGISTER_BPF_FILTER		= 37,
+
 	/* this goes last */
 	IORING_REGISTER_LAST,
 
diff --git a/include/uapi/linux/io_uring/bpf_filter.h b/include/uapi/linux/io_uring/bpf_filter.h
new file mode 100644
index 000000000000..8334a40e0f06
--- /dev/null
+++ b/include/uapi/linux/io_uring/bpf_filter.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: (GPL-2.0 WITH Linux-syscall-note) OR MIT */
+/*
+ * Header file for the io_uring BPF filters.
+ */
+#ifndef LINUX_IO_URING_BPF_FILTER_H
+#define LINUX_IO_URING_BPF_FILTER_H
+
+#include <linux/types.h>
+
+/*
+ * Struct passed to filters.
+ */
+struct io_uring_bpf_ctx {
+	__u64	user_data;
+	__u8	opcode;
+	__u8	sqe_flags;
+	__u8	pad[6];
+	__u64	resv[6];
+};
+
+enum {
+	/*
+	 * If set, any currently unset opcode will have a deny filter attached
+	 */
+	IO_URING_BPF_FILTER_DENY_REST	= 1,
+};
+
+struct io_uring_bpf_filter {
+	__u32	opcode;		/* io_uring opcode to filter */
+	__u32	flags;
+	__u32	filter_len;	/* number of BPF instructions */
+	__u32	resv;
+	__u64	filter_ptr;	/* pointer to BPF filter */
+	__u64	resv2[5];
+};
+
+enum {
+	IO_URING_BPF_CMD_FILTER	= 1,
+};
+
+struct io_uring_bpf {
+	__u16	cmd_type;	/* IO_URING_BPF_* values */
+	__u16	cmd_flags;	/* none so far */
+	__u32	resv;
+	union {
+		struct io_uring_bpf_filter	filter;
+	};
+};
+
+#endif
diff --git a/io_uring/Kconfig b/io_uring/Kconfig
index 4b949c42c0bf..a7ae23cf1035 100644
--- a/io_uring/Kconfig
+++ b/io_uring/Kconfig
@@ -9,3 +9,8 @@ config IO_URING_ZCRX
 	depends on PAGE_POOL
 	depends on INET
 	depends on NET_RX_BUSY_POLL
+
+config IO_URING_BPF
+	def_bool y
+	depends on BPF
+	depends on NET
diff --git a/io_uring/Makefile b/io_uring/Makefile
index bc4e4a3fa0a5..f3c505caa91e 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -22,3 +22,4 @@ obj-$(CONFIG_NET_RX_BUSY_POLL)	+= napi.o
 obj-$(CONFIG_NET) += net.o cmd_net.o
 obj-$(CONFIG_PROC_FS) += fdinfo.o
 obj-$(CONFIG_IO_URING_MOCK_FILE) += mock_file.o
+obj-$(CONFIG_IO_URING_BPF) += bpf_filter.o
diff --git a/io_uring/bpf_filter.c b/io_uring/bpf_filter.c
new file mode 100644
index 000000000000..08ca30545228
--- /dev/null
+++ b/io_uring/bpf_filter.c
@@ -0,0 +1,329 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BPF filter support for io_uring. Supports SQE opcodes for now.
+ */
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/io_uring.h>
+#include <linux/filter.h>
+#include <linux/bpf.h>
+#include <uapi/linux/io_uring.h>
+
+#include "io_uring.h"
+#include "bpf_filter.h"
+#include "net.h"
+
+struct io_bpf_filter {
+	struct bpf_prog		*prog;
+	struct io_bpf_filter	*next;
+};
+
+/* Deny if this is set as the filter */
+static const struct io_bpf_filter dummy_filter;
+
+static void io_uring_populate_bpf_ctx(struct io_uring_bpf_ctx *bctx,
+				      struct io_kiocb *req)
+{
+	bctx->opcode = req->opcode;
+	bctx->sqe_flags = (__force int) req->flags & SQE_VALID_FLAGS;
+	bctx->user_data = req->cqe.user_data;
+	/* clear residual */
+	memset(bctx->pad, 0, sizeof(bctx->pad) + sizeof(bctx->resv));
+}
+
+/*
+ * Run registered filters for a given opcode. For filters, a return of 0 denies
+ * execution of the request, a return of 1 allows it. If any filter for an
+ * opcode returns 0, filter processing is stopped, and the request is denied.
+ * This also stops the processing of filters.
+ *
+ * __io_uring_run_bpf_filters() returns 0 on success, allow running the
+ * request, and -EACCES when a request is denied.
+ */
+int __io_uring_run_bpf_filters(struct io_restriction *res, struct io_kiocb *req)
+{
+	struct io_bpf_filter *filter;
+	struct io_uring_bpf_ctx bpf_ctx;
+	int ret;
+
+	/* Fast check for existence of filters outside of RCU */
+	if (!rcu_access_pointer(res->bpf_filters->filters[req->opcode]))
+		return 0;
+
+	/*
+	 * req->opcode has already been validated to be within the range
+	 * of what we expect, io_init_req() does this.
+	 */
+	rcu_read_lock();
+	filter = rcu_dereference(res->bpf_filters->filters[req->opcode]);
+	if (!filter) {
+		ret = 1;
+		goto out;
+	} else if (filter == &dummy_filter) {
+		ret = 0;
+		goto out;
+	}
+
+	io_uring_populate_bpf_ctx(&bpf_ctx, req);
+
+	/*
+	 * Iterate registered filters. The opcode is allowed IFF all filters
+	 * return 1. If any filter returns denied, opcode will be denied.
+	 */
+	do {
+		if (filter == &dummy_filter)
+			ret = 0;
+		else
+			ret = bpf_prog_run(filter->prog, &bpf_ctx);
+		if (!ret)
+			break;
+		filter = filter->next;
+	} while (filter);
+out:
+	rcu_read_unlock();
+	return ret ? 0 : -EACCES;
+}
+
+static void io_free_bpf_filters(struct rcu_head *head)
+{
+	struct io_bpf_filter __rcu **filter;
+	struct io_bpf_filters *filters;
+	int i;
+
+	filters = container_of(head, struct io_bpf_filters, rcu_head);
+	spin_lock(&filters->lock);
+	filter = filters->filters;
+	if (!filter) {
+		spin_unlock(&filters->lock);
+		return;
+	}
+	spin_unlock(&filters->lock);
+
+	for (i = 0; i < IORING_OP_LAST; i++) {
+		struct io_bpf_filter *f;
+
+		rcu_read_lock();
+		f = rcu_dereference(filter[i]);
+		while (f) {
+			struct io_bpf_filter *next = f->next;
+
+			/*
+			 * Even if stacked, dummy filter will always be last
+			 * as it can only get installed into an empty spot.
+			 */
+			if (f == &dummy_filter)
+				break;
+			bpf_prog_destroy(f->prog);
+			kfree(f);
+			f = next;
+		}
+		rcu_read_unlock();
+	}
+	kfree(filters->filters);
+	kfree(filters);
+}
+
+static void __io_put_bpf_filters(struct io_bpf_filters *filters)
+{
+	if (refcount_dec_and_test(&filters->refs))
+		call_rcu(&filters->rcu_head, io_free_bpf_filters);
+}
+
+void io_put_bpf_filters(struct io_restriction *res)
+{
+	if (res->bpf_filters)
+		__io_put_bpf_filters(res->bpf_filters);
+}
+
+static struct io_bpf_filters *io_new_bpf_filters(void)
+{
+	struct io_bpf_filters *filters;
+
+	filters = kzalloc(sizeof(*filters), GFP_KERNEL_ACCOUNT);
+	if (!filters)
+		return ERR_PTR(-ENOMEM);
+
+	filters->filters = kcalloc(IORING_OP_LAST,
+				   sizeof(struct io_bpf_filter *),
+				   GFP_KERNEL_ACCOUNT);
+	if (!filters->filters) {
+		kfree(filters);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	refcount_set(&filters->refs, 1);
+	spin_lock_init(&filters->lock);
+	return filters;
+}
+
+/*
+ * Validate classic BPF filter instructions. Only allow a safe subset of
+ * operations - no packet data access, just context field loads and basic
+ * ALU/jump operations.
+ */
+static int io_uring_check_cbpf_filter(struct sock_filter *filter,
+				      unsigned int flen)
+{
+	int pc;
+
+	for (pc = 0; pc < flen; pc++) {
+		struct sock_filter *ftest = &filter[pc];
+		u16 code = ftest->code;
+		u32 k = ftest->k;
+
+		switch (code) {
+		case BPF_LD | BPF_W | BPF_ABS:
+			ftest->code = BPF_LDX | BPF_W | BPF_ABS;
+			/* 32-bit aligned and not out of bounds. */
+			if (k >= sizeof(struct io_uring_bpf_ctx) || k & 3)
+				return -EINVAL;
+			continue;
+		case BPF_LD | BPF_W | BPF_LEN:
+			ftest->code = BPF_LD | BPF_IMM;
+			ftest->k = sizeof(struct io_uring_bpf_ctx);
+			continue;
+		case BPF_LDX | BPF_W | BPF_LEN:
+			ftest->code = BPF_LDX | BPF_IMM;
+			ftest->k = sizeof(struct io_uring_bpf_ctx);
+			continue;
+		/* Explicitly include allowed calls. */
+		case BPF_RET | BPF_K:
+		case BPF_RET | BPF_A:
+		case BPF_ALU | BPF_ADD | BPF_K:
+		case BPF_ALU | BPF_ADD | BPF_X:
+		case BPF_ALU | BPF_SUB | BPF_K:
+		case BPF_ALU | BPF_SUB | BPF_X:
+		case BPF_ALU | BPF_MUL | BPF_K:
+		case BPF_ALU | BPF_MUL | BPF_X:
+		case BPF_ALU | BPF_DIV | BPF_K:
+		case BPF_ALU | BPF_DIV | BPF_X:
+		case BPF_ALU | BPF_AND | BPF_K:
+		case BPF_ALU | BPF_AND | BPF_X:
+		case BPF_ALU | BPF_OR | BPF_K:
+		case BPF_ALU | BPF_OR | BPF_X:
+		case BPF_ALU | BPF_XOR | BPF_K:
+		case BPF_ALU | BPF_XOR | BPF_X:
+		case BPF_ALU | BPF_LSH | BPF_K:
+		case BPF_ALU | BPF_LSH | BPF_X:
+		case BPF_ALU | BPF_RSH | BPF_K:
+		case BPF_ALU | BPF_RSH | BPF_X:
+		case BPF_ALU | BPF_NEG:
+		case BPF_LD | BPF_IMM:
+		case BPF_LDX | BPF_IMM:
+		case BPF_MISC | BPF_TAX:
+		case BPF_MISC | BPF_TXA:
+		case BPF_LD | BPF_MEM:
+		case BPF_LDX | BPF_MEM:
+		case BPF_ST:
+		case BPF_STX:
+		case BPF_JMP | BPF_JA:
+		case BPF_JMP | BPF_JEQ | BPF_K:
+		case BPF_JMP | BPF_JEQ | BPF_X:
+		case BPF_JMP | BPF_JGE | BPF_K:
+		case BPF_JMP | BPF_JGE | BPF_X:
+		case BPF_JMP | BPF_JGT | BPF_K:
+		case BPF_JMP | BPF_JGT | BPF_X:
+		case BPF_JMP | BPF_JSET | BPF_K:
+		case BPF_JMP | BPF_JSET | BPF_X:
+			continue;
+		default:
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+#define IO_URING_BPF_FILTER_FLAGS	IO_URING_BPF_FILTER_DENY_REST
+
+int io_register_bpf_filter(struct io_restriction *res,
+			   struct io_uring_bpf __user *arg)
+{
+	struct io_bpf_filter *filter, *old_filter;
+	struct io_bpf_filters *filters;
+	struct io_uring_bpf reg;
+	struct bpf_prog *prog;
+	struct sock_fprog fprog;
+	int ret;
+
+	if (copy_from_user(&reg, arg, sizeof(reg)))
+		return -EFAULT;
+	if (reg.cmd_type != IO_URING_BPF_CMD_FILTER)
+		return -EINVAL;
+	if (reg.cmd_flags || reg.resv)
+		return -EINVAL;
+
+	if (reg.filter.opcode >= IORING_OP_LAST)
+		return -EINVAL;
+	if (reg.filter.flags & ~IO_URING_BPF_FILTER_FLAGS)
+		return -EINVAL;
+	if (reg.filter.resv)
+		return -EINVAL;
+	if (!mem_is_zero(reg.filter.resv2, sizeof(reg.filter.resv2)))
+		return -EINVAL;
+	if (!reg.filter.filter_len || reg.filter.filter_len > BPF_MAXINSNS)
+		return -EINVAL;
+
+	fprog.len = reg.filter.filter_len;
+	fprog.filter = u64_to_user_ptr(reg.filter.filter_ptr);
+
+	ret = bpf_prog_create_from_user(&prog, &fprog,
+					io_uring_check_cbpf_filter, false);
+	if (ret)
+		return ret;
+
+	/*
+	 * No existing filters, allocate set.
+	 */
+	filters = res->bpf_filters;
+	if (!filters) {
+		filters = io_new_bpf_filters();
+		if (IS_ERR(filters)) {
+			ret = PTR_ERR(filters);
+			goto err_prog;
+		}
+	}
+
+	filter = kzalloc(sizeof(*filter), GFP_KERNEL_ACCOUNT);
+	if (!filter) {
+		ret = -ENOMEM;
+		goto err;
+	}
+	filter->prog = prog;
+	res->bpf_filters = filters;
+
+	/*
+	 * Insert filter - if the current opcode already has a filter
+	 * attached, add to the set.
+	 */
+	rcu_read_lock();
+	spin_lock_bh(&filters->lock);
+	old_filter = rcu_dereference(filters->filters[reg.filter.opcode]);
+	if (old_filter)
+		filter->next = old_filter;
+	rcu_assign_pointer(filters->filters[reg.filter.opcode], filter);
+
+	/*
+	 * If IO_URING_BPF_FILTER_DENY_REST is set, fill any unregistered
+	 * opcode with the dummy filter. That will cause them to be denied.
+	 */
+	if (reg.filter.flags & IO_URING_BPF_FILTER_DENY_REST) {
+		for (int i = 0; i < IORING_OP_LAST; i++) {
+			if (i == reg.filter.opcode)
+				continue;
+			old_filter = rcu_dereference(filters->filters[i]);
+			if (old_filter)
+				continue;
+			rcu_assign_pointer(filters->filters[i], &dummy_filter);
+		}
+	}
+
+	spin_unlock_bh(&filters->lock);
+	rcu_read_unlock();
+	return 0;
+err:
+	if (filters != res->bpf_filters)
+		__io_put_bpf_filters(filters);
+err_prog:
+	bpf_prog_destroy(prog);
+	return ret;
+}
diff --git a/io_uring/bpf_filter.h b/io_uring/bpf_filter.h
new file mode 100644
index 000000000000..27eae9705473
--- /dev/null
+++ b/io_uring/bpf_filter.h
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef IO_URING_BPF_FILTER_H
+#define IO_URING_BPF_FILTER_H
+
+#include <uapi/linux/io_uring/bpf_filter.h>
+
+#ifdef CONFIG_IO_URING_BPF
+
+int __io_uring_run_bpf_filters(struct io_restriction *res, struct io_kiocb *req);
+
+int io_register_bpf_filter(struct io_restriction *res,
+			   struct io_uring_bpf __user *arg);
+
+void io_put_bpf_filters(struct io_restriction *res);
+
+static inline int io_uring_run_bpf_filters(struct io_restriction *res,
+					   struct io_kiocb *req)
+{
+	if (res->bpf_filters)
+		return __io_uring_run_bpf_filters(res, req);
+
+	return 0;
+}
+
+#else
+
+static inline int io_register_bpf_filter(struct io_restriction *res,
+					 struct io_uring_bpf __user *arg)
+{
+	return -EINVAL;
+}
+static inline int io_uring_run_bpf_filters(struct io_restriction *res,
+					   struct io_kiocb *req)
+{
+	return 0;
+}
+static inline void io_put_bpf_filters(struct io_restriction *res)
+{
+}
+#endif /* CONFIG_IO_URING_BPF */
+
+#endif
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 2cde22af78a3..67533e494836 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -93,6 +93,7 @@
 #include "rw.h"
 #include "alloc_cache.h"
 #include "eventfd.h"
+#include "bpf_filter.h"
 
 #define SQE_COMMON_FLAGS (IOSQE_FIXED_FILE | IOSQE_IO_LINK | \
 			  IOSQE_IO_HARDLINK | IOSQE_ASYNC)
@@ -2261,6 +2262,12 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	if (unlikely(ret))
 		return io_submit_fail_init(sqe, req, ret);
 
+	if (unlikely(ctx->restrictions.bpf_filters)) {
+		ret = io_uring_run_bpf_filters(&ctx->restrictions, req);
+		if (ret)
+			return io_submit_fail_init(sqe, req, ret);
+	}
+
 	trace_io_uring_submit_req(req);
 
 	/*
@@ -2850,6 +2857,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	percpu_ref_exit(&ctx->refs);
 	free_uid(ctx->user);
 	io_req_caches_free(ctx);
+	io_put_bpf_filters(&ctx->restrictions);
 
 	WARN_ON_ONCE(ctx->nr_req_allocated);
 
diff --git a/io_uring/register.c b/io_uring/register.c
index 8551f13920dc..30957c2cb5eb 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -33,6 +33,7 @@
 #include "memmap.h"
 #include "zcrx.h"
 #include "query.h"
+#include "bpf_filter.h"
 
 #define IORING_MAX_RESTRICTIONS	(IORING_RESTRICTION_LAST + \
 				 IORING_REGISTER_LAST + IORING_OP_LAST)
@@ -830,6 +831,13 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 	case IORING_REGISTER_ZCRX_CTRL:
 		ret = io_zcrx_ctrl(ctx, arg, nr_args);
 		break;
+	case IORING_REGISTER_BPF_FILTER:
+		ret = -EINVAL;
+
+		if (nr_args != 1)
+			break;
+		ret = io_register_bpf_filter(&ctx->restrictions, arg);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/7] io_uring/net: allow filtering on IORING_OP_SOCKET data
  2026-01-19 23:54 [PATCHSET v6] Inherited restrictions and BPF filtering for io_uring Jens Axboe
  2026-01-19 23:54 ` [PATCH 1/7] io_uring: add support for BPF filtering for opcode restrictions Jens Axboe
@ 2026-01-19 23:54 ` Jens Axboe
  2026-01-19 23:54 ` [PATCH 3/7] io_uring/bpf_filter: allow filtering on contents of struct open_how Jens Axboe
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2026-01-19 23:54 UTC (permalink / raw)
  To: io-uring; +Cc: brauner, jannh, kees, linux-kernel, Jens Axboe

Example population method for the BPF based opcode filtering. This
exposes the socket family, type, and protocol to a registered BPF
filter. This in turn enables the filter to make decisions based on
what was passed in to the IORING_OP_SOCKET request type.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/uapi/linux/io_uring/bpf_filter.h |  9 ++++++++-
 io_uring/bpf_filter.c                    | 10 ++++++++++
 io_uring/net.c                           |  9 +++++++++
 io_uring/net.h                           |  6 ++++++
 4 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/io_uring/bpf_filter.h b/include/uapi/linux/io_uring/bpf_filter.h
index 8334a40e0f06..ad6961be5efa 100644
--- a/include/uapi/linux/io_uring/bpf_filter.h
+++ b/include/uapi/linux/io_uring/bpf_filter.h
@@ -15,7 +15,14 @@ struct io_uring_bpf_ctx {
 	__u8	opcode;
 	__u8	sqe_flags;
 	__u8	pad[6];
-	__u64	resv[6];
+	union {
+		__u64	resv[6];
+		struct {
+			__u32	family;
+			__u32	type;
+			__u32	protocol;
+		} socket;
+	};
 };
 
 enum {
diff --git a/io_uring/bpf_filter.c b/io_uring/bpf_filter.c
index 08ca30545228..8934c0586842 100644
--- a/io_uring/bpf_filter.c
+++ b/io_uring/bpf_filter.c
@@ -29,6 +29,16 @@ static void io_uring_populate_bpf_ctx(struct io_uring_bpf_ctx *bctx,
 	bctx->user_data = req->cqe.user_data;
 	/* clear residual */
 	memset(bctx->pad, 0, sizeof(bctx->pad) + sizeof(bctx->resv));
+
+	/*
+	 * Opcodes can provide a handler fo populating more data into bctx,
+	 * for filters to use.
+	 */
+	switch (req->opcode) {
+	case IORING_OP_SOCKET:
+		io_socket_bpf_populate(bctx, req);
+		break;
+	}
 }
 
 /*
diff --git a/io_uring/net.c b/io_uring/net.c
index 519ea055b761..4fcba36bd0bb 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -1699,6 +1699,15 @@ int io_accept(struct io_kiocb *req, unsigned int issue_flags)
 	return IOU_COMPLETE;
 }
 
+void io_socket_bpf_populate(struct io_uring_bpf_ctx *bctx, struct io_kiocb *req)
+{
+	struct io_socket *sock = io_kiocb_to_cmd(req, struct io_socket);
+
+	bctx->socket.family = sock->domain;
+	bctx->socket.type = sock->type;
+	bctx->socket.protocol = sock->protocol;
+}
+
 int io_socket_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_socket *sock = io_kiocb_to_cmd(req, struct io_socket);
diff --git a/io_uring/net.h b/io_uring/net.h
index 43e5ce5416b7..a862960a3bb9 100644
--- a/io_uring/net.h
+++ b/io_uring/net.h
@@ -3,6 +3,7 @@
 #include <linux/net.h>
 #include <linux/uio.h>
 #include <linux/io_uring_types.h>
+#include <uapi/linux/io_uring/bpf_filter.h>
 
 struct io_async_msghdr {
 #if defined(CONFIG_NET)
@@ -44,6 +45,7 @@ int io_accept(struct io_kiocb *req, unsigned int issue_flags);
 
 int io_socket_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_socket(struct io_kiocb *req, unsigned int issue_flags);
+void io_socket_bpf_populate(struct io_uring_bpf_ctx *bctx, struct io_kiocb *req);
 
 int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_connect(struct io_kiocb *req, unsigned int issue_flags);
@@ -64,4 +66,8 @@ void io_netmsg_cache_free(const void *entry);
 static inline void io_netmsg_cache_free(const void *entry)
 {
 }
+static inline void io_socket_bpf_populate(struct io_uring_bpf_ctx *bctx,
+					  struct io_kiocb *req)
+{
+}
 #endif
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/7] io_uring/bpf_filter: allow filtering on contents of struct open_how
  2026-01-19 23:54 [PATCHSET v6] Inherited restrictions and BPF filtering for io_uring Jens Axboe
  2026-01-19 23:54 ` [PATCH 1/7] io_uring: add support for BPF filtering for opcode restrictions Jens Axboe
  2026-01-19 23:54 ` [PATCH 2/7] io_uring/net: allow filtering on IORING_OP_SOCKET data Jens Axboe
@ 2026-01-19 23:54 ` Jens Axboe
  2026-01-27  9:33   ` Christian Brauner
  2026-01-19 23:54 ` [PATCH 4/7] io_uring/bpf_filter: cache lookup table in ctx->bpf_filters Jens Axboe
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2026-01-19 23:54 UTC (permalink / raw)
  To: io-uring; +Cc: brauner, jannh, kees, linux-kernel, Jens Axboe

This adds custom filtering for IORING_OP_OPENAT and IORING_OP_OPENAT2,
where the open_how flags, mode, and resolve can be checked by filters.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/uapi/linux/io_uring/bpf_filter.h | 5 +++++
 io_uring/bpf_filter.c                    | 5 +++++
 io_uring/openclose.c                     | 9 +++++++++
 io_uring/openclose.h                     | 3 +++
 4 files changed, 22 insertions(+)

diff --git a/include/uapi/linux/io_uring/bpf_filter.h b/include/uapi/linux/io_uring/bpf_filter.h
index ad6961be5efa..7f468628c491 100644
--- a/include/uapi/linux/io_uring/bpf_filter.h
+++ b/include/uapi/linux/io_uring/bpf_filter.h
@@ -22,6 +22,11 @@ struct io_uring_bpf_ctx {
 			__u32	type;
 			__u32	protocol;
 		} socket;
+		struct {
+			__u64	flags;
+			__u64	mode;
+			__u64	resolve;
+		} open;
 	};
 };
 
diff --git a/io_uring/bpf_filter.c b/io_uring/bpf_filter.c
index 8934c0586842..3352f53fd2b9 100644
--- a/io_uring/bpf_filter.c
+++ b/io_uring/bpf_filter.c
@@ -12,6 +12,7 @@
 #include "io_uring.h"
 #include "bpf_filter.h"
 #include "net.h"
+#include "openclose.h"
 
 struct io_bpf_filter {
 	struct bpf_prog		*prog;
@@ -38,6 +39,10 @@ static void io_uring_populate_bpf_ctx(struct io_uring_bpf_ctx *bctx,
 	case IORING_OP_SOCKET:
 		io_socket_bpf_populate(bctx, req);
 		break;
+	case IORING_OP_OPENAT:
+	case IORING_OP_OPENAT2:
+		io_openat_bpf_populate(bctx, req);
+		break;
 	}
 }
 
diff --git a/io_uring/openclose.c b/io_uring/openclose.c
index 15dde9bd6ff6..31c687adf873 100644
--- a/io_uring/openclose.c
+++ b/io_uring/openclose.c
@@ -85,6 +85,15 @@ static int __io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe
 	return 0;
 }
 
+void io_openat_bpf_populate(struct io_uring_bpf_ctx *bctx, struct io_kiocb *req)
+{
+	struct io_open *open = io_kiocb_to_cmd(req, struct io_open);
+
+	bctx->open.flags = open->how.flags;
+	bctx->open.mode = open->how.mode;
+	bctx->open.resolve = open->how.resolve;
+}
+
 int io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_open *open = io_kiocb_to_cmd(req, struct io_open);
diff --git a/io_uring/openclose.h b/io_uring/openclose.h
index 4ca2a9935abc..566739920658 100644
--- a/io_uring/openclose.h
+++ b/io_uring/openclose.h
@@ -1,11 +1,14 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#include "bpf_filter.h"
+
 int __io_close_fixed(struct io_ring_ctx *ctx, unsigned int issue_flags,
 		     unsigned int offset);
 
 int io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_openat(struct io_kiocb *req, unsigned int issue_flags);
 void io_open_cleanup(struct io_kiocb *req);
+void io_openat_bpf_populate(struct io_uring_bpf_ctx *bctx, struct io_kiocb *req);
 
 int io_openat2_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_openat2(struct io_kiocb *req, unsigned int issue_flags);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/7] io_uring/bpf_filter: cache lookup table in ctx->bpf_filters
  2026-01-19 23:54 [PATCHSET v6] Inherited restrictions and BPF filtering for io_uring Jens Axboe
                   ` (2 preceding siblings ...)
  2026-01-19 23:54 ` [PATCH 3/7] io_uring/bpf_filter: allow filtering on contents of struct open_how Jens Axboe
@ 2026-01-19 23:54 ` Jens Axboe
  2026-01-27  9:33   ` Christian Brauner
  2026-01-19 23:54 ` [PATCH 5/7] io_uring/bpf_filter: add ref counts to struct io_bpf_filter Jens Axboe
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2026-01-19 23:54 UTC (permalink / raw)
  To: io-uring; +Cc: brauner, jannh, kees, linux-kernel, Jens Axboe

Currently a few pointer dereferences need to be made to both check if
BPF filters are installed, and then also to retrieve the actual filter
for the opcode. Cache the table in ctx->bpf_filters to avoid that.

Add a bit of debug info on ring exit to show if we ever got this wrong.
Small risk of that given that the table is currently only updated in one
spot, but once task forking is enabled, that will add one more spot.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h |  2 ++
 io_uring/bpf_filter.c          |  7 ++++---
 io_uring/bpf_filter.h          | 10 +++++-----
 io_uring/io_uring.c            | 11 +++++++++--
 io_uring/register.c            |  3 +++
 5 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 37f0a5f7b2f4..366927635277 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -287,6 +287,8 @@ struct io_ring_ctx {
 
 		struct task_struct	*submitter_task;
 		struct io_rings		*rings;
+		/* cache of ->restrictions.bpf_filters->filters */
+		struct io_bpf_filter __rcu	**bpf_filters;
 		struct percpu_ref	refs;
 
 		clockid_t		clockid;
diff --git a/io_uring/bpf_filter.c b/io_uring/bpf_filter.c
index 3352f53fd2b9..06fad04c4b54 100644
--- a/io_uring/bpf_filter.c
+++ b/io_uring/bpf_filter.c
@@ -55,14 +55,15 @@ static void io_uring_populate_bpf_ctx(struct io_uring_bpf_ctx *bctx,
  * __io_uring_run_bpf_filters() returns 0 on success, allow running the
  * request, and -EACCES when a request is denied.
  */
-int __io_uring_run_bpf_filters(struct io_restriction *res, struct io_kiocb *req)
+int __io_uring_run_bpf_filters(struct io_bpf_filter __rcu **filters,
+			       struct io_kiocb *req)
 {
 	struct io_bpf_filter *filter;
 	struct io_uring_bpf_ctx bpf_ctx;
 	int ret;
 
 	/* Fast check for existence of filters outside of RCU */
-	if (!rcu_access_pointer(res->bpf_filters->filters[req->opcode]))
+	if (!rcu_access_pointer(filters[req->opcode]))
 		return 0;
 
 	/*
@@ -70,7 +71,7 @@ int __io_uring_run_bpf_filters(struct io_restriction *res, struct io_kiocb *req)
 	 * of what we expect, io_init_req() does this.
 	 */
 	rcu_read_lock();
-	filter = rcu_dereference(res->bpf_filters->filters[req->opcode]);
+	filter = rcu_dereference(filters[req->opcode]);
 	if (!filter) {
 		ret = 1;
 		goto out;
diff --git a/io_uring/bpf_filter.h b/io_uring/bpf_filter.h
index 27eae9705473..9f3cdb92eb16 100644
--- a/io_uring/bpf_filter.h
+++ b/io_uring/bpf_filter.h
@@ -6,18 +6,18 @@
 
 #ifdef CONFIG_IO_URING_BPF
 
-int __io_uring_run_bpf_filters(struct io_restriction *res, struct io_kiocb *req);
+int __io_uring_run_bpf_filters(struct io_bpf_filter __rcu **filters, struct io_kiocb *req);
 
 int io_register_bpf_filter(struct io_restriction *res,
 			   struct io_uring_bpf __user *arg);
 
 void io_put_bpf_filters(struct io_restriction *res);
 
-static inline int io_uring_run_bpf_filters(struct io_restriction *res,
+static inline int io_uring_run_bpf_filters(struct io_bpf_filter __rcu **filters,
 					   struct io_kiocb *req)
 {
-	if (res->bpf_filters)
-		return __io_uring_run_bpf_filters(res, req);
+	if (filters)
+		return __io_uring_run_bpf_filters(filters, req);
 
 	return 0;
 }
@@ -29,7 +29,7 @@ static inline int io_register_bpf_filter(struct io_restriction *res,
 {
 	return -EINVAL;
 }
-static inline int io_uring_run_bpf_filters(struct io_restriction *res,
+static inline int io_uring_run_bpf_filters(struct io_bpf_filter __rcu **filters,
 					   struct io_kiocb *req)
 {
 	return 0;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 67533e494836..62aeaf0fad74 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2262,8 +2262,8 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	if (unlikely(ret))
 		return io_submit_fail_init(sqe, req, ret);
 
-	if (unlikely(ctx->restrictions.bpf_filters)) {
-		ret = io_uring_run_bpf_filters(&ctx->restrictions, req);
+	if (unlikely(ctx->bpf_filters)) {
+		ret = io_uring_run_bpf_filters(ctx->bpf_filters, req);
 		if (ret)
 			return io_submit_fail_init(sqe, req, ret);
 	}
@@ -2857,6 +2857,13 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	percpu_ref_exit(&ctx->refs);
 	free_uid(ctx->user);
 	io_req_caches_free(ctx);
+
+	if (ctx->restrictions.bpf_filters) {
+		WARN_ON_ONCE(ctx->bpf_filters !=
+			     ctx->restrictions.bpf_filters->filters);
+	} else {
+		WARN_ON_ONCE(ctx->bpf_filters);
+	}
 	io_put_bpf_filters(&ctx->restrictions);
 
 	WARN_ON_ONCE(ctx->nr_req_allocated);
diff --git a/io_uring/register.c b/io_uring/register.c
index 30957c2cb5eb..40de9b8924b9 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -837,6 +837,9 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 		if (nr_args != 1)
 			break;
 		ret = io_register_bpf_filter(&ctx->restrictions, arg);
+		if (!ret)
+			WRITE_ONCE(ctx->bpf_filters,
+				   ctx->restrictions.bpf_filters->filters);
 		break;
 	default:
 		ret = -EINVAL;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/7] io_uring/bpf_filter: add ref counts to struct io_bpf_filter
  2026-01-19 23:54 [PATCHSET v6] Inherited restrictions and BPF filtering for io_uring Jens Axboe
                   ` (3 preceding siblings ...)
  2026-01-19 23:54 ` [PATCH 4/7] io_uring/bpf_filter: cache lookup table in ctx->bpf_filters Jens Axboe
@ 2026-01-19 23:54 ` Jens Axboe
  2026-01-27  9:34   ` Christian Brauner
  2026-01-19 23:54 ` [PATCH 6/7] io_uring: add task fork hook Jens Axboe
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2026-01-19 23:54 UTC (permalink / raw)
  To: io-uring; +Cc: brauner, jannh, kees, linux-kernel, Jens Axboe

In preparation for allowing inheritance of BPF filters and filter
tables, add a reference count to the filter. This allows multiple tables
to safely include the same filter.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/bpf_filter.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/io_uring/bpf_filter.c b/io_uring/bpf_filter.c
index 06fad04c4b54..fc9eaf29fcbf 100644
--- a/io_uring/bpf_filter.c
+++ b/io_uring/bpf_filter.c
@@ -15,6 +15,7 @@
 #include "openclose.h"
 
 struct io_bpf_filter {
+	refcount_t		refs;
 	struct bpf_prog		*prog;
 	struct io_bpf_filter	*next;
 };
@@ -129,6 +130,11 @@ static void io_free_bpf_filters(struct rcu_head *head)
 			 */
 			if (f == &dummy_filter)
 				break;
+
+			/* Someone still holds a ref, stop iterating. */
+			if (!refcount_dec_and_test(&f->refs))
+				break;
+
 			bpf_prog_destroy(f->prog);
 			kfree(f);
 			f = next;
@@ -304,6 +310,7 @@ int io_register_bpf_filter(struct io_restriction *res,
 		ret = -ENOMEM;
 		goto err;
 	}
+	refcount_set(&filter->refs, 1);
 	filter->prog = prog;
 	res->bpf_filters = filters;
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 6/7] io_uring: add task fork hook
  2026-01-19 23:54 [PATCHSET v6] Inherited restrictions and BPF filtering for io_uring Jens Axboe
                   ` (4 preceding siblings ...)
  2026-01-19 23:54 ` [PATCH 5/7] io_uring/bpf_filter: add ref counts to struct io_bpf_filter Jens Axboe
@ 2026-01-19 23:54 ` Jens Axboe
  2026-01-27 10:07   ` Christian Brauner
  2026-01-19 23:54 ` [PATCH 7/7] io_uring: allow registration of per-task restrictions Jens Axboe
  2026-01-22  3:37 ` [PATCHSET v6] Inherited restrictions and BPF filtering for io_uring Jens Axboe
  7 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2026-01-19 23:54 UTC (permalink / raw)
  To: io-uring; +Cc: brauner, jannh, kees, linux-kernel, Jens Axboe

Called when copy_process() is called to copy state to a new child.
Right now this is just a stub, but will be used shortly to properly
handle fork'ing of task based io_uring restrictions.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring.h | 14 +++++++++++++-
 include/linux/sched.h    |  1 +
 io_uring/tctx.c          | 25 ++++++++++++++++---------
 kernel/fork.c            |  5 +++++
 4 files changed, 35 insertions(+), 10 deletions(-)

diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 85fe4e6b275c..d1aa4edfc2a5 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -12,6 +12,7 @@ void __io_uring_free(struct task_struct *tsk);
 void io_uring_unreg_ringfd(void);
 const char *io_uring_get_opcode(u8 opcode);
 bool io_is_uring_fops(struct file *file);
+int __io_uring_fork(struct task_struct *tsk);
 
 static inline void io_uring_files_cancel(void)
 {
@@ -25,9 +26,16 @@ static inline void io_uring_task_cancel(void)
 }
 static inline void io_uring_free(struct task_struct *tsk)
 {
-	if (tsk->io_uring)
+	if (tsk->io_uring || tsk->io_uring_restrict)
 		__io_uring_free(tsk);
 }
+static inline int io_uring_fork(struct task_struct *tsk)
+{
+	if (tsk->io_uring_restrict)
+		return __io_uring_fork(tsk);
+
+	return 0;
+}
 #else
 static inline void io_uring_task_cancel(void)
 {
@@ -46,6 +54,10 @@ static inline bool io_is_uring_fops(struct file *file)
 {
 	return false;
 }
+static inline int io_uring_fork(struct task_struct *tsk)
+{
+	return 0;
+}
 #endif
 
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d395f2810fac..9abbd11bb87c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1190,6 +1190,7 @@ struct task_struct {
 
 #ifdef CONFIG_IO_URING
 	struct io_uring_task		*io_uring;
+	struct io_restriction		*io_uring_restrict;
 #endif
 
 	/* Namespaces: */
diff --git a/io_uring/tctx.c b/io_uring/tctx.c
index 5b66755579c0..d4f7698805e4 100644
--- a/io_uring/tctx.c
+++ b/io_uring/tctx.c
@@ -54,16 +54,18 @@ void __io_uring_free(struct task_struct *tsk)
 	 * node is stored in the xarray. Until that gets sorted out, attempt
 	 * an iteration here and warn if any entries are found.
 	 */
-	xa_for_each(&tctx->xa, index, node) {
-		WARN_ON_ONCE(1);
-		break;
-	}
-	WARN_ON_ONCE(tctx->io_wq);
-	WARN_ON_ONCE(tctx->cached_refs);
+	if (tctx) {
+		xa_for_each(&tctx->xa, index, node) {
+			WARN_ON_ONCE(1);
+			break;
+		}
+		WARN_ON_ONCE(tctx->io_wq);
+		WARN_ON_ONCE(tctx->cached_refs);
 
-	percpu_counter_destroy(&tctx->inflight);
-	kfree(tctx);
-	tsk->io_uring = NULL;
+		percpu_counter_destroy(&tctx->inflight);
+		kfree(tctx);
+		tsk->io_uring = NULL;
+	}
 }
 
 __cold int io_uring_alloc_task_context(struct task_struct *task,
@@ -351,3 +353,8 @@ int io_ringfd_unregister(struct io_ring_ctx *ctx, void __user *__arg,
 
 	return i ? i : ret;
 }
+
+int __io_uring_fork(struct task_struct *tsk)
+{
+	return 0;
+}
diff --git a/kernel/fork.c b/kernel/fork.c
index b1f3915d5f8e..08a2515380ec 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -97,6 +97,7 @@
 #include <linux/kasan.h>
 #include <linux/scs.h>
 #include <linux/io_uring.h>
+#include <linux/io_uring_types.h>
 #include <linux/bpf.h>
 #include <linux/stackprotector.h>
 #include <linux/user_events.h>
@@ -2129,6 +2130,9 @@ __latent_entropy struct task_struct *copy_process(
 
 #ifdef CONFIG_IO_URING
 	p->io_uring = NULL;
+	retval = io_uring_fork(p);
+	if (unlikely(retval))
+		goto bad_fork_cleanup_delayacct;
 #endif
 
 	p->default_timer_slack_ns = current->timer_slack_ns;
@@ -2525,6 +2529,7 @@ __latent_entropy struct task_struct *copy_process(
 	mpol_put(p->mempolicy);
 #endif
 bad_fork_cleanup_delayacct:
+	io_uring_free(p);
 	delayacct_tsk_free(p);
 bad_fork_cleanup_count:
 	dec_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 7/7] io_uring: allow registration of per-task restrictions
  2026-01-19 23:54 [PATCHSET v6] Inherited restrictions and BPF filtering for io_uring Jens Axboe
                   ` (5 preceding siblings ...)
  2026-01-19 23:54 ` [PATCH 6/7] io_uring: add task fork hook Jens Axboe
@ 2026-01-19 23:54 ` Jens Axboe
  2026-01-22  3:37 ` [PATCHSET v6] Inherited restrictions and BPF filtering for io_uring Jens Axboe
  7 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2026-01-19 23:54 UTC (permalink / raw)
  To: io-uring; +Cc: brauner, jannh, kees, linux-kernel, Jens Axboe

Currently io_uring supports restricting operations on a per-ring basis.
To use those, the ring must be setup in a disabled state by setting
IORING_SETUP_R_DISABLED. Then restrictions can be set for the ring, and
the ring can then be enabled.

This commit adds support for IORING_REGISTER_RESTRICTIONS with ring_fd
== -1, like the other "blind" register opcodes which work on the task
rather than a specific ring. This allows registration of the same kind
of restrictions as can been done on a specific ring, but with the task
itself. Once done, any ring created will inherit these restrictions.

If a restriction filter is registered with a task, then it's inherited
on fork for its children. Children may only further restrict operations,
not extend them.

Inheriting restrictions include both the classic
IORING_REGISTER_RESTRICTIONS based restrictions, as well as the BPF
filters that have been registered with the task via
IORING_REGISTER_BPF_FILTER.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h |  2 +
 include/uapi/linux/io_uring.h  |  7 +++
 io_uring/bpf_filter.c          | 86 +++++++++++++++++++++++++++++++++-
 io_uring/bpf_filter.h          |  6 +++
 io_uring/io_uring.c            | 33 +++++++++++++
 io_uring/io_uring.h            |  1 +
 io_uring/register.c            | 80 +++++++++++++++++++++++++++++++
 io_uring/tctx.c                | 17 +++++++
 8 files changed, 231 insertions(+), 1 deletion(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 366927635277..15ed7fa2bca3 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -231,6 +231,8 @@ struct io_restriction {
 	DECLARE_BITMAP(register_op, IORING_REGISTER_LAST);
 	DECLARE_BITMAP(sqe_op, IORING_OP_LAST);
 	struct io_bpf_filters *bpf_filters;
+	/* ->bpf_filters needs COW on modification */
+	bool bpf_filters_cow;
 	u8 sqe_flags_allowed;
 	u8 sqe_flags_required;
 	/* IORING_OP_* restrictions exist */
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 94669b77fee8..aeeffcf27fee 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -808,6 +808,13 @@ struct io_uring_restriction {
 	__u32 resv2[3];
 };
 
+struct io_uring_task_restriction {
+	__u16 flags;
+	__u16 nr_res;
+	__u32 resv[3];
+	__DECLARE_FLEX_ARRAY(struct io_uring_restriction, restrictions);
+};
+
 struct io_uring_clock_register {
 	__u32	clockid;
 	__u32	__resv[3];
diff --git a/io_uring/bpf_filter.c b/io_uring/bpf_filter.c
index fc9eaf29fcbf..fb5394afc19f 100644
--- a/io_uring/bpf_filter.c
+++ b/io_uring/bpf_filter.c
@@ -255,13 +255,77 @@ static int io_uring_check_cbpf_filter(struct sock_filter *filter,
 	return 0;
 }
 
+void io_bpf_filter_clone(struct io_restriction *dst, struct io_restriction *src)
+{
+	if (!src->bpf_filters)
+		return;
+
+	rcu_read_lock();
+	/*
+	 * If the src filter is going away, just ignore it.
+	 */
+	if (refcount_inc_not_zero(&src->bpf_filters->refs)) {
+		dst->bpf_filters = src->bpf_filters;
+		dst->bpf_filters_cow = true;
+	}
+	rcu_read_unlock();
+}
+
+/*
+ * Allocate a new struct io_bpf_filters. Used when a filter is cloned and
+ * modifications need to be made.
+ */
+static struct io_bpf_filters *io_bpf_filter_cow(struct io_restriction *src)
+{
+	struct io_bpf_filters *filters;
+	struct io_bpf_filter *srcf;
+	int i;
+
+	filters = io_new_bpf_filters();
+	if (IS_ERR(filters))
+		return filters;
+
+	/*
+	 * Iterate filters from src and assign in destination. Grabbing
+	 * a reference is enough, we don't need to duplicate the memory.
+	 * This is safe because filters are only ever appended to the
+	 * front of the list, hence the only memory ever touched inside
+	 * a filter is the refcount.
+	 */
+	rcu_read_lock();
+	for (i = 0; i < IORING_OP_LAST; i++) {
+		srcf = rcu_dereference(src->bpf_filters->filters[i]);
+		if (!srcf) {
+			continue;
+		} else if (srcf == &dummy_filter) {
+			rcu_assign_pointer(filters->filters[i], &dummy_filter);
+			continue;
+		}
+
+		/*
+		 * Getting a ref on the first node is enough, putting the
+		 * filter and iterating nodes to free will stop on the first
+		 * one that doesn't hit zero when dropping.
+		 */
+		if (!refcount_inc_not_zero(&srcf->refs))
+			goto err;
+		rcu_assign_pointer(filters->filters[i], srcf);
+	}
+	rcu_read_unlock();
+	return filters;
+err:
+	rcu_read_unlock();
+	__io_put_bpf_filters(filters);
+	return ERR_PTR(-EBUSY);
+}
+
 #define IO_URING_BPF_FILTER_FLAGS	IO_URING_BPF_FILTER_DENY_REST
 
 int io_register_bpf_filter(struct io_restriction *res,
 			   struct io_uring_bpf __user *arg)
 {
+	struct io_bpf_filters *filters, *old_filters = NULL;
 	struct io_bpf_filter *filter, *old_filter;
-	struct io_bpf_filters *filters;
 	struct io_uring_bpf reg;
 	struct bpf_prog *prog;
 	struct sock_fprog fprog;
@@ -303,6 +367,17 @@ int io_register_bpf_filter(struct io_restriction *res,
 			ret = PTR_ERR(filters);
 			goto err_prog;
 		}
+	} else if (res->bpf_filters_cow) {
+		filters = io_bpf_filter_cow(res);
+		if (IS_ERR(filters)) {
+			ret = PTR_ERR(filters);
+			goto err_prog;
+		}
+		/*
+		 * Stash old filters, we'll put them once we know we'll
+		 * succeed. Until then, res->bpf_filters is left untouched.
+		 */
+		old_filters = res->bpf_filters;
 	}
 
 	filter = kzalloc(sizeof(*filter), GFP_KERNEL_ACCOUNT);
@@ -312,6 +387,15 @@ int io_register_bpf_filter(struct io_restriction *res,
 	}
 	refcount_set(&filter->refs, 1);
 	filter->prog = prog;
+
+	/*
+	 * Success - install the new filter set now. If we did COW, put
+	 * the old filters as we're replacing them.
+	 */
+	if (old_filters) {
+		__io_put_bpf_filters(old_filters);
+		res->bpf_filters_cow = false;
+	}
 	res->bpf_filters = filters;
 
 	/*
diff --git a/io_uring/bpf_filter.h b/io_uring/bpf_filter.h
index 9f3cdb92eb16..66a776cf25b4 100644
--- a/io_uring/bpf_filter.h
+++ b/io_uring/bpf_filter.h
@@ -13,6 +13,8 @@ int io_register_bpf_filter(struct io_restriction *res,
 
 void io_put_bpf_filters(struct io_restriction *res);
 
+void io_bpf_filter_clone(struct io_restriction *dst, struct io_restriction *src);
+
 static inline int io_uring_run_bpf_filters(struct io_bpf_filter __rcu **filters,
 					   struct io_kiocb *req)
 {
@@ -37,6 +39,10 @@ static inline int io_uring_run_bpf_filters(struct io_bpf_filter __rcu **filters,
 static inline void io_put_bpf_filters(struct io_restriction *res)
 {
 }
+static inline void io_bpf_filter_clone(struct io_restriction *dst,
+				       struct io_restriction *src)
+{
+}
 #endif /* CONFIG_IO_URING_BPF */
 
 #endif
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 62aeaf0fad74..e190827d2436 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3569,6 +3569,32 @@ int io_prepare_config(struct io_ctx_config *config)
 	return 0;
 }
 
+void io_restriction_clone(struct io_restriction *dst, struct io_restriction *src)
+{
+	memcpy(&dst->register_op, &src->register_op, sizeof(dst->register_op));
+	memcpy(&dst->sqe_op, &src->sqe_op, sizeof(dst->sqe_op));
+	dst->sqe_flags_allowed = src->sqe_flags_allowed;
+	dst->sqe_flags_required = src->sqe_flags_required;
+	dst->op_registered = src->op_registered;
+	dst->reg_registered = src->reg_registered;
+
+	io_bpf_filter_clone(dst, src);
+}
+
+static void io_ctx_restriction_clone(struct io_ring_ctx *ctx,
+				     struct io_restriction *src)
+{
+	struct io_restriction *dst = &ctx->restrictions;
+
+	io_restriction_clone(dst, src);
+	if (dst->bpf_filters)
+		WRITE_ONCE(ctx->bpf_filters, dst->bpf_filters->filters);
+	if (dst->op_registered)
+		ctx->op_restricted = 1;
+	if (dst->reg_registered)
+		ctx->reg_restricted = 1;
+}
+
 static __cold int io_uring_create(struct io_ctx_config *config)
 {
 	struct io_uring_params *p = &config->p;
@@ -3629,6 +3655,13 @@ static __cold int io_uring_create(struct io_ctx_config *config)
 	else
 		ctx->notify_method = TWA_SIGNAL;
 
+	/*
+	 * If the current task has restrictions enabled, then copy them to
+	 * our newly created ring and mark it as registered.
+	 */
+	if (current->io_uring_restrict)
+		io_ctx_restriction_clone(ctx, current->io_uring_restrict);
+
 	/*
 	 * This is just grabbed for accounting purposes. When a process exits,
 	 * the mm is exited and dropped before the files, hence we need to hang
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index c5bbb43b5842..feb9f76761e9 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -195,6 +195,7 @@ void io_task_refs_refill(struct io_uring_task *tctx);
 bool __io_alloc_req_refill(struct io_ring_ctx *ctx);
 
 void io_activate_pollwq(struct io_ring_ctx *ctx);
+void io_restriction_clone(struct io_restriction *dst, struct io_restriction *src);
 
 static inline void io_lockdep_assert_cq_locked(struct io_ring_ctx *ctx)
 {
diff --git a/io_uring/register.c b/io_uring/register.c
index 40de9b8924b9..af4815bc11d6 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -190,6 +190,82 @@ static __cold int io_register_restrictions(struct io_ring_ctx *ctx,
 	return 0;
 }
 
+static int io_register_restrictions_task(void __user *arg, unsigned int nr_args)
+{
+	struct io_uring_task_restriction __user *ures = arg;
+	struct io_uring_task_restriction tres;
+	struct io_restriction *res;
+	int ret;
+
+	/* Disallow if task already has registered restrictions */
+	if (current->io_uring_restrict)
+		return -EPERM;
+	/*
+	 * Similar to seccomp, disallow setting a filter if task_no_new_privs
+	 * is true and we're not CAP_SYS_ADMIN.
+	 */
+	if (!task_no_new_privs(current) &&
+	    !ns_capable_noaudit(current_user_ns(), CAP_SYS_ADMIN))
+		return -EACCES;
+	if (nr_args != 1)
+		return -EINVAL;
+
+	if (copy_from_user(&tres, arg, sizeof(tres)))
+		return -EFAULT;
+
+	if (tres.flags)
+		return -EINVAL;
+	if (!mem_is_zero(tres.resv, sizeof(tres.resv)))
+		return -EINVAL;
+
+	res = kzalloc(sizeof(*res), GFP_KERNEL_ACCOUNT);
+	if (!res)
+		return -ENOMEM;
+
+	ret = io_parse_restrictions(ures->restrictions, tres.nr_res, res);
+	if (ret < 0) {
+		kfree(res);
+		return ret;
+	}
+	current->io_uring_restrict = res;
+	return 0;
+}
+
+static int io_register_bpf_filter_task(void __user *arg, unsigned int nr_args)
+{
+	struct io_restriction *res;
+	int ret;
+
+	/*
+	 * Similar to seccomp, disallow setting a filter if task_no_new_privs
+	 * is true and we're not CAP_SYS_ADMIN.
+	 */
+	if (!task_no_new_privs(current) &&
+	    !ns_capable_noaudit(current_user_ns(), CAP_SYS_ADMIN))
+		return -EACCES;
+
+	if (nr_args != 1)
+		return -EINVAL;
+
+	/* If no task restrictions exist, setup a new set */
+	res = current->io_uring_restrict;
+	if (!res) {
+		res = kzalloc(sizeof(*res), GFP_KERNEL_ACCOUNT);
+		if (!res)
+			return -ENOMEM;
+	}
+
+	ret = io_register_bpf_filter(res, arg);
+	if (ret) {
+		if (res != current->io_uring_restrict)
+			kfree(res);
+		return ret;
+	}
+	if (!current->io_uring_restrict)
+		current->io_uring_restrict = res;
+	return 0;
+}
+
 static int io_register_enable_rings(struct io_ring_ctx *ctx)
 {
 	if (!(ctx->flags & IORING_SETUP_R_DISABLED))
@@ -912,6 +988,10 @@ static int io_uring_register_blind(unsigned int opcode, void __user *arg,
 		return io_uring_register_send_msg_ring(arg, nr_args);
 	case IORING_REGISTER_QUERY:
 		return io_query(arg, nr_args);
+	case IORING_REGISTER_RESTRICTIONS:
+		return io_register_restrictions_task(arg, nr_args);
+	case IORING_REGISTER_BPF_FILTER:
+		return io_register_bpf_filter_task(arg, nr_args);
 	}
 	return -EINVAL;
 }
diff --git a/io_uring/tctx.c b/io_uring/tctx.c
index d4f7698805e4..e3da31fdf16f 100644
--- a/io_uring/tctx.c
+++ b/io_uring/tctx.c
@@ -11,6 +11,7 @@
 
 #include "io_uring.h"
 #include "tctx.h"
+#include "bpf_filter.h"
 
 static struct io_wq *io_init_wq_offload(struct io_ring_ctx *ctx,
 					struct task_struct *task)
@@ -66,6 +67,11 @@ void __io_uring_free(struct task_struct *tsk)
 		kfree(tctx);
 		tsk->io_uring = NULL;
 	}
+	if (tsk->io_uring_restrict) {
+		io_put_bpf_filters(tsk->io_uring_restrict);
+		kfree(tsk->io_uring_restrict);
+		tsk->io_uring_restrict = NULL;
+	}
 }
 
 __cold int io_uring_alloc_task_context(struct task_struct *task,
@@ -356,5 +362,16 @@ int io_ringfd_unregister(struct io_ring_ctx *ctx, void __user *__arg,
 
 int __io_uring_fork(struct task_struct *tsk)
 {
+	struct io_restriction *res, *src = tsk->io_uring_restrict;
+
+	/* Don't leave it dangling on error */
+	tsk->io_uring_restrict = NULL;
+
+	res = kzalloc(sizeof(*res), GFP_KERNEL_ACCOUNT);
+	if (!res)
+		return -ENOMEM;
+
+	tsk->io_uring_restrict = res;
+	io_restriction_clone(res, src);
 	return 0;
 }
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCHSET v6] Inherited restrictions and BPF filtering for io_uring
  2026-01-19 23:54 [PATCHSET v6] Inherited restrictions and BPF filtering for io_uring Jens Axboe
                   ` (6 preceding siblings ...)
  2026-01-19 23:54 ` [PATCH 7/7] io_uring: allow registration of per-task restrictions Jens Axboe
@ 2026-01-22  3:37 ` Jens Axboe
  7 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2026-01-22  3:37 UTC (permalink / raw)
  To: io-uring; +Cc: brauner, jannh, kees, linux-kernel

On 1/19/26 4:54 PM, Jens Axboe wrote:
> Hi,
> 
> Followup to v5 here:
> 
> https://lore.kernel.org/io-uring/20260118172328.1067592-1-axboe@kernel.dk/
> 
> Mostly just addressing a bit of feedback, feature wise this is all the
> same as before. For details on the patches, see the v5 posting linked
> above. For details on the changes, see the changes section below.
> 
> Kernel branch can be found here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git/log/?h=io_uring-bpf-restrictions.3
> 
> and a liburing branch with support helpers, man page, and a fairly
> substantial test case can be found here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/liburing.git/log/?h=bpf-restrictions
> 
> Feedback welcome!
> 
> Changes since v5:
> - Disallow setting or appending filters for no_new_privs, unless the
>   user is also CAP_SYS_ADMIN (Aleksa)
> - Add support for filtering of IORING_OP_OPENAT/OPENAT2, in terms of
>   being able to deny certain resolve or creation flags.
> - Change layout of io_uring_bpf_ctx slightly, for easier/faster clearing
>   of unused members.
> - Expand liburing test cases to cover both the no_new_privs situation,
>   and testing the OPENAT/OPENAT2 filters.
> 
>  include/linux/io_uring.h                 |  14 +-
>  include/linux/io_uring_types.h           |  13 +
>  include/linux/sched.h                    |   1 +
>  include/uapi/linux/io_uring.h            |  10 +
>  include/uapi/linux/io_uring/bpf_filter.h |  62 ++++
>  io_uring/Kconfig                         |   5 +
>  io_uring/Makefile                        |   1 +
>  io_uring/bpf_filter.c                    | 436 +++++++++++++++++++++++
>  io_uring/bpf_filter.h                    |  48 +++
>  io_uring/io_uring.c                      |  48 +++
>  io_uring/io_uring.h                      |   1 +
>  io_uring/net.c                           |   9 +
>  io_uring/net.h                           |   6 +
>  io_uring/openclose.c                     |   9 +
>  io_uring/openclose.h                     |   3 +
>  io_uring/register.c                      |  91 +++++
>  io_uring/tctx.c                          |  42 ++-
>  kernel/fork.c                            |   5 +
>  18 files changed, 794 insertions(+), 10 deletions(-)

Any comments on this one?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/7] io_uring/bpf_filter: allow filtering on contents of struct open_how
  2026-01-19 23:54 ` [PATCH 3/7] io_uring/bpf_filter: allow filtering on contents of struct open_how Jens Axboe
@ 2026-01-27  9:33   ` Christian Brauner
  0 siblings, 0 replies; 16+ messages in thread
From: Christian Brauner @ 2026-01-27  9:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, jannh, kees, linux-kernel

On Mon, Jan 19, 2026 at 04:54:26PM -0700, Jens Axboe wrote:
> This adds custom filtering for IORING_OP_OPENAT and IORING_OP_OPENAT2,
> where the open_how flags, mode, and resolve can be checked by filters.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  include/uapi/linux/io_uring/bpf_filter.h | 5 +++++
>  io_uring/bpf_filter.c                    | 5 +++++
>  io_uring/openclose.c                     | 9 +++++++++
>  io_uring/openclose.h                     | 3 +++
>  4 files changed, 22 insertions(+)
> 
> diff --git a/include/uapi/linux/io_uring/bpf_filter.h b/include/uapi/linux/io_uring/bpf_filter.h
> index ad6961be5efa..7f468628c491 100644
> --- a/include/uapi/linux/io_uring/bpf_filter.h
> +++ b/include/uapi/linux/io_uring/bpf_filter.h
> @@ -22,6 +22,11 @@ struct io_uring_bpf_ctx {
>  			__u32	type;
>  			__u32	protocol;
>  		} socket;
> +		struct {
> +			__u64	flags;
> +			__u64	mode;
> +			__u64	resolve;
> +		} open;

So openat2()'s struct is extensible and there are plans to extend it to
include e.g., upgrade masks to restrict how a file descriptor can be
reopened. And in general there's the potential that this gets extended
with additional fields. So if it's workable I would add a size argument
in here to communicate to the bpf program what io_uring currently knows
about/is able to filter on. That should be fairly simple and doesn't
require you to change a whole lot?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/7] io_uring/bpf_filter: cache lookup table in ctx->bpf_filters
  2026-01-19 23:54 ` [PATCH 4/7] io_uring/bpf_filter: cache lookup table in ctx->bpf_filters Jens Axboe
@ 2026-01-27  9:33   ` Christian Brauner
  0 siblings, 0 replies; 16+ messages in thread
From: Christian Brauner @ 2026-01-27  9:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, jannh, kees, linux-kernel

On Mon, Jan 19, 2026 at 04:54:27PM -0700, Jens Axboe wrote:
> Currently a few pointer dereferences need to be made to both check if
> BPF filters are installed, and then also to retrieve the actual filter
> for the opcode. Cache the table in ctx->bpf_filters to avoid that.
> 
> Add a bit of debug info on ring exit to show if we ever got this wrong.
> Small risk of that given that the table is currently only updated in one
> spot, but once task forking is enabled, that will add one more spot.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---

Reviewed-by: Christian Brauner <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/7] io_uring/bpf_filter: add ref counts to struct io_bpf_filter
  2026-01-19 23:54 ` [PATCH 5/7] io_uring/bpf_filter: add ref counts to struct io_bpf_filter Jens Axboe
@ 2026-01-27  9:34   ` Christian Brauner
  0 siblings, 0 replies; 16+ messages in thread
From: Christian Brauner @ 2026-01-27  9:34 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, jannh, kees, linux-kernel

On Mon, Jan 19, 2026 at 04:54:28PM -0700, Jens Axboe wrote:
> In preparation for allowing inheritance of BPF filters and filter
> tables, add a reference count to the filter. This allows multiple tables
> to safely include the same filter.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---

Reviewed-by: Christian Brauner <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/7] io_uring: add support for BPF filtering for opcode restrictions
  2026-01-19 23:54 ` [PATCH 1/7] io_uring: add support for BPF filtering for opcode restrictions Jens Axboe
@ 2026-01-27 10:06   ` Christian Brauner
  2026-01-27 16:41     ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Christian Brauner @ 2026-01-27 10:06 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, jannh, kees, linux-kernel

On Mon, Jan 19, 2026 at 04:54:24PM -0700, Jens Axboe wrote:
> Add support for loading classic BPF programs with io_uring to provide
> fine-grained filtering of SQE operations. Unlike
> IORING_REGISTER_RESTRICTIONS which only allows bitmap-based allow/deny
> of opcodes, BPF filters can inspect request attributes and make dynamic
> decisions.
> 
> The filter is registered via IORING_REGISTER_BPF_FILTER with a struct
> io_uring_bpf:
> 
> struct io_uring_bpf_filter {
> 	__u32	opcode;		/* io_uring opcode to filter */
> 	__u32	flags;
> 	__u32	filter_len;	/* number of BPF instructions */
> 	__u32	resv;
> 	__u64	filter_ptr;	/* pointer to BPF filter */
> 	__u64	resv2[5];
> };
> 
> enum {
> 	IO_URING_BPF_CMD_FILTER	= 1,
> };
> 
> struct io_uring_bpf {
> 	__u16	cmd_type;	/* IO_URING_BPF_* values */
> 	__u16	cmd_flags;	/* none so far */
> 	__u32	resv;
> 	union {
> 		struct io_uring_bpf_filter	filter;
> 	};
> };
> 
> and the filters get supplied a struct io_uring_bpf_ctx:
> 
> struct io_uring_bpf_ctx {
> 	__u64	user_data;
> 	__u8	opcode;
> 	__u8	sqe_flags;
> 	__u8	pad[6];
> 	__u64	resv[6];
> };
> 
> where it's possible to filter on opcode and sqe_flags, with resv[6]
> being set aside for specific finer grained filtering inside an opcode.
> An example of that for sockets is in one of the following patches.
> Anything the opcode supports can end up in this struct, populated by
> the opcode itself, and hence can be filtered for.
> 
> Filters have the following semantics:
>   - Return 1 to allow the request
>   - Return 0 to deny the request with -EACCES
>   - Multiple filters can be stacked per opcode. All filters must
>     return 1 for the opcode to be allowed.
>   - Filters are evaluated in registration order (most recent first)
> 
> The implementation uses classic BPF (cBPF) rather than eBPF for as
> that's required for containers, and since they can be used by any
> user in the system.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  include/linux/io_uring_types.h           |   9 +
>  include/uapi/linux/io_uring.h            |   3 +
>  include/uapi/linux/io_uring/bpf_filter.h |  50 ++++
>  io_uring/Kconfig                         |   5 +
>  io_uring/Makefile                        |   1 +
>  io_uring/bpf_filter.c                    | 329 +++++++++++++++++++++++
>  io_uring/bpf_filter.h                    |  42 +++
>  io_uring/io_uring.c                      |   8 +
>  io_uring/register.c                      |   8 +
>  9 files changed, 455 insertions(+)
>  create mode 100644 include/uapi/linux/io_uring/bpf_filter.h
>  create mode 100644 io_uring/bpf_filter.c
>  create mode 100644 io_uring/bpf_filter.h
> 
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 211686ad89fd..37f0a5f7b2f4 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -219,9 +219,18 @@ struct io_rings {
>  	struct io_uring_cqe	cqes[] ____cacheline_aligned_in_smp;
>  };
>  
> +struct io_bpf_filter;
> +struct io_bpf_filters {
> +	refcount_t refs;	/* ref for ->bpf_filters */
> +	spinlock_t lock;	/* protects ->bpf_filters modifications */
> +	struct io_bpf_filter __rcu **filters;
> +	struct rcu_head rcu_head;
> +};
> +
>  struct io_restriction {
>  	DECLARE_BITMAP(register_op, IORING_REGISTER_LAST);
>  	DECLARE_BITMAP(sqe_op, IORING_OP_LAST);
> +	struct io_bpf_filters *bpf_filters;
>  	u8 sqe_flags_allowed;
>  	u8 sqe_flags_required;
>  	/* IORING_OP_* restrictions exist */
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index b5b23c0d5283..94669b77fee8 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -700,6 +700,9 @@ enum io_uring_register_op {
>  	/* auxiliary zcrx configuration, see enum zcrx_ctrl_op */
>  	IORING_REGISTER_ZCRX_CTRL		= 36,
>  
> +	/* register bpf filtering programs */
> +	IORING_REGISTER_BPF_FILTER		= 37,
> +
>  	/* this goes last */
>  	IORING_REGISTER_LAST,
>  
> diff --git a/include/uapi/linux/io_uring/bpf_filter.h b/include/uapi/linux/io_uring/bpf_filter.h
> new file mode 100644
> index 000000000000..8334a40e0f06
> --- /dev/null
> +++ b/include/uapi/linux/io_uring/bpf_filter.h
> @@ -0,0 +1,50 @@
> +/* SPDX-License-Identifier: (GPL-2.0 WITH Linux-syscall-note) OR MIT */
> +/*
> + * Header file for the io_uring BPF filters.
> + */
> +#ifndef LINUX_IO_URING_BPF_FILTER_H
> +#define LINUX_IO_URING_BPF_FILTER_H
> +
> +#include <linux/types.h>
> +
> +/*
> + * Struct passed to filters.
> + */
> +struct io_uring_bpf_ctx {
> +	__u64	user_data;
> +	__u8	opcode;
> +	__u8	sqe_flags;
> +	__u8	pad[6];
> +	__u64	resv[6];
> +};
> +
> +enum {
> +	/*
> +	 * If set, any currently unset opcode will have a deny filter attached
> +	 */
> +	IO_URING_BPF_FILTER_DENY_REST	= 1,
> +};
> +
> +struct io_uring_bpf_filter {
> +	__u32	opcode;		/* io_uring opcode to filter */
> +	__u32	flags;
> +	__u32	filter_len;	/* number of BPF instructions */
> +	__u32	resv;
> +	__u64	filter_ptr;	/* pointer to BPF filter */
> +	__u64	resv2[5];
> +};
> +
> +enum {
> +	IO_URING_BPF_CMD_FILTER	= 1,
> +};
> +
> +struct io_uring_bpf {
> +	__u16	cmd_type;	/* IO_URING_BPF_* values */
> +	__u16	cmd_flags;	/* none so far */
> +	__u32	resv;
> +	union {
> +		struct io_uring_bpf_filter	filter;
> +	};
> +};
> +
> +#endif
> diff --git a/io_uring/Kconfig b/io_uring/Kconfig
> index 4b949c42c0bf..a7ae23cf1035 100644
> --- a/io_uring/Kconfig
> +++ b/io_uring/Kconfig
> @@ -9,3 +9,8 @@ config IO_URING_ZCRX
>  	depends on PAGE_POOL
>  	depends on INET
>  	depends on NET_RX_BUSY_POLL
> +
> +config IO_URING_BPF
> +	def_bool y
> +	depends on BPF
> +	depends on NET
> diff --git a/io_uring/Makefile b/io_uring/Makefile
> index bc4e4a3fa0a5..f3c505caa91e 100644
> --- a/io_uring/Makefile
> +++ b/io_uring/Makefile
> @@ -22,3 +22,4 @@ obj-$(CONFIG_NET_RX_BUSY_POLL)	+= napi.o
>  obj-$(CONFIG_NET) += net.o cmd_net.o
>  obj-$(CONFIG_PROC_FS) += fdinfo.o
>  obj-$(CONFIG_IO_URING_MOCK_FILE) += mock_file.o
> +obj-$(CONFIG_IO_URING_BPF) += bpf_filter.o
> diff --git a/io_uring/bpf_filter.c b/io_uring/bpf_filter.c
> new file mode 100644
> index 000000000000..08ca30545228
> --- /dev/null
> +++ b/io_uring/bpf_filter.c
> @@ -0,0 +1,329 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * BPF filter support for io_uring. Supports SQE opcodes for now.
> + */
> +#include <linux/kernel.h>
> +#include <linux/errno.h>
> +#include <linux/io_uring.h>
> +#include <linux/filter.h>
> +#include <linux/bpf.h>
> +#include <uapi/linux/io_uring.h>
> +
> +#include "io_uring.h"
> +#include "bpf_filter.h"
> +#include "net.h"
> +
> +struct io_bpf_filter {
> +	struct bpf_prog		*prog;
> +	struct io_bpf_filter	*next;
> +};
> +
> +/* Deny if this is set as the filter */
> +static const struct io_bpf_filter dummy_filter;
> +
> +static void io_uring_populate_bpf_ctx(struct io_uring_bpf_ctx *bctx,
> +				      struct io_kiocb *req)
> +{
> +	bctx->opcode = req->opcode;
> +	bctx->sqe_flags = (__force int) req->flags & SQE_VALID_FLAGS;
> +	bctx->user_data = req->cqe.user_data;
> +	/* clear residual */
> +	memset(bctx->pad, 0, sizeof(bctx->pad) + sizeof(bctx->resv));
> +}
> +
> +/*
> + * Run registered filters for a given opcode. For filters, a return of 0 denies
> + * execution of the request, a return of 1 allows it. If any filter for an
> + * opcode returns 0, filter processing is stopped, and the request is denied.
> + * This also stops the processing of filters.
> + *
> + * __io_uring_run_bpf_filters() returns 0 on success, allow running the
> + * request, and -EACCES when a request is denied.
> + */
> +int __io_uring_run_bpf_filters(struct io_restriction *res, struct io_kiocb *req)
> +{
> +	struct io_bpf_filter *filter;
> +	struct io_uring_bpf_ctx bpf_ctx;
> +	int ret;
> +
> +	/* Fast check for existence of filters outside of RCU */
> +	if (!rcu_access_pointer(res->bpf_filters->filters[req->opcode]))
> +		return 0;
> +
> +	/*
> +	 * req->opcode has already been validated to be within the range
> +	 * of what we expect, io_init_req() does this.
> +	 */
> +	rcu_read_lock();
> +	filter = rcu_dereference(res->bpf_filters->filters[req->opcode]);
> +	if (!filter) {
> +		ret = 1;
> +		goto out;
> +	} else if (filter == &dummy_filter) {
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	io_uring_populate_bpf_ctx(&bpf_ctx, req);
> +
> +	/*
> +	 * Iterate registered filters. The opcode is allowed IFF all filters
> +	 * return 1. If any filter returns denied, opcode will be denied.
> +	 */
> +	do {
> +		if (filter == &dummy_filter)
> +			ret = 0;
> +		else
> +			ret = bpf_prog_run(filter->prog, &bpf_ctx);
> +		if (!ret)
> +			break;
> +		filter = filter->next;
> +	} while (filter);
> +out:
> +	rcu_read_unlock();
> +	return ret ? 0 : -EACCES;
> +}

Maybe we can write this a little nicer?:

int __io_uring_run_bpf_filters(struct io_restriction *res, struct io_kiocb *req)
{
	struct io_bpf_filter *filter;
	struct io_uring_bpf_ctx bpf_ctx;

	/* Fast check for existence of filters outside of RCU */
	if (!rcu_access_pointer(res->bpf_filters->filters[req->opcode]))
		return 0;

	/*
	 * req->opcode has already been validated to be within the range
	 * of what we expect, io_init_req() does this.
	 */
	guard(rcu)();
	filter = rcu_dereference(res->bpf_filters->filters[req->opcode]);
	if (!filter)
		return 0;

	if (filter == &dummy_filter)
		return -EACCES;

	io_uring_populate_bpf_ctx(&bpf_ctx, req);

	/*
	 * Iterate registered filters. The opcode is allowed IFF all filters
	 * return 1. If any filter returns denied, opcode will be denied.
	 */
	for (; filter ; filter = filter->next) {
		int ret;

		if (filter == &dummy_filter)
			return -EACCES;

		ret = bpf_prog_run(filter->prog, &bpf_ctx);
		if (!ret)
			return -EACCES;
	}

	return 0;
}

> +
> +static void io_free_bpf_filters(struct rcu_head *head)
> +{
> +	struct io_bpf_filter __rcu **filter;
> +	struct io_bpf_filters *filters;
> +	int i;
> +
> +	filters = container_of(head, struct io_bpf_filters, rcu_head);
> +	spin_lock(&filters->lock);
> +	filter = filters->filters;
> +	if (!filter) {
> +		spin_unlock(&filters->lock);
> +		return;
> +	}
> +	spin_unlock(&filters->lock);

This is minor but I prefer:

	scoped_guard(spinlock)(&filters->lock) {
		filters = container_of(head, struct io_bpf_filters, rcu_head);
		filter = filters->filters;
		if (!filter)
			return;
	}

> +
> +static void __io_put_bpf_filters(struct io_bpf_filters *filters)
> +{
> +	if (refcount_dec_and_test(&filters->refs))
> +		call_rcu(&filters->rcu_head, io_free_bpf_filters);
> +}
> +
> +void io_put_bpf_filters(struct io_restriction *res)
> +{
> +	if (res->bpf_filters)
> +		__io_put_bpf_filters(res->bpf_filters);
> +}
> +
> +static struct io_bpf_filters *io_new_bpf_filters(void)
> +{
> +	struct io_bpf_filters *filters;
> +
> +	filters = kzalloc(sizeof(*filters), GFP_KERNEL_ACCOUNT);
> +	if (!filters)
> +		return ERR_PTR(-ENOMEM);
> +
> +	filters->filters = kcalloc(IORING_OP_LAST,
> +				   sizeof(struct io_bpf_filter *),
> +				   GFP_KERNEL_ACCOUNT);
> +	if (!filters->filters) {
> +		kfree(filters);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	refcount_set(&filters->refs, 1);
> +	spin_lock_init(&filters->lock);
> +	return filters;
> +}

static struct io_bpf_filters *io_new_bpf_filters(void)
{
	struct io_bpf_filters *filters __free(kfree) = NULL;

	filters = kzalloc(sizeof(*filters), GFP_KERNEL_ACCOUNT);
	if (!filters)
		return ERR_PTR(-ENOMEM);

	filters->filters = kcalloc(IORING_OP_LAST,
				   sizeof(struct io_bpf_filter *),
				   GFP_KERNEL_ACCOUNT);
	if (!filters->filters)
		return ERR_PTR(-ENOMEM);

	refcount_set(&filters->refs, 1);
	spin_lock_init(&filters->lock);
	return no_free_ptr(filters);
}

> +
> +/*
> + * Validate classic BPF filter instructions. Only allow a safe subset of
> + * operations - no packet data access, just context field loads and basic
> + * ALU/jump operations.
> + */
> +static int io_uring_check_cbpf_filter(struct sock_filter *filter,
> +				      unsigned int flen)
> +{
> +	int pc;

Seems fine to me but I can't meaningfully review this.

> +int io_register_bpf_filter(struct io_restriction *res,
> +			   struct io_uring_bpf __user *arg)
> +{
> +	struct io_bpf_filter *filter, *old_filter;
> +	struct io_bpf_filters *filters;
> +	struct io_uring_bpf reg;
> +	struct bpf_prog *prog;
> +	struct sock_fprog fprog;
> +	int ret;
> +
> +	if (copy_from_user(&reg, arg, sizeof(reg)))
> +		return -EFAULT;
> +	if (reg.cmd_type != IO_URING_BPF_CMD_FILTER)
> +		return -EINVAL;
> +	if (reg.cmd_flags || reg.resv)
> +		return -EINVAL;
> +
> +	if (reg.filter.opcode >= IORING_OP_LAST)
> +		return -EINVAL;

So you only support per-op-code filtering with cbpf. I assume that you
would argue that people can use the existing io_uring restrictions. But
that's not inherited, right? So then this forces users to have a bpf
program for all opcodes that io_uring on their system supports.

I think that this is a bit unfortunate and wasteful for both userspace
and io_uring. Can't we do a combined thing where we also allow filters
to attach to all op-codes. Then userspace could start with an allow-list
or deny-list filter and then attach further per-op-code bpf programs to
the op-codes they want to manage specifically. Then you also get
inheritance of the restrictions per-task.

That would be nicer imho.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 6/7] io_uring: add task fork hook
  2026-01-19 23:54 ` [PATCH 6/7] io_uring: add task fork hook Jens Axboe
@ 2026-01-27 10:07   ` Christian Brauner
  0 siblings, 0 replies; 16+ messages in thread
From: Christian Brauner @ 2026-01-27 10:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, jannh, kees, linux-kernel

On Mon, Jan 19, 2026 at 04:54:29PM -0700, Jens Axboe wrote:
> Called when copy_process() is called to copy state to a new child.
> Right now this is just a stub, but will be used shortly to properly
> handle fork'ing of task based io_uring restrictions.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d395f2810fac..9abbd11bb87c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1190,6 +1190,7 @@ struct task_struct {
>  
>  #ifdef CONFIG_IO_URING
>  	struct io_uring_task		*io_uring;
> +	struct io_restriction		*io_uring_restrict;
>  #endif

Somewhat should make a graph how much struct task_struct grew in the
last 5 years. :D

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/7] io_uring: add support for BPF filtering for opcode restrictions
  2026-01-27 10:06   ` Christian Brauner
@ 2026-01-27 16:41     ` Jens Axboe
  0 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2026-01-27 16:41 UTC (permalink / raw)
  To: Christian Brauner; +Cc: io-uring, jannh, kees, linux-kernel

On 1/27/26 3:06 AM, Christian Brauner wrote:
>> +int __io_uring_run_bpf_filters(struct io_restriction *res, struct io_kiocb *req)
>> +{
>> +	struct io_bpf_filter *filter;
>> +	struct io_uring_bpf_ctx bpf_ctx;
>> +	int ret;
>> +
>> +	/* Fast check for existence of filters outside of RCU */
>> +	if (!rcu_access_pointer(res->bpf_filters->filters[req->opcode]))
>> +		return 0;
>> +
>> +	/*
>> +	 * req->opcode has already been validated to be within the range
>> +	 * of what we expect, io_init_req() does this.
>> +	 */
>> +	rcu_read_lock();
>> +	filter = rcu_dereference(res->bpf_filters->filters[req->opcode]);
>> +	if (!filter) {
>> +		ret = 1;
>> +		goto out;
>> +	} else if (filter == &dummy_filter) {
>> +		ret = 0;
>> +		goto out;
>> +	}
>> +
>> +	io_uring_populate_bpf_ctx(&bpf_ctx, req);
>> +
>> +	/*
>> +	 * Iterate registered filters. The opcode is allowed IFF all filters
>> +	 * return 1. If any filter returns denied, opcode will be denied.
>> +	 */
>> +	do {
>> +		if (filter == &dummy_filter)
>> +			ret = 0;
>> +		else
>> +			ret = bpf_prog_run(filter->prog, &bpf_ctx);
>> +		if (!ret)
>> +			break;
>> +		filter = filter->next;
>> +	} while (filter);
>> +out:
>> +	rcu_read_unlock();
>> +	return ret ? 0 : -EACCES;
>> +}
> 
> Maybe we can write this a little nicer?:
> 
> int __io_uring_run_bpf_filters(struct io_restriction *res, struct io_kiocb *req)
> {
> 	struct io_bpf_filter *filter;
> 	struct io_uring_bpf_ctx bpf_ctx;
> 
> 	/* Fast check for existence of filters outside of RCU */
> 	if (!rcu_access_pointer(res->bpf_filters->filters[req->opcode]))
> 		return 0;
> 
> 	/*
> 	 * req->opcode has already been validated to be within the range
> 	 * of what we expect, io_init_req() does this.
> 	 */
> 	guard(rcu)();
> 	filter = rcu_dereference(res->bpf_filters->filters[req->opcode]);
> 	if (!filter)
> 		return 0;
> 
> 	if (filter == &dummy_filter)
> 		return -EACCES;
> 
> 	io_uring_populate_bpf_ctx(&bpf_ctx, req);
> 
> 	/*
> 	 * Iterate registered filters. The opcode is allowed IFF all filters
> 	 * return 1. If any filter returns denied, opcode will be denied.
> 	 */
> 	for (; filter ; filter = filter->next) {
> 		int ret;
> 
> 		if (filter == &dummy_filter)
> 			return -EACCES;
> 
> 		ret = bpf_prog_run(filter->prog, &bpf_ctx);
> 		if (!ret)
> 			return -EACCES;
> 	}
> 
> 	return 0;
> }

Did a variant based on this, I agree it looks nicer with guard for this
one.

>> +static void io_free_bpf_filters(struct rcu_head *head)
>> +{
>> +	struct io_bpf_filter __rcu **filter;
>> +	struct io_bpf_filters *filters;
>> +	int i;
>> +
>> +	filters = container_of(head, struct io_bpf_filters, rcu_head);
>> +	spin_lock(&filters->lock);
>> +	filter = filters->filters;
>> +	if (!filter) {
>> +		spin_unlock(&filters->lock);
>> +		return;
>> +	}
>> +	spin_unlock(&filters->lock);
> 
> This is minor but I prefer:
> 
> 	scoped_guard(spinlock)(&filters->lock) {
> 		filters = container_of(head, struct io_bpf_filters, rcu_head);
> 		filter = filters->filters;
> 		if (!filter)
> 			return;
> 	}

Reason I tend to never do that is that I always have to grep for the
syntax... And case in point, you need to do:

	scoped_guard(spinlock, &filters->lock) {

so I guess I'm not the only one :-). But it does read better that way,
made this change too.

>> +static struct io_bpf_filters *io_new_bpf_filters(void)
>> +{
>> +	struct io_bpf_filters *filters;
>> +
>> +	filters = kzalloc(sizeof(*filters), GFP_KERNEL_ACCOUNT);
>> +	if (!filters)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	filters->filters = kcalloc(IORING_OP_LAST,
>> +				   sizeof(struct io_bpf_filter *),
>> +				   GFP_KERNEL_ACCOUNT);
>> +	if (!filters->filters) {
>> +		kfree(filters);
>> +		return ERR_PTR(-ENOMEM);
>> +	}
>> +
>> +	refcount_set(&filters->refs, 1);
>> +	spin_lock_init(&filters->lock);
>> +	return filters;
>> +}
> 
> static struct io_bpf_filters *io_new_bpf_filters(void)
> {
> 	struct io_bpf_filters *filters __free(kfree) = NULL;
> 
> 	filters = kzalloc(sizeof(*filters), GFP_KERNEL_ACCOUNT);
> 	if (!filters)
> 		return ERR_PTR(-ENOMEM);
> 
> 	filters->filters = kcalloc(IORING_OP_LAST,
> 				   sizeof(struct io_bpf_filter *),
> 				   GFP_KERNEL_ACCOUNT);
> 	if (!filters->filters)
> 		return ERR_PTR(-ENOMEM);
> 
> 	refcount_set(&filters->refs, 1);
> 	spin_lock_init(&filters->lock);
> 	return no_free_ptr(filters);
> }

Adopted as well, thanks.

>> +/*
>> + * Validate classic BPF filter instructions. Only allow a safe subset of
>> + * operations - no packet data access, just context field loads and basic
>> + * ALU/jump operations.
>> + */
>> +static int io_uring_check_cbpf_filter(struct sock_filter *filter,
>> +				      unsigned int flen)
>> +{
>> +	int pc;
> 
> Seems fine to me but I can't meaningfully review this.

Yeah seriously... It's just the seccomp filter modified for this use
case, so supposedly should be vetted already.

>> +int io_register_bpf_filter(struct io_restriction *res,
>> +			   struct io_uring_bpf __user *arg)
>> +{
>> +	struct io_bpf_filter *filter, *old_filter;
>> +	struct io_bpf_filters *filters;
>> +	struct io_uring_bpf reg;
>> +	struct bpf_prog *prog;
>> +	struct sock_fprog fprog;
>> +	int ret;
>> +
>> +	if (copy_from_user(&reg, arg, sizeof(reg)))
>> +		return -EFAULT;
>> +	if (reg.cmd_type != IO_URING_BPF_CMD_FILTER)
>> +		return -EINVAL;
>> +	if (reg.cmd_flags || reg.resv)
>> +		return -EINVAL;
>> +
>> +	if (reg.filter.opcode >= IORING_OP_LAST)
>> +		return -EINVAL;
> 
> So you only support per-op-code filtering with cbpf. I assume that you
> would argue that people can use the existing io_uring restrictions. But
> that's not inherited, right? So then this forces users to have a bpf
> program for all opcodes that io_uring on their system supports.

The existing restrictions ARE inherited now, and can be set on a
per-task basis as well. That's in the last patch. And since the classic
"deny this opcode" filtering is way cheaper than running a BPF prog, I
do think that's the right approach.

> I think that this is a bit unfortunate and wasteful for both userspace
> and io_uring. Can't we do a combined thing where we also allow filters
> to attach to all op-codes. Then userspace could start with an allow-list
> or deny-list filter and then attach further per-op-code bpf programs to
> the op-codes they want to manage specifically. Then you also get
> inheritance of the restrictions per-task.
> 
> That would be nicer imho.

I'm considering this one moot given the above on using
IORING_REGISTER_RESTRICTIONS with fd == -1 to set per-task restrictions
using the classic non-bpf filtering, which are inherited as well. You
only need the cBPF filters if you want to deny an opcode based on aux
data for that opcode.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 7/7] io_uring: allow registration of per-task restrictions
  2026-01-27 18:29 [PATCHSET v7] " Jens Axboe
@ 2026-01-27 18:30 ` Jens Axboe
  0 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2026-01-27 18:30 UTC (permalink / raw)
  To: io-uring; +Cc: brauner, cyphar, jannh, kees, linux-kernel, Jens Axboe

Currently io_uring supports restricting operations on a per-ring basis.
To use those, the ring must be setup in a disabled state by setting
IORING_SETUP_R_DISABLED. Then restrictions can be set for the ring, and
the ring can then be enabled.

This commit adds support for IORING_REGISTER_RESTRICTIONS with ring_fd
== -1, like the other "blind" register opcodes which work on the task
rather than a specific ring. This allows registration of the same kind
of restrictions as can been done on a specific ring, but with the task
itself. Once done, any ring created will inherit these restrictions.

If a restriction filter is registered with a task, then it's inherited
on fork for its children. Children may only further restrict operations,
not extend them.

Inheriting restrictions include both the classic
IORING_REGISTER_RESTRICTIONS based restrictions, as well as the BPF
filters that have been registered with the task via
IORING_REGISTER_BPF_FILTER.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h |  2 +
 include/uapi/linux/io_uring.h  |  7 +++
 io_uring/bpf_filter.c          | 86 +++++++++++++++++++++++++++++++++-
 io_uring/bpf_filter.h          |  6 +++
 io_uring/io_uring.c            | 33 +++++++++++++
 io_uring/io_uring.h            |  1 +
 io_uring/register.c            | 80 +++++++++++++++++++++++++++++++
 io_uring/tctx.c                | 17 +++++++
 8 files changed, 231 insertions(+), 1 deletion(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 7617df247238..510d801b9a55 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -231,6 +231,8 @@ struct io_restriction {
 	DECLARE_BITMAP(register_op, IORING_REGISTER_LAST);
 	DECLARE_BITMAP(sqe_op, IORING_OP_LAST);
 	struct io_bpf_filters *bpf_filters;
+	/* ->bpf_filters needs COW on modification */
+	bool bpf_filters_cow;
 	u8 sqe_flags_allowed;
 	u8 sqe_flags_required;
 	/* IORING_OP_* restrictions exist */
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 94669b77fee8..aeeffcf27fee 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -808,6 +808,13 @@ struct io_uring_restriction {
 	__u32 resv2[3];
 };
 
+struct io_uring_task_restriction {
+	__u16 flags;
+	__u16 nr_res;
+	__u32 resv[3];
+	__DECLARE_FLEX_ARRAY(struct io_uring_restriction, restrictions);
+};
+
 struct io_uring_clock_register {
 	__u32	clockid;
 	__u32	__resv[3];
diff --git a/io_uring/bpf_filter.c b/io_uring/bpf_filter.c
index b94944ab8442..3816883a45ed 100644
--- a/io_uring/bpf_filter.c
+++ b/io_uring/bpf_filter.c
@@ -249,13 +249,77 @@ static int io_uring_check_cbpf_filter(struct sock_filter *filter,
 	return 0;
 }
 
+void io_bpf_filter_clone(struct io_restriction *dst, struct io_restriction *src)
+{
+	if (!src->bpf_filters)
+		return;
+
+	rcu_read_lock();
+	/*
+	 * If the src filter is going away, just ignore it.
+	 */
+	if (refcount_inc_not_zero(&src->bpf_filters->refs)) {
+		dst->bpf_filters = src->bpf_filters;
+		dst->bpf_filters_cow = true;
+	}
+	rcu_read_unlock();
+}
+
+/*
+ * Allocate a new struct io_bpf_filters. Used when a filter is cloned and
+ * modifications need to be made.
+ */
+static struct io_bpf_filters *io_bpf_filter_cow(struct io_restriction *src)
+{
+	struct io_bpf_filters *filters;
+	struct io_bpf_filter *srcf;
+	int i;
+
+	filters = io_new_bpf_filters();
+	if (IS_ERR(filters))
+		return filters;
+
+	/*
+	 * Iterate filters from src and assign in destination. Grabbing
+	 * a reference is enough, we don't need to duplicate the memory.
+	 * This is safe because filters are only ever appended to the
+	 * front of the list, hence the only memory ever touched inside
+	 * a filter is the refcount.
+	 */
+	rcu_read_lock();
+	for (i = 0; i < IORING_OP_LAST; i++) {
+		srcf = rcu_dereference(src->bpf_filters->filters[i]);
+		if (!srcf) {
+			continue;
+		} else if (srcf == &dummy_filter) {
+			rcu_assign_pointer(filters->filters[i], &dummy_filter);
+			continue;
+		}
+
+		/*
+		 * Getting a ref on the first node is enough, putting the
+		 * filter and iterating nodes to free will stop on the first
+		 * one that doesn't hit zero when dropping.
+		 */
+		if (!refcount_inc_not_zero(&srcf->refs))
+			goto err;
+		rcu_assign_pointer(filters->filters[i], srcf);
+	}
+	rcu_read_unlock();
+	return filters;
+err:
+	rcu_read_unlock();
+	__io_put_bpf_filters(filters);
+	return ERR_PTR(-EBUSY);
+}
+
 #define IO_URING_BPF_FILTER_FLAGS	IO_URING_BPF_FILTER_DENY_REST
 
 int io_register_bpf_filter(struct io_restriction *res,
 			   struct io_uring_bpf __user *arg)
 {
+	struct io_bpf_filters *filters, *old_filters = NULL;
 	struct io_bpf_filter *filter, *old_filter;
-	struct io_bpf_filters *filters;
 	struct io_uring_bpf reg;
 	struct bpf_prog *prog;
 	struct sock_fprog fprog;
@@ -297,6 +361,17 @@ int io_register_bpf_filter(struct io_restriction *res,
 			ret = PTR_ERR(filters);
 			goto err_prog;
 		}
+	} else if (res->bpf_filters_cow) {
+		filters = io_bpf_filter_cow(res);
+		if (IS_ERR(filters)) {
+			ret = PTR_ERR(filters);
+			goto err_prog;
+		}
+		/*
+		 * Stash old filters, we'll put them once we know we'll
+		 * succeed. Until then, res->bpf_filters is left untouched.
+		 */
+		old_filters = res->bpf_filters;
 	}
 
 	filter = kzalloc(sizeof(*filter), GFP_KERNEL_ACCOUNT);
@@ -306,6 +381,15 @@ int io_register_bpf_filter(struct io_restriction *res,
 	}
 	refcount_set(&filter->refs, 1);
 	filter->prog = prog;
+
+	/*
+	 * Success - install the new filter set now. If we did COW, put
+	 * the old filters as we're replacing them.
+	 */
+	if (old_filters) {
+		__io_put_bpf_filters(old_filters);
+		res->bpf_filters_cow = false;
+	}
 	res->bpf_filters = filters;
 
 	/*
diff --git a/io_uring/bpf_filter.h b/io_uring/bpf_filter.h
index 9f3cdb92eb16..66a776cf25b4 100644
--- a/io_uring/bpf_filter.h
+++ b/io_uring/bpf_filter.h
@@ -13,6 +13,8 @@ int io_register_bpf_filter(struct io_restriction *res,
 
 void io_put_bpf_filters(struct io_restriction *res);
 
+void io_bpf_filter_clone(struct io_restriction *dst, struct io_restriction *src);
+
 static inline int io_uring_run_bpf_filters(struct io_bpf_filter __rcu **filters,
 					   struct io_kiocb *req)
 {
@@ -37,6 +39,10 @@ static inline int io_uring_run_bpf_filters(struct io_bpf_filter __rcu **filters,
 static inline void io_put_bpf_filters(struct io_restriction *res)
 {
 }
+static inline void io_bpf_filter_clone(struct io_restriction *dst,
+				       struct io_restriction *src)
+{
+}
 #endif /* CONFIG_IO_URING_BPF */
 
 #endif
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 049454278563..e43c5283b23a 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2880,6 +2880,32 @@ int io_prepare_config(struct io_ctx_config *config)
 	return 0;
 }
 
+void io_restriction_clone(struct io_restriction *dst, struct io_restriction *src)
+{
+	memcpy(&dst->register_op, &src->register_op, sizeof(dst->register_op));
+	memcpy(&dst->sqe_op, &src->sqe_op, sizeof(dst->sqe_op));
+	dst->sqe_flags_allowed = src->sqe_flags_allowed;
+	dst->sqe_flags_required = src->sqe_flags_required;
+	dst->op_registered = src->op_registered;
+	dst->reg_registered = src->reg_registered;
+
+	io_bpf_filter_clone(dst, src);
+}
+
+static void io_ctx_restriction_clone(struct io_ring_ctx *ctx,
+				     struct io_restriction *src)
+{
+	struct io_restriction *dst = &ctx->restrictions;
+
+	io_restriction_clone(dst, src);
+	if (dst->bpf_filters)
+		WRITE_ONCE(ctx->bpf_filters, dst->bpf_filters->filters);
+	if (dst->op_registered)
+		ctx->op_restricted = 1;
+	if (dst->reg_registered)
+		ctx->reg_restricted = 1;
+}
+
 static __cold int io_uring_create(struct io_ctx_config *config)
 {
 	struct io_uring_params *p = &config->p;
@@ -2940,6 +2966,13 @@ static __cold int io_uring_create(struct io_ctx_config *config)
 	else
 		ctx->notify_method = TWA_SIGNAL;
 
+	/*
+	 * If the current task has restrictions enabled, then copy them to
+	 * our newly created ring and mark it as registered.
+	 */
+	if (current->io_uring_restrict)
+		io_ctx_restriction_clone(ctx, current->io_uring_restrict);
+
 	/*
 	 * This is just grabbed for accounting purposes. When a process exits,
 	 * the mm is exited and dropped before the files, hence we need to hang
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 29b8f90fdabf..a08d78c716f8 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -197,6 +197,7 @@ void io_task_refs_refill(struct io_uring_task *tctx);
 bool __io_alloc_req_refill(struct io_ring_ctx *ctx);
 
 void io_activate_pollwq(struct io_ring_ctx *ctx);
+void io_restriction_clone(struct io_restriction *dst, struct io_restriction *src);
 
 static inline void io_lockdep_assert_cq_locked(struct io_ring_ctx *ctx)
 {
diff --git a/io_uring/register.c b/io_uring/register.c
index 40de9b8924b9..af4815bc11d6 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -190,6 +190,82 @@ static __cold int io_register_restrictions(struct io_ring_ctx *ctx,
 	return 0;
 }
 
+static int io_register_restrictions_task(void __user *arg, unsigned int nr_args)
+{
+	struct io_uring_task_restriction __user *ures = arg;
+	struct io_uring_task_restriction tres;
+	struct io_restriction *res;
+	int ret;
+
+	/* Disallow if task already has registered restrictions */
+	if (current->io_uring_restrict)
+		return -EPERM;
+	/*
+	 * Similar to seccomp, disallow setting a filter if task_no_new_privs
+	 * is true and we're not CAP_SYS_ADMIN.
+	 */
+	if (!task_no_new_privs(current) &&
+	    !ns_capable_noaudit(current_user_ns(), CAP_SYS_ADMIN))
+		return -EACCES;
+	if (nr_args != 1)
+		return -EINVAL;
+
+	if (copy_from_user(&tres, arg, sizeof(tres)))
+		return -EFAULT;
+
+	if (tres.flags)
+		return -EINVAL;
+	if (!mem_is_zero(tres.resv, sizeof(tres.resv)))
+		return -EINVAL;
+
+	res = kzalloc(sizeof(*res), GFP_KERNEL_ACCOUNT);
+	if (!res)
+		return -ENOMEM;
+
+	ret = io_parse_restrictions(ures->restrictions, tres.nr_res, res);
+	if (ret < 0) {
+		kfree(res);
+		return ret;
+	}
+	current->io_uring_restrict = res;
+	return 0;
+}
+
+static int io_register_bpf_filter_task(void __user *arg, unsigned int nr_args)
+{
+	struct io_restriction *res;
+	int ret;
+
+	/*
+	 * Similar to seccomp, disallow setting a filter if task_no_new_privs
+	 * is true and we're not CAP_SYS_ADMIN.
+	 */
+	if (!task_no_new_privs(current) &&
+	    !ns_capable_noaudit(current_user_ns(), CAP_SYS_ADMIN))
+		return -EACCES;
+
+	if (nr_args != 1)
+		return -EINVAL;
+
+	/* If no task restrictions exist, setup a new set */
+	res = current->io_uring_restrict;
+	if (!res) {
+		res = kzalloc(sizeof(*res), GFP_KERNEL_ACCOUNT);
+		if (!res)
+			return -ENOMEM;
+	}
+
+	ret = io_register_bpf_filter(res, arg);
+	if (ret) {
+		if (res != current->io_uring_restrict)
+			kfree(res);
+		return ret;
+	}
+	if (!current->io_uring_restrict)
+		current->io_uring_restrict = res;
+	return 0;
+}
+
 static int io_register_enable_rings(struct io_ring_ctx *ctx)
 {
 	if (!(ctx->flags & IORING_SETUP_R_DISABLED))
@@ -912,6 +988,10 @@ static int io_uring_register_blind(unsigned int opcode, void __user *arg,
 		return io_uring_register_send_msg_ring(arg, nr_args);
 	case IORING_REGISTER_QUERY:
 		return io_query(arg, nr_args);
+	case IORING_REGISTER_RESTRICTIONS:
+		return io_register_restrictions_task(arg, nr_args);
+	case IORING_REGISTER_BPF_FILTER:
+		return io_register_bpf_filter_task(arg, nr_args);
 	}
 	return -EINVAL;
 }
diff --git a/io_uring/tctx.c b/io_uring/tctx.c
index d4f7698805e4..e3da31fdf16f 100644
--- a/io_uring/tctx.c
+++ b/io_uring/tctx.c
@@ -11,6 +11,7 @@
 
 #include "io_uring.h"
 #include "tctx.h"
+#include "bpf_filter.h"
 
 static struct io_wq *io_init_wq_offload(struct io_ring_ctx *ctx,
 					struct task_struct *task)
@@ -66,6 +67,11 @@ void __io_uring_free(struct task_struct *tsk)
 		kfree(tctx);
 		tsk->io_uring = NULL;
 	}
+	if (tsk->io_uring_restrict) {
+		io_put_bpf_filters(tsk->io_uring_restrict);
+		kfree(tsk->io_uring_restrict);
+		tsk->io_uring_restrict = NULL;
+	}
 }
 
 __cold int io_uring_alloc_task_context(struct task_struct *task,
@@ -356,5 +362,16 @@ int io_ringfd_unregister(struct io_ring_ctx *ctx, void __user *__arg,
 
 int __io_uring_fork(struct task_struct *tsk)
 {
+	struct io_restriction *res, *src = tsk->io_uring_restrict;
+
+	/* Don't leave it dangling on error */
+	tsk->io_uring_restrict = NULL;
+
+	res = kzalloc(sizeof(*res), GFP_KERNEL_ACCOUNT);
+	if (!res)
+		return -ENOMEM;
+
+	tsk->io_uring_restrict = res;
+	io_restriction_clone(res, src);
 	return 0;
 }
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-01-27 18:33 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-19 23:54 [PATCHSET v6] Inherited restrictions and BPF filtering for io_uring Jens Axboe
2026-01-19 23:54 ` [PATCH 1/7] io_uring: add support for BPF filtering for opcode restrictions Jens Axboe
2026-01-27 10:06   ` Christian Brauner
2026-01-27 16:41     ` Jens Axboe
2026-01-19 23:54 ` [PATCH 2/7] io_uring/net: allow filtering on IORING_OP_SOCKET data Jens Axboe
2026-01-19 23:54 ` [PATCH 3/7] io_uring/bpf_filter: allow filtering on contents of struct open_how Jens Axboe
2026-01-27  9:33   ` Christian Brauner
2026-01-19 23:54 ` [PATCH 4/7] io_uring/bpf_filter: cache lookup table in ctx->bpf_filters Jens Axboe
2026-01-27  9:33   ` Christian Brauner
2026-01-19 23:54 ` [PATCH 5/7] io_uring/bpf_filter: add ref counts to struct io_bpf_filter Jens Axboe
2026-01-27  9:34   ` Christian Brauner
2026-01-19 23:54 ` [PATCH 6/7] io_uring: add task fork hook Jens Axboe
2026-01-27 10:07   ` Christian Brauner
2026-01-19 23:54 ` [PATCH 7/7] io_uring: allow registration of per-task restrictions Jens Axboe
2026-01-22  3:37 ` [PATCHSET v6] Inherited restrictions and BPF filtering for io_uring Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2026-01-27 18:29 [PATCHSET v7] " Jens Axboe
2026-01-27 18:30 ` [PATCH 7/7] io_uring: allow registration of per-task restrictions Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox