* [PATCHSET v5] Inherited restrictions and BPF filtering
@ 2026-01-18 17:16 Jens Axboe
2026-01-18 17:16 ` [PATCH 1/6] io_uring: add support for BPF filtering for opcode restrictions Jens Axboe
` (5 more replies)
0 siblings, 6 replies; 12+ messages in thread
From: Jens Axboe @ 2026-01-18 17:16 UTC (permalink / raw)
To: io-uring; +Cc: brauner
Hi,
Followup to v4 here:
https://lore.kernel.org/io-uring/20260116224356.399361-1-axboe@kernel.dk/
Due to some feedback from Christian, ended up redoing the filter side of
this to use cBPF rather than eBPF. This provides better support for the
some of the intended use case of this, like containers, as eBPF cannot
be used unprivileged there. This obviously comes with a bit of pain on
the usability front, as you now need to write filters in cBPF bytecode.
I did keep the API such that eBPF filters can be added as well, but that
can be a separate patch. Since the BPF type is just a minor part of this
change, most of the code is exactly the same as before.
As before, filters can be registered with directly with a ring, or with
the calling task. Filters registered with a ring only affect that ring,
while filters registered with a task will affect any ring subsequently
created. Additionally, task filters are inherited across fork. For both
the original task and any of its children, once registered, only further
restrictions may be added. A forked child initially starts with a
reference to its parent table. If the parent makes changes to that
table, they will also affect the child. The exception being if the child
registers further filters - in that case, the filters table is COW'ed
and the reference is dropped to the parent table.
Kernel branch can be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git/log/?h=io_uring-bpf-restrictions.2
and a liburing branch with support helpers and a fairly substantial test
case can be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/axboe/liburing.git/log/?h=bpf-restrictions
include/linux/io_uring.h | 14 +-
include/linux/io_uring_types.h | 13 +
include/linux/sched.h | 1 +
include/uapi/linux/io_uring.h | 10 +
include/uapi/linux/io_uring/bpf_filter.h | 54 +++
io_uring/Kconfig | 5 +
io_uring/Makefile | 1 +
io_uring/bpf_filter.c | 430 +++++++++++++++++++++++
io_uring/bpf_filter.h | 48 +++
io_uring/io_uring.c | 48 +++
io_uring/io_uring.h | 1 +
io_uring/net.c | 9 +
io_uring/net.h | 6 +
io_uring/register.c | 76 ++++
io_uring/tctx.c | 42 ++-
kernel/fork.c | 5 +
16 files changed, 753 insertions(+), 10 deletions(-)
Changes since v4
- Drop eBPF and switch to cBPF instead. This is a bit of a pain on the
userspace side obviously, as you now have to write bytecode. But it's
necessary for supporting some of the use cases we care about, like
containers.
- Add ctx->bpf_filters cache to reduce dereferences needed to get to
the filter table.
- Do fast "no filter exists for this opcode" check.
- Fix bug with dummy filter in iterating and running filters.
- Fix bug with ring inheriting task filters for classic filters.
- Move uapi headers to io_uring/bpf_filter.h
- Add Kconfig CONFIG_IO_URING_BPF symbol
--
Jens Axboe
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/6] io_uring: add support for BPF filtering for opcode restrictions
2026-01-18 17:16 [PATCHSET v5] Inherited restrictions and BPF filtering Jens Axboe
@ 2026-01-18 17:16 ` Jens Axboe
2026-01-19 18:51 ` Aleksa Sarai
2026-01-18 17:16 ` [PATCH 2/6] io_uring/net: allow filtering on IORING_OP_SOCKET data Jens Axboe
` (4 subsequent siblings)
5 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2026-01-18 17:16 UTC (permalink / raw)
To: io-uring; +Cc: brauner, Jens Axboe
This adds support for loading BPF programs with io_uring, which can
restrict the opcodes executed. Unlike IORING_REGISTER_RESTRICTIONS,
using BPF programs allow fine grained control over both the opcode in
question, as well as other data associated with the request. This
initial patch just supports whatever is in the io_kiocb for filtering,
but shortly opcode specific support will be added.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/io_uring_types.h | 9 +
include/uapi/linux/io_uring.h | 3 +
include/uapi/linux/io_uring/bpf_filter.h | 47 ++++
io_uring/Kconfig | 5 +
io_uring/Makefile | 1 +
io_uring/bpf_filter.c | 328 +++++++++++++++++++++++
io_uring/bpf_filter.h | 42 +++
io_uring/io_uring.c | 8 +
io_uring/register.c | 8 +
9 files changed, 451 insertions(+)
create mode 100644 include/uapi/linux/io_uring/bpf_filter.h
create mode 100644 io_uring/bpf_filter.c
create mode 100644 io_uring/bpf_filter.h
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 211686ad89fd..37f0a5f7b2f4 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -219,9 +219,18 @@ struct io_rings {
struct io_uring_cqe cqes[] ____cacheline_aligned_in_smp;
};
+struct io_bpf_filter;
+struct io_bpf_filters {
+ refcount_t refs; /* ref for ->bpf_filters */
+ spinlock_t lock; /* protects ->bpf_filters modifications */
+ struct io_bpf_filter __rcu **filters;
+ struct rcu_head rcu_head;
+};
+
struct io_restriction {
DECLARE_BITMAP(register_op, IORING_REGISTER_LAST);
DECLARE_BITMAP(sqe_op, IORING_OP_LAST);
+ struct io_bpf_filters *bpf_filters;
u8 sqe_flags_allowed;
u8 sqe_flags_required;
/* IORING_OP_* restrictions exist */
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index b5b23c0d5283..94669b77fee8 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -700,6 +700,9 @@ enum io_uring_register_op {
/* auxiliary zcrx configuration, see enum zcrx_ctrl_op */
IORING_REGISTER_ZCRX_CTRL = 36,
+ /* register bpf filtering programs */
+ IORING_REGISTER_BPF_FILTER = 37,
+
/* this goes last */
IORING_REGISTER_LAST,
diff --git a/include/uapi/linux/io_uring/bpf_filter.h b/include/uapi/linux/io_uring/bpf_filter.h
new file mode 100644
index 000000000000..14bd5b7468a7
--- /dev/null
+++ b/include/uapi/linux/io_uring/bpf_filter.h
@@ -0,0 +1,47 @@
+/* SPDX-License-Identifier: (GPL-2.0 WITH Linux-syscall-note) OR MIT */
+/*
+ * Header file for the io_uring BPF filters.
+ */
+#ifndef LINUX_IO_URING_BPF_FILTER_H
+#define LINUX_IO_URING_BPF_FILTER_H
+
+#include <linux/types.h>
+
+struct io_uring_bpf_ctx {
+ __u8 opcode;
+ __u8 sqe_flags;
+ __u8 pad[6];
+ __u64 user_data;
+ __u64 resv[6];
+};
+
+enum {
+ /*
+ * If set, any currently unset opcode will have a deny filter attached
+ */
+ IO_URING_BPF_FILTER_DENY_REST = 1,
+};
+
+struct io_uring_bpf_filter {
+ __u32 opcode; /* io_uring opcode to filter */
+ __u32 flags;
+ __u32 filter_len; /* number of BPF instructions */
+ __u32 resv;
+ __u64 filter_ptr; /* pointer to BPF filter */
+ __u64 resv2[5];
+};
+
+enum {
+ IO_URING_BPF_CMD_FILTER = 1,
+};
+
+struct io_uring_bpf {
+ __u16 cmd_type; /* IO_URING_BPF_* values */
+ __u16 cmd_flags; /* none so far */
+ __u32 resv;
+ union {
+ struct io_uring_bpf_filter filter;
+ };
+};
+
+#endif
diff --git a/io_uring/Kconfig b/io_uring/Kconfig
index 4b949c42c0bf..a7ae23cf1035 100644
--- a/io_uring/Kconfig
+++ b/io_uring/Kconfig
@@ -9,3 +9,8 @@ config IO_URING_ZCRX
depends on PAGE_POOL
depends on INET
depends on NET_RX_BUSY_POLL
+
+config IO_URING_BPF
+ def_bool y
+ depends on BPF
+ depends on NET
diff --git a/io_uring/Makefile b/io_uring/Makefile
index bc4e4a3fa0a5..f3c505caa91e 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -22,3 +22,4 @@ obj-$(CONFIG_NET_RX_BUSY_POLL) += napi.o
obj-$(CONFIG_NET) += net.o cmd_net.o
obj-$(CONFIG_PROC_FS) += fdinfo.o
obj-$(CONFIG_IO_URING_MOCK_FILE) += mock_file.o
+obj-$(CONFIG_IO_URING_BPF) += bpf_filter.o
diff --git a/io_uring/bpf_filter.c b/io_uring/bpf_filter.c
new file mode 100644
index 000000000000..48c7ea6f8d63
--- /dev/null
+++ b/io_uring/bpf_filter.c
@@ -0,0 +1,328 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BPF filter support for io_uring. Supports SQE opcodes for now.
+ */
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/io_uring.h>
+#include <linux/filter.h>
+#include <linux/bpf.h>
+#include <uapi/linux/io_uring.h>
+
+#include "io_uring.h"
+#include "bpf_filter.h"
+#include "net.h"
+
+struct io_bpf_filter {
+ struct bpf_prog *prog;
+ struct io_bpf_filter *next;
+};
+
+/* Deny if this is set as the filter */
+static const struct io_bpf_filter dummy_filter;
+
+static void io_uring_populate_bpf_ctx(struct io_uring_bpf_ctx *bctx,
+ struct io_kiocb *req)
+{
+ memset(bctx, 0, sizeof(*bctx));
+ bctx->opcode = req->opcode;
+ bctx->sqe_flags = (__force int) req->flags & SQE_VALID_FLAGS;
+ bctx->user_data = req->cqe.user_data;
+}
+
+/*
+ * Run registered filters for a given opcode. For filters, a return of 0 denies
+ * execution of the request, a return of 1 allows it. If any filter for an
+ * opcode returns 0, filter processing is stopped, and the request is denied.
+ * This also stops the processing of filters.
+ *
+ * __io_uring_run_bpf_filters() returns 0 on success, allow running the
+ * request, and -EACCES when a request is denied.
+ */
+int __io_uring_run_bpf_filters(struct io_restriction *res, struct io_kiocb *req)
+{
+ struct io_bpf_filter *filter;
+ struct io_uring_bpf_ctx bpf_ctx;
+ int ret;
+
+ /* Fast check for existence of filters outside of RCU */
+ if (!rcu_access_pointer(res->bpf_filters->filters[req->opcode]))
+ return 0;
+
+ /*
+ * req->opcode has already been validated to be within the range
+ * of what we expect, io_init_req() does this.
+ */
+ rcu_read_lock();
+ filter = rcu_dereference(res->bpf_filters->filters[req->opcode]);
+ if (!filter) {
+ ret = 1;
+ goto out;
+ } else if (filter == &dummy_filter) {
+ ret = 0;
+ goto out;
+ }
+
+ io_uring_populate_bpf_ctx(&bpf_ctx, req);
+
+ /*
+ * Iterate registered filters. The opcode is allowed IFF all filters
+ * return 1. If any filter returns denied, opcode will be denied.
+ */
+ do {
+ if (filter == &dummy_filter)
+ ret = 0;
+ else
+ ret = bpf_prog_run(filter->prog, &bpf_ctx);
+ if (!ret)
+ break;
+ filter = filter->next;
+ } while (filter);
+out:
+ rcu_read_unlock();
+ return ret ? 0 : -EACCES;
+}
+
+static void io_free_bpf_filters(struct rcu_head *head)
+{
+ struct io_bpf_filter __rcu **filter;
+ struct io_bpf_filters *filters;
+ int i;
+
+ filters = container_of(head, struct io_bpf_filters, rcu_head);
+ spin_lock(&filters->lock);
+ filter = filters->filters;
+ if (!filter) {
+ spin_unlock(&filters->lock);
+ return;
+ }
+ spin_unlock(&filters->lock);
+
+ for (i = 0; i < IORING_OP_LAST; i++) {
+ struct io_bpf_filter *f;
+
+ rcu_read_lock();
+ f = rcu_dereference(filter[i]);
+ while (f) {
+ struct io_bpf_filter *next = f->next;
+
+ /*
+ * Even if stacked, dummy filter will always be last
+ * as it can only get installed into an empty spot.
+ */
+ if (f == &dummy_filter)
+ break;
+ bpf_prog_destroy(f->prog);
+ kfree(f);
+ f = next;
+ }
+ rcu_read_unlock();
+ }
+ kfree(filters->filters);
+ kfree(filters);
+}
+
+static void __io_put_bpf_filters(struct io_bpf_filters *filters)
+{
+ if (refcount_dec_and_test(&filters->refs))
+ call_rcu(&filters->rcu_head, io_free_bpf_filters);
+}
+
+void io_put_bpf_filters(struct io_restriction *res)
+{
+ if (res->bpf_filters)
+ __io_put_bpf_filters(res->bpf_filters);
+}
+
+static struct io_bpf_filters *io_new_bpf_filters(void)
+{
+ struct io_bpf_filters *filters;
+
+ filters = kzalloc(sizeof(*filters), GFP_KERNEL_ACCOUNT);
+ if (!filters)
+ return ERR_PTR(-ENOMEM);
+
+ filters->filters = kcalloc(IORING_OP_LAST,
+ sizeof(struct io_bpf_filter *),
+ GFP_KERNEL_ACCOUNT);
+ if (!filters->filters) {
+ kfree(filters);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ refcount_set(&filters->refs, 1);
+ spin_lock_init(&filters->lock);
+ return filters;
+}
+
+/*
+ * Validate classic BPF filter instructions. Only allow a safe subset of
+ * operations - no packet data access, just context field loads and basic
+ * ALU/jump operations.
+ */
+static int io_uring_check_cbpf_filter(struct sock_filter *filter,
+ unsigned int flen)
+{
+ int pc;
+
+ for (pc = 0; pc < flen; pc++) {
+ struct sock_filter *ftest = &filter[pc];
+ u16 code = ftest->code;
+ u32 k = ftest->k;
+
+ switch (code) {
+ case BPF_LD | BPF_W | BPF_ABS:
+ ftest->code = BPF_LDX | BPF_W | BPF_ABS;
+ /* 32-bit aligned and not out of bounds. */
+ if (k >= sizeof(struct io_uring_bpf_ctx) || k & 3)
+ return -EINVAL;
+ continue;
+ case BPF_LD | BPF_W | BPF_LEN:
+ ftest->code = BPF_LD | BPF_IMM;
+ ftest->k = sizeof(struct io_uring_bpf_ctx);
+ continue;
+ case BPF_LDX | BPF_W | BPF_LEN:
+ ftest->code = BPF_LDX | BPF_IMM;
+ ftest->k = sizeof(struct io_uring_bpf_ctx);
+ continue;
+ /* Explicitly include allowed calls. */
+ case BPF_RET | BPF_K:
+ case BPF_RET | BPF_A:
+ case BPF_ALU | BPF_ADD | BPF_K:
+ case BPF_ALU | BPF_ADD | BPF_X:
+ case BPF_ALU | BPF_SUB | BPF_K:
+ case BPF_ALU | BPF_SUB | BPF_X:
+ case BPF_ALU | BPF_MUL | BPF_K:
+ case BPF_ALU | BPF_MUL | BPF_X:
+ case BPF_ALU | BPF_DIV | BPF_K:
+ case BPF_ALU | BPF_DIV | BPF_X:
+ case BPF_ALU | BPF_AND | BPF_K:
+ case BPF_ALU | BPF_AND | BPF_X:
+ case BPF_ALU | BPF_OR | BPF_K:
+ case BPF_ALU | BPF_OR | BPF_X:
+ case BPF_ALU | BPF_XOR | BPF_K:
+ case BPF_ALU | BPF_XOR | BPF_X:
+ case BPF_ALU | BPF_LSH | BPF_K:
+ case BPF_ALU | BPF_LSH | BPF_X:
+ case BPF_ALU | BPF_RSH | BPF_K:
+ case BPF_ALU | BPF_RSH | BPF_X:
+ case BPF_ALU | BPF_NEG:
+ case BPF_LD | BPF_IMM:
+ case BPF_LDX | BPF_IMM:
+ case BPF_MISC | BPF_TAX:
+ case BPF_MISC | BPF_TXA:
+ case BPF_LD | BPF_MEM:
+ case BPF_LDX | BPF_MEM:
+ case BPF_ST:
+ case BPF_STX:
+ case BPF_JMP | BPF_JA:
+ case BPF_JMP | BPF_JEQ | BPF_K:
+ case BPF_JMP | BPF_JEQ | BPF_X:
+ case BPF_JMP | BPF_JGE | BPF_K:
+ case BPF_JMP | BPF_JGE | BPF_X:
+ case BPF_JMP | BPF_JGT | BPF_K:
+ case BPF_JMP | BPF_JGT | BPF_X:
+ case BPF_JMP | BPF_JSET | BPF_K:
+ case BPF_JMP | BPF_JSET | BPF_X:
+ continue;
+ default:
+ return -EINVAL;
+ }
+ }
+ return 0;
+}
+
+#define IO_URING_BPF_FILTER_FLAGS IO_URING_BPF_FILTER_DENY_REST
+
+int io_register_bpf_filter(struct io_restriction *res,
+ struct io_uring_bpf __user *arg)
+{
+ struct io_bpf_filter *filter, *old_filter;
+ struct io_bpf_filters *filters;
+ struct io_uring_bpf reg;
+ struct bpf_prog *prog;
+ struct sock_fprog fprog;
+ int ret;
+
+ if (copy_from_user(®, arg, sizeof(reg)))
+ return -EFAULT;
+ if (reg.cmd_type != IO_URING_BPF_CMD_FILTER)
+ return -EINVAL;
+ if (reg.cmd_flags || reg.resv)
+ return -EINVAL;
+
+ if (reg.filter.opcode >= IORING_OP_LAST)
+ return -EINVAL;
+ if (reg.filter.flags & ~IO_URING_BPF_FILTER_FLAGS)
+ return -EINVAL;
+ if (reg.filter.resv)
+ return -EINVAL;
+ if (!mem_is_zero(reg.filter.resv2, sizeof(reg.filter.resv2)))
+ return -EINVAL;
+ if (!reg.filter.filter_len || reg.filter.filter_len > BPF_MAXINSNS)
+ return -EINVAL;
+
+ fprog.len = reg.filter.filter_len;
+ fprog.filter = u64_to_user_ptr(reg.filter.filter_ptr);
+
+ ret = bpf_prog_create_from_user(&prog, &fprog,
+ io_uring_check_cbpf_filter, false);
+ if (ret)
+ return ret;
+
+ /*
+ * No existing filters, allocate set.
+ */
+ filters = res->bpf_filters;
+ if (!filters) {
+ filters = io_new_bpf_filters();
+ if (IS_ERR(filters)) {
+ ret = PTR_ERR(filters);
+ goto err_prog;
+ }
+ }
+
+ filter = kzalloc(sizeof(*filter), GFP_KERNEL_ACCOUNT);
+ if (!filter) {
+ ret = -ENOMEM;
+ goto err;
+ }
+ filter->prog = prog;
+ res->bpf_filters = filters;
+
+ /*
+ * Insert filter - if the current opcode already has a filter
+ * attached, add to the set.
+ */
+ rcu_read_lock();
+ spin_lock_bh(&filters->lock);
+ old_filter = rcu_dereference(filters->filters[reg.filter.opcode]);
+ if (old_filter)
+ filter->next = old_filter;
+ rcu_assign_pointer(filters->filters[reg.filter.opcode], filter);
+
+ /*
+ * If IO_URING_BPF_FILTER_DENY_REST is set, fill any unregistered
+ * opcode with the dummy filter. That will cause them to be denied.
+ */
+ if (reg.filter.flags & IO_URING_BPF_FILTER_DENY_REST) {
+ for (int i = 0; i < IORING_OP_LAST; i++) {
+ if (i == reg.filter.opcode)
+ continue;
+ old_filter = rcu_dereference(filters->filters[i]);
+ if (old_filter)
+ continue;
+ rcu_assign_pointer(filters->filters[i], &dummy_filter);
+ }
+ }
+
+ spin_unlock_bh(&filters->lock);
+ rcu_read_unlock();
+ return 0;
+err:
+ if (filters != res->bpf_filters)
+ __io_put_bpf_filters(filters);
+err_prog:
+ bpf_prog_destroy(prog);
+ return ret;
+}
diff --git a/io_uring/bpf_filter.h b/io_uring/bpf_filter.h
new file mode 100644
index 000000000000..27eae9705473
--- /dev/null
+++ b/io_uring/bpf_filter.h
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef IO_URING_BPF_FILTER_H
+#define IO_URING_BPF_FILTER_H
+
+#include <uapi/linux/io_uring/bpf_filter.h>
+
+#ifdef CONFIG_IO_URING_BPF
+
+int __io_uring_run_bpf_filters(struct io_restriction *res, struct io_kiocb *req);
+
+int io_register_bpf_filter(struct io_restriction *res,
+ struct io_uring_bpf __user *arg);
+
+void io_put_bpf_filters(struct io_restriction *res);
+
+static inline int io_uring_run_bpf_filters(struct io_restriction *res,
+ struct io_kiocb *req)
+{
+ if (res->bpf_filters)
+ return __io_uring_run_bpf_filters(res, req);
+
+ return 0;
+}
+
+#else
+
+static inline int io_register_bpf_filter(struct io_restriction *res,
+ struct io_uring_bpf __user *arg)
+{
+ return -EINVAL;
+}
+static inline int io_uring_run_bpf_filters(struct io_restriction *res,
+ struct io_kiocb *req)
+{
+ return 0;
+}
+static inline void io_put_bpf_filters(struct io_restriction *res)
+{
+}
+#endif /* CONFIG_IO_URING_BPF */
+
+#endif
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 2cde22af78a3..67533e494836 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -93,6 +93,7 @@
#include "rw.h"
#include "alloc_cache.h"
#include "eventfd.h"
+#include "bpf_filter.h"
#define SQE_COMMON_FLAGS (IOSQE_FIXED_FILE | IOSQE_IO_LINK | \
IOSQE_IO_HARDLINK | IOSQE_ASYNC)
@@ -2261,6 +2262,12 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
if (unlikely(ret))
return io_submit_fail_init(sqe, req, ret);
+ if (unlikely(ctx->restrictions.bpf_filters)) {
+ ret = io_uring_run_bpf_filters(&ctx->restrictions, req);
+ if (ret)
+ return io_submit_fail_init(sqe, req, ret);
+ }
+
trace_io_uring_submit_req(req);
/*
@@ -2850,6 +2857,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
percpu_ref_exit(&ctx->refs);
free_uid(ctx->user);
io_req_caches_free(ctx);
+ io_put_bpf_filters(&ctx->restrictions);
WARN_ON_ONCE(ctx->nr_req_allocated);
diff --git a/io_uring/register.c b/io_uring/register.c
index 8551f13920dc..30957c2cb5eb 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -33,6 +33,7 @@
#include "memmap.h"
#include "zcrx.h"
#include "query.h"
+#include "bpf_filter.h"
#define IORING_MAX_RESTRICTIONS (IORING_RESTRICTION_LAST + \
IORING_REGISTER_LAST + IORING_OP_LAST)
@@ -830,6 +831,13 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
case IORING_REGISTER_ZCRX_CTRL:
ret = io_zcrx_ctrl(ctx, arg, nr_args);
break;
+ case IORING_REGISTER_BPF_FILTER:
+ ret = -EINVAL;
+
+ if (nr_args != 1)
+ break;
+ ret = io_register_bpf_filter(&ctx->restrictions, arg);
+ break;
default:
ret = -EINVAL;
break;
--
2.51.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 2/6] io_uring/net: allow filtering on IORING_OP_SOCKET data
2026-01-18 17:16 [PATCHSET v5] Inherited restrictions and BPF filtering Jens Axboe
2026-01-18 17:16 ` [PATCH 1/6] io_uring: add support for BPF filtering for opcode restrictions Jens Axboe
@ 2026-01-18 17:16 ` Jens Axboe
2026-01-18 17:16 ` [PATCH 3/6] io_uring/bpf_filter: cache lookup table in ctx->bpf_filters Jens Axboe
` (3 subsequent siblings)
5 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2026-01-18 17:16 UTC (permalink / raw)
To: io-uring; +Cc: brauner, Jens Axboe
Example population method for the BPF based opcode filtering. This
exposes the socket family, type, and protocol to a registered BPF
filter. This in turn enables the filter to make decisions based on
what was passed in to the IORING_OP_SOCKET request type.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/uapi/linux/io_uring/bpf_filter.h | 9 ++++++++-
io_uring/bpf_filter.c | 10 ++++++++++
io_uring/net.c | 9 +++++++++
io_uring/net.h | 6 ++++++
4 files changed, 33 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/io_uring/bpf_filter.h b/include/uapi/linux/io_uring/bpf_filter.h
index 14bd5b7468a7..e7565458d4d8 100644
--- a/include/uapi/linux/io_uring/bpf_filter.h
+++ b/include/uapi/linux/io_uring/bpf_filter.h
@@ -12,7 +12,14 @@ struct io_uring_bpf_ctx {
__u8 sqe_flags;
__u8 pad[6];
__u64 user_data;
- __u64 resv[6];
+ union {
+ __u64 resv[6];
+ struct {
+ __u32 family;
+ __u32 type;
+ __u32 protocol;
+ } socket;
+ };
};
enum {
diff --git a/io_uring/bpf_filter.c b/io_uring/bpf_filter.c
index 48c7ea6f8d63..63996b350e60 100644
--- a/io_uring/bpf_filter.c
+++ b/io_uring/bpf_filter.c
@@ -28,6 +28,16 @@ static void io_uring_populate_bpf_ctx(struct io_uring_bpf_ctx *bctx,
bctx->opcode = req->opcode;
bctx->sqe_flags = (__force int) req->flags & SQE_VALID_FLAGS;
bctx->user_data = req->cqe.user_data;
+
+ /*
+ * Opcodes can provide a handler fo populating more data into bctx,
+ * for filters to use.
+ */
+ switch (req->opcode) {
+ case IORING_OP_SOCKET:
+ io_socket_bpf_populate(bctx, req);
+ break;
+ }
}
/*
diff --git a/io_uring/net.c b/io_uring/net.c
index 519ea055b761..4fcba36bd0bb 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -1699,6 +1699,15 @@ int io_accept(struct io_kiocb *req, unsigned int issue_flags)
return IOU_COMPLETE;
}
+void io_socket_bpf_populate(struct io_uring_bpf_ctx *bctx, struct io_kiocb *req)
+{
+ struct io_socket *sock = io_kiocb_to_cmd(req, struct io_socket);
+
+ bctx->socket.family = sock->domain;
+ bctx->socket.type = sock->type;
+ bctx->socket.protocol = sock->protocol;
+}
+
int io_socket_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
struct io_socket *sock = io_kiocb_to_cmd(req, struct io_socket);
diff --git a/io_uring/net.h b/io_uring/net.h
index 43e5ce5416b7..a862960a3bb9 100644
--- a/io_uring/net.h
+++ b/io_uring/net.h
@@ -3,6 +3,7 @@
#include <linux/net.h>
#include <linux/uio.h>
#include <linux/io_uring_types.h>
+#include <uapi/linux/io_uring/bpf_filter.h>
struct io_async_msghdr {
#if defined(CONFIG_NET)
@@ -44,6 +45,7 @@ int io_accept(struct io_kiocb *req, unsigned int issue_flags);
int io_socket_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
int io_socket(struct io_kiocb *req, unsigned int issue_flags);
+void io_socket_bpf_populate(struct io_uring_bpf_ctx *bctx, struct io_kiocb *req);
int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
int io_connect(struct io_kiocb *req, unsigned int issue_flags);
@@ -64,4 +66,8 @@ void io_netmsg_cache_free(const void *entry);
static inline void io_netmsg_cache_free(const void *entry)
{
}
+static inline void io_socket_bpf_populate(struct io_uring_bpf_ctx *bctx,
+ struct io_kiocb *req)
+{
+}
#endif
--
2.51.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 3/6] io_uring/bpf_filter: cache lookup table in ctx->bpf_filters
2026-01-18 17:16 [PATCHSET v5] Inherited restrictions and BPF filtering Jens Axboe
2026-01-18 17:16 ` [PATCH 1/6] io_uring: add support for BPF filtering for opcode restrictions Jens Axboe
2026-01-18 17:16 ` [PATCH 2/6] io_uring/net: allow filtering on IORING_OP_SOCKET data Jens Axboe
@ 2026-01-18 17:16 ` Jens Axboe
2026-01-18 17:16 ` [PATCH 4/6] io_uring/bpf_filter: add ref counts to struct io_bpf_filter Jens Axboe
` (2 subsequent siblings)
5 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2026-01-18 17:16 UTC (permalink / raw)
To: io-uring; +Cc: brauner, Jens Axboe
Currently a few pointer dereferences need to be made to both check if
BPF filters are installed, and then also to retrieve the actual filter
for the opcode. Cache the table in ctx->bpf_filters to avoid that.
Add a bit of debug info on ring exit to show if we ever got this wrong.
Small risk of that given that the table is currently only updated in one
spot, but once task forking is enabled, that will add one more spot.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/io_uring_types.h | 2 ++
io_uring/bpf_filter.c | 7 ++++---
io_uring/bpf_filter.h | 10 +++++-----
io_uring/io_uring.c | 11 +++++++++--
io_uring/register.c | 3 +++
5 files changed, 23 insertions(+), 10 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 37f0a5f7b2f4..366927635277 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -287,6 +287,8 @@ struct io_ring_ctx {
struct task_struct *submitter_task;
struct io_rings *rings;
+ /* cache of ->restrictions.bpf_filters->filters */
+ struct io_bpf_filter __rcu **bpf_filters;
struct percpu_ref refs;
clockid_t clockid;
diff --git a/io_uring/bpf_filter.c b/io_uring/bpf_filter.c
index 63996b350e60..0e668852b3ea 100644
--- a/io_uring/bpf_filter.c
+++ b/io_uring/bpf_filter.c
@@ -49,14 +49,15 @@ static void io_uring_populate_bpf_ctx(struct io_uring_bpf_ctx *bctx,
* __io_uring_run_bpf_filters() returns 0 on success, allow running the
* request, and -EACCES when a request is denied.
*/
-int __io_uring_run_bpf_filters(struct io_restriction *res, struct io_kiocb *req)
+int __io_uring_run_bpf_filters(struct io_bpf_filter __rcu **filters,
+ struct io_kiocb *req)
{
struct io_bpf_filter *filter;
struct io_uring_bpf_ctx bpf_ctx;
int ret;
/* Fast check for existence of filters outside of RCU */
- if (!rcu_access_pointer(res->bpf_filters->filters[req->opcode]))
+ if (!rcu_access_pointer(filters[req->opcode]))
return 0;
/*
@@ -64,7 +65,7 @@ int __io_uring_run_bpf_filters(struct io_restriction *res, struct io_kiocb *req)
* of what we expect, io_init_req() does this.
*/
rcu_read_lock();
- filter = rcu_dereference(res->bpf_filters->filters[req->opcode]);
+ filter = rcu_dereference(filters[req->opcode]);
if (!filter) {
ret = 1;
goto out;
diff --git a/io_uring/bpf_filter.h b/io_uring/bpf_filter.h
index 27eae9705473..9f3cdb92eb16 100644
--- a/io_uring/bpf_filter.h
+++ b/io_uring/bpf_filter.h
@@ -6,18 +6,18 @@
#ifdef CONFIG_IO_URING_BPF
-int __io_uring_run_bpf_filters(struct io_restriction *res, struct io_kiocb *req);
+int __io_uring_run_bpf_filters(struct io_bpf_filter __rcu **filters, struct io_kiocb *req);
int io_register_bpf_filter(struct io_restriction *res,
struct io_uring_bpf __user *arg);
void io_put_bpf_filters(struct io_restriction *res);
-static inline int io_uring_run_bpf_filters(struct io_restriction *res,
+static inline int io_uring_run_bpf_filters(struct io_bpf_filter __rcu **filters,
struct io_kiocb *req)
{
- if (res->bpf_filters)
- return __io_uring_run_bpf_filters(res, req);
+ if (filters)
+ return __io_uring_run_bpf_filters(filters, req);
return 0;
}
@@ -29,7 +29,7 @@ static inline int io_register_bpf_filter(struct io_restriction *res,
{
return -EINVAL;
}
-static inline int io_uring_run_bpf_filters(struct io_restriction *res,
+static inline int io_uring_run_bpf_filters(struct io_bpf_filter __rcu **filters,
struct io_kiocb *req)
{
return 0;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 67533e494836..62aeaf0fad74 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2262,8 +2262,8 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
if (unlikely(ret))
return io_submit_fail_init(sqe, req, ret);
- if (unlikely(ctx->restrictions.bpf_filters)) {
- ret = io_uring_run_bpf_filters(&ctx->restrictions, req);
+ if (unlikely(ctx->bpf_filters)) {
+ ret = io_uring_run_bpf_filters(ctx->bpf_filters, req);
if (ret)
return io_submit_fail_init(sqe, req, ret);
}
@@ -2857,6 +2857,13 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
percpu_ref_exit(&ctx->refs);
free_uid(ctx->user);
io_req_caches_free(ctx);
+
+ if (ctx->restrictions.bpf_filters) {
+ WARN_ON_ONCE(ctx->bpf_filters !=
+ ctx->restrictions.bpf_filters->filters);
+ } else {
+ WARN_ON_ONCE(ctx->bpf_filters);
+ }
io_put_bpf_filters(&ctx->restrictions);
WARN_ON_ONCE(ctx->nr_req_allocated);
diff --git a/io_uring/register.c b/io_uring/register.c
index 30957c2cb5eb..40de9b8924b9 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -837,6 +837,9 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
if (nr_args != 1)
break;
ret = io_register_bpf_filter(&ctx->restrictions, arg);
+ if (!ret)
+ WRITE_ONCE(ctx->bpf_filters,
+ ctx->restrictions.bpf_filters->filters);
break;
default:
ret = -EINVAL;
--
2.51.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 4/6] io_uring/bpf_filter: add ref counts to struct io_bpf_filter
2026-01-18 17:16 [PATCHSET v5] Inherited restrictions and BPF filtering Jens Axboe
` (2 preceding siblings ...)
2026-01-18 17:16 ` [PATCH 3/6] io_uring/bpf_filter: cache lookup table in ctx->bpf_filters Jens Axboe
@ 2026-01-18 17:16 ` Jens Axboe
2026-01-18 17:16 ` [PATCH 5/6] io_uring: add task fork hook Jens Axboe
2026-01-18 17:16 ` [PATCH 6/6] io_uring: allow registration of per-task restrictions Jens Axboe
5 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2026-01-18 17:16 UTC (permalink / raw)
To: io-uring; +Cc: brauner, Jens Axboe
In preparation for allowing inheritance of BPF filters and filter
tables, add a reference count to the filter. This allows multiple tables
to safely include the same filter.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
io_uring/bpf_filter.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/io_uring/bpf_filter.c b/io_uring/bpf_filter.c
index 0e668852b3ea..545acd480ffd 100644
--- a/io_uring/bpf_filter.c
+++ b/io_uring/bpf_filter.c
@@ -14,6 +14,7 @@
#include "net.h"
struct io_bpf_filter {
+ refcount_t refs;
struct bpf_prog *prog;
struct io_bpf_filter *next;
};
@@ -123,6 +124,11 @@ static void io_free_bpf_filters(struct rcu_head *head)
*/
if (f == &dummy_filter)
break;
+
+ /* Someone still holds a ref, stop iterating. */
+ if (!refcount_dec_and_test(&f->refs))
+ break;
+
bpf_prog_destroy(f->prog);
kfree(f);
f = next;
@@ -298,6 +304,7 @@ int io_register_bpf_filter(struct io_restriction *res,
ret = -ENOMEM;
goto err;
}
+ refcount_set(&filter->refs, 1);
filter->prog = prog;
res->bpf_filters = filters;
--
2.51.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 5/6] io_uring: add task fork hook
2026-01-18 17:16 [PATCHSET v5] Inherited restrictions and BPF filtering Jens Axboe
` (3 preceding siblings ...)
2026-01-18 17:16 ` [PATCH 4/6] io_uring/bpf_filter: add ref counts to struct io_bpf_filter Jens Axboe
@ 2026-01-18 17:16 ` Jens Axboe
2026-01-18 17:16 ` [PATCH 6/6] io_uring: allow registration of per-task restrictions Jens Axboe
5 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2026-01-18 17:16 UTC (permalink / raw)
To: io-uring; +Cc: brauner, Jens Axboe
Called when copy_process() is called to copy state to a new child.
Right now this is just a stub, but will be used shortly to properly
handle fork'ing of task based io_uring restrictions.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/io_uring.h | 14 +++++++++++++-
include/linux/sched.h | 1 +
io_uring/tctx.c | 25 ++++++++++++++++---------
kernel/fork.c | 5 +++++
4 files changed, 35 insertions(+), 10 deletions(-)
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 85fe4e6b275c..d1aa4edfc2a5 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -12,6 +12,7 @@ void __io_uring_free(struct task_struct *tsk);
void io_uring_unreg_ringfd(void);
const char *io_uring_get_opcode(u8 opcode);
bool io_is_uring_fops(struct file *file);
+int __io_uring_fork(struct task_struct *tsk);
static inline void io_uring_files_cancel(void)
{
@@ -25,9 +26,16 @@ static inline void io_uring_task_cancel(void)
}
static inline void io_uring_free(struct task_struct *tsk)
{
- if (tsk->io_uring)
+ if (tsk->io_uring || tsk->io_uring_restrict)
__io_uring_free(tsk);
}
+static inline int io_uring_fork(struct task_struct *tsk)
+{
+ if (tsk->io_uring_restrict)
+ return __io_uring_fork(tsk);
+
+ return 0;
+}
#else
static inline void io_uring_task_cancel(void)
{
@@ -46,6 +54,10 @@ static inline bool io_is_uring_fops(struct file *file)
{
return false;
}
+static inline int io_uring_fork(struct task_struct *tsk)
+{
+ return 0;
+}
#endif
#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d395f2810fac..9abbd11bb87c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1190,6 +1190,7 @@ struct task_struct {
#ifdef CONFIG_IO_URING
struct io_uring_task *io_uring;
+ struct io_restriction *io_uring_restrict;
#endif
/* Namespaces: */
diff --git a/io_uring/tctx.c b/io_uring/tctx.c
index 5b66755579c0..d4f7698805e4 100644
--- a/io_uring/tctx.c
+++ b/io_uring/tctx.c
@@ -54,16 +54,18 @@ void __io_uring_free(struct task_struct *tsk)
* node is stored in the xarray. Until that gets sorted out, attempt
* an iteration here and warn if any entries are found.
*/
- xa_for_each(&tctx->xa, index, node) {
- WARN_ON_ONCE(1);
- break;
- }
- WARN_ON_ONCE(tctx->io_wq);
- WARN_ON_ONCE(tctx->cached_refs);
+ if (tctx) {
+ xa_for_each(&tctx->xa, index, node) {
+ WARN_ON_ONCE(1);
+ break;
+ }
+ WARN_ON_ONCE(tctx->io_wq);
+ WARN_ON_ONCE(tctx->cached_refs);
- percpu_counter_destroy(&tctx->inflight);
- kfree(tctx);
- tsk->io_uring = NULL;
+ percpu_counter_destroy(&tctx->inflight);
+ kfree(tctx);
+ tsk->io_uring = NULL;
+ }
}
__cold int io_uring_alloc_task_context(struct task_struct *task,
@@ -351,3 +353,8 @@ int io_ringfd_unregister(struct io_ring_ctx *ctx, void __user *__arg,
return i ? i : ret;
}
+
+int __io_uring_fork(struct task_struct *tsk)
+{
+ return 0;
+}
diff --git a/kernel/fork.c b/kernel/fork.c
index b1f3915d5f8e..08a2515380ec 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -97,6 +97,7 @@
#include <linux/kasan.h>
#include <linux/scs.h>
#include <linux/io_uring.h>
+#include <linux/io_uring_types.h>
#include <linux/bpf.h>
#include <linux/stackprotector.h>
#include <linux/user_events.h>
@@ -2129,6 +2130,9 @@ __latent_entropy struct task_struct *copy_process(
#ifdef CONFIG_IO_URING
p->io_uring = NULL;
+ retval = io_uring_fork(p);
+ if (unlikely(retval))
+ goto bad_fork_cleanup_delayacct;
#endif
p->default_timer_slack_ns = current->timer_slack_ns;
@@ -2525,6 +2529,7 @@ __latent_entropy struct task_struct *copy_process(
mpol_put(p->mempolicy);
#endif
bad_fork_cleanup_delayacct:
+ io_uring_free(p);
delayacct_tsk_free(p);
bad_fork_cleanup_count:
dec_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
--
2.51.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 6/6] io_uring: allow registration of per-task restrictions
2026-01-18 17:16 [PATCHSET v5] Inherited restrictions and BPF filtering Jens Axboe
` (4 preceding siblings ...)
2026-01-18 17:16 ` [PATCH 5/6] io_uring: add task fork hook Jens Axboe
@ 2026-01-18 17:16 ` Jens Axboe
2026-01-19 17:54 ` Aleksa Sarai
5 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2026-01-18 17:16 UTC (permalink / raw)
To: io-uring; +Cc: brauner, Jens Axboe
Currently io_uring supports restricting operations on a per-ring basis.
To use those, the ring must be setup in a disabled state by setting
IORING_SETUP_R_DISABLED. Then restrictions can be set for the ring, and
the ring can then be enabled.
This commit adds support for IORING_REGISTER_RESTRICTIONS with ring_fd
== -1, like the other "blind" register opcodes which work on the task
rather than a specific ring. This allows registration of the same kind
of restrictions as can been done on a specific ring, but with the task
itself. Once done, any ring created will inherit these restrictions.
If a restriction filter is registered with a task, then it's inherited
on fork for its children. Children may only further restrict operations,
not extend them.
Inheriting restrictions include both the classic
IORING_REGISTER_RESTRICTIONS based restrictions, as well as the BPF
filters that have been registered with the task via
IORING_REGISTER_BPF_FILTER.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/io_uring_types.h | 2 +
include/uapi/linux/io_uring.h | 7 +++
io_uring/bpf_filter.c | 86 +++++++++++++++++++++++++++++++++-
io_uring/bpf_filter.h | 6 +++
io_uring/io_uring.c | 33 +++++++++++++
io_uring/io_uring.h | 1 +
io_uring/register.c | 65 +++++++++++++++++++++++++
io_uring/tctx.c | 17 +++++++
8 files changed, 216 insertions(+), 1 deletion(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 366927635277..15ed7fa2bca3 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -231,6 +231,8 @@ struct io_restriction {
DECLARE_BITMAP(register_op, IORING_REGISTER_LAST);
DECLARE_BITMAP(sqe_op, IORING_OP_LAST);
struct io_bpf_filters *bpf_filters;
+ /* ->bpf_filters needs COW on modification */
+ bool bpf_filters_cow;
u8 sqe_flags_allowed;
u8 sqe_flags_required;
/* IORING_OP_* restrictions exist */
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 94669b77fee8..aeeffcf27fee 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -808,6 +808,13 @@ struct io_uring_restriction {
__u32 resv2[3];
};
+struct io_uring_task_restriction {
+ __u16 flags;
+ __u16 nr_res;
+ __u32 resv[3];
+ __DECLARE_FLEX_ARRAY(struct io_uring_restriction, restrictions);
+};
+
struct io_uring_clock_register {
__u32 clockid;
__u32 __resv[3];
diff --git a/io_uring/bpf_filter.c b/io_uring/bpf_filter.c
index 545acd480ffd..b3a66b4793b3 100644
--- a/io_uring/bpf_filter.c
+++ b/io_uring/bpf_filter.c
@@ -249,13 +249,77 @@ static int io_uring_check_cbpf_filter(struct sock_filter *filter,
return 0;
}
+void io_bpf_filter_clone(struct io_restriction *dst, struct io_restriction *src)
+{
+ if (!src->bpf_filters)
+ return;
+
+ rcu_read_lock();
+ /*
+ * If the src filter is going away, just ignore it.
+ */
+ if (refcount_inc_not_zero(&src->bpf_filters->refs)) {
+ dst->bpf_filters = src->bpf_filters;
+ dst->bpf_filters_cow = true;
+ }
+ rcu_read_unlock();
+}
+
+/*
+ * Allocate a new struct io_bpf_filters. Used when a filter is cloned and
+ * modifications need to be made.
+ */
+static struct io_bpf_filters *io_bpf_filter_cow(struct io_restriction *src)
+{
+ struct io_bpf_filters *filters;
+ struct io_bpf_filter *srcf;
+ int i;
+
+ filters = io_new_bpf_filters();
+ if (IS_ERR(filters))
+ return filters;
+
+ /*
+ * Iterate filters from src and assign in destination. Grabbing
+ * a reference is enough, we don't need to duplicate the memory.
+ * This is safe because filters are only ever appended to the
+ * front of the list, hence the only memory ever touched inside
+ * a filter is the refcount.
+ */
+ rcu_read_lock();
+ for (i = 0; i < IORING_OP_LAST; i++) {
+ srcf = rcu_dereference(src->bpf_filters->filters[i]);
+ if (!srcf) {
+ continue;
+ } else if (srcf == &dummy_filter) {
+ rcu_assign_pointer(filters->filters[i], &dummy_filter);
+ continue;
+ }
+
+ /*
+ * Getting a ref on the first node is enough, putting the
+ * filter and iterating nodes to free will stop on the first
+ * one that doesn't hit zero when dropping.
+ */
+ if (!refcount_inc_not_zero(&srcf->refs))
+ goto err;
+ rcu_assign_pointer(filters->filters[i], srcf);
+ }
+ rcu_read_unlock();
+ return filters;
+err:
+ rcu_read_unlock();
+ __io_put_bpf_filters(filters);
+ return ERR_PTR(-EBUSY);
+}
+
#define IO_URING_BPF_FILTER_FLAGS IO_URING_BPF_FILTER_DENY_REST
int io_register_bpf_filter(struct io_restriction *res,
struct io_uring_bpf __user *arg)
{
+ struct io_bpf_filters *filters, *old_filters = NULL;
struct io_bpf_filter *filter, *old_filter;
- struct io_bpf_filters *filters;
struct io_uring_bpf reg;
struct bpf_prog *prog;
struct sock_fprog fprog;
@@ -297,6 +361,17 @@ int io_register_bpf_filter(struct io_restriction *res,
ret = PTR_ERR(filters);
goto err_prog;
}
+ } else if (res->bpf_filters_cow) {
+ filters = io_bpf_filter_cow(res);
+ if (IS_ERR(filters)) {
+ ret = PTR_ERR(filters);
+ goto err_prog;
+ }
+ /*
+ * Stash old filters, we'll put them once we know we'll
+ * succeed. Until then, res->bpf_filters is left untouched.
+ */
+ old_filters = res->bpf_filters;
}
filter = kzalloc(sizeof(*filter), GFP_KERNEL_ACCOUNT);
@@ -306,6 +381,15 @@ int io_register_bpf_filter(struct io_restriction *res,
}
refcount_set(&filter->refs, 1);
filter->prog = prog;
+
+ /*
+ * Success - install the new filter set now. If we did COW, put
+ * the old filters as we're replacing them.
+ */
+ if (old_filters) {
+ __io_put_bpf_filters(old_filters);
+ res->bpf_filters_cow = false;
+ }
res->bpf_filters = filters;
/*
diff --git a/io_uring/bpf_filter.h b/io_uring/bpf_filter.h
index 9f3cdb92eb16..66a776cf25b4 100644
--- a/io_uring/bpf_filter.h
+++ b/io_uring/bpf_filter.h
@@ -13,6 +13,8 @@ int io_register_bpf_filter(struct io_restriction *res,
void io_put_bpf_filters(struct io_restriction *res);
+void io_bpf_filter_clone(struct io_restriction *dst, struct io_restriction *src);
+
static inline int io_uring_run_bpf_filters(struct io_bpf_filter __rcu **filters,
struct io_kiocb *req)
{
@@ -37,6 +39,10 @@ static inline int io_uring_run_bpf_filters(struct io_bpf_filter __rcu **filters,
static inline void io_put_bpf_filters(struct io_restriction *res)
{
}
+static inline void io_bpf_filter_clone(struct io_restriction *dst,
+ struct io_restriction *src)
+{
+}
#endif /* CONFIG_IO_URING_BPF */
#endif
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 62aeaf0fad74..e190827d2436 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3569,6 +3569,32 @@ int io_prepare_config(struct io_ctx_config *config)
return 0;
}
+void io_restriction_clone(struct io_restriction *dst, struct io_restriction *src)
+{
+ memcpy(&dst->register_op, &src->register_op, sizeof(dst->register_op));
+ memcpy(&dst->sqe_op, &src->sqe_op, sizeof(dst->sqe_op));
+ dst->sqe_flags_allowed = src->sqe_flags_allowed;
+ dst->sqe_flags_required = src->sqe_flags_required;
+ dst->op_registered = src->op_registered;
+ dst->reg_registered = src->reg_registered;
+
+ io_bpf_filter_clone(dst, src);
+}
+
+static void io_ctx_restriction_clone(struct io_ring_ctx *ctx,
+ struct io_restriction *src)
+{
+ struct io_restriction *dst = &ctx->restrictions;
+
+ io_restriction_clone(dst, src);
+ if (dst->bpf_filters)
+ WRITE_ONCE(ctx->bpf_filters, dst->bpf_filters->filters);
+ if (dst->op_registered)
+ ctx->op_restricted = 1;
+ if (dst->reg_registered)
+ ctx->reg_restricted = 1;
+}
+
static __cold int io_uring_create(struct io_ctx_config *config)
{
struct io_uring_params *p = &config->p;
@@ -3629,6 +3655,13 @@ static __cold int io_uring_create(struct io_ctx_config *config)
else
ctx->notify_method = TWA_SIGNAL;
+ /*
+ * If the current task has restrictions enabled, then copy them to
+ * our newly created ring and mark it as registered.
+ */
+ if (current->io_uring_restrict)
+ io_ctx_restriction_clone(ctx, current->io_uring_restrict);
+
/*
* This is just grabbed for accounting purposes. When a process exits,
* the mm is exited and dropped before the files, hence we need to hang
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index c5bbb43b5842..feb9f76761e9 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -195,6 +195,7 @@ void io_task_refs_refill(struct io_uring_task *tctx);
bool __io_alloc_req_refill(struct io_ring_ctx *ctx);
void io_activate_pollwq(struct io_ring_ctx *ctx);
+void io_restriction_clone(struct io_restriction *dst, struct io_restriction *src);
static inline void io_lockdep_assert_cq_locked(struct io_ring_ctx *ctx)
{
diff --git a/io_uring/register.c b/io_uring/register.c
index 40de9b8924b9..e8a68b04a6f4 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -190,6 +190,67 @@ static __cold int io_register_restrictions(struct io_ring_ctx *ctx,
return 0;
}
+static int io_register_restrictions_task(void __user *arg, unsigned int nr_args)
+{
+ struct io_uring_task_restriction __user *ures = arg;
+ struct io_uring_task_restriction tres;
+ struct io_restriction *res;
+ int ret;
+
+ /* Disallow if task already has registered restrictions */
+ if (current->io_uring_restrict)
+ return -EPERM;
+ if (nr_args != 1)
+ return -EINVAL;
+
+ if (copy_from_user(&tres, arg, sizeof(tres)))
+ return -EFAULT;
+
+ if (tres.flags)
+ return -EINVAL;
+ if (!mem_is_zero(tres.resv, sizeof(tres.resv)))
+ return -EINVAL;
+
+ res = kzalloc(sizeof(*res), GFP_KERNEL_ACCOUNT);
+ if (!res)
+ return -ENOMEM;
+
+ ret = io_parse_restrictions(ures->restrictions, tres.nr_res, res);
+ if (ret < 0) {
+ kfree(res);
+ return ret;
+ }
+ current->io_uring_restrict = res;
+ return 0;
+}
+
+static int io_register_bpf_filter_task(void __user *arg, unsigned int nr_args)
+{
+ struct io_restriction *res;
+ int ret;
+
+ if (nr_args != 1)
+ return -EINVAL;
+
+ /* If no task restrictions exist, setup a new set */
+ res = current->io_uring_restrict;
+ if (!res) {
+ res = kzalloc(sizeof(*res), GFP_KERNEL_ACCOUNT);
+ if (!res)
+ return -ENOMEM;
+ }
+
+ ret = io_register_bpf_filter(res, arg);
+ if (ret) {
+ if (res != current->io_uring_restrict)
+ kfree(res);
+ return ret;
+ }
+ if (!current->io_uring_restrict)
+ current->io_uring_restrict = res;
+ return 0;
+}
+
static int io_register_enable_rings(struct io_ring_ctx *ctx)
{
if (!(ctx->flags & IORING_SETUP_R_DISABLED))
@@ -912,6 +973,10 @@ static int io_uring_register_blind(unsigned int opcode, void __user *arg,
return io_uring_register_send_msg_ring(arg, nr_args);
case IORING_REGISTER_QUERY:
return io_query(arg, nr_args);
+ case IORING_REGISTER_RESTRICTIONS:
+ return io_register_restrictions_task(arg, nr_args);
+ case IORING_REGISTER_BPF_FILTER:
+ return io_register_bpf_filter_task(arg, nr_args);
}
return -EINVAL;
}
diff --git a/io_uring/tctx.c b/io_uring/tctx.c
index d4f7698805e4..e3da31fdf16f 100644
--- a/io_uring/tctx.c
+++ b/io_uring/tctx.c
@@ -11,6 +11,7 @@
#include "io_uring.h"
#include "tctx.h"
+#include "bpf_filter.h"
static struct io_wq *io_init_wq_offload(struct io_ring_ctx *ctx,
struct task_struct *task)
@@ -66,6 +67,11 @@ void __io_uring_free(struct task_struct *tsk)
kfree(tctx);
tsk->io_uring = NULL;
}
+ if (tsk->io_uring_restrict) {
+ io_put_bpf_filters(tsk->io_uring_restrict);
+ kfree(tsk->io_uring_restrict);
+ tsk->io_uring_restrict = NULL;
+ }
}
__cold int io_uring_alloc_task_context(struct task_struct *task,
@@ -356,5 +362,16 @@ int io_ringfd_unregister(struct io_ring_ctx *ctx, void __user *__arg,
int __io_uring_fork(struct task_struct *tsk)
{
+ struct io_restriction *res, *src = tsk->io_uring_restrict;
+
+ /* Don't leave it dangling on error */
+ tsk->io_uring_restrict = NULL;
+
+ res = kzalloc(sizeof(*res), GFP_KERNEL_ACCOUNT);
+ if (!res)
+ return -ENOMEM;
+
+ tsk->io_uring_restrict = res;
+ io_restriction_clone(res, src);
return 0;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 6/6] io_uring: allow registration of per-task restrictions
2026-01-18 17:16 ` [PATCH 6/6] io_uring: allow registration of per-task restrictions Jens Axboe
@ 2026-01-19 17:54 ` Aleksa Sarai
2026-01-19 18:02 ` Jens Axboe
2026-01-19 20:29 ` Jens Axboe
0 siblings, 2 replies; 12+ messages in thread
From: Aleksa Sarai @ 2026-01-19 17:54 UTC (permalink / raw)
To: Jens Axboe; +Cc: io-uring, brauner, Jann Horn, Kees Cook
[-- Attachment #1: Type: text/plain, Size: 14757 bytes --]
On 2026-01-18, Jens Axboe <axboe@kernel.dk> wrote:
> Currently io_uring supports restricting operations on a per-ring basis.
> To use those, the ring must be setup in a disabled state by setting
> IORING_SETUP_R_DISABLED. Then restrictions can be set for the ring, and
> the ring can then be enabled.
>
> This commit adds support for IORING_REGISTER_RESTRICTIONS with ring_fd
> == -1, like the other "blind" register opcodes which work on the task
> rather than a specific ring. This allows registration of the same kind
> of restrictions as can been done on a specific ring, but with the task
> itself. Once done, any ring created will inherit these restrictions.
>
> If a restriction filter is registered with a task, then it's inherited
> on fork for its children. Children may only further restrict operations,
> not extend them.
>
> Inheriting restrictions include both the classic
> IORING_REGISTER_RESTRICTIONS based restrictions, as well as the BPF
> filters that have been registered with the task via
> IORING_REGISTER_BPF_FILTER.
Adding Kees and Jann to Cc, since this is pretty much the "seccomp but
for io_uring" stuff that has been discussed quite a few times. (Though I
guess they'll find this thread from LWN soon enough.)
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
> include/linux/io_uring_types.h | 2 +
> include/uapi/linux/io_uring.h | 7 +++
> io_uring/bpf_filter.c | 86 +++++++++++++++++++++++++++++++++-
> io_uring/bpf_filter.h | 6 +++
> io_uring/io_uring.c | 33 +++++++++++++
> io_uring/io_uring.h | 1 +
> io_uring/register.c | 65 +++++++++++++++++++++++++
> io_uring/tctx.c | 17 +++++++
> 8 files changed, 216 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 366927635277..15ed7fa2bca3 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -231,6 +231,8 @@ struct io_restriction {
> DECLARE_BITMAP(register_op, IORING_REGISTER_LAST);
> DECLARE_BITMAP(sqe_op, IORING_OP_LAST);
> struct io_bpf_filters *bpf_filters;
> + /* ->bpf_filters needs COW on modification */
> + bool bpf_filters_cow;
> u8 sqe_flags_allowed;
> u8 sqe_flags_required;
> /* IORING_OP_* restrictions exist */
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 94669b77fee8..aeeffcf27fee 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -808,6 +808,13 @@ struct io_uring_restriction {
> __u32 resv2[3];
> };
>
> +struct io_uring_task_restriction {
> + __u16 flags;
> + __u16 nr_res;
> + __u32 resv[3];
> + __DECLARE_FLEX_ARRAY(struct io_uring_restriction, restrictions);
> +};
> +
> struct io_uring_clock_register {
> __u32 clockid;
> __u32 __resv[3];
> diff --git a/io_uring/bpf_filter.c b/io_uring/bpf_filter.c
> index 545acd480ffd..b3a66b4793b3 100644
> --- a/io_uring/bpf_filter.c
> +++ b/io_uring/bpf_filter.c
> @@ -249,13 +249,77 @@ static int io_uring_check_cbpf_filter(struct sock_filter *filter,
> return 0;
> }
>
> +void io_bpf_filter_clone(struct io_restriction *dst, struct io_restriction *src)
> +{
> + if (!src->bpf_filters)
> + return;
> +
> + rcu_read_lock();
> + /*
> + * If the src filter is going away, just ignore it.
> + */
> + if (refcount_inc_not_zero(&src->bpf_filters->refs)) {
> + dst->bpf_filters = src->bpf_filters;
> + dst->bpf_filters_cow = true;
> + }
> + rcu_read_unlock();
> +}
> +
> +/*
> + * Allocate a new struct io_bpf_filters. Used when a filter is cloned and
> + * modifications need to be made.
> + */
> +static struct io_bpf_filters *io_bpf_filter_cow(struct io_restriction *src)
> +{
> + struct io_bpf_filters *filters;
> + struct io_bpf_filter *srcf;
> + int i;
> +
> + filters = io_new_bpf_filters();
> + if (IS_ERR(filters))
> + return filters;
> +
> + /*
> + * Iterate filters from src and assign in destination. Grabbing
> + * a reference is enough, we don't need to duplicate the memory.
> + * This is safe because filters are only ever appended to the
> + * front of the list, hence the only memory ever touched inside
> + * a filter is the refcount.
> + */
> + rcu_read_lock();
> + for (i = 0; i < IORING_OP_LAST; i++) {
> + srcf = rcu_dereference(src->bpf_filters->filters[i]);
> + if (!srcf) {
> + continue;
> + } else if (srcf == &dummy_filter) {
> + rcu_assign_pointer(filters->filters[i], &dummy_filter);
> + continue;
> + }
> +
> + /*
> + * Getting a ref on the first node is enough, putting the
> + * filter and iterating nodes to free will stop on the first
> + * one that doesn't hit zero when dropping.
> + */
> + if (!refcount_inc_not_zero(&srcf->refs))
> + goto err;
> + rcu_assign_pointer(filters->filters[i], srcf);
> + }
> + rcu_read_unlock();
> + return filters;
> +err:
> + rcu_read_unlock();
> + __io_put_bpf_filters(filters);
> + return ERR_PTR(-EBUSY);
> +}
> +
> #define IO_URING_BPF_FILTER_FLAGS IO_URING_BPF_FILTER_DENY_REST
>
> int io_register_bpf_filter(struct io_restriction *res,
> struct io_uring_bpf __user *arg)
> {
> + struct io_bpf_filters *filters, *old_filters = NULL;
> struct io_bpf_filter *filter, *old_filter;
> - struct io_bpf_filters *filters;
> struct io_uring_bpf reg;
> struct bpf_prog *prog;
> struct sock_fprog fprog;
> @@ -297,6 +361,17 @@ int io_register_bpf_filter(struct io_restriction *res,
> ret = PTR_ERR(filters);
> goto err_prog;
> }
> + } else if (res->bpf_filters_cow) {
> + filters = io_bpf_filter_cow(res);
> + if (IS_ERR(filters)) {
> + ret = PTR_ERR(filters);
> + goto err_prog;
> + }
> + /*
> + * Stash old filters, we'll put them once we know we'll
> + * succeed. Until then, res->bpf_filters is left untouched.
> + */
> + old_filters = res->bpf_filters;
> }
>
> filter = kzalloc(sizeof(*filter), GFP_KERNEL_ACCOUNT);
> @@ -306,6 +381,15 @@ int io_register_bpf_filter(struct io_restriction *res,
> }
> refcount_set(&filter->refs, 1);
> filter->prog = prog;
> +
> + /*
> + * Success - install the new filter set now. If we did COW, put
> + * the old filters as we're replacing them.
> + */
> + if (old_filters) {
> + __io_put_bpf_filters(old_filters);
> + res->bpf_filters_cow = false;
> + }
> res->bpf_filters = filters;
>
> /*
> diff --git a/io_uring/bpf_filter.h b/io_uring/bpf_filter.h
> index 9f3cdb92eb16..66a776cf25b4 100644
> --- a/io_uring/bpf_filter.h
> +++ b/io_uring/bpf_filter.h
> @@ -13,6 +13,8 @@ int io_register_bpf_filter(struct io_restriction *res,
>
> void io_put_bpf_filters(struct io_restriction *res);
>
> +void io_bpf_filter_clone(struct io_restriction *dst, struct io_restriction *src);
> +
> static inline int io_uring_run_bpf_filters(struct io_bpf_filter __rcu **filters,
> struct io_kiocb *req)
> {
> @@ -37,6 +39,10 @@ static inline int io_uring_run_bpf_filters(struct io_bpf_filter __rcu **filters,
> static inline void io_put_bpf_filters(struct io_restriction *res)
> {
> }
> +static inline void io_bpf_filter_clone(struct io_restriction *dst,
> + struct io_restriction *src)
> +{
> +}
> #endif /* CONFIG_IO_URING_BPF */
>
> #endif
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 62aeaf0fad74..e190827d2436 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -3569,6 +3569,32 @@ int io_prepare_config(struct io_ctx_config *config)
> return 0;
> }
>
> +void io_restriction_clone(struct io_restriction *dst, struct io_restriction *src)
> +{
> + memcpy(&dst->register_op, &src->register_op, sizeof(dst->register_op));
> + memcpy(&dst->sqe_op, &src->sqe_op, sizeof(dst->sqe_op));
> + dst->sqe_flags_allowed = src->sqe_flags_allowed;
> + dst->sqe_flags_required = src->sqe_flags_required;
> + dst->op_registered = src->op_registered;
> + dst->reg_registered = src->reg_registered;
> +
> + io_bpf_filter_clone(dst, src);
> +}
> +
> +static void io_ctx_restriction_clone(struct io_ring_ctx *ctx,
> + struct io_restriction *src)
> +{
> + struct io_restriction *dst = &ctx->restrictions;
> +
> + io_restriction_clone(dst, src);
> + if (dst->bpf_filters)
> + WRITE_ONCE(ctx->bpf_filters, dst->bpf_filters->filters);
> + if (dst->op_registered)
> + ctx->op_restricted = 1;
> + if (dst->reg_registered)
> + ctx->reg_restricted = 1;
> +}
> +
> static __cold int io_uring_create(struct io_ctx_config *config)
> {
> struct io_uring_params *p = &config->p;
> @@ -3629,6 +3655,13 @@ static __cold int io_uring_create(struct io_ctx_config *config)
> else
> ctx->notify_method = TWA_SIGNAL;
>
> + /*
> + * If the current task has restrictions enabled, then copy them to
> + * our newly created ring and mark it as registered.
> + */
> + if (current->io_uring_restrict)
> + io_ctx_restriction_clone(ctx, current->io_uring_restrict);
> +
> /*
> * This is just grabbed for accounting purposes. When a process exits,
> * the mm is exited and dropped before the files, hence we need to hang
> diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
> index c5bbb43b5842..feb9f76761e9 100644
> --- a/io_uring/io_uring.h
> +++ b/io_uring/io_uring.h
> @@ -195,6 +195,7 @@ void io_task_refs_refill(struct io_uring_task *tctx);
> bool __io_alloc_req_refill(struct io_ring_ctx *ctx);
>
> void io_activate_pollwq(struct io_ring_ctx *ctx);
> +void io_restriction_clone(struct io_restriction *dst, struct io_restriction *src);
>
> static inline void io_lockdep_assert_cq_locked(struct io_ring_ctx *ctx)
> {
> diff --git a/io_uring/register.c b/io_uring/register.c
> index 40de9b8924b9..e8a68b04a6f4 100644
> --- a/io_uring/register.c
> +++ b/io_uring/register.c
> @@ -190,6 +190,67 @@ static __cold int io_register_restrictions(struct io_ring_ctx *ctx,
> return 0;
> }
>
> +static int io_register_restrictions_task(void __user *arg, unsigned int nr_args)
> +{
> + struct io_uring_task_restriction __user *ures = arg;
> + struct io_uring_task_restriction tres;
> + struct io_restriction *res;
> + int ret;
You almost certainly want to copy the seccomp logic of disallowing the
setting of restrictions unless no_new_privs is set or the process has
CAP_SYS_ADMIN.
While seccomp is more dangerous in this respect (as it allows you to
modify the return value of a syscall), being able to alter the execution
of setuid binaries usually leads to security issues, so it's probably
best to just copy what seccomp does here.
> + /* Disallow if task already has registered restrictions */
> + if (current->io_uring_restrict)
> + return -EPERM;
I guess specifying "stacked" restrictions (a-la seccomp) is intended as
future work?
This is kind of critical for both nesting use-cases and for making this
usable more widely (I imagine systemd will want to set system-wide
restrictions which would lock out programs from being able to set their
own process-wide restrictions -- nested containers are also a fairly
common use-case these days too).
(For containers we would probably only really use the cBPF stuff but it
would be nice for them to both be stackable -- if only for the reason
that you could set them in any order.)
> + if (nr_args != 1)
> + return -EINVAL;
> +
> + if (copy_from_user(&tres, arg, sizeof(tres)))
> + return -EFAULT;
> +
> + if (tres.flags)
> + return -EINVAL;
> + if (!mem_is_zero(tres.resv, sizeof(tres.resv)))
> + return -EINVAL;
I would suggest using copy_struct_from_user() to make extensions easier,
but I don't know if that is the kind of thing you feel necessary for
io_uring APIs.
> +
> + res = kzalloc(sizeof(*res), GFP_KERNEL_ACCOUNT);
> + if (!res)
> + return -ENOMEM;
> +
> + ret = io_parse_restrictions(ures->restrictions, tres.nr_res, res);
> + if (ret < 0) {
> + kfree(res);
> + return ret;
> + }
> + current->io_uring_restrict = res;
> + return 0;
> +}
> +
> +static int io_register_bpf_filter_task(void __user *arg, unsigned int nr_args)
> +{
> + struct io_restriction *res;
> + int ret;
> +
> + if (nr_args != 1)
> + return -EINVAL;
Same comment as above about no_new_privs / CAP_SYS_ADMIN.
> +
> + /* If no task restrictions exist, setup a new set */
> + res = current->io_uring_restrict;
> + if (!res) {
> + res = kzalloc(sizeof(*res), GFP_KERNEL_ACCOUNT);
> + if (!res)
> + return -ENOMEM;
> + }
> +
> + ret = io_register_bpf_filter(res, arg);
> + if (ret) {
> + if (res != current->io_uring_restrict)
> + kfree(res);
> + return ret;
> + }
> + if (!current->io_uring_restrict)
> + current->io_uring_restrict = res;
> + return 0;
> +}
> +
> static int io_register_enable_rings(struct io_ring_ctx *ctx)
> {
> if (!(ctx->flags & IORING_SETUP_R_DISABLED))
> @@ -912,6 +973,10 @@ static int io_uring_register_blind(unsigned int opcode, void __user *arg,
> return io_uring_register_send_msg_ring(arg, nr_args);
> case IORING_REGISTER_QUERY:
> return io_query(arg, nr_args);
> + case IORING_REGISTER_RESTRICTIONS:
> + return io_register_restrictions_task(arg, nr_args);
> + case IORING_REGISTER_BPF_FILTER:
> + return io_register_bpf_filter_task(arg, nr_args);
> }
> return -EINVAL;
> }
> diff --git a/io_uring/tctx.c b/io_uring/tctx.c
> index d4f7698805e4..e3da31fdf16f 100644
> --- a/io_uring/tctx.c
> +++ b/io_uring/tctx.c
> @@ -11,6 +11,7 @@
>
> #include "io_uring.h"
> #include "tctx.h"
> +#include "bpf_filter.h"
>
> static struct io_wq *io_init_wq_offload(struct io_ring_ctx *ctx,
> struct task_struct *task)
> @@ -66,6 +67,11 @@ void __io_uring_free(struct task_struct *tsk)
> kfree(tctx);
> tsk->io_uring = NULL;
> }
> + if (tsk->io_uring_restrict) {
> + io_put_bpf_filters(tsk->io_uring_restrict);
> + kfree(tsk->io_uring_restrict);
> + tsk->io_uring_restrict = NULL;
> + }
> }
>
> __cold int io_uring_alloc_task_context(struct task_struct *task,
> @@ -356,5 +362,16 @@ int io_ringfd_unregister(struct io_ring_ctx *ctx, void __user *__arg,
>
> int __io_uring_fork(struct task_struct *tsk)
> {
> + struct io_restriction *res, *src = tsk->io_uring_restrict;
> +
> + /* Don't leave it dangling on error */
> + tsk->io_uring_restrict = NULL;
> +
> + res = kzalloc(sizeof(*res), GFP_KERNEL_ACCOUNT);
> + if (!res)
> + return -ENOMEM;
> +
> + tsk->io_uring_restrict = res;
> + io_restriction_clone(res, src);
> return 0;
> }
> --
> 2.51.0
>
>
--
Aleksa Sarai
https://www.cyphar.com/
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 6/6] io_uring: allow registration of per-task restrictions
2026-01-19 17:54 ` Aleksa Sarai
@ 2026-01-19 18:02 ` Jens Axboe
2026-01-19 20:29 ` Jens Axboe
1 sibling, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2026-01-19 18:02 UTC (permalink / raw)
To: Aleksa Sarai; +Cc: io-uring, brauner, Jann Horn, Kees Cook
On 1/19/26 10:54 AM, Aleksa Sarai wrote:
> On 2026-01-18, Jens Axboe <axboe@kernel.dk> wrote:
>> Currently io_uring supports restricting operations on a per-ring basis.
>> To use those, the ring must be setup in a disabled state by setting
>> IORING_SETUP_R_DISABLED. Then restrictions can be set for the ring, and
>> the ring can then be enabled.
>>
>> This commit adds support for IORING_REGISTER_RESTRICTIONS with ring_fd
>> == -1, like the other "blind" register opcodes which work on the task
>> rather than a specific ring. This allows registration of the same kind
>> of restrictions as can been done on a specific ring, but with the task
>> itself. Once done, any ring created will inherit these restrictions.
>>
>> If a restriction filter is registered with a task, then it's inherited
>> on fork for its children. Children may only further restrict operations,
>> not extend them.
>>
>> Inheriting restrictions include both the classic
>> IORING_REGISTER_RESTRICTIONS based restrictions, as well as the BPF
>> filters that have been registered with the task via
>> IORING_REGISTER_BPF_FILTER.
>
> Adding Kees and Jann to Cc, since this is pretty much the "seccomp but
> for io_uring" stuff that has been discussed quite a few times. (Though I
> guess they'll find this thread from LWN soon enough.)
Thanks indeed - my plan was to distribute this wider for a v6 posting,
which should be coming shortly.
--
Jens Axboe
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/6] io_uring: add support for BPF filtering for opcode restrictions
2026-01-18 17:16 ` [PATCH 1/6] io_uring: add support for BPF filtering for opcode restrictions Jens Axboe
@ 2026-01-19 18:51 ` Aleksa Sarai
2026-01-19 20:17 ` Jens Axboe
0 siblings, 1 reply; 12+ messages in thread
From: Aleksa Sarai @ 2026-01-19 18:51 UTC (permalink / raw)
To: Jens Axboe; +Cc: io-uring, brauner, Kees Cook, Jann Horn
[-- Attachment #1: Type: text/plain, Size: 20570 bytes --]
On 2026-01-18, Jens Axboe <axboe@kernel.dk> wrote:
> This adds support for loading BPF programs with io_uring, which can
> restrict the opcodes executed. Unlike IORING_REGISTER_RESTRICTIONS,
> using BPF programs allow fine grained control over both the opcode in
> question, as well as other data associated with the request. This
> initial patch just supports whatever is in the io_kiocb for filtering,
> but shortly opcode specific support will be added.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
> include/linux/io_uring_types.h | 9 +
> include/uapi/linux/io_uring.h | 3 +
> include/uapi/linux/io_uring/bpf_filter.h | 47 ++++
> io_uring/Kconfig | 5 +
> io_uring/Makefile | 1 +
> io_uring/bpf_filter.c | 328 +++++++++++++++++++++++
> io_uring/bpf_filter.h | 42 +++
> io_uring/io_uring.c | 8 +
> io_uring/register.c | 8 +
> 9 files changed, 451 insertions(+)
> create mode 100644 include/uapi/linux/io_uring/bpf_filter.h
> create mode 100644 io_uring/bpf_filter.c
> create mode 100644 io_uring/bpf_filter.h
>
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 211686ad89fd..37f0a5f7b2f4 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -219,9 +219,18 @@ struct io_rings {
> struct io_uring_cqe cqes[] ____cacheline_aligned_in_smp;
> };
>
> +struct io_bpf_filter;
> +struct io_bpf_filters {
> + refcount_t refs; /* ref for ->bpf_filters */
> + spinlock_t lock; /* protects ->bpf_filters modifications */
> + struct io_bpf_filter __rcu **filters;
> + struct rcu_head rcu_head;
> +};
> +
> struct io_restriction {
> DECLARE_BITMAP(register_op, IORING_REGISTER_LAST);
> DECLARE_BITMAP(sqe_op, IORING_OP_LAST);
> + struct io_bpf_filters *bpf_filters;
> u8 sqe_flags_allowed;
> u8 sqe_flags_required;
> /* IORING_OP_* restrictions exist */
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index b5b23c0d5283..94669b77fee8 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -700,6 +700,9 @@ enum io_uring_register_op {
> /* auxiliary zcrx configuration, see enum zcrx_ctrl_op */
> IORING_REGISTER_ZCRX_CTRL = 36,
>
> + /* register bpf filtering programs */
> + IORING_REGISTER_BPF_FILTER = 37,
> +
> /* this goes last */
> IORING_REGISTER_LAST,
>
> diff --git a/include/uapi/linux/io_uring/bpf_filter.h b/include/uapi/linux/io_uring/bpf_filter.h
> new file mode 100644
> index 000000000000..14bd5b7468a7
> --- /dev/null
> +++ b/include/uapi/linux/io_uring/bpf_filter.h
> @@ -0,0 +1,47 @@
> +/* SPDX-License-Identifier: (GPL-2.0 WITH Linux-syscall-note) OR MIT */
> +/*
> + * Header file for the io_uring BPF filters.
> + */
> +#ifndef LINUX_IO_URING_BPF_FILTER_H
> +#define LINUX_IO_URING_BPF_FILTER_H
> +
> +#include <linux/types.h>
> +
> +struct io_uring_bpf_ctx {
> + __u8 opcode;
> + __u8 sqe_flags;
> + __u8 pad[6];
> + __u64 user_data;
> + __u64 resv[6];
> +};
I had more envisioned this as operating on SQEs directly, not on an
intermediate representation.
I get why this is much simpler to deal with (operating on the SQE
directly is going to be racy as a malicious process could change the
argument values after the filters have run -- this is kind of like the
classic ptrace-seccomp hole), but since SQEs are a fixed size it seems
like the most natural analogue to seccomp's model.
While it is a tad ugly, AFAICS (looking at io_socket_prep) this would
also let you filter socket(2) directly which would obviate the need for
the second patch in this series.
That being said, if you do really want to go with a custom
representation, I would suggest including a size and making it variable
size so that filtering pointers (especially of extensible struct
syscalls like openat2) is more trivial to accomplish in the future. [1]
is the model we came up with for seccomp, which suffers from having to
deal with arbitrary syscall bodies but if you have your own
representation you can end up with something reasonably extensible
without the baggage.
[1]: https://www.youtube.com/watch?v=CHpLLR0CwSw
> +
> +enum {
> + /*
> + * If set, any currently unset opcode will have a deny filter attached
> + */
> + IO_URING_BPF_FILTER_DENY_REST = 1,
> +};
> +
> +struct io_uring_bpf_filter {
> + __u32 opcode; /* io_uring opcode to filter */
> + __u32 flags;
> + __u32 filter_len; /* number of BPF instructions */
> + __u32 resv;
> + __u64 filter_ptr; /* pointer to BPF filter */
> + __u64 resv2[5];
> +};
Since io_uring_bpf_ctx contains the opcode, it seems a little strange
that you would require userspace to set up a separate filter for every
opcode they wish to filter. seccomp lets you just have one filter for
all syscall numbers -- I can see the argument that this lets you build
optimised filters for each opcode but having to set up filters for 65
opcodes (at time of writing) seems less than optimal...
The optimisation seccomp uses for filters that simply blanket allow a
syscall is to pre-compute a cached bitmap of those syscalls to avoid
running the filter for those syscalls (see seccomp_cache_prepare). That
seems like a more practical solution which provides a similar (if not
better) optimisation for the allow-filter case.
Doing it this way would also remove the need for
IO_URING_BPF_FILTER_DENY_REST, because the filters could just implement
that themselves.
> +
> +enum {
> + IO_URING_BPF_CMD_FILTER = 1,
> +};
> +
> +struct io_uring_bpf {
> + __u16 cmd_type; /* IO_URING_BPF_* values */
> + __u16 cmd_flags; /* none so far */
> + __u32 resv;
> + union {
> + struct io_uring_bpf_filter filter;
> + };
> +};
> +
> +#endif
> diff --git a/io_uring/Kconfig b/io_uring/Kconfig
> index 4b949c42c0bf..a7ae23cf1035 100644
> --- a/io_uring/Kconfig
> +++ b/io_uring/Kconfig
> @@ -9,3 +9,8 @@ config IO_URING_ZCRX
> depends on PAGE_POOL
> depends on INET
> depends on NET_RX_BUSY_POLL
> +
> +config IO_URING_BPF
> + def_bool y
> + depends on BPF
> + depends on NET
> diff --git a/io_uring/Makefile b/io_uring/Makefile
> index bc4e4a3fa0a5..f3c505caa91e 100644
> --- a/io_uring/Makefile
> +++ b/io_uring/Makefile
> @@ -22,3 +22,4 @@ obj-$(CONFIG_NET_RX_BUSY_POLL) += napi.o
> obj-$(CONFIG_NET) += net.o cmd_net.o
> obj-$(CONFIG_PROC_FS) += fdinfo.o
> obj-$(CONFIG_IO_URING_MOCK_FILE) += mock_file.o
> +obj-$(CONFIG_IO_URING_BPF) += bpf_filter.o
> diff --git a/io_uring/bpf_filter.c b/io_uring/bpf_filter.c
> new file mode 100644
> index 000000000000..48c7ea6f8d63
> --- /dev/null
> +++ b/io_uring/bpf_filter.c
> @@ -0,0 +1,328 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * BPF filter support for io_uring. Supports SQE opcodes for now.
> + */
> +#include <linux/kernel.h>
> +#include <linux/errno.h>
> +#include <linux/io_uring.h>
> +#include <linux/filter.h>
> +#include <linux/bpf.h>
> +#include <uapi/linux/io_uring.h>
> +
> +#include "io_uring.h"
> +#include "bpf_filter.h"
> +#include "net.h"
> +
> +struct io_bpf_filter {
> + struct bpf_prog *prog;
> + struct io_bpf_filter *next;
> +};
> +
> +/* Deny if this is set as the filter */
> +static const struct io_bpf_filter dummy_filter;
> +
> +static void io_uring_populate_bpf_ctx(struct io_uring_bpf_ctx *bctx,
> + struct io_kiocb *req)
> +{
> + memset(bctx, 0, sizeof(*bctx));
> + bctx->opcode = req->opcode;
> + bctx->sqe_flags = (__force int) req->flags & SQE_VALID_FLAGS;
> + bctx->user_data = req->cqe.user_data;
> +}
> +
> +/*
> + * Run registered filters for a given opcode. For filters, a return of 0 denies
> + * execution of the request, a return of 1 allows it. If any filter for an
> + * opcode returns 0, filter processing is stopped, and the request is denied.
> + * This also stops the processing of filters.
> + *
> + * __io_uring_run_bpf_filters() returns 0 on success, allow running the
> + * request, and -EACCES when a request is denied.
> + */
> +int __io_uring_run_bpf_filters(struct io_restriction *res, struct io_kiocb *req)
> +{
> + struct io_bpf_filter *filter;
> + struct io_uring_bpf_ctx bpf_ctx;
> + int ret;
> +
> + /* Fast check for existence of filters outside of RCU */
> + if (!rcu_access_pointer(res->bpf_filters->filters[req->opcode]))
> + return 0;
> +
> + /*
> + * req->opcode has already been validated to be within the range
> + * of what we expect, io_init_req() does this.
> + */
> + rcu_read_lock();
> + filter = rcu_dereference(res->bpf_filters->filters[req->opcode]);
> + if (!filter) {
> + ret = 1;
> + goto out;
> + } else if (filter == &dummy_filter) {
> + ret = 0;
> + goto out;
> + }
> +
> + io_uring_populate_bpf_ctx(&bpf_ctx, req);
> +
> + /*
> + * Iterate registered filters. The opcode is allowed IFF all filters
> + * return 1. If any filter returns denied, opcode will be denied.
> + */
> + do {
> + if (filter == &dummy_filter)
> + ret = 0;
> + else
> + ret = bpf_prog_run(filter->prog, &bpf_ctx);
> + if (!ret)
> + break;
> + filter = filter->next;
> + } while (filter);
I understand why you didn't want to replicate the messiness of seccomp's
arbitrary errno feature (it's almost certainly for the best), but maybe
it would be prudent to make the expected return values some special
(large) value so that you have some wiggle room for future expansion?
For instance, if you ever wanted to add support for logging (a-la
SECCOMP_RET_LOG) then it would need to be lower priority than blocking
the operation and you would need to have something like the logic in
seccomp_run_filters to return the highest priority filter return value.
(You could validate that the filter only returns IO_URING_BPF_RET_BLOCK
or 0 in the verifier.)
> +out:
> + rcu_read_unlock();
> + return ret ? 0 : -EACCES;
> +}
> +
> +static void io_free_bpf_filters(struct rcu_head *head)
> +{
> + struct io_bpf_filter __rcu **filter;
> + struct io_bpf_filters *filters;
> + int i;
> +
> + filters = container_of(head, struct io_bpf_filters, rcu_head);
> + spin_lock(&filters->lock);
> + filter = filters->filters;
> + if (!filter) {
> + spin_unlock(&filters->lock);
> + return;
> + }
> + spin_unlock(&filters->lock);
> +
> + for (i = 0; i < IORING_OP_LAST; i++) {
> + struct io_bpf_filter *f;
> +
> + rcu_read_lock();
> + f = rcu_dereference(filter[i]);
> + while (f) {
> + struct io_bpf_filter *next = f->next;
> +
> + /*
> + * Even if stacked, dummy filter will always be last
> + * as it can only get installed into an empty spot.
> + */
> + if (f == &dummy_filter)
> + break;
> + bpf_prog_destroy(f->prog);
> + kfree(f);
> + f = next;
> + }
> + rcu_read_unlock();
> + }
> + kfree(filters->filters);
> + kfree(filters);
> +}
> +
> +static void __io_put_bpf_filters(struct io_bpf_filters *filters)
> +{
> + if (refcount_dec_and_test(&filters->refs))
> + call_rcu(&filters->rcu_head, io_free_bpf_filters);
> +}
> +
> +void io_put_bpf_filters(struct io_restriction *res)
> +{
> + if (res->bpf_filters)
> + __io_put_bpf_filters(res->bpf_filters);
> +}
> +
> +static struct io_bpf_filters *io_new_bpf_filters(void)
> +{
> + struct io_bpf_filters *filters;
> +
> + filters = kzalloc(sizeof(*filters), GFP_KERNEL_ACCOUNT);
> + if (!filters)
> + return ERR_PTR(-ENOMEM);
> +
> + filters->filters = kcalloc(IORING_OP_LAST,
> + sizeof(struct io_bpf_filter *),
> + GFP_KERNEL_ACCOUNT);
> + if (!filters->filters) {
> + kfree(filters);
> + return ERR_PTR(-ENOMEM);
> + }
> +
> + refcount_set(&filters->refs, 1);
> + spin_lock_init(&filters->lock);
> + return filters;
> +}
> +
> +/*
> + * Validate classic BPF filter instructions. Only allow a safe subset of
> + * operations - no packet data access, just context field loads and basic
> + * ALU/jump operations.
> + */
> +static int io_uring_check_cbpf_filter(struct sock_filter *filter,
> + unsigned int flen)
> +{
> + int pc;
> +
> + for (pc = 0; pc < flen; pc++) {
> + struct sock_filter *ftest = &filter[pc];
> + u16 code = ftest->code;
> + u32 k = ftest->k;
> +
> + switch (code) {
> + case BPF_LD | BPF_W | BPF_ABS:
> + ftest->code = BPF_LDX | BPF_W | BPF_ABS;
> + /* 32-bit aligned and not out of bounds. */
> + if (k >= sizeof(struct io_uring_bpf_ctx) || k & 3)
> + return -EINVAL;
> + continue;
> + case BPF_LD | BPF_W | BPF_LEN:
> + ftest->code = BPF_LD | BPF_IMM;
> + ftest->k = sizeof(struct io_uring_bpf_ctx);
> + continue;
> + case BPF_LDX | BPF_W | BPF_LEN:
> + ftest->code = BPF_LDX | BPF_IMM;
> + ftest->k = sizeof(struct io_uring_bpf_ctx);
> + continue;
> + /* Explicitly include allowed calls. */
> + case BPF_RET | BPF_K:
> + case BPF_RET | BPF_A:
> + case BPF_ALU | BPF_ADD | BPF_K:
> + case BPF_ALU | BPF_ADD | BPF_X:
> + case BPF_ALU | BPF_SUB | BPF_K:
> + case BPF_ALU | BPF_SUB | BPF_X:
> + case BPF_ALU | BPF_MUL | BPF_K:
> + case BPF_ALU | BPF_MUL | BPF_X:
> + case BPF_ALU | BPF_DIV | BPF_K:
> + case BPF_ALU | BPF_DIV | BPF_X:
> + case BPF_ALU | BPF_AND | BPF_K:
> + case BPF_ALU | BPF_AND | BPF_X:
> + case BPF_ALU | BPF_OR | BPF_K:
> + case BPF_ALU | BPF_OR | BPF_X:
> + case BPF_ALU | BPF_XOR | BPF_K:
> + case BPF_ALU | BPF_XOR | BPF_X:
> + case BPF_ALU | BPF_LSH | BPF_K:
> + case BPF_ALU | BPF_LSH | BPF_X:
> + case BPF_ALU | BPF_RSH | BPF_K:
> + case BPF_ALU | BPF_RSH | BPF_X:
> + case BPF_ALU | BPF_NEG:
> + case BPF_LD | BPF_IMM:
> + case BPF_LDX | BPF_IMM:
> + case BPF_MISC | BPF_TAX:
> + case BPF_MISC | BPF_TXA:
> + case BPF_LD | BPF_MEM:
> + case BPF_LDX | BPF_MEM:
> + case BPF_ST:
> + case BPF_STX:
> + case BPF_JMP | BPF_JA:
> + case BPF_JMP | BPF_JEQ | BPF_K:
> + case BPF_JMP | BPF_JEQ | BPF_X:
> + case BPF_JMP | BPF_JGE | BPF_K:
> + case BPF_JMP | BPF_JGE | BPF_X:
> + case BPF_JMP | BPF_JGT | BPF_K:
> + case BPF_JMP | BPF_JGT | BPF_X:
> + case BPF_JMP | BPF_JSET | BPF_K:
> + case BPF_JMP | BPF_JSET | BPF_X:
> + continue;
> + default:
> + return -EINVAL;
> + }
> + }
> + return 0;
> +}
> +
> +#define IO_URING_BPF_FILTER_FLAGS IO_URING_BPF_FILTER_DENY_REST
> +
> +int io_register_bpf_filter(struct io_restriction *res,
> + struct io_uring_bpf __user *arg)
> +{
> + struct io_bpf_filter *filter, *old_filter;
> + struct io_bpf_filters *filters;
> + struct io_uring_bpf reg;
> + struct bpf_prog *prog;
> + struct sock_fprog fprog;
> + int ret;
> +
> + if (copy_from_user(®, arg, sizeof(reg)))
> + return -EFAULT;
> + if (reg.cmd_type != IO_URING_BPF_CMD_FILTER)
> + return -EINVAL;
> + if (reg.cmd_flags || reg.resv)
> + return -EINVAL;
> +
> + if (reg.filter.opcode >= IORING_OP_LAST)
> + return -EINVAL;
> + if (reg.filter.flags & ~IO_URING_BPF_FILTER_FLAGS)
> + return -EINVAL;
> + if (reg.filter.resv)
> + return -EINVAL;
> + if (!mem_is_zero(reg.filter.resv2, sizeof(reg.filter.resv2)))
> + return -EINVAL;
> + if (!reg.filter.filter_len || reg.filter.filter_len > BPF_MAXINSNS)
> + return -EINVAL;
Similar question to my other mail about copy_struct_from_user().
> +
> + fprog.len = reg.filter.filter_len;
> + fprog.filter = u64_to_user_ptr(reg.filter.filter_ptr);
> +
> + ret = bpf_prog_create_from_user(&prog, &fprog,
> + io_uring_check_cbpf_filter, false);
> + if (ret)
> + return ret;
> +
> + /*
> + * No existing filters, allocate set.
> + */
> + filters = res->bpf_filters;
> + if (!filters) {
> + filters = io_new_bpf_filters();
> + if (IS_ERR(filters)) {
> + ret = PTR_ERR(filters);
> + goto err_prog;
> + }
> + }
> +
> + filter = kzalloc(sizeof(*filter), GFP_KERNEL_ACCOUNT);
> + if (!filter) {
> + ret = -ENOMEM;
> + goto err;
> + }
> + filter->prog = prog;
> + res->bpf_filters = filters;
> +
> + /*
> + * Insert filter - if the current opcode already has a filter
> + * attached, add to the set.
> + */
> + rcu_read_lock();
> + spin_lock_bh(&filters->lock);
> + old_filter = rcu_dereference(filters->filters[reg.filter.opcode]);
> + if (old_filter)
> + filter->next = old_filter;
> + rcu_assign_pointer(filters->filters[reg.filter.opcode], filter);
> +
> + /*
> + * If IO_URING_BPF_FILTER_DENY_REST is set, fill any unregistered
> + * opcode with the dummy filter. That will cause them to be denied.
> + */
> + if (reg.filter.flags & IO_URING_BPF_FILTER_DENY_REST) {
> + for (int i = 0; i < IORING_OP_LAST; i++) {
> + if (i == reg.filter.opcode)
> + continue;
> + old_filter = rcu_dereference(filters->filters[i]);
> + if (old_filter)
> + continue;
> + rcu_assign_pointer(filters->filters[i], &dummy_filter);
> + }
> + }
> +
> + spin_unlock_bh(&filters->lock);
> + rcu_read_unlock();
> + return 0;
> +err:
> + if (filters != res->bpf_filters)
> + __io_put_bpf_filters(filters);
> +err_prog:
> + bpf_prog_destroy(prog);
> + return ret;
> +}
> diff --git a/io_uring/bpf_filter.h b/io_uring/bpf_filter.h
> new file mode 100644
> index 000000000000..27eae9705473
> --- /dev/null
> +++ b/io_uring/bpf_filter.h
> @@ -0,0 +1,42 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#ifndef IO_URING_BPF_FILTER_H
> +#define IO_URING_BPF_FILTER_H
> +
> +#include <uapi/linux/io_uring/bpf_filter.h>
> +
> +#ifdef CONFIG_IO_URING_BPF
> +
> +int __io_uring_run_bpf_filters(struct io_restriction *res, struct io_kiocb *req);
> +
> +int io_register_bpf_filter(struct io_restriction *res,
> + struct io_uring_bpf __user *arg);
> +
> +void io_put_bpf_filters(struct io_restriction *res);
> +
> +static inline int io_uring_run_bpf_filters(struct io_restriction *res,
> + struct io_kiocb *req)
> +{
> + if (res->bpf_filters)
> + return __io_uring_run_bpf_filters(res, req);
> +
> + return 0;
> +}
> +
> +#else
> +
> +static inline int io_register_bpf_filter(struct io_restriction *res,
> + struct io_uring_bpf __user *arg)
> +{
> + return -EINVAL;
> +}
> +static inline int io_uring_run_bpf_filters(struct io_restriction *res,
> + struct io_kiocb *req)
> +{
> + return 0;
> +}
> +static inline void io_put_bpf_filters(struct io_restriction *res)
> +{
> +}
> +#endif /* CONFIG_IO_URING_BPF */
> +
> +#endif
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 2cde22af78a3..67533e494836 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -93,6 +93,7 @@
> #include "rw.h"
> #include "alloc_cache.h"
> #include "eventfd.h"
> +#include "bpf_filter.h"
>
> #define SQE_COMMON_FLAGS (IOSQE_FIXED_FILE | IOSQE_IO_LINK | \
> IOSQE_IO_HARDLINK | IOSQE_ASYNC)
> @@ -2261,6 +2262,12 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
> if (unlikely(ret))
> return io_submit_fail_init(sqe, req, ret);
>
> + if (unlikely(ctx->restrictions.bpf_filters)) {
> + ret = io_uring_run_bpf_filters(&ctx->restrictions, req);
> + if (ret)
> + return io_submit_fail_init(sqe, req, ret);
> + }
> +
> trace_io_uring_submit_req(req);
>
> /*
> @@ -2850,6 +2857,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
> percpu_ref_exit(&ctx->refs);
> free_uid(ctx->user);
> io_req_caches_free(ctx);
> + io_put_bpf_filters(&ctx->restrictions);
>
> WARN_ON_ONCE(ctx->nr_req_allocated);
>
> diff --git a/io_uring/register.c b/io_uring/register.c
> index 8551f13920dc..30957c2cb5eb 100644
> --- a/io_uring/register.c
> +++ b/io_uring/register.c
> @@ -33,6 +33,7 @@
> #include "memmap.h"
> #include "zcrx.h"
> #include "query.h"
> +#include "bpf_filter.h"
>
> #define IORING_MAX_RESTRICTIONS (IORING_RESTRICTION_LAST + \
> IORING_REGISTER_LAST + IORING_OP_LAST)
> @@ -830,6 +831,13 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
> case IORING_REGISTER_ZCRX_CTRL:
> ret = io_zcrx_ctrl(ctx, arg, nr_args);
> break;
> + case IORING_REGISTER_BPF_FILTER:
> + ret = -EINVAL;
> +
> + if (nr_args != 1)
> + break;
> + ret = io_register_bpf_filter(&ctx->restrictions, arg);
> + break;
> default:
> ret = -EINVAL;
> break;
> --
> 2.51.0
>
>
--
Aleksa Sarai
https://www.cyphar.com/
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/6] io_uring: add support for BPF filtering for opcode restrictions
2026-01-19 18:51 ` Aleksa Sarai
@ 2026-01-19 20:17 ` Jens Axboe
0 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2026-01-19 20:17 UTC (permalink / raw)
To: Aleksa Sarai; +Cc: io-uring, brauner, Kees Cook, Jann Horn
On 1/19/26 11:51 AM, Aleksa Sarai wrote:
> On 2026-01-18, Jens Axboe <axboe@kernel.dk> wrote:
>> This adds support for loading BPF programs with io_uring, which can
>> restrict the opcodes executed. Unlike IORING_REGISTER_RESTRICTIONS,
>> using BPF programs allow fine grained control over both the opcode in
>> question, as well as other data associated with the request. This
>> initial patch just supports whatever is in the io_kiocb for filtering,
>> but shortly opcode specific support will be added.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>> include/linux/io_uring_types.h | 9 +
>> include/uapi/linux/io_uring.h | 3 +
>> include/uapi/linux/io_uring/bpf_filter.h | 47 ++++
>> io_uring/Kconfig | 5 +
>> io_uring/Makefile | 1 +
>> io_uring/bpf_filter.c | 328 +++++++++++++++++++++++
>> io_uring/bpf_filter.h | 42 +++
>> io_uring/io_uring.c | 8 +
>> io_uring/register.c | 8 +
>> 9 files changed, 451 insertions(+)
>> create mode 100644 include/uapi/linux/io_uring/bpf_filter.h
>> create mode 100644 io_uring/bpf_filter.c
>> create mode 100644 io_uring/bpf_filter.h
>>
>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>> index 211686ad89fd..37f0a5f7b2f4 100644
>> --- a/include/linux/io_uring_types.h
>> +++ b/include/linux/io_uring_types.h
>> @@ -219,9 +219,18 @@ struct io_rings {
>> struct io_uring_cqe cqes[] ____cacheline_aligned_in_smp;
>> };
>>
>> +struct io_bpf_filter;
>> +struct io_bpf_filters {
>> + refcount_t refs; /* ref for ->bpf_filters */
>> + spinlock_t lock; /* protects ->bpf_filters modifications */
>> + struct io_bpf_filter __rcu **filters;
>> + struct rcu_head rcu_head;
>> +};
>> +
>> struct io_restriction {
>> DECLARE_BITMAP(register_op, IORING_REGISTER_LAST);
>> DECLARE_BITMAP(sqe_op, IORING_OP_LAST);
>> + struct io_bpf_filters *bpf_filters;
>> u8 sqe_flags_allowed;
>> u8 sqe_flags_required;
>> /* IORING_OP_* restrictions exist */
>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>> index b5b23c0d5283..94669b77fee8 100644
>> --- a/include/uapi/linux/io_uring.h
>> +++ b/include/uapi/linux/io_uring.h
>> @@ -700,6 +700,9 @@ enum io_uring_register_op {
>> /* auxiliary zcrx configuration, see enum zcrx_ctrl_op */
>> IORING_REGISTER_ZCRX_CTRL = 36,
>>
>> + /* register bpf filtering programs */
>> + IORING_REGISTER_BPF_FILTER = 37,
>> +
>> /* this goes last */
>> IORING_REGISTER_LAST,
>>
>> diff --git a/include/uapi/linux/io_uring/bpf_filter.h b/include/uapi/linux/io_uring/bpf_filter.h
>> new file mode 100644
>> index 000000000000..14bd5b7468a7
>> --- /dev/null
>> +++ b/include/uapi/linux/io_uring/bpf_filter.h
>> @@ -0,0 +1,47 @@
>> +/* SPDX-License-Identifier: (GPL-2.0 WITH Linux-syscall-note) OR MIT */
>> +/*
>> + * Header file for the io_uring BPF filters.
>> + */
>> +#ifndef LINUX_IO_URING_BPF_FILTER_H
>> +#define LINUX_IO_URING_BPF_FILTER_H
>> +
>> +#include <linux/types.h>
>> +
>> +struct io_uring_bpf_ctx {
>> + __u8 opcode;
>> + __u8 sqe_flags;
>> + __u8 pad[6];
>> + __u64 user_data;
>> + __u64 resv[6];
>> +};
>
> I had more envisioned this as operating on SQEs directly, not on an
> intermediate representation.
>
> I get why this is much simpler to deal with (operating on the SQE
> directly is going to be racy as a malicious process could change the
> argument values after the filters have run -- this is kind of like the
> classic ptrace-seccomp hole), but since SQEs are a fixed size it seems
> like the most natural analogue to seccomp's model.
>
> While it is a tad ugly, AFAICS (looking at io_socket_prep) this would
> also let you filter socket(2) directly which would obviate the need for
> the second patch in this series.
It fundamentally cannot operate on the sqe directly, exactly for the
reasons you outline. We need to move the data to stable storage first,
which is why I did it that way. It has to be done past the prep stage
for that reason.
> That being said, if you do really want to go with a custom
> representation, I would suggest including a size and making it variable
There's no other choice than going with a custom representation...
IHMO it also makes it much cleaner. You really don't want filters
dealing with the intricasies of SQE layout. For custom opcode filters,
they become much easier to reason about if they are working with a
sub-struct in the union.
> size so that filtering pointers (especially of extensible struct
> syscalls like openat2) is more trivial to accomplish in the future. [1]
> is the model we came up with for seccomp, which suffers from having to
> deal with arbitrary syscall bodies but if you have your own
> representation you can end up with something reasonably extensible
> without the baggage.
>
> [1]: https://www.youtube.com/watch?v=CHpLLR0CwSw
I'll take a look at that - fwiw, another reason why we need a custom
struct is precisely so we can filter on things that need to be brought
into the kernel first. Things like open_how for openat2 variants, for
example. Otherwise this would not be possible. We can do that in a clean
way, and never have to deal with any kind of pointers.
>> +enum {
>> + /*
>> + * If set, any currently unset opcode will have a deny filter attached
>> + */
>> + IO_URING_BPF_FILTER_DENY_REST = 1,
>> +};
>> +
>> +struct io_uring_bpf_filter {
>> + __u32 opcode; /* io_uring opcode to filter */
>> + __u32 flags;
>> + __u32 filter_len; /* number of BPF instructions */
>> + __u32 resv;
>> + __u64 filter_ptr; /* pointer to BPF filter */
>> + __u64 resv2[5];
>> +};
>
> Since io_uring_bpf_ctx contains the opcode, it seems a little strange
> that you would require userspace to set up a separate filter for every
> opcode they wish to filter. seccomp lets you just have one filter for
> all syscall numbers -- I can see the argument that this lets you build
> optimised filters for each opcode but having to set up filters for 65
> opcodes (at time of writing) seems less than optimal...
I agree it's a bit cumbersome. But if you just want to filter a specific
opcode, that's easily doable with the current non-bpf restrictions. You
only really need the BPF filter if you want to allow an ocpode, but
restrict certain parts of it.
> The optimisation seccomp uses for filters that simply blanket allow a
> syscall is to pre-compute a cached bitmap of those syscalls to avoid
> running the filter for those syscalls (see seccomp_cache_prepare). That
> seems like a more practical solution which provides a similar (if not
> better) optimisation for the allow-filter case.
But I don't think useful in this case, as there's already an existing
mechanism to do that in io_uring. And with this patch, particularly the
last patch, those can also get set per task and inherited, so that any
new ring will abide by them.
>> + /*
>> + * Iterate registered filters. The opcode is allowed IFF all filters
>> + * return 1. If any filter returns denied, opcode will be denied.
>> + */
>> + do {
>> + if (filter == &dummy_filter)
>> + ret = 0;
>> + else
>> + ret = bpf_prog_run(filter->prog, &bpf_ctx);
>> + if (!ret)
>> + break;
>> + filter = filter->next;
>> + } while (filter);
>
> I understand why you didn't want to replicate the messiness of seccomp's
> arbitrary errno feature (it's almost certainly for the best), but maybe
> it would be prudent to make the expected return values some special
> (large) value so that you have some wiggle room for future expansion?
>
> For instance, if you ever wanted to add support for logging (a-la
> SECCOMP_RET_LOG) then it would need to be lower priority than blocking
> the operation and you would need to have something like the logic in
> seccomp_run_filters to return the highest priority filter return value.
>
> (You could validate that the filter only returns IO_URING_BPF_RET_BLOCK
> or 0 in the verifier.)
Can't you just do that with any value? 0 is currently DENY, 1 is ALLOW,
those are the only documented filter return values. You could just have
2 be DENY_LOG or whatever, and 3 be ALLOW_LOG and so forth. Don't really
see why you'd need to limit yourself to 0..1 as it currently stands.
>> +int io_register_bpf_filter(struct io_restriction *res,
>> + struct io_uring_bpf __user *arg)
>> +{
>> + struct io_bpf_filter *filter, *old_filter;
>> + struct io_bpf_filters *filters;
>> + struct io_uring_bpf reg;
>> + struct bpf_prog *prog;
>> + struct sock_fprog fprog;
>> + int ret;
>> +
>> + if (copy_from_user(®, arg, sizeof(reg)))
>> + return -EFAULT;
>> + if (reg.cmd_type != IO_URING_BPF_CMD_FILTER)
>> + return -EINVAL;
>> + if (reg.cmd_flags || reg.resv)
>> + return -EINVAL;
>> +
>> + if (reg.filter.opcode >= IORING_OP_LAST)
>> + return -EINVAL;
>> + if (reg.filter.flags & ~IO_URING_BPF_FILTER_FLAGS)
>> + return -EINVAL;
>> + if (reg.filter.resv)
>> + return -EINVAL;
>> + if (!mem_is_zero(reg.filter.resv2, sizeof(reg.filter.resv2)))
>> + return -EINVAL;
>> + if (!reg.filter.filter_len || reg.filter.filter_len > BPF_MAXINSNS)
>> + return -EINVAL;
>
> Similar question to my other mail about copy_struct_from_user().
Ah missed that in the other email. It'd help a lot if you trim your
replies and only quote relevant parts, it's a lot more efficient. And
makes it much harder not to miss content. Sure we can use
copy_struct_from_user().
--
Jens Axboe
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 6/6] io_uring: allow registration of per-task restrictions
2026-01-19 17:54 ` Aleksa Sarai
2026-01-19 18:02 ` Jens Axboe
@ 2026-01-19 20:29 ` Jens Axboe
1 sibling, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2026-01-19 20:29 UTC (permalink / raw)
To: Aleksa Sarai; +Cc: io-uring, brauner, Jann Horn, Kees Cook
>> +static int io_register_restrictions_task(void __user *arg, unsigned int nr_args)
>> +{
>> + struct io_uring_task_restriction __user *ures = arg;
>> + struct io_uring_task_restriction tres;
>> + struct io_restriction *res;
>> + int ret;
>
> You almost certainly want to copy the seccomp logic of disallowing the
> setting of restrictions unless no_new_privs is set or the process has
> CAP_SYS_ADMIN.
(this is why I missed it, it's 288 lines down of quoted email?)
Good point, yes can do.
> While seccomp is more dangerous in this respect (as it allows you to
> modify the return value of a syscall), being able to alter the execution
> of setuid binaries usually leads to security issues, so it's probably
> best to just copy what seccomp does here.
Agree, that was kind of my goal, just largely mimic that part to avoid
surprises.
>> + /* Disallow if task already has registered restrictions */
>> + if (current->io_uring_restrict)
>> + return -EPERM;
>
> I guess specifying "stacked" restrictions (a-la seccomp) is intended as
> future work?
You can stack already, you just stack within the current set.
> This is kind of critical for both nesting use-cases and for making this
> usable more widely (I imagine systemd will want to set system-wide
> restrictions which would lock out programs from being able to set their
> own process-wide restrictions -- nested containers are also a fairly
> common use-case these days too).
>
> (For containers we would probably only really use the cBPF stuff but it
> would be nice for them to both be stackable -- if only for the reason
> that you could set them in any order.)
Agree and this is why you can already do that.
>> + if (nr_args != 1)
>> + return -EINVAL;
>> +
>> + if (copy_from_user(&tres, arg, sizeof(tres)))
>> + return -EFAULT;
>> +
>> + if (tres.flags)
>> + return -EINVAL;
>> + if (!mem_is_zero(tres.resv, sizeof(tres.resv)))
>> + return -EINVAL;
>
> I would suggest using copy_struct_from_user() to make extensions easier,
> but I don't know if that is the kind of thing you feel necessary for
> io_uring APIs.
I don't disagree with that, but the io_uring uapi in this regard has
been extensible in the past with just reserved fields. Hence I'd rather
just stick with that approach, rather than make this case "special".
>> +static int io_register_bpf_filter_task(void __user *arg, unsigned int nr_args)
>> +{
>> + struct io_restriction *res;
>> + int ret;
>> +
>> + if (nr_args != 1)
>> + return -EINVAL;
>
> Same comment as above about no_new_privs / CAP_SYS_ADMIN.
Agree, will make that change.
--
Jens Axboe
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-01-19 20:29 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-18 17:16 [PATCHSET v5] Inherited restrictions and BPF filtering Jens Axboe
2026-01-18 17:16 ` [PATCH 1/6] io_uring: add support for BPF filtering for opcode restrictions Jens Axboe
2026-01-19 18:51 ` Aleksa Sarai
2026-01-19 20:17 ` Jens Axboe
2026-01-18 17:16 ` [PATCH 2/6] io_uring/net: allow filtering on IORING_OP_SOCKET data Jens Axboe
2026-01-18 17:16 ` [PATCH 3/6] io_uring/bpf_filter: cache lookup table in ctx->bpf_filters Jens Axboe
2026-01-18 17:16 ` [PATCH 4/6] io_uring/bpf_filter: add ref counts to struct io_bpf_filter Jens Axboe
2026-01-18 17:16 ` [PATCH 5/6] io_uring: add task fork hook Jens Axboe
2026-01-18 17:16 ` [PATCH 6/6] io_uring: allow registration of per-task restrictions Jens Axboe
2026-01-19 17:54 ` Aleksa Sarai
2026-01-19 18:02 ` Jens Axboe
2026-01-19 20:29 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox