* [PATCHSET RFC 0/2] Per-task io_uring opcode restrictions
@ 2026-01-08 20:17 Jens Axboe
2026-01-08 20:17 ` [PATCH 1/2] io_uring: allow registration of per-task restrictions Jens Axboe
` (2 more replies)
0 siblings, 3 replies; 5+ messages in thread
From: Jens Axboe @ 2026-01-08 20:17 UTC (permalink / raw)
To: io-uring
Hi,
One common complaint is that io_uring doesn't work with seccomp. Which
is true, as seccomp is entirely designed around a classic sync syscall -
if you can filter what you need based on a syscall number and the
arguments, then it's fine. But for anything else, it doesn't really
work. This means that solutions that rely on syscall filtering, eg
docker, there's really not much you can do with seccomp outside of
entirely disabling io_uring. That's not ideal.
As I do think that's a gap we have that needs closing, here's an RFC
attempt at that. Suggestions more than welcome! I want to arrive at
something that works for the various use cases.
io_uring already has a filtering mechanism for opcodes, however it needs
to be done after a ring has been created. The ring is created in a
disabled state, and then restrictions are applied, and finally the ring
is enabled so it can get used. This is cumbersome and doesn't
necessarily fit everybody's needs.
This patch adds support for extending that same list of disallowed
opcodes and register to something that can be applied to the task as a
whole. Once applied, any ring created under that task will have these
restrictions applied. Patch 1 adds the basic support for this, and patch
2 adds support for having the restrictions applied at fork or thread
create time too, so any task or thread created under the current task
will get the same restrictions.
A few test cases can be found in liburing, in the task-restrictions
branch:
https://git.kernel.org/pub/scm/linux/kernel/git/axboe/liburing.git/log/?h=task-restrictions
include/linux/io_uring.h | 2 +-
include/linux/io_uring_types.h | 2 ++
include/linux/sched.h | 1 +
include/uapi/linux/io_uring.h | 16 ++++++++++++++
io_uring/io_uring.c | 10 +++++++++
io_uring/register.c | 39 ++++++++++++++++++++++++++++++++++
io_uring/tctx.c | 23 ++++++++++++--------
kernel/fork.c | 6 ++++++
8 files changed, 89 insertions(+), 10 deletions(-)
--
Jens Axboe
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 1/2] io_uring: allow registration of per-task restrictions
2026-01-08 20:17 [PATCHSET RFC 0/2] Per-task io_uring opcode restrictions Jens Axboe
@ 2026-01-08 20:17 ` Jens Axboe
2026-01-08 20:17 ` [PATCH 2/2] io_uring/register: add support for inheriting task restrictions Jens Axboe
2026-01-08 22:04 ` [PATCHSET RFC 0/2] Per-task io_uring opcode restrictions Gabriel Krisman Bertazi
2 siblings, 0 replies; 5+ messages in thread
From: Jens Axboe @ 2026-01-08 20:17 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe
Currently io_uring supports restricting operations on a per-ring basis.
To use those, the ring must be setup in a disabled state by setting
IORING_SETUP_R_DISABLED. Then restrictions can be set for the ring, and
the ring can then be enabled.
This commit adds support for IORING_REGISTER_RESTRICTIONS_TASK, which
allows to register the same kind of restrictions, but with the task
itself rather than with a specific ring. Once done, any ring created
will inherit these restrictions.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/io_uring.h | 2 +-
include/linux/sched.h | 1 +
include/uapi/linux/io_uring.h | 9 +++++++++
io_uring/io_uring.c | 10 ++++++++++
io_uring/register.c | 36 +++++++++++++++++++++++++++++++++++
io_uring/tctx.c | 21 +++++++++++---------
kernel/fork.c | 1 +
7 files changed, 70 insertions(+), 10 deletions(-)
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 85fe4e6b275c..cfd2f4c667ee 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -25,7 +25,7 @@ static inline void io_uring_task_cancel(void)
}
static inline void io_uring_free(struct task_struct *tsk)
{
- if (tsk->io_uring)
+ if (tsk->io_uring || tsk->io_uring_restrict)
__io_uring_free(tsk);
}
#else
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d395f2810fac..9abbd11bb87c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1190,6 +1190,7 @@ struct task_struct {
#ifdef CONFIG_IO_URING
struct io_uring_task *io_uring;
+ struct io_restriction *io_uring_restrict;
#endif
/* Namespaces: */
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index b5b23c0d5283..3ecf9c1bfa2d 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -700,6 +700,8 @@ enum io_uring_register_op {
/* auxiliary zcrx configuration, see enum zcrx_ctrl_op */
IORING_REGISTER_ZCRX_CTRL = 36,
+ IORING_REGISTER_RESTRICTIONS_TASK = 37,
+
/* this goes last */
IORING_REGISTER_LAST,
@@ -805,6 +807,13 @@ struct io_uring_restriction {
__u32 resv2[3];
};
+struct io_uring_task_restriction {
+ __u16 flags;
+ __u16 nr_res;
+ __u32 resv[3];
+ __DECLARE_FLEX_ARRAY(struct io_uring_restriction, restrictions);
+};
+
struct io_uring_clock_register {
__u32 clockid;
__u32 __resv[3];
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 1aebdba425e8..044da739ed0b 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3608,6 +3608,16 @@ static __cold int io_uring_create(struct io_ctx_config *config)
else
ctx->notify_method = TWA_SIGNAL;
+ /*
+ * If the current task has restrictions enabled, then copy them to
+ * our newly created ring and mark it as registered.
+ */
+ if (current->io_uring_restrict) {
+ memcpy(&ctx->restrictions, current->io_uring_restrict, sizeof(ctx->restrictions));
+ ctx->restrictions.registered = true;
+ ctx->restricted = true;
+ }
+
/*
* This is just grabbed for accounting purposes. When a process exits,
* the mm is exited and dropped before the files, hence we need to hang
diff --git a/io_uring/register.c b/io_uring/register.c
index 62d39b3ff317..eac7a6da32b4 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -175,6 +175,40 @@ static __cold int io_register_restrictions(struct io_ring_ctx *ctx,
return ret;
}
+static int io_register_restrictions_task(void __user *arg, unsigned int nr_args)
+{
+ struct io_uring_task_restriction __user *ures = arg;
+ struct io_uring_task_restriction tres;
+ struct io_restriction *res;
+ int ret;
+
+ /* Disallow if task already has registered restrictions */
+ if (current->io_uring_restrict)
+ return -EBUSY;
+ if (nr_args != 1)
+ return -EINVAL;
+
+ if (copy_from_user(&tres, arg, sizeof(tres)))
+ return -EFAULT;
+
+ if (tres.flags)
+ return -EINVAL;
+ if (!mem_is_zero(tres.resv, sizeof(tres.resv)))
+ return -EINVAL;
+
+ res = kzalloc(sizeof(*res), GFP_KERNEL);
+ if (!res)
+ return -ENOMEM;
+
+ ret = io_parse_restrictions(ures->restrictions, tres.nr_res, res);
+ if (ret) {
+ kfree(res);
+ return ret;
+ }
+ current->io_uring_restrict = res;
+ return 0;
+}
+
static int io_register_enable_rings(struct io_ring_ctx *ctx)
{
if (!(ctx->flags & IORING_SETUP_R_DISABLED))
@@ -889,6 +923,8 @@ static int io_uring_register_blind(unsigned int opcode, void __user *arg,
return io_uring_register_send_msg_ring(arg, nr_args);
case IORING_REGISTER_QUERY:
return io_query(arg, nr_args);
+ case IORING_REGISTER_RESTRICTIONS_TASK:
+ return io_register_restrictions_task(arg, nr_args);
}
return -EINVAL;
}
diff --git a/io_uring/tctx.c b/io_uring/tctx.c
index 5b66755579c0..c8ad735936dc 100644
--- a/io_uring/tctx.c
+++ b/io_uring/tctx.c
@@ -54,16 +54,19 @@ void __io_uring_free(struct task_struct *tsk)
* node is stored in the xarray. Until that gets sorted out, attempt
* an iteration here and warn if any entries are found.
*/
- xa_for_each(&tctx->xa, index, node) {
- WARN_ON_ONCE(1);
- break;
- }
- WARN_ON_ONCE(tctx->io_wq);
- WARN_ON_ONCE(tctx->cached_refs);
+ if (tctx) {
+ xa_for_each(&tctx->xa, index, node) {
+ WARN_ON_ONCE(1);
+ break;
+ }
+ WARN_ON_ONCE(tctx->io_wq);
+ WARN_ON_ONCE(tctx->cached_refs);
- percpu_counter_destroy(&tctx->inflight);
- kfree(tctx);
- tsk->io_uring = NULL;
+ percpu_counter_destroy(&tctx->inflight);
+ kfree(tctx);
+ tsk->io_uring = NULL;
+ }
+ kfree(tsk->io_uring_restrict);
}
__cold int io_uring_alloc_task_context(struct task_struct *task,
diff --git a/kernel/fork.c b/kernel/fork.c
index b1f3915d5f8e..6081e1c93e21 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2129,6 +2129,7 @@ __latent_entropy struct task_struct *copy_process(
#ifdef CONFIG_IO_URING
p->io_uring = NULL;
+ p->io_uring_restrict = NULL;
#endif
p->default_timer_slack_ns = current->timer_slack_ns;
--
2.51.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH 2/2] io_uring/register: add support for inheriting task restrictions
2026-01-08 20:17 [PATCHSET RFC 0/2] Per-task io_uring opcode restrictions Jens Axboe
2026-01-08 20:17 ` [PATCH 1/2] io_uring: allow registration of per-task restrictions Jens Axboe
@ 2026-01-08 20:17 ` Jens Axboe
2026-01-08 22:04 ` [PATCHSET RFC 0/2] Per-task io_uring opcode restrictions Gabriel Krisman Bertazi
2 siblings, 0 replies; 5+ messages in thread
From: Jens Axboe @ 2026-01-08 20:17 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe
By default, the registered task restrictions only apply to the task they
were registered for. Any forked tasks or created threads will not
inherit them.
However, if IORING_REG_RESTRICTIONS_INHERIT is set when registering the
task restrictions, then they will be inherited across process fork or
thread creation.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/io_uring_types.h | 2 ++
include/uapi/linux/io_uring.h | 7 +++++++
io_uring/register.c | 5 ++++-
io_uring/tctx.c | 4 +++-
kernel/fork.c | 7 ++++++-
5 files changed, 22 insertions(+), 3 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 54fd30abf2b8..b63b927d8718 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -222,9 +222,11 @@ struct io_rings {
struct io_restriction {
DECLARE_BITMAP(register_op, IORING_REGISTER_LAST);
DECLARE_BITMAP(sqe_op, IORING_OP_LAST);
+ refcount_t refs;
u8 sqe_flags_allowed;
u8 sqe_flags_required;
bool registered;
+ bool inherited;
};
struct io_submit_link {
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 3ecf9c1bfa2d..8d671b5e33e3 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -807,6 +807,13 @@ struct io_uring_restriction {
__u32 resv2[3];
};
+enum {
+ /*
+ * Registered restrictions are inherited for a fork.
+ */
+ IORING_REG_RESTRICTIONS_INHERIT = (1U << 0),
+};
+
struct io_uring_task_restriction {
__u16 flags;
__u16 nr_res;
diff --git a/io_uring/register.c b/io_uring/register.c
index eac7a6da32b4..36573b362225 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -191,7 +191,7 @@ static int io_register_restrictions_task(void __user *arg, unsigned int nr_args)
if (copy_from_user(&tres, arg, sizeof(tres)))
return -EFAULT;
- if (tres.flags)
+ if (tres.flags & ~IORING_REG_RESTRICTIONS_INHERIT)
return -EINVAL;
if (!mem_is_zero(tres.resv, sizeof(tres.resv)))
return -EINVAL;
@@ -205,6 +205,9 @@ static int io_register_restrictions_task(void __user *arg, unsigned int nr_args)
kfree(res);
return ret;
}
+ if (tres.flags & IORING_REG_RESTRICTIONS_INHERIT)
+ res->inherited = true;
+ refcount_set(&res->refs, 1);
current->io_uring_restrict = res;
return 0;
}
diff --git a/io_uring/tctx.c b/io_uring/tctx.c
index c8ad735936dc..f9ad9cbee9be 100644
--- a/io_uring/tctx.c
+++ b/io_uring/tctx.c
@@ -66,7 +66,9 @@ void __io_uring_free(struct task_struct *tsk)
kfree(tctx);
tsk->io_uring = NULL;
}
- kfree(tsk->io_uring_restrict);
+ if (tsk->io_uring_restrict &&
+ refcount_dec_and_test(&tsk->io_uring_restrict->refs))
+ kfree(tsk->io_uring_restrict);
}
__cold int io_uring_alloc_task_context(struct task_struct *task,
diff --git a/kernel/fork.c b/kernel/fork.c
index 6081e1c93e21..505f9397a645 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -97,6 +97,7 @@
#include <linux/kasan.h>
#include <linux/scs.h>
#include <linux/io_uring.h>
+#include <linux/io_uring_types.h>
#include <linux/bpf.h>
#include <linux/stackprotector.h>
#include <linux/user_events.h>
@@ -2129,7 +2130,10 @@ __latent_entropy struct task_struct *copy_process(
#ifdef CONFIG_IO_URING
p->io_uring = NULL;
- p->io_uring_restrict = NULL;
+ if (p->io_uring_restrict && p->io_uring_restrict->inherited)
+ refcount_inc(&p->io_uring_restrict->refs);
+ else
+ p->io_uring_restrict = NULL;
#endif
p->default_timer_slack_ns = current->timer_slack_ns;
@@ -2526,6 +2530,7 @@ __latent_entropy struct task_struct *copy_process(
mpol_put(p->mempolicy);
#endif
bad_fork_cleanup_delayacct:
+ io_uring_free(p);
delayacct_tsk_free(p);
bad_fork_cleanup_count:
dec_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
--
2.51.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCHSET RFC 0/2] Per-task io_uring opcode restrictions
2026-01-08 20:17 [PATCHSET RFC 0/2] Per-task io_uring opcode restrictions Jens Axboe
2026-01-08 20:17 ` [PATCH 1/2] io_uring: allow registration of per-task restrictions Jens Axboe
2026-01-08 20:17 ` [PATCH 2/2] io_uring/register: add support for inheriting task restrictions Jens Axboe
@ 2026-01-08 22:04 ` Gabriel Krisman Bertazi
2026-01-08 23:54 ` Jens Axboe
2 siblings, 1 reply; 5+ messages in thread
From: Gabriel Krisman Bertazi @ 2026-01-08 22:04 UTC (permalink / raw)
To: Jens Axboe; +Cc: io-uring
Jens Axboe <axboe@kernel.dk> writes:
> Hi,
>
> One common complaint is that io_uring doesn't work with seccomp. Which
> is true, as seccomp is entirely designed around a classic sync syscall -
> if you can filter what you need based on a syscall number and the
> arguments, then it's fine. But for anything else, it doesn't really
> work. This means that solutions that rely on syscall filtering, eg
> docker, there's really not much you can do with seccomp outside of
> entirely disabling io_uring. That's not ideal.
>
> As I do think that's a gap we have that needs closing, here's an RFC
> attempt at that. Suggestions more than welcome! I want to arrive at
> something that works for the various use cases.
>
> io_uring already has a filtering mechanism for opcodes, however it needs
> to be done after a ring has been created. The ring is created in a
> disabled state, and then restrictions are applied, and finally the ring
> is enabled so it can get used. This is cumbersome and doesn't
> necessarily fit everybody's needs.
>
> This patch adds support for extending that same list of disallowed
> opcodes and register to something that can be applied to the task as a
> whole. Once applied, any ring created under that task will have these
> restrictions applied. Patch 1 adds the basic support for this, and patch
> 2 adds support for having the restrictions applied at fork or thread
> create time too, so any task or thread created under the current task
> will get the same restrictions.
Hi Jens,
Considering this is like to seccomp, a security mechanism, I don't see a
use case for running without IORING_REG_RESTRICTIONS_INHERIT. Otherwise
there is a quick way around it by just execve'ing into itself. IIRC,
seccomp also doesn't support disabling filters for the same reason.
So, unless someone has a use case, I'd suggest dropping the flag
and just making IORING_REG_RESTRICTIONS_INHERIT the default behavior.
Beyond that, adding more restrictions on an already restricted
application would be a useful use-case, so returning -EBUSY on
current->io_uring_restrict might not be doable long trem. But feature
can be added later.
Finally, I suspect we will come quickly to the need of more complex
filtering of arguments, like seccomp. Again, something that can be
added later but could be considered now for the interface.
> A few test cases can be found in liburing, in the task-restrictions
> branch:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/liburing.git/log/?h=task-restrictions
>
> include/linux/io_uring.h | 2 +-
> include/linux/io_uring_types.h | 2 ++
> include/linux/sched.h | 1 +
> include/uapi/linux/io_uring.h | 16 ++++++++++++++
> io_uring/io_uring.c | 10 +++++++++
> io_uring/register.c | 39 ++++++++++++++++++++++++++++++++++
> io_uring/tctx.c | 23 ++++++++++++--------
> kernel/fork.c | 6 ++++++
> 8 files changed, 89 insertions(+), 10 deletions(-)
--
Gabriel Krisman Bertazi
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCHSET RFC 0/2] Per-task io_uring opcode restrictions
2026-01-08 22:04 ` [PATCHSET RFC 0/2] Per-task io_uring opcode restrictions Gabriel Krisman Bertazi
@ 2026-01-08 23:54 ` Jens Axboe
0 siblings, 0 replies; 5+ messages in thread
From: Jens Axboe @ 2026-01-08 23:54 UTC (permalink / raw)
To: Gabriel Krisman Bertazi; +Cc: io-uring
On 1/8/26 3:04 PM, Gabriel Krisman Bertazi wrote:
> Jens Axboe <axboe@kernel.dk> writes:
>
>> Hi,
>>
>> One common complaint is that io_uring doesn't work with seccomp. Which
>> is true, as seccomp is entirely designed around a classic sync syscall -
>> if you can filter what you need based on a syscall number and the
>> arguments, then it's fine. But for anything else, it doesn't really
>> work. This means that solutions that rely on syscall filtering, eg
>> docker, there's really not much you can do with seccomp outside of
>> entirely disabling io_uring. That's not ideal.
>>
>> As I do think that's a gap we have that needs closing, here's an RFC
>> attempt at that. Suggestions more than welcome! I want to arrive at
>> something that works for the various use cases.
>>
>> io_uring already has a filtering mechanism for opcodes, however it needs
>> to be done after a ring has been created. The ring is created in a
>> disabled state, and then restrictions are applied, and finally the ring
>> is enabled so it can get used. This is cumbersome and doesn't
>> necessarily fit everybody's needs.
>>
>> This patch adds support for extending that same list of disallowed
>> opcodes and register to something that can be applied to the task as a
>> whole. Once applied, any ring created under that task will have these
>> restrictions applied. Patch 1 adds the basic support for this, and patch
>> 2 adds support for having the restrictions applied at fork or thread
>> create time too, so any task or thread created under the current task
>> will get the same restrictions.
>
> Hi Jens,
>
> Considering this is like to seccomp, a security mechanism, I don't see a
> use case for running without IORING_REG_RESTRICTIONS_INHERIT. Otherwise
> there is a quick way around it by just execve'ing into itself. IIRC,
> seccomp also doesn't support disabling filters for the same reason.
> So, unless someone has a use case, I'd suggest dropping the flag
> and just making IORING_REG_RESTRICTIONS_INHERIT the default behavior.
Yes good point, and then I can fold these two patches as well. I do
agree that having it be inherited on fork is probably the only way to
go. Not posted with this series, but I did add support for unregistering
a filter, IFF you were the original creator of it. You can either update
it with a new set of restrictions, or simply pass NULL and get the
current set removed.
> Beyond that, adding more restrictions on an already restricted
> application would be a useful use-case, so returning -EBUSY on
> current->io_uring_restrict might not be doable long trem. But feature
> can be added later.
We could certainly do something like that, where you can "OR" in more
restrictions, you can't just "AND" them. I'll add that.
> Finally, I suspect we will come quickly to the need of more complex
> filtering of arguments, like seccomp. Again, something that can be
> added later but could be considered now for the interface.
Quite possibly, as it's using the same mechanism we already have, it
just supports filtering opcodes, register opcodes, and flags for either
of those. We do have some vacant fields in io_uring_restriction right
now which could cover more cases, at least.
--
Jens Axboe
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-01-08 23:54 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-08 20:17 [PATCHSET RFC 0/2] Per-task io_uring opcode restrictions Jens Axboe
2026-01-08 20:17 ` [PATCH 1/2] io_uring: allow registration of per-task restrictions Jens Axboe
2026-01-08 20:17 ` [PATCH 2/2] io_uring/register: add support for inheriting task restrictions Jens Axboe
2026-01-08 22:04 ` [PATCHSET RFC 0/2] Per-task io_uring opcode restrictions Gabriel Krisman Bertazi
2026-01-08 23:54 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox