public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET RFC v2 0/3] Per-task io_uring opcode restrictions
@ 2026-01-09 18:48 Jens Axboe
  2026-01-09 18:48 ` [PATCH 1/3] io_uring: allow registration of per-task restrictions Jens Axboe
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Jens Axboe @ 2026-01-09 18:48 UTC (permalink / raw)
  To: io-uring; +Cc: krisman

Hi,

For details on this patchset, see the v1 posting here:

https://lore.kernel.org/io-uring/20260108202944.288490-1-axboe@kernel.dk/

This code can also be found here:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git/log/?h=io_uring-task-restrictions

and a corresponding liburing branch here:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/liburing.git/log/?h=task-restrictions

with basic support and test cases.

 include/linux/io_uring.h       |   2 +-
 include/linux/io_uring_types.h |   2 +
 include/linux/sched.h          |   1 +
 include/uapi/linux/io_uring.h  |  18 ++++++
 io_uring/io_uring.c            |  10 +++
 io_uring/register.c            | 110 ++++++++++++++++++++++++++++++---
 io_uring/tctx.c                |  25 +++++---
 kernel/fork.c                  |   4 ++
 8 files changed, 153 insertions(+), 19 deletions(-)

Changes since v1
- Remove IORING_REG_RESTRICTIONS_INHERIT flag, restrictions are
  inherited across fork by default now.
- Allow original creator of a restriction set to unregister it as well.
- Add IORING_REG_RESTRICTIONS_MASK flag, which allows anyone to further
  restrict the currently assigned restriction set.
- Ensure failure operations restore old set.
- Add more test cases on the liburing side.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/3] io_uring: allow registration of per-task restrictions
  2026-01-09 18:48 [PATCHSET RFC v2 0/3] Per-task io_uring opcode restrictions Jens Axboe
@ 2026-01-09 18:48 ` Jens Axboe
  2026-01-09 18:48 ` [PATCH 2/3] io_uring/register: add MASK support for task filter set Jens Axboe
  2026-01-09 18:48 ` [PATCH 3/3] io_uring/register: allow original task restrictions owner to unregister Jens Axboe
  2 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2026-01-09 18:48 UTC (permalink / raw)
  To: io-uring; +Cc: krisman, Jens Axboe

Currently io_uring supports restricting operations on a per-ring basis.
To use those, the ring must be setup in a disabled state by setting
IORING_SETUP_R_DISABLED. Then restrictions can be set for the ring, and
the ring can then be enabled.

This commit adds support for IORING_REGISTER_RESTRICTIONS_TASK, which
allows to register the same kind of restrictions, but with the task
itself rather than with a specific ring. Once done, any ring created
will inherit these restrictions.

If a restriction filter is registered with a task, then it's
inherited on fork for its children.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring.h       |  2 +-
 include/linux/io_uring_types.h |  1 +
 include/linux/sched.h          |  1 +
 include/uapi/linux/io_uring.h  |  9 +++++++++
 io_uring/io_uring.c            | 10 +++++++++
 io_uring/register.c            | 37 ++++++++++++++++++++++++++++++++++
 io_uring/tctx.c                | 25 ++++++++++++++---------
 kernel/fork.c                  |  4 ++++
 8 files changed, 79 insertions(+), 10 deletions(-)

diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 85fe4e6b275c..cfd2f4c667ee 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -25,7 +25,7 @@ static inline void io_uring_task_cancel(void)
 }
 static inline void io_uring_free(struct task_struct *tsk)
 {
-	if (tsk->io_uring)
+	if (tsk->io_uring || tsk->io_uring_restrict)
 		__io_uring_free(tsk);
 }
 #else
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 54fd30abf2b8..196f41ec6d60 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -222,6 +222,7 @@ struct io_rings {
 struct io_restriction {
 	DECLARE_BITMAP(register_op, IORING_REGISTER_LAST);
 	DECLARE_BITMAP(sqe_op, IORING_OP_LAST);
+	refcount_t refs;
 	u8 sqe_flags_allowed;
 	u8 sqe_flags_required;
 	bool registered;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d395f2810fac..9abbd11bb87c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1190,6 +1190,7 @@ struct task_struct {
 
 #ifdef CONFIG_IO_URING
 	struct io_uring_task		*io_uring;
+	struct io_restriction		*io_uring_restrict;
 #endif
 
 	/* Namespaces: */
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index b5b23c0d5283..3ecf9c1bfa2d 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -700,6 +700,8 @@ enum io_uring_register_op {
 	/* auxiliary zcrx configuration, see enum zcrx_ctrl_op */
 	IORING_REGISTER_ZCRX_CTRL		= 36,
 
+	IORING_REGISTER_RESTRICTIONS_TASK	= 37,
+
 	/* this goes last */
 	IORING_REGISTER_LAST,
 
@@ -805,6 +807,13 @@ struct io_uring_restriction {
 	__u32 resv2[3];
 };
 
+struct io_uring_task_restriction {
+	__u16 flags;
+	__u16 nr_res;
+	__u32 resv[3];
+	__DECLARE_FLEX_ARRAY(struct io_uring_restriction, restrictions);
+};
+
 struct io_uring_clock_register {
 	__u32	clockid;
 	__u32	__resv[3];
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 1aebdba425e8..044da739ed0b 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3608,6 +3608,16 @@ static __cold int io_uring_create(struct io_ctx_config *config)
 	else
 		ctx->notify_method = TWA_SIGNAL;
 
+	/*
+	 * If the current task has restrictions enabled, then copy them to
+	 * our newly created ring and mark it as registered.
+	 */
+	if (current->io_uring_restrict) {
+		memcpy(&ctx->restrictions, current->io_uring_restrict, sizeof(ctx->restrictions));
+		ctx->restrictions.registered = true;
+		ctx->restricted = true;
+	}
+
 	/*
 	 * This is just grabbed for accounting purposes. When a process exits,
 	 * the mm is exited and dropped before the files, hence we need to hang
diff --git a/io_uring/register.c b/io_uring/register.c
index 62d39b3ff317..89254d0fbe79 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -175,6 +175,41 @@ static __cold int io_register_restrictions(struct io_ring_ctx *ctx,
 	return ret;
 }
 
+static int io_register_restrictions_task(void __user *arg, unsigned int nr_args)
+{
+	struct io_uring_task_restriction __user *ures = arg;
+	struct io_uring_task_restriction tres;
+	struct io_restriction *res;
+	int ret;
+
+	/* Disallow if task already has registered restrictions */
+	if (current->io_uring_restrict)
+		return -EPERM;
+	if (nr_args != 1)
+		return -EINVAL;
+
+	if (copy_from_user(&tres, arg, sizeof(tres)))
+		return -EFAULT;
+
+	if (tres.flags)
+		return -EINVAL;
+	if (!mem_is_zero(tres.resv, sizeof(tres.resv)))
+		return -EINVAL;
+
+	res = kzalloc(sizeof(*res), GFP_KERNEL);
+	if (!res)
+		return -ENOMEM;
+
+	ret = io_parse_restrictions(ures->restrictions, tres.nr_res, res);
+	if (ret) {
+		kfree(res);
+		return ret;
+	}
+	refcount_set(&res->refs, 1);
+	current->io_uring_restrict = res;
+	return 0;
+}
+
 static int io_register_enable_rings(struct io_ring_ctx *ctx)
 {
 	if (!(ctx->flags & IORING_SETUP_R_DISABLED))
@@ -889,6 +924,8 @@ static int io_uring_register_blind(unsigned int opcode, void __user *arg,
 		return io_uring_register_send_msg_ring(arg, nr_args);
 	case IORING_REGISTER_QUERY:
 		return io_query(arg, nr_args);
+	case IORING_REGISTER_RESTRICTIONS_TASK:
+		return io_register_restrictions_task(arg, nr_args);
 	}
 	return -EINVAL;
 }
diff --git a/io_uring/tctx.c b/io_uring/tctx.c
index 5b66755579c0..1ec71d5cf3f0 100644
--- a/io_uring/tctx.c
+++ b/io_uring/tctx.c
@@ -54,16 +54,23 @@ void __io_uring_free(struct task_struct *tsk)
 	 * node is stored in the xarray. Until that gets sorted out, attempt
 	 * an iteration here and warn if any entries are found.
 	 */
-	xa_for_each(&tctx->xa, index, node) {
-		WARN_ON_ONCE(1);
-		break;
-	}
-	WARN_ON_ONCE(tctx->io_wq);
-	WARN_ON_ONCE(tctx->cached_refs);
+	if (tctx) {
+		xa_for_each(&tctx->xa, index, node) {
+			WARN_ON_ONCE(1);
+			break;
+		}
+		WARN_ON_ONCE(tctx->io_wq);
+		WARN_ON_ONCE(tctx->cached_refs);
 
-	percpu_counter_destroy(&tctx->inflight);
-	kfree(tctx);
-	tsk->io_uring = NULL;
+		percpu_counter_destroy(&tctx->inflight);
+		kfree(tctx);
+		tsk->io_uring = NULL;
+	}
+	if (tsk->io_uring_restrict) {
+		if (refcount_dec_and_test(&tsk->io_uring_restrict->refs))
+			kfree(tsk->io_uring_restrict);
+		tsk->io_uring_restrict = NULL;
+	}
 }
 
 __cold int io_uring_alloc_task_context(struct task_struct *task,
diff --git a/kernel/fork.c b/kernel/fork.c
index b1f3915d5f8e..da8fd6fd384c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -97,6 +97,7 @@
 #include <linux/kasan.h>
 #include <linux/scs.h>
 #include <linux/io_uring.h>
+#include <linux/io_uring_types.h>
 #include <linux/bpf.h>
 #include <linux/stackprotector.h>
 #include <linux/user_events.h>
@@ -2129,6 +2130,8 @@ __latent_entropy struct task_struct *copy_process(
 
 #ifdef CONFIG_IO_URING
 	p->io_uring = NULL;
+	if (p->io_uring_restrict)
+		refcount_inc(&p->io_uring_restrict->refs);
 #endif
 
 	p->default_timer_slack_ns = current->timer_slack_ns;
@@ -2525,6 +2528,7 @@ __latent_entropy struct task_struct *copy_process(
 	mpol_put(p->mempolicy);
 #endif
 bad_fork_cleanup_delayacct:
+	io_uring_free(p);
 	delayacct_tsk_free(p);
 bad_fork_cleanup_count:
 	dec_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/3] io_uring/register: add MASK support for task filter set
  2026-01-09 18:48 [PATCHSET RFC v2 0/3] Per-task io_uring opcode restrictions Jens Axboe
  2026-01-09 18:48 ` [PATCH 1/3] io_uring: allow registration of per-task restrictions Jens Axboe
@ 2026-01-09 18:48 ` Jens Axboe
  2026-01-09 18:48 ` [PATCH 3/3] io_uring/register: allow original task restrictions owner to unregister Jens Axboe
  2 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2026-01-09 18:48 UTC (permalink / raw)
  To: io-uring; +Cc: krisman, Jens Axboe

If IORING_REG_RESTRICTIONS_MASK is set in the flags for an
IORING_REGISTER_RESTRICTIONS_TASK operation, then further restrictions
can be added to the current set. No restrictions may be relaxed this
way. If a current set exists, the passed in set is added to the current
set. If no current set exists, the new set applied will as the current
task io_uring restriction filter.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/uapi/linux/io_uring.h |  9 ++++++
 io_uring/register.c           | 54 ++++++++++++++++++++++++++---------
 2 files changed, 49 insertions(+), 14 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 3ecf9c1bfa2d..e39da481f14c 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -807,6 +807,15 @@ struct io_uring_restriction {
 	__u32 resv2[3];
 };
 
+enum {
+	/*
+	 * MASK operation to further restrict a filter set. Can clear opcodes
+	 * allowed for SQEs or register operations, clear allowed SQE flags,
+	 * and set further required SQE flags.
+	 */
+	IORING_REG_RESTRICTIONS_MASK	= (1U << 0),
+};
+
 struct io_uring_task_restriction {
 	__u16 flags;
 	__u16 nr_res;
diff --git a/io_uring/register.c b/io_uring/register.c
index 89254d0fbe79..552b22f6b2dc 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -104,7 +104,8 @@ static int io_register_personality(struct io_ring_ctx *ctx)
 }
 
 static __cold int io_parse_restrictions(void __user *arg, unsigned int nr_args,
-					struct io_restriction *restrictions)
+					struct io_restriction *restrictions,
+					bool mask_it)
 {
 	struct io_uring_restriction *res;
 	size_t size;
@@ -122,32 +123,41 @@ static __cold int io_parse_restrictions(void __user *arg, unsigned int nr_args,
 		return PTR_ERR(res);
 
 	ret = -EINVAL;
-
 	for (i = 0; i < nr_args; i++) {
 		switch (res[i].opcode) {
 		case IORING_RESTRICTION_REGISTER_OP:
 			if (res[i].register_op >= IORING_REGISTER_LAST)
 				goto err;
-			__set_bit(res[i].register_op, restrictions->register_op);
+			if (mask_it)
+				__clear_bit(res[i].register_op, restrictions->register_op);
+			else
+				__set_bit(res[i].register_op, restrictions->register_op);
 			break;
 		case IORING_RESTRICTION_SQE_OP:
 			if (res[i].sqe_op >= IORING_OP_LAST)
 				goto err;
-			__set_bit(res[i].sqe_op, restrictions->sqe_op);
+			if (mask_it)
+				__clear_bit(res[i].sqe_op, restrictions->sqe_op);
+			else
+				__set_bit(res[i].sqe_op, restrictions->sqe_op);
 			break;
 		case IORING_RESTRICTION_SQE_FLAGS_ALLOWED:
-			restrictions->sqe_flags_allowed = res[i].sqe_flags;
+			if (mask_it)
+				restrictions->sqe_flags_allowed &= res[i].sqe_flags;
+			else
+				restrictions->sqe_flags_allowed = res[i].sqe_flags;
 			break;
 		case IORING_RESTRICTION_SQE_FLAGS_REQUIRED:
-			restrictions->sqe_flags_required = res[i].sqe_flags;
+			if (mask_it)
+				restrictions->sqe_flags_required |= res[i].sqe_flags;
+			else
+				restrictions->sqe_flags_required = res[i].sqe_flags;
 			break;
 		default:
 			goto err;
 		}
 	}
-
 	ret = 0;
-
 err:
 	kfree(res);
 	return ret;
@@ -166,7 +176,7 @@ static __cold int io_register_restrictions(struct io_ring_ctx *ctx,
 	if (ctx->restrictions.registered)
 		return -EBUSY;
 
-	ret = io_parse_restrictions(arg, nr_args, &ctx->restrictions);
+	ret = io_parse_restrictions(arg, nr_args, &ctx->restrictions, false);
 	/* Reset all restrictions if an error happened */
 	if (ret != 0)
 		memset(&ctx->restrictions, 0, sizeof(ctx->restrictions));
@@ -182,29 +192,45 @@ static int io_register_restrictions_task(void __user *arg, unsigned int nr_args)
 	struct io_restriction *res;
 	int ret;
 
-	/* Disallow if task already has registered restrictions */
-	if (current->io_uring_restrict)
-		return -EPERM;
 	if (nr_args != 1)
 		return -EINVAL;
 
 	if (copy_from_user(&tres, arg, sizeof(tres)))
 		return -EFAULT;
 
-	if (tres.flags)
+	if (tres.flags & ~IORING_REG_RESTRICTIONS_MASK)
 		return -EINVAL;
 	if (!mem_is_zero(tres.resv, sizeof(tres.resv)))
 		return -EINVAL;
 
+	/*
+	 * Disallow if task already has registered restrictions, and we're
+	 * not passing in further restrictions to add to an existing set.
+	 */
+	if (current->io_uring_restrict &&
+	    !(tres.flags & IORING_REG_RESTRICTIONS_MASK))
+		return -EPERM;
+
 	res = kzalloc(sizeof(*res), GFP_KERNEL);
 	if (!res)
 		return -ENOMEM;
 
-	ret = io_parse_restrictions(ures->restrictions, tres.nr_res, res);
+	/*
+	 * Can only be set if we're MASK'ing in more restrictions. If so,
+	 * copy existing filters.
+	 */
+	if (current->io_uring_restrict)
+		memcpy(res, current->io_uring_restrict, sizeof(*res));
+
+	ret = io_parse_restrictions(ures->restrictions, tres.nr_res, res,
+				    tres.flags & IORING_REG_RESTRICTIONS_MASK);
 	if (ret) {
 		kfree(res);
 		return ret;
 	}
+	if (current->io_uring_restrict &&
+	    refcount_dec_and_test(&current->io_uring_restrict->refs))
+		kfree(current->io_uring_restrict);
 	refcount_set(&res->refs, 1);
 	current->io_uring_restrict = res;
 	return 0;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/3] io_uring/register: allow original task restrictions owner to unregister
  2026-01-09 18:48 [PATCHSET RFC v2 0/3] Per-task io_uring opcode restrictions Jens Axboe
  2026-01-09 18:48 ` [PATCH 1/3] io_uring: allow registration of per-task restrictions Jens Axboe
  2026-01-09 18:48 ` [PATCH 2/3] io_uring/register: add MASK support for task filter set Jens Axboe
@ 2026-01-09 18:48 ` Jens Axboe
  2026-01-13  0:10   ` Gabriel Krisman Bertazi
  2 siblings, 1 reply; 6+ messages in thread
From: Jens Axboe @ 2026-01-09 18:48 UTC (permalink / raw)
  To: io-uring; +Cc: krisman, Jens Axboe

Currently any attempt to register a set of task restrictions if an
existing set exists will fail with -EPERM. But it is feasible to let the
original creator/owner performance this operation. Either to remove
restrictions entirely, or to replace them with a new set.

If an existing set exists and NULL is passed for the new set, the
current set is unregistered. If an existing set exists and a new set is
supplied, the old set is dropped and replaced with the new one.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h |  1 +
 io_uring/register.c            | 45 ++++++++++++++++++++++++++++------
 2 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 196f41ec6d60..1ff7817b3535 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -222,6 +222,7 @@ struct io_rings {
 struct io_restriction {
 	DECLARE_BITMAP(register_op, IORING_REGISTER_LAST);
 	DECLARE_BITMAP(sqe_op, IORING_OP_LAST);
+	pid_t pid;
 	refcount_t refs;
 	u8 sqe_flags_allowed;
 	u8 sqe_flags_required;
diff --git a/io_uring/register.c b/io_uring/register.c
index 552b22f6b2dc..c8b8a9edbc65 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -189,12 +189,19 @@ static int io_register_restrictions_task(void __user *arg, unsigned int nr_args)
 {
 	struct io_uring_task_restriction __user *ures = arg;
 	struct io_uring_task_restriction tres;
-	struct io_restriction *res;
+	struct io_restriction *old_res, *res;
 	int ret;
 
 	if (nr_args != 1)
 		return -EINVAL;
 
+	res = current->io_uring_restrict;
+	if (!ures) {
+		if (!res)
+			return -EFAULT;
+		goto drop_set;
+	}
+
 	if (copy_from_user(&tres, arg, sizeof(tres)))
 		return -EFAULT;
 
@@ -207,13 +214,27 @@ static int io_register_restrictions_task(void __user *arg, unsigned int nr_args)
 	 * Disallow if task already has registered restrictions, and we're
 	 * not passing in further restrictions to add to an existing set.
 	 */
-	if (current->io_uring_restrict &&
-	    !(tres.flags & IORING_REG_RESTRICTIONS_MASK))
-		return -EPERM;
+	old_res = NULL;
+	if (res && !(tres.flags & IORING_REG_RESTRICTIONS_MASK)) {
+		/* Not owner, may only append further restrictions */
+drop_set:
+		if (res->pid != current->pid)
+			return -EPERM;
+		/* Old set to be put later if we succeed */
+		old_res = res;
+		/* No new mask supplied, we're done */
+		if (!ures) {
+			ret = 0;
+			current->io_uring_restrict = NULL;
+			goto out;
+		}
+	}
 
 	res = kzalloc(sizeof(*res), GFP_KERNEL);
-	if (!res)
-		return -ENOMEM;
+	if (!res) {
+		ret = -ENOMEM;
+		goto out;
+	}
 
 	/*
 	 * Can only be set if we're MASK'ing in more restrictions. If so,
@@ -226,14 +247,22 @@ static int io_register_restrictions_task(void __user *arg, unsigned int nr_args)
 				    tres.flags & IORING_REG_RESTRICTIONS_MASK);
 	if (ret) {
 		kfree(res);
-		return ret;
+		goto out;
 	}
 	if (current->io_uring_restrict &&
 	    refcount_dec_and_test(&current->io_uring_restrict->refs))
 		kfree(current->io_uring_restrict);
+	res->pid = current->pid;
 	refcount_set(&res->refs, 1);
 	current->io_uring_restrict = res;
-	return 0;
+	ret = 0;
+out:
+	if (ret) {
+		if (old_res)
+			current->io_uring_restrict = old_res;
+	} else if (old_res && refcount_dec_and_test(&old_res->refs))
+		kfree(old_res);
+	return ret;
 }
 
 static int io_register_enable_rings(struct io_ring_ctx *ctx)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 3/3] io_uring/register: allow original task restrictions owner to unregister
  2026-01-09 18:48 ` [PATCH 3/3] io_uring/register: allow original task restrictions owner to unregister Jens Axboe
@ 2026-01-13  0:10   ` Gabriel Krisman Bertazi
  2026-01-13 18:25     ` Jens Axboe
  0 siblings, 1 reply; 6+ messages in thread
From: Gabriel Krisman Bertazi @ 2026-01-13  0:10 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

[-- Attachment #1: Type: text/plain, Size: 3736 bytes --]

Jens Axboe <axboe@kernel.dk> writes:

> Currently any attempt to register a set of task restrictions if an
> existing set exists will fail with -EPERM. But it is feasible to let the
> original creator/owner performance this operation. Either to remove
> restrictions entirely, or to replace them with a new set.
>
> If an existing set exists and NULL is passed for the new set, the
> current set is unregistered. If an existing set exists and a new set is
> supplied, the old set is dropped and replaced with the new one.

Feature-wise, I think this covers what I mentioned in the previous
iteration.  Even though this is an RFC, I think I found two bugs that
allow the child to escape the restrictions:

> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  include/linux/io_uring_types.h |  1 +
>  io_uring/register.c            | 45 ++++++++++++++++++++++++++++------
>  2 files changed, 38 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 196f41ec6d60..1ff7817b3535 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -222,6 +222,7 @@ struct io_rings {
>  struct io_restriction {
>  	DECLARE_BITMAP(register_op, IORING_REGISTER_LAST);
>  	DECLARE_BITMAP(sqe_op, IORING_OP_LAST);
> +	pid_t pid;
>  	refcount_t refs;
>  	u8 sqe_flags_allowed;
>  	u8 sqe_flags_required;
> diff --git a/io_uring/register.c b/io_uring/register.c
> index 552b22f6b2dc..c8b8a9edbc65 100644
> --- a/io_uring/register.c
> +++ b/io_uring/register.c
> @@ -189,12 +189,19 @@ static int io_register_restrictions_task(void __user *arg, unsigned int nr_args)
>  {
>  	struct io_uring_task_restriction __user *ures = arg;
>  	struct io_uring_task_restriction tres;
> -	struct io_restriction *res;
> +	struct io_restriction *old_res, *res;
>  	int ret;
>  
>  	if (nr_args != 1)
>  		return -EINVAL;
>  
> +	res = current->io_uring_restrict;
> +	if (!ures) {
> +		if (!res)
> +			return -EFAULT;
> +		goto drop_set;
> +	}
> +
>  	if (copy_from_user(&tres, arg, sizeof(tres)))
>  		return -EFAULT;
>  
> @@ -207,13 +214,27 @@ static int io_register_restrictions_task(void __user *arg, unsigned int nr_args)
>  	 * Disallow if task already has registered restrictions, and we're
>  	 * not passing in further restrictions to add to an existing set.
>  	 */
> -	if (current->io_uring_restrict &&
> -	    !(tres.flags & IORING_REG_RESTRICTIONS_MASK))
> -		return -EPERM;
> +	old_res = NULL;
> +	if (res && !(tres.flags & IORING_REG_RESTRICTIONS_MASK)) {
> +		/* Not owner, may only append further restrictions */
> +drop_set:
> +		if (res->pid != current->pid)
> +			return -EPERM;

This might be hard to exploit, but if the parent terminates, the pid can
get reused.  Then, if the child forks until it gets the same pid, it can
unregister the filter.  I suppose the fix would require holding a
reference to the task, similar to what pidfd does. but perhaps just
abandon the unregistering semantics?  I'm not sure it is that useful...

> @@ -226,14 +247,22 @@ static int io_register_restrictions_task(void __user *arg, unsigned int nr_args)
>  				    tres.flags & IORING_REG_RESTRICTIONS_MASK);
>  	if (ret) {
>  		kfree(res);
> -		return ret;
> +		goto out;
>  	}
>  	if (current->io_uring_restrict &&
>  	    refcount_dec_and_test(&current->io_uring_restrict->refs))
>  		kfree(current->io_uring_restrict);
> +	res->pid = current->pid;

res->pid must always point to the first task that added a
restriction. So:

if (!current->io_uring_restrict)
       res->pid = current->pid;

Otherwise, the child will become the owner after adding another
restriction, and can then break out with a further unregister.  Based on
your testcase, this escapes the filter:


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: poc.patch --]
[-- Type: text/x-patch, Size: 889 bytes --]

diff --git a/test/task-restrictions.c b/test/task-restrictions.c
index 5a9170b4..4d4b457c 100644
--- a/test/task-restrictions.c
+++ b/test/task-restrictions.c
@@ -92,6 +92,12 @@ static int test_restrictions(int should_work)
 static void *thread_fn(void *unused)
 {
 	int ret;
+	struct io_uring_task_restriction  *res =
+		calloc(1, sizeof(*res) + 1 * sizeof(struct io_uring_restriction));
+	res->restrictions[1].opcode = IORING_RESTRICTION_SQE_OP;
+	res->restrictions[1].sqe_op = IORING_OP_FUTEX_WAIT;
+	res->nr_res = 1;
+	res->flags = IORING_REG_RESTRICTIONS_MASK;
 
 	ret = test_restrictions(0);
 	if (ret) {
@@ -99,6 +105,7 @@ static void *thread_fn(void *unused)
 		return (void *) (uintptr_t) ret;
 	}
 
+	ret = io_uring_register_task_restrictions(res);
 	ret = io_uring_register_task_restrictions(NULL);
 	if (!ret) {
 		fprintf(stderr, "thread restrictions unregister worked?!\n");

[-- Attachment #3: Type: text/plain, Size: 29 bytes --]


-- 
Gabriel Krisman Bertazi

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 3/3] io_uring/register: allow original task restrictions owner to unregister
  2026-01-13  0:10   ` Gabriel Krisman Bertazi
@ 2026-01-13 18:25     ` Jens Axboe
  0 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2026-01-13 18:25 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi; +Cc: io-uring

On 1/12/26 5:10 PM, Gabriel Krisman Bertazi wrote:
> Jens Axboe <axboe@kernel.dk> writes:
> 
>> Currently any attempt to register a set of task restrictions if an
>> existing set exists will fail with -EPERM. But it is feasible to let the
>> original creator/owner performance this operation. Either to remove
>> restrictions entirely, or to replace them with a new set.
>>
>> If an existing set exists and NULL is passed for the new set, the
>> current set is unregistered. If an existing set exists and a new set is
>> supplied, the old set is dropped and replaced with the new one.
> 
> Feature-wise, I think this covers what I mentioned in the previous
> iteration.  Even though this is an RFC, I think I found two bugs that
> allow the child to escape the restrictions:
> 
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  include/linux/io_uring_types.h |  1 +
>>  io_uring/register.c            | 45 ++++++++++++++++++++++++++++------
>>  2 files changed, 38 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>> index 196f41ec6d60..1ff7817b3535 100644
>> --- a/include/linux/io_uring_types.h
>> +++ b/include/linux/io_uring_types.h
>> @@ -222,6 +222,7 @@ struct io_rings {
>>  struct io_restriction {
>>  	DECLARE_BITMAP(register_op, IORING_REGISTER_LAST);
>>  	DECLARE_BITMAP(sqe_op, IORING_OP_LAST);
>> +	pid_t pid;
>>  	refcount_t refs;
>>  	u8 sqe_flags_allowed;
>>  	u8 sqe_flags_required;
>> diff --git a/io_uring/register.c b/io_uring/register.c
>> index 552b22f6b2dc..c8b8a9edbc65 100644
>> --- a/io_uring/register.c
>> +++ b/io_uring/register.c
>> @@ -189,12 +189,19 @@ static int io_register_restrictions_task(void __user *arg, unsigned int nr_args)
>>  {
>>  	struct io_uring_task_restriction __user *ures = arg;
>>  	struct io_uring_task_restriction tres;
>> -	struct io_restriction *res;
>> +	struct io_restriction *old_res, *res;
>>  	int ret;
>>  
>>  	if (nr_args != 1)
>>  		return -EINVAL;
>>  
>> +	res = current->io_uring_restrict;
>> +	if (!ures) {
>> +		if (!res)
>> +			return -EFAULT;
>> +		goto drop_set;
>> +	}
>> +
>>  	if (copy_from_user(&tres, arg, sizeof(tres)))
>>  		return -EFAULT;
>>  
>> @@ -207,13 +214,27 @@ static int io_register_restrictions_task(void __user *arg, unsigned int nr_args)
>>  	 * Disallow if task already has registered restrictions, and we're
>>  	 * not passing in further restrictions to add to an existing set.
>>  	 */
>> -	if (current->io_uring_restrict &&
>> -	    !(tres.flags & IORING_REG_RESTRICTIONS_MASK))
>> -		return -EPERM;
>> +	old_res = NULL;
>> +	if (res && !(tres.flags & IORING_REG_RESTRICTIONS_MASK)) {
>> +		/* Not owner, may only append further restrictions */
>> +drop_set:
>> +		if (res->pid != current->pid)
>> +			return -EPERM;
> 
> This might be hard to exploit, but if the parent terminates, the pid
> can get reused.  Then, if the child forks until it gets the same pid,
> it can unregister the filter.  I suppose the fix would require holding
> a reference to the task, similar to what pidfd does. but perhaps just
> abandon the unregistering semantics?  I'm not sure it is that
> useful...

I did ponder pid reuse and considered it not an issue due to the size of
the space. But from other feedback, seems like unregistering is not a
good idea anyway, should always be cummultative. There's a valid use
case where the task is forked up front, then restrictions registered,
and then exec. We can't allow unregistering for that case.

So I think I'll just drop this particular patch for now. It's also why I
kept it separate...

>> @@ -226,14 +247,22 @@ static int io_register_restrictions_task(void __user *arg, unsigned int nr_args)
>>  				    tres.flags & IORING_REG_RESTRICTIONS_MASK);
>>  	if (ret) {
>>  		kfree(res);
>> -		return ret;
>> +		goto out;
>>  	}
>>  	if (current->io_uring_restrict &&
>>  	    refcount_dec_and_test(&current->io_uring_restrict->refs))
>>  		kfree(current->io_uring_restrict);
>> +	res->pid = current->pid;
> 
> res->pid must always point to the first task that added a
> restriction. So:
> 
> if (!current->io_uring_restrict)
>        res->pid = current->pid;
> 
> Otherwise, the child will become the owner after adding another
> restriction, and can then break out with a further unregister.  Based on
> your testcase, this escapes the filter:

Thanks for looking at that too, but I guess moot with it getting
dropped. But yes I do think you're right!

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-01-13 18:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-09 18:48 [PATCHSET RFC v2 0/3] Per-task io_uring opcode restrictions Jens Axboe
2026-01-09 18:48 ` [PATCH 1/3] io_uring: allow registration of per-task restrictions Jens Axboe
2026-01-09 18:48 ` [PATCH 2/3] io_uring/register: add MASK support for task filter set Jens Axboe
2026-01-09 18:48 ` [PATCH 3/3] io_uring/register: allow original task restrictions owner to unregister Jens Axboe
2026-01-13  0:10   ` Gabriel Krisman Bertazi
2026-01-13 18:25     ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox