public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched/task: always defer 'struct task_struct' destruction via RCU
@ 2026-05-08 14:02 Alice Ryhl
  2026-05-08 20:01 ` Sebastian Andrzej Siewior
  2026-05-08 21:38 ` Andrea Righi
  0 siblings, 2 replies; 6+ messages in thread
From: Alice Ryhl @ 2026-05-08 14:02 UTC (permalink / raw)
  To: Paul E. McKenney, Andrea Righi, Boqun Feng, Changwoo Min,
	Clark Williams, David Vernet, Frederic Weisbecker, Ingo Molnar,
	Jens Axboe, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Neeraj Upadhyay, Peter Zijlstra,
	Sebastian Andrzej Siewior, Steven Rostedt, Tejun Heo,
	Uladzislau Rezki, Zqiang
  Cc: io-uring, rcu, sched-ext, linux-kernel, linux-rt-devel,
	Alice Ryhl

The sched/task.h header file currently exposes a tryget_task_struct()
function, but it is very risky to use it: If the last refcount of the
task is dropped using put_task_struct_many(), then the task is freed
right away without an RCU grace period.

This means that if the kernel contains a code path anywhere such that
the last refcount of a task may be dropped with put_task_struct_many(),
and it also contains a code path anywhere that tries to stash a task
pointer under rcu and use tryget_task_struct() on it, then if they ever
execute on the same 'struct task_struct', it results in a
use-after-free.

The above applies even if the RCU user drops its own task reference with
put_task_struct(), because if that is not the last reference, then it's
possible for another thread to invoke put_task_struct_many() and free
the task less than a grace period after the RCU user called
put_task_struct().

There does not appear to be an actual problem in the kernel tree right
now because there are no in-tree users of put_task_struct_many() where
refcount_sub_and_test() might return 'true'. Io-uring invokes the
function from task work while the task is still running, so it will not
decrement it all the way to zero. (Note that if I'm wrong about this,
then it's probably possible to trigger UAF by combining this codepath in
io-uring with the tryget_task_struct() call in sched-ext.)

However, the current situation is fragile and error-prone.
- If you look at put_task_struct_many() in isolation, it looks like it
  would be okay to call it in a situation where refcount_sub_and_test()
  might return 'true'.
- Similarly, if you look at tryget_task_struct(), you would assume that
  you are allowed to call this method for a grace period after 'users'
  hitting zero. (If not, why does it exist?)
But if two different kernel developers anywhere in the kernel make these
conflicting assumptions at any point in the future, then the combination
of their code may lead to a use-after-free if there is any way for them
to interact via the same 'struct task_struct'.

Thus, as a defensive measure, we should either make
put_task_struct_many() use call_rcu(), or we should delete
tryget_task_struct(). This patch suggests the former because it does not
change anything for any callers that exist today. (As argued previously,
the body of the 'if' statement is dead code in the kernel today.)

The comment in put_task_struct() is also updated so that nobody changes
its implementation to only use call_rcu() under PREEMPT_RT in the
future. The current comment suggests that would be a legal change, but
it is similarly incompatible with anyone using tryget_task_struct().

Signed-off-by: Alice Ryhl <aliceryhl@google.com>
---
Including sched-ext and io-uring in the cc list as they are the only
users of tryget_task_struct() and put_task_struct_many() respectively.
---
 include/linux/sched/task.h | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 41ed884cffc9..da2fbd17b676 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -131,19 +131,25 @@ static inline void put_task_struct(struct task_struct *t)
 		return;
 
 	/*
-	 * Under PREEMPT_RT, we can't call __put_task_struct
-	 * in atomic context because it will indirectly
-	 * acquire sleeping locks. The same is true if the
-	 * current process has a mutex enqueued (blocked on
-	 * a PI chain).
+	 * Delay __put_task_struct() for one grace period so
+	 * that tryget_task_struct() may be used for one
+	 * grace period after any call to put_task_struct().
 	 *
-	 * In !RT, it is always safe to call __put_task_struct().
-	 * Though, in order to simplify the code, resort to the
-	 * deferred call too.
+	 * This also has the benefit of making it legal to
+	 * call put_task_struct() in atomic context. We
+	 * can't do that under PREEMPT_RT because it will
+	 * indirectly acquire sleeping locks. The same is
+	 * true if the current process has a mutex enqueued
+	 * (blocked on a PI chain).
 	 *
 	 * call_rcu() will schedule __put_task_struct_rcu_cb()
 	 * to be called in process context.
 	 *
+	 * In !RT, it is safe to call __put_task_struct()
+	 * from atomic context, but we still need to delay
+	 * cleanup for a grace period to accommodate
+	 * tryget_task_struct() callers.
+	 *
 	 * __put_task_struct() is called when
 	 * refcount_dec_and_test(&t->usage) succeeds.
 	 *
@@ -164,7 +170,7 @@ DEFINE_FREE(put_task, struct task_struct *, if (_T) put_task_struct(_T))
 static inline void put_task_struct_many(struct task_struct *t, int nr)
 {
 	if (refcount_sub_and_test(nr, &t->usage))
-		__put_task_struct(t);
+		call_rcu(&t->rcu, __put_task_struct_rcu_cb);
 }
 
 void put_task_struct_rcu_user(struct task_struct *task);

---
base-commit: 7fd2df204f342fc17d1a0bfcd474b24232fb0f32
change-id: 20260508-put-task-struct-many-5b5b2f4ae174

Best regards,
-- 
Alice Ryhl <aliceryhl@google.com>


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched/task: always defer 'struct task_struct' destruction via RCU
  2026-05-08 14:02 [PATCH] sched/task: always defer 'struct task_struct' destruction via RCU Alice Ryhl
@ 2026-05-08 20:01 ` Sebastian Andrzej Siewior
  2026-05-09  7:18   ` Alice Ryhl
  2026-05-08 21:38 ` Andrea Righi
  1 sibling, 1 reply; 6+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-05-08 20:01 UTC (permalink / raw)
  To: Alice Ryhl
  Cc: Paul E. McKenney, Andrea Righi, Boqun Feng, Changwoo Min,
	Clark Williams, David Vernet, Frederic Weisbecker, Ingo Molnar,
	Jens Axboe, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Neeraj Upadhyay, Peter Zijlstra,
	Steven Rostedt, Tejun Heo, Uladzislau Rezki, Zqiang, io-uring,
	rcu, sched-ext, linux-kernel, linux-rt-devel

On 2026-05-08 14:02:45 [+0000], Alice Ryhl wrote:
> The sched/task.h header file currently exposes a tryget_task_struct()
> function, but it is very risky to use it: If the last refcount of the
> task is dropped using put_task_struct_many(), then the task is freed
> right away without an RCU grace period.
> 
> This means that if the kernel contains a code path anywhere such that
> the last refcount of a task may be dropped with put_task_struct_many(),
> and it also contains a code path anywhere that tries to stash a task
> pointer under rcu and use tryget_task_struct() on it, then if they ever
> execute on the same 'struct task_struct', it results in a
> use-after-free.

If the counter dropped to 0 then tryget_task_struct() won't increment
it. There is also task_struct::rcu_users which holds one `usage' on it
and this RCU grace period we care about.

The only reason why there is a RCU free here is because of RT and it was
limited to RT only. Then a PI case came up (on RT again) I asked
repeatedly to have it unconditional on RT and !RT. Which then did
happen.

I don't think I would mind to align the two code paths but not as a
"this might be UAF if" but to do the same "thing". The important RCU
grace period happens via put_task_struct_rcu_user().

Sebastian

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched/task: always defer 'struct task_struct' destruction via RCU
  2026-05-08 14:02 [PATCH] sched/task: always defer 'struct task_struct' destruction via RCU Alice Ryhl
  2026-05-08 20:01 ` Sebastian Andrzej Siewior
@ 2026-05-08 21:38 ` Andrea Righi
  2026-05-10 13:41   ` Alice Ryhl
  1 sibling, 1 reply; 6+ messages in thread
From: Andrea Righi @ 2026-05-08 21:38 UTC (permalink / raw)
  To: Alice Ryhl
  Cc: Paul E. McKenney, Boqun Feng, Changwoo Min, Clark Williams,
	David Vernet, Frederic Weisbecker, Ingo Molnar, Jens Axboe,
	Joel Fernandes, Josh Triplett, Lai Jiangshan, Mathieu Desnoyers,
	Neeraj Upadhyay, Peter Zijlstra, Sebastian Andrzej Siewior,
	Steven Rostedt, Tejun Heo, Uladzislau Rezki, Zqiang, io-uring,
	rcu, sched-ext, linux-kernel, linux-rt-devel

Hi Alice,

On Fri, May 08, 2026 at 02:02:45PM +0000, Alice Ryhl wrote:
> The sched/task.h header file currently exposes a tryget_task_struct()
> function, but it is very risky to use it: If the last refcount of the
> task is dropped using put_task_struct_many(), then the task is freed
> right away without an RCU grace period.
> 
> This means that if the kernel contains a code path anywhere such that
> the last refcount of a task may be dropped with put_task_struct_many(),
> and it also contains a code path anywhere that tries to stash a task
> pointer under rcu and use tryget_task_struct() on it, then if they ever
> execute on the same 'struct task_struct', it results in a
> use-after-free.
> 
> The above applies even if the RCU user drops its own task reference with
> put_task_struct(), because if that is not the last reference, then it's
> possible for another thread to invoke put_task_struct_many() and free
> the task less than a grace period after the RCU user called
> put_task_struct().
> 
> There does not appear to be an actual problem in the kernel tree right
> now because there are no in-tree users of put_task_struct_many() where
> refcount_sub_and_test() might return 'true'. Io-uring invokes the
> function from task work while the task is still running, so it will not
> decrement it all the way to zero. (Note that if I'm wrong about this,
> then it's probably possible to trigger UAF by combining this codepath in
> io-uring with the tryget_task_struct() call in sched-ext.)
> 
> However, the current situation is fragile and error-prone.
> - If you look at put_task_struct_many() in isolation, it looks like it
>   would be okay to call it in a situation where refcount_sub_and_test()
>   might return 'true'.
> - Similarly, if you look at tryget_task_struct(), you would assume that
>   you are allowed to call this method for a grace period after 'users'
>   hitting zero. (If not, why does it exist?)
> But if two different kernel developers anywhere in the kernel make these
> conflicting assumptions at any point in the future, then the combination
> of their code may lead to a use-after-free if there is any way for them
> to interact via the same 'struct task_struct'.
> 
> Thus, as a defensive measure, we should either make
> put_task_struct_many() use call_rcu(), or we should delete
> tryget_task_struct(). This patch suggests the former because it does not
> change anything for any callers that exist today. (As argued previously,
> the body of the 'if' statement is dead code in the kernel today.)
> 
> The comment in put_task_struct() is also updated so that nobody changes
> its implementation to only use call_rcu() under PREEMPT_RT in the
> future. The current comment suggests that would be a legal change, but
> it is similarly incompatible with anyone using tryget_task_struct().
> 
> Signed-off-by: Alice Ryhl <aliceryhl@google.com>
> ---
> Including sched-ext and io-uring in the cc list as they are the only
> users of tryget_task_struct() and put_task_struct_many() respectively.

For sched_ext I think we should be already protected by scx_tasks_lock.

From kernel/sched/core.c:

  finish_task_switch():
      if (prev_state == TASK_DEAD) {
          prev->sched_class->task_dead(prev);
          sched_ext_dead(prev);
          cgroup_task_dead(prev);
          put_task_stack(prev);
          ...
          put_task_struct_rcu_user(prev);
      }

 And sched_ext_dead() in kernel/sched/ext.c:

  scoped_guard(raw_spinlock_irqsave, &scx_tasks_lock) {
      list_del_init(&p->scx.tasks_node);
      ...
  }

Now on the sched_ext iter side:

  scx_task_iter_start();                     /* takes scx_tasks_lock */
  while ((p = scx_task_iter_next_locked()))
      if (!tryget_task_struct(p))            /* still under scx_tasks_lock */
         ...

So, the locking gives us the invariant: while the iter holds scx_tasks_lock and
observes p on the list, sched_ext_dead(p) cannot have completed.

And the css_task_iter paths have the analogous ordering.

That said, I think this patch still makes sense to provide a consistent
semantics between put_task_struct() and put_task_struct_many(), as mentioned by
Sebastian. So, maybe reword the message around consistency rather than UAF?

Thanks,
-Andrea

> ---
>  include/linux/sched/task.h | 24 +++++++++++++++---------
>  1 file changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 41ed884cffc9..da2fbd17b676 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -131,19 +131,25 @@ static inline void put_task_struct(struct task_struct *t)
>  		return;
>  
>  	/*
> -	 * Under PREEMPT_RT, we can't call __put_task_struct
> -	 * in atomic context because it will indirectly
> -	 * acquire sleeping locks. The same is true if the
> -	 * current process has a mutex enqueued (blocked on
> -	 * a PI chain).
> +	 * Delay __put_task_struct() for one grace period so
> +	 * that tryget_task_struct() may be used for one
> +	 * grace period after any call to put_task_struct().
>  	 *
> -	 * In !RT, it is always safe to call __put_task_struct().
> -	 * Though, in order to simplify the code, resort to the
> -	 * deferred call too.
> +	 * This also has the benefit of making it legal to
> +	 * call put_task_struct() in atomic context. We
> +	 * can't do that under PREEMPT_RT because it will
> +	 * indirectly acquire sleeping locks. The same is
> +	 * true if the current process has a mutex enqueued
> +	 * (blocked on a PI chain).
>  	 *
>  	 * call_rcu() will schedule __put_task_struct_rcu_cb()
>  	 * to be called in process context.
>  	 *
> +	 * In !RT, it is safe to call __put_task_struct()
> +	 * from atomic context, but we still need to delay
> +	 * cleanup for a grace period to accommodate
> +	 * tryget_task_struct() callers.
> +	 *
>  	 * __put_task_struct() is called when
>  	 * refcount_dec_and_test(&t->usage) succeeds.
>  	 *
> @@ -164,7 +170,7 @@ DEFINE_FREE(put_task, struct task_struct *, if (_T) put_task_struct(_T))
>  static inline void put_task_struct_many(struct task_struct *t, int nr)
>  {
>  	if (refcount_sub_and_test(nr, &t->usage))
> -		__put_task_struct(t);
> +		call_rcu(&t->rcu, __put_task_struct_rcu_cb);
>  }
>  
>  void put_task_struct_rcu_user(struct task_struct *task);
> 
> ---
> base-commit: 7fd2df204f342fc17d1a0bfcd474b24232fb0f32
> change-id: 20260508-put-task-struct-many-5b5b2f4ae174
> 
> Best regards,
> -- 
> Alice Ryhl <aliceryhl@google.com>
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched/task: always defer 'struct task_struct' destruction via RCU
  2026-05-08 20:01 ` Sebastian Andrzej Siewior
@ 2026-05-09  7:18   ` Alice Ryhl
  0 siblings, 0 replies; 6+ messages in thread
From: Alice Ryhl @ 2026-05-09  7:18 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Paul E. McKenney, Andrea Righi, Boqun Feng, Changwoo Min,
	Clark Williams, David Vernet, Frederic Weisbecker, Ingo Molnar,
	Jens Axboe, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Neeraj Upadhyay, Peter Zijlstra,
	Steven Rostedt, Tejun Heo, Uladzislau Rezki, Zqiang, io-uring,
	rcu, sched-ext, linux-kernel, linux-rt-devel

On Fri, May 08, 2026 at 10:01:57PM +0200, Sebastian Andrzej Siewior wrote:
> On 2026-05-08 14:02:45 [+0000], Alice Ryhl wrote:
> > The sched/task.h header file currently exposes a tryget_task_struct()
> > function, but it is very risky to use it: If the last refcount of the
> > task is dropped using put_task_struct_many(), then the task is freed
> > right away without an RCU grace period.
> > 
> > This means that if the kernel contains a code path anywhere such that
> > the last refcount of a task may be dropped with put_task_struct_many(),
> > and it also contains a code path anywhere that tries to stash a task
> > pointer under rcu and use tryget_task_struct() on it, then if they ever
> > execute on the same 'struct task_struct', it results in a
> > use-after-free.
> 
> If the counter dropped to 0 then tryget_task_struct() won't increment
> it.

Yes. If the 'struct task_struct' hasn't been freed yet. What is the
scenario where it might be zero, but you are certain it is not yet
freed? If not rcu, then I guess this applies only to those cases where
__put_task_struct() itself removes the task from the relevant collection
when 'users' hits zero.

If tryget_task_struct() can only safely be used in that scenario, then I
think that's worth at least a comment in the header file, because at
first glance it's a surprising limitation.

> There is also task_struct::rcu_users which holds one `usage' on it
> and this RCU grace period we care about.

Sure, but I guess my question is: why does tryget_task_struct() exist?
The 'rcu_users' field is not the reason because 'usage' can't be zero
when using that field.

Alice

> The only reason why there is a RCU free here is because of RT and it was
> limited to RT only. Then a PI case came up (on RT again) I asked
> repeatedly to have it unconditional on RT and !RT. Which then did
> happen.
> 
> I don't think I would mind to align the two code paths but not as a
> "this might be UAF if" but to do the same "thing". The important RCU
> grace period happens via put_task_struct_rcu_user().
> 
> Sebastian

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched/task: always defer 'struct task_struct' destruction via RCU
  2026-05-08 21:38 ` Andrea Righi
@ 2026-05-10 13:41   ` Alice Ryhl
  2026-05-10 18:36     ` Andrea Righi
  0 siblings, 1 reply; 6+ messages in thread
From: Alice Ryhl @ 2026-05-10 13:41 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Paul E. McKenney, Boqun Feng, Changwoo Min, Clark Williams,
	David Vernet, Frederic Weisbecker, Ingo Molnar, Jens Axboe,
	Joel Fernandes, Josh Triplett, Lai Jiangshan, Mathieu Desnoyers,
	Neeraj Upadhyay, Peter Zijlstra, Sebastian Andrzej Siewior,
	Steven Rostedt, Tejun Heo, Uladzislau Rezki, Zqiang, io-uring,
	rcu, sched-ext, linux-kernel, linux-rt-devel

On Fri, May 08, 2026 at 11:38:18PM +0200, Andrea Righi wrote:
> Hi Alice,
> 
> On Fri, May 08, 2026 at 02:02:45PM +0000, Alice Ryhl wrote:
> > The sched/task.h header file currently exposes a tryget_task_struct()
> > function, but it is very risky to use it: If the last refcount of the
> > task is dropped using put_task_struct_many(), then the task is freed
> > right away without an RCU grace period.
> > 
> > This means that if the kernel contains a code path anywhere such that
> > the last refcount of a task may be dropped with put_task_struct_many(),
> > and it also contains a code path anywhere that tries to stash a task
> > pointer under rcu and use tryget_task_struct() on it, then if they ever
> > execute on the same 'struct task_struct', it results in a
> > use-after-free.
> > 
> > The above applies even if the RCU user drops its own task reference with
> > put_task_struct(), because if that is not the last reference, then it's
> > possible for another thread to invoke put_task_struct_many() and free
> > the task less than a grace period after the RCU user called
> > put_task_struct().
> > 
> > There does not appear to be an actual problem in the kernel tree right
> > now because there are no in-tree users of put_task_struct_many() where
> > refcount_sub_and_test() might return 'true'. Io-uring invokes the
> > function from task work while the task is still running, so it will not
> > decrement it all the way to zero. (Note that if I'm wrong about this,
> > then it's probably possible to trigger UAF by combining this codepath in
> > io-uring with the tryget_task_struct() call in sched-ext.)
> > 
> > However, the current situation is fragile and error-prone.
> > - If you look at put_task_struct_many() in isolation, it looks like it
> >   would be okay to call it in a situation where refcount_sub_and_test()
> >   might return 'true'.
> > - Similarly, if you look at tryget_task_struct(), you would assume that
> >   you are allowed to call this method for a grace period after 'users'
> >   hitting zero. (If not, why does it exist?)
> > But if two different kernel developers anywhere in the kernel make these
> > conflicting assumptions at any point in the future, then the combination
> > of their code may lead to a use-after-free if there is any way for them
> > to interact via the same 'struct task_struct'.
> > 
> > Thus, as a defensive measure, we should either make
> > put_task_struct_many() use call_rcu(), or we should delete
> > tryget_task_struct(). This patch suggests the former because it does not
> > change anything for any callers that exist today. (As argued previously,
> > the body of the 'if' statement is dead code in the kernel today.)
> > 
> > The comment in put_task_struct() is also updated so that nobody changes
> > its implementation to only use call_rcu() under PREEMPT_RT in the
> > future. The current comment suggests that would be a legal change, but
> > it is similarly incompatible with anyone using tryget_task_struct().
> > 
> > Signed-off-by: Alice Ryhl <aliceryhl@google.com>
> > ---
> > Including sched-ext and io-uring in the cc list as they are the only
> > users of tryget_task_struct() and put_task_struct_many() respectively.
> 
> For sched_ext I think we should be already protected by scx_tasks_lock.
> 
> From kernel/sched/core.c:
> 
>   finish_task_switch():
>       if (prev_state == TASK_DEAD) {
>           prev->sched_class->task_dead(prev);
>           sched_ext_dead(prev);
>           cgroup_task_dead(prev);
>           put_task_stack(prev);
>           ...
>           put_task_struct_rcu_user(prev);
>       }
> 
>  And sched_ext_dead() in kernel/sched/ext.c:
> 
>   scoped_guard(raw_spinlock_irqsave, &scx_tasks_lock) {
>       list_del_init(&p->scx.tasks_node);
>       ...
>   }
> 
> Now on the sched_ext iter side:
> 
>   scx_task_iter_start();                     /* takes scx_tasks_lock */
>   while ((p = scx_task_iter_next_locked()))
>       if (!tryget_task_struct(p))            /* still under scx_tasks_lock */
>          ...
> 
> So, the locking gives us the invariant: while the iter holds scx_tasks_lock and
> observes p on the list, sched_ext_dead(p) cannot have completed.

Correct my if I'm wrong, but this sounds like you don't need the tryget
variant. The 'users' counter is guaranteed be non-zero for one grace
period after put_task_struct_rcu_user(prev).

Alice

> And the css_task_iter paths have the analogous ordering.
> 
> That said, I think this patch still makes sense to provide a consistent
> semantics between put_task_struct() and put_task_struct_many(), as mentioned by
> Sebastian. So, maybe reword the message around consistency rather than UAF?
> 
> Thanks,
> -Andrea
> 
> > ---
> >  include/linux/sched/task.h | 24 +++++++++++++++---------
> >  1 file changed, 15 insertions(+), 9 deletions(-)
> > 
> > diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> > index 41ed884cffc9..da2fbd17b676 100644
> > --- a/include/linux/sched/task.h
> > +++ b/include/linux/sched/task.h
> > @@ -131,19 +131,25 @@ static inline void put_task_struct(struct task_struct *t)
> >  		return;
> >  
> >  	/*
> > -	 * Under PREEMPT_RT, we can't call __put_task_struct
> > -	 * in atomic context because it will indirectly
> > -	 * acquire sleeping locks. The same is true if the
> > -	 * current process has a mutex enqueued (blocked on
> > -	 * a PI chain).
> > +	 * Delay __put_task_struct() for one grace period so
> > +	 * that tryget_task_struct() may be used for one
> > +	 * grace period after any call to put_task_struct().
> >  	 *
> > -	 * In !RT, it is always safe to call __put_task_struct().
> > -	 * Though, in order to simplify the code, resort to the
> > -	 * deferred call too.
> > +	 * This also has the benefit of making it legal to
> > +	 * call put_task_struct() in atomic context. We
> > +	 * can't do that under PREEMPT_RT because it will
> > +	 * indirectly acquire sleeping locks. The same is
> > +	 * true if the current process has a mutex enqueued
> > +	 * (blocked on a PI chain).
> >  	 *
> >  	 * call_rcu() will schedule __put_task_struct_rcu_cb()
> >  	 * to be called in process context.
> >  	 *
> > +	 * In !RT, it is safe to call __put_task_struct()
> > +	 * from atomic context, but we still need to delay
> > +	 * cleanup for a grace period to accommodate
> > +	 * tryget_task_struct() callers.
> > +	 *
> >  	 * __put_task_struct() is called when
> >  	 * refcount_dec_and_test(&t->usage) succeeds.
> >  	 *
> > @@ -164,7 +170,7 @@ DEFINE_FREE(put_task, struct task_struct *, if (_T) put_task_struct(_T))
> >  static inline void put_task_struct_many(struct task_struct *t, int nr)
> >  {
> >  	if (refcount_sub_and_test(nr, &t->usage))
> > -		__put_task_struct(t);
> > +		call_rcu(&t->rcu, __put_task_struct_rcu_cb);
> >  }
> >  
> >  void put_task_struct_rcu_user(struct task_struct *task);
> > 
> > ---
> > base-commit: 7fd2df204f342fc17d1a0bfcd474b24232fb0f32
> > change-id: 20260508-put-task-struct-many-5b5b2f4ae174
> > 
> > Best regards,
> > -- 
> > Alice Ryhl <aliceryhl@google.com>
> > 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched/task: always defer 'struct task_struct' destruction via RCU
  2026-05-10 13:41   ` Alice Ryhl
@ 2026-05-10 18:36     ` Andrea Righi
  0 siblings, 0 replies; 6+ messages in thread
From: Andrea Righi @ 2026-05-10 18:36 UTC (permalink / raw)
  To: Alice Ryhl
  Cc: Paul E. McKenney, Boqun Feng, Changwoo Min, Clark Williams,
	David Vernet, Frederic Weisbecker, Ingo Molnar, Jens Axboe,
	Joel Fernandes, Josh Triplett, Lai Jiangshan, Mathieu Desnoyers,
	Neeraj Upadhyay, Peter Zijlstra, Sebastian Andrzej Siewior,
	Steven Rostedt, Tejun Heo, Uladzislau Rezki, Zqiang, io-uring,
	rcu, sched-ext, linux-kernel, linux-rt-devel

On Sun, May 10, 2026 at 01:41:27PM +0000, Alice Ryhl wrote:
> On Fri, May 08, 2026 at 11:38:18PM +0200, Andrea Righi wrote:
> > Hi Alice,
> > 
> > On Fri, May 08, 2026 at 02:02:45PM +0000, Alice Ryhl wrote:
> > > The sched/task.h header file currently exposes a tryget_task_struct()
> > > function, but it is very risky to use it: If the last refcount of the
> > > task is dropped using put_task_struct_many(), then the task is freed
> > > right away without an RCU grace period.
> > > 
> > > This means that if the kernel contains a code path anywhere such that
> > > the last refcount of a task may be dropped with put_task_struct_many(),
> > > and it also contains a code path anywhere that tries to stash a task
> > > pointer under rcu and use tryget_task_struct() on it, then if they ever
> > > execute on the same 'struct task_struct', it results in a
> > > use-after-free.
> > > 
> > > The above applies even if the RCU user drops its own task reference with
> > > put_task_struct(), because if that is not the last reference, then it's
> > > possible for another thread to invoke put_task_struct_many() and free
> > > the task less than a grace period after the RCU user called
> > > put_task_struct().
> > > 
> > > There does not appear to be an actual problem in the kernel tree right
> > > now because there are no in-tree users of put_task_struct_many() where
> > > refcount_sub_and_test() might return 'true'. Io-uring invokes the
> > > function from task work while the task is still running, so it will not
> > > decrement it all the way to zero. (Note that if I'm wrong about this,
> > > then it's probably possible to trigger UAF by combining this codepath in
> > > io-uring with the tryget_task_struct() call in sched-ext.)
> > > 
> > > However, the current situation is fragile and error-prone.
> > > - If you look at put_task_struct_many() in isolation, it looks like it
> > >   would be okay to call it in a situation where refcount_sub_and_test()
> > >   might return 'true'.
> > > - Similarly, if you look at tryget_task_struct(), you would assume that
> > >   you are allowed to call this method for a grace period after 'users'
> > >   hitting zero. (If not, why does it exist?)
> > > But if two different kernel developers anywhere in the kernel make these
> > > conflicting assumptions at any point in the future, then the combination
> > > of their code may lead to a use-after-free if there is any way for them
> > > to interact via the same 'struct task_struct'.
> > > 
> > > Thus, as a defensive measure, we should either make
> > > put_task_struct_many() use call_rcu(), or we should delete
> > > tryget_task_struct(). This patch suggests the former because it does not
> > > change anything for any callers that exist today. (As argued previously,
> > > the body of the 'if' statement is dead code in the kernel today.)
> > > 
> > > The comment in put_task_struct() is also updated so that nobody changes
> > > its implementation to only use call_rcu() under PREEMPT_RT in the
> > > future. The current comment suggests that would be a legal change, but
> > > it is similarly incompatible with anyone using tryget_task_struct().
> > > 
> > > Signed-off-by: Alice Ryhl <aliceryhl@google.com>
> > > ---
> > > Including sched-ext and io-uring in the cc list as they are the only
> > > users of tryget_task_struct() and put_task_struct_many() respectively.
> > 
> > For sched_ext I think we should be already protected by scx_tasks_lock.
> > 
> > From kernel/sched/core.c:
> > 
> >   finish_task_switch():
> >       if (prev_state == TASK_DEAD) {
> >           prev->sched_class->task_dead(prev);
> >           sched_ext_dead(prev);
> >           cgroup_task_dead(prev);
> >           put_task_stack(prev);
> >           ...
> >           put_task_struct_rcu_user(prev);
> >       }
> > 
> >  And sched_ext_dead() in kernel/sched/ext.c:
> > 
> >   scoped_guard(raw_spinlock_irqsave, &scx_tasks_lock) {
> >       list_del_init(&p->scx.tasks_node);
> >       ...
> >   }
> > 
> > Now on the sched_ext iter side:
> > 
> >   scx_task_iter_start();                     /* takes scx_tasks_lock */
> >   while ((p = scx_task_iter_next_locked()))
> >       if (!tryget_task_struct(p))            /* still under scx_tasks_lock */
> >          ...
> > 
> > So, the locking gives us the invariant: while the iter holds scx_tasks_lock and
> > observes p on the list, sched_ext_dead(p) cannot have completed.
> 
> Correct my if I'm wrong, but this sounds like you don't need the tryget
> variant. The 'users' counter is guaranteed be non-zero for one grace
> period after put_task_struct_rcu_user(prev).

Correct, I think we can just get rid of tryget and use get_task_struct().
I'll run some stress tests with this change.

-Andrea

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-05-10 18:36 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-08 14:02 [PATCH] sched/task: always defer 'struct task_struct' destruction via RCU Alice Ryhl
2026-05-08 20:01 ` Sebastian Andrzej Siewior
2026-05-09  7:18   ` Alice Ryhl
2026-05-08 21:38 ` Andrea Righi
2026-05-10 13:41   ` Alice Ryhl
2026-05-10 18:36     ` Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox