public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH 0/2] fix task_work interation with freezing
@ 2024-07-07 16:32 Pavel Begunkov
  2024-07-07 16:32 ` [PATCH 1/2] io_uring/io-wq: limit retrying worker initialisation Pavel Begunkov
  2024-07-07 16:32 ` [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() Pavel Begunkov
  0 siblings, 2 replies; 20+ messages in thread
From: Pavel Begunkov @ 2024-07-07 16:32 UTC (permalink / raw)
  To: io-uring
  Cc: Jens Axboe, asml.silence, Oleg Nesterov, Andrew Morton,
	Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel

It's reported [1] that a task_work queued at a wrong time can prevent
freezing and make the tasks to spin in get_signal() taking 100%
of CPU. Patch 1 is a preparation. Patch 2 addresses the issue.

[1] https://github.com/systemd/systemd/issues/33626

Pavel Begunkov (2):
  io_uring/io-wq: limit retrying worker initialisation
  signal: rerun task_work while freezing

 io_uring/io-wq.c | 10 +++++++---
 kernel/signal.c  |  4 ++++
 2 files changed, 11 insertions(+), 3 deletions(-)

-- 
2.44.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/2] io_uring/io-wq: limit retrying worker initialisation
  2024-07-07 16:32 [PATCH 0/2] fix task_work interation with freezing Pavel Begunkov
@ 2024-07-07 16:32 ` Pavel Begunkov
  2024-07-07 16:32 ` [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() Pavel Begunkov
  1 sibling, 0 replies; 20+ messages in thread
From: Pavel Begunkov @ 2024-07-07 16:32 UTC (permalink / raw)
  To: io-uring
  Cc: Jens Axboe, asml.silence, Oleg Nesterov, Andrew Morton,
	Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel,
	Julian Orth

If io-wq worker creation fails, we retry it by queueing up a task_work.
tasK_work is needed because it should be done from the user process
context. The problem is that retries are not limited, and if queueing a
task_work is the reason for the failure, we might get into an infinite
loop.

It doesn't seem to happen now but it will with the following patch
executing task_work in the freezing loop. For now, arbitrarily limit the
number of attempts to create a worker.

Cc: [email protected]
Fixes: 3146cba99aa28 ("io-wq: make worker creation resilient against signals")
Reported-by: Julian Orth <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io-wq.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 913c92249522..f1e7c670add8 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -23,6 +23,7 @@
 #include "io_uring.h"
 
 #define WORKER_IDLE_TIMEOUT	(5 * HZ)
+#define WORKER_INIT_LIMIT	3
 
 enum {
 	IO_WORKER_F_UP		= 0,	/* up and active */
@@ -58,6 +59,7 @@ struct io_worker {
 
 	unsigned long create_state;
 	struct callback_head create_work;
+	int init_retries;
 
 	union {
 		struct rcu_head rcu;
@@ -745,7 +747,7 @@ static bool io_wq_work_match_all(struct io_wq_work *work, void *data)
 	return true;
 }
 
-static inline bool io_should_retry_thread(long err)
+static inline bool io_should_retry_thread(struct io_worker *worker, long err)
 {
 	/*
 	 * Prevent perpetual task_work retry, if the task (or its group) is
@@ -753,6 +755,8 @@ static inline bool io_should_retry_thread(long err)
 	 */
 	if (fatal_signal_pending(current))
 		return false;
+	if (worker->init_retries++ >= WORKER_INIT_LIMIT)
+		return false;
 
 	switch (err) {
 	case -EAGAIN:
@@ -779,7 +783,7 @@ static void create_worker_cont(struct callback_head *cb)
 		io_init_new_worker(wq, worker, tsk);
 		io_worker_release(worker);
 		return;
-	} else if (!io_should_retry_thread(PTR_ERR(tsk))) {
+	} else if (!io_should_retry_thread(worker, PTR_ERR(tsk))) {
 		struct io_wq_acct *acct = io_wq_get_acct(worker);
 
 		atomic_dec(&acct->nr_running);
@@ -846,7 +850,7 @@ static bool create_io_worker(struct io_wq *wq, int index)
 	tsk = create_io_thread(io_wq_worker, worker, NUMA_NO_NODE);
 	if (!IS_ERR(tsk)) {
 		io_init_new_worker(wq, worker, tsk);
-	} else if (!io_should_retry_thread(PTR_ERR(tsk))) {
+	} else if (!io_should_retry_thread(worker, PTR_ERR(tsk))) {
 		kfree(worker);
 		goto fail;
 	} else {
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/2] kernel: rerun task_work while freezing in get_signal()
  2024-07-07 16:32 [PATCH 0/2] fix task_work interation with freezing Pavel Begunkov
  2024-07-07 16:32 ` [PATCH 1/2] io_uring/io-wq: limit retrying worker initialisation Pavel Begunkov
@ 2024-07-07 16:32 ` Pavel Begunkov
  2024-07-08 10:42   ` Oleg Nesterov
  1 sibling, 1 reply; 20+ messages in thread
From: Pavel Begunkov @ 2024-07-07 16:32 UTC (permalink / raw)
  To: io-uring
  Cc: Jens Axboe, asml.silence, Oleg Nesterov, Andrew Morton,
	Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel,
	Julian Orth

io_uring can asynchronously add a task_work while the task is getting
freezed. TIF_NOTIFY_SIGNAL will prevent the task from sleeping in
do_freezer_trap(), and since the get_signal()'s relock loop doesn't
retry task_work, the task will spin there not being able to sleep
until the freezing is cancelled / the task is killed / etc.

Cc: [email protected]
Link: https://github.com/systemd/systemd/issues/33626
Fixes: 3146cba99aa28 ("io-wq: make worker creation resilient against signals")
Reported-by: Julian Orth <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
---
 kernel/signal.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/signal.c b/kernel/signal.c
index 1f9dd41c04be..790d60fcfff0 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2694,6 +2694,10 @@ bool get_signal(struct ksignal *ksig)
 	try_to_freeze();
 
 relock:
+	clear_notify_signal();
+	if (unlikely(task_work_pending(current)))
+		task_work_run();
+
 	spin_lock_irq(&sighand->siglock);
 
 	/*
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal()
  2024-07-07 16:32 ` [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() Pavel Begunkov
@ 2024-07-08 10:42   ` Oleg Nesterov
  2024-07-08 15:40     ` Pavel Begunkov
  0 siblings, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2024-07-08 10:42 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: io-uring, Jens Axboe, Andrew Morton, Christian Brauner,
	Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth

On 07/07, Pavel Begunkov wrote:
>
> io_uring can asynchronously add a task_work while the task is getting
> freezed. TIF_NOTIFY_SIGNAL will prevent the task from sleeping in
> do_freezer_trap(), and since the get_signal()'s relock loop doesn't
> retry task_work, the task will spin there not being able to sleep
> until the freezing is cancelled / the task is killed / etc.
> 
> Cc: [email protected]
> Link: https://github.com/systemd/systemd/issues/33626
> Fixes: 3146cba99aa28 ("io-wq: make worker creation resilient against signals")

I don't think we should blame io_uring even if so far it is the only user
of TWA_SIGNAL.

Perhaps we should change do_freezer_trap() somehow, not sure... It assumes
that TIF_SIGPENDING is the only reason to not sleep in TASK_INTERRUPTIBLE,
today this is not true.

> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2694,6 +2694,10 @@ bool get_signal(struct ksignal *ksig)
>  	try_to_freeze();
>  
>  relock:
> +	clear_notify_signal();
> +	if (unlikely(task_work_pending(current)))
> +		task_work_run();
> +
>  	spin_lock_irq(&sighand->siglock);

Well, but can't we kill the same code at the start of get_signal() then?
Of course, in this case get_signal() should check signal_pending(), not
task_sigpending().

Or perhaps something like the patch below makes more sense? I dunno...

Oleg.

diff --git a/kernel/signal.c b/kernel/signal.c
index 1f9dd41c04be..e2ae85293fbb 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2676,6 +2676,7 @@ bool get_signal(struct ksignal *ksig)
 	struct signal_struct *signal = current->signal;
 	int signr;
 
+start:
 	clear_notify_signal();
 	if (unlikely(task_work_pending(current)))
 		task_work_run();
@@ -2760,10 +2761,11 @@ bool get_signal(struct ksignal *ksig)
 			if (current->jobctl & JOBCTL_TRAP_MASK) {
 				do_jobctl_trap();
 				spin_unlock_irq(&sighand->siglock);
+				goto relock;
 			} else if (current->jobctl & JOBCTL_TRAP_FREEZE)
 				do_freezer_trap();
-
-			goto relock;
+				goto start;
+			}
 		}
 
 		/*


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal()
  2024-07-08 10:42   ` Oleg Nesterov
@ 2024-07-08 15:40     ` Pavel Begunkov
  2024-07-08 18:48       ` Peter Zijlstra
  2024-07-09 10:36       ` Oleg Nesterov
  0 siblings, 2 replies; 20+ messages in thread
From: Pavel Begunkov @ 2024-07-08 15:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: io-uring, Jens Axboe, Andrew Morton, Christian Brauner,
	Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth,
	Tejun Heo, Peter Zijlstra

On 7/8/24 11:42, Oleg Nesterov wrote:
> On 07/07, Pavel Begunkov wrote:
>>
>> io_uring can asynchronously add a task_work while the task is getting
>> freezed. TIF_NOTIFY_SIGNAL will prevent the task from sleeping in
>> do_freezer_trap(), and since the get_signal()'s relock loop doesn't
>> retry task_work, the task will spin there not being able to sleep
>> until the freezing is cancelled / the task is killed / etc.
>>
>> Cc: [email protected]
>> Link: https://github.com/systemd/systemd/issues/33626
>> Fixes: 3146cba99aa28 ("io-wq: make worker creation resilient against signals")
> 
> I don't think we should blame io_uring even if so far it is the only user
> of TWA_SIGNAL.

And it's not entirely correct even for backporting purposes,
I'll pin it to when freezing was introduced then.

> Perhaps we should change do_freezer_trap() somehow, not sure... It assumes
> that TIF_SIGPENDING is the only reason to not sleep in TASK_INTERRUPTIBLE,
> today this is not true.

Let's CC Peter Zijlstra and Tejun in case they might have
some input on that.

Link to this patch for convenience:
https://lore.kernel.org/all/1d935e9d87fd8672ef3e8a9a0db340d355ea08b4.1720368770.git.asml.silence@gmail.com/

>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -2694,6 +2694,10 @@ bool get_signal(struct ksignal *ksig)
>>   	try_to_freeze();
>>   
>>   relock:
>> +	clear_notify_signal();
>> +	if (unlikely(task_work_pending(current)))
>> +		task_work_run();
>> +
>>   	spin_lock_irq(&sighand->siglock);
> 
> Well, but can't we kill the same code at the start of get_signal() then?
> Of course, in this case get_signal() should check signal_pending(), not
> task_sigpending().

Should be fine, but I didn't want to change the
try_to_freeze() -> __refrigerator() path, which also reschedules.

> Or perhaps something like the patch below makes more sense? I dunno...

It needs a far backporting, I'd really prefer to keep it
lean and without more side effects if possible, unless
there is a strong opinion on that.

> Oleg.
> 
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 1f9dd41c04be..e2ae85293fbb 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2676,6 +2676,7 @@ bool get_signal(struct ksignal *ksig)
>   	struct signal_struct *signal = current->signal;
>   	int signr;
>   
> +start:
>   	clear_notify_signal();
>   	if (unlikely(task_work_pending(current)))
>   		task_work_run();
> @@ -2760,10 +2761,11 @@ bool get_signal(struct ksignal *ksig)
>   			if (current->jobctl & JOBCTL_TRAP_MASK) {
>   				do_jobctl_trap();
>   				spin_unlock_irq(&sighand->siglock);
> +				goto relock;
>   			} else if (current->jobctl & JOBCTL_TRAP_FREEZE)
>   				do_freezer_trap();
> -
> -			goto relock;
> +				goto start;
> +			}
>   		}
>   
>   		/*
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal()
  2024-07-08 15:40     ` Pavel Begunkov
@ 2024-07-08 18:48       ` Peter Zijlstra
  2024-07-09 10:36       ` Oleg Nesterov
  1 sibling, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2024-07-08 18:48 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Oleg Nesterov, io-uring, Jens Axboe, Andrew Morton,
	Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel,
	Julian Orth, Tejun Heo

On Mon, Jul 08, 2024 at 04:40:07PM +0100, Pavel Begunkov wrote:

> > > --- a/kernel/signal.c
> > > +++ b/kernel/signal.c
> > > @@ -2694,6 +2694,10 @@ bool get_signal(struct ksignal *ksig)
> > >   	try_to_freeze();
> > >   relock:
> > > +	clear_notify_signal();
> > > +	if (unlikely(task_work_pending(current)))
> > > +		task_work_run();
> > > +
> > >   	spin_lock_irq(&sighand->siglock);
> > 
> > Well, but can't we kill the same code at the start of get_signal() then?
> > Of course, in this case get_signal() should check signal_pending(), not
> > task_sigpending().
> 
> Should be fine, but I didn't want to change the
> try_to_freeze() -> __refrigerator() path, which also reschedules.
> 
> > Or perhaps something like the patch below makes more sense? I dunno...
> 
> It needs a far backporting, I'd really prefer to keep it
> lean and without more side effects if possible, unless
> there is a strong opinion on that.

It's been a minute since I dug my way through the signal code, but I
think I slightly favour Oleg's version for not duplicating that
task_work_run().


> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index 1f9dd41c04be..e2ae85293fbb 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -2676,6 +2676,7 @@ bool get_signal(struct ksignal *ksig)
> >   	struct signal_struct *signal = current->signal;
> >   	int signr;
> > +start:
> >   	clear_notify_signal();
> >   	if (unlikely(task_work_pending(current)))
> >   		task_work_run();
> > @@ -2760,10 +2761,11 @@ bool get_signal(struct ksignal *ksig)
> >   			if (current->jobctl & JOBCTL_TRAP_MASK) {
> >   				do_jobctl_trap();
> >   				spin_unlock_irq(&sighand->siglock);
> > +				goto relock;
> >   			} else if (current->jobctl & JOBCTL_TRAP_FREEZE)
> >   				do_freezer_trap();
> > -
> > -			goto relock;
> > +				goto start;
> > +			}
> >   		}
> >   		/*
> > 
> 
> -- 
> Pavel Begunkov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal()
  2024-07-08 15:40     ` Pavel Begunkov
  2024-07-08 18:48       ` Peter Zijlstra
@ 2024-07-09 10:36       ` Oleg Nesterov
  2024-07-09 14:05         ` Pavel Begunkov
  1 sibling, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2024-07-09 10:36 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: io-uring, Jens Axboe, Andrew Morton, Christian Brauner,
	Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth,
	Tejun Heo, Peter Zijlstra

On 07/08, Pavel Begunkov wrote:
>
> On 7/8/24 11:42, Oleg Nesterov wrote:
> >I don't think we should blame io_uring even if so far it is the only user
> >of TWA_SIGNAL.
>
> And it's not entirely correct even for backporting purposes,
> I'll pin it to when freezing was introduced then.

This is another problem introduced by 12db8b690010 ("entry: Add support for
TIF_NOTIFY_SIGNAL")

We need much more changes. Say, zap_threads() does the same and assumes
that only SIGKILL or freezeing can make dump_interrupted() true.

There are more similar problems. I'll try to think, so far I do not see
a simple solution...

As for this particular problem, I agree it needs a simple/backportable fix.

> >>  relock:
> >>+	clear_notify_signal();
> >>+	if (unlikely(task_work_pending(current)))
> >>+		task_work_run();
> >>+
> >>  	spin_lock_irq(&sighand->siglock);
> >
> >Well, but can't we kill the same code at the start of get_signal() then?
> >Of course, in this case get_signal() should check signal_pending(), not
> >task_sigpending().
>
> Should be fine,

Well, not really at least performance-wise... get_signal() should return
asap if TIF_NOTIFY_SIGNAL was the only reason to call get_signal().

> but I didn't want to change the
> try_to_freeze() -> __refrigerator() path, which also reschedules.

Could you spell please?

> >Or perhaps something like the patch below makes more sense? I dunno...
>
> It needs a far backporting, I'd really prefer to keep it
> lean and without more side effects if possible, unless
> there is a strong opinion on that.

Well, I don't think my patch is really worse in this sense. Just it
is buggy ;) it needs another recalc_sigpending() before goto start,
so lets forget it.

So I am starting to agree with your change as a workaround until we
find a clean solution (if ever ;).

But can I ask you to add this additional clear_notify_signal() +
task_work_run() to the end of do_freezer_trap() ? get_signal() is
already a mess...


-----------------------------------------------------------------------
Either way I have no idea whether a cgroup_task_frozen() task should
react to task_work_add(TWA_SIGNAL) or not.

Documentation/admin-guide/cgroup-v2.rst says

	Writing "1" to the file causes freezing of the cgroup and all
	descendant cgroups. This means that all belonging processes will
	be stopped and will not run until the cgroup will be explicitly
	unfrozen.

AFAICS this is not accurate, they can run but can't return to user-mode.
So I guess task_work_run() is fine.

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal()
  2024-07-09 10:36       ` Oleg Nesterov
@ 2024-07-09 14:05         ` Pavel Begunkov
  2024-07-09 16:39           ` Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Pavel Begunkov @ 2024-07-09 14:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: io-uring, Jens Axboe, Andrew Morton, Christian Brauner,
	Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth,
	Tejun Heo, Peter Zijlstra

On 7/9/24 11:36, Oleg Nesterov wrote:
> On 07/08, Pavel Begunkov wrote:
>>
>> On 7/8/24 11:42, Oleg Nesterov wrote:
>>> I don't think we should blame io_uring even if so far it is the only user
>>> of TWA_SIGNAL.
>>
>> And it's not entirely correct even for backporting purposes,
>> I'll pin it to when freezing was introduced then.
> 
> This is another problem introduced by 12db8b690010 ("entry: Add support for
> TIF_NOTIFY_SIGNAL")

Ah, yes, I forgot NOTIFY_SIGNAL was split out of SIGPENDING

> We need much more changes. Say, zap_threads() does the same and assumes
> that only SIGKILL or freezeing can make dump_interrupted() true.
> 
> There are more similar problems. I'll try to think, so far I do not see
> a simple solution...

Thanks. And there was some patching done before against dumping
being interrupted by task_work, indeed a reoccurring issue.


> As for this particular problem, I agree it needs a simple/backportable fix.
> 
>>>>   relock:
>>>> +	clear_notify_signal();
>>>> +	if (unlikely(task_work_pending(current)))
>>>> +		task_work_run();
>>>> +
>>>>   	spin_lock_irq(&sighand->siglock);
>>>
>>> Well, but can't we kill the same code at the start of get_signal() then?
>>> Of course, in this case get_signal() should check signal_pending(), not
>>> task_sigpending().
>>
>> Should be fine,
> 
> Well, not really at least performance-wise... get_signal() should return
> asap if TIF_NOTIFY_SIGNAL was the only reason to call get_signal().
> 
>> but I didn't want to change the
>> try_to_freeze() -> __refrigerator() path, which also reschedules.
> 
> Could you spell please?

Let's say it calls get_signal() for freezing with a task_work pending.
Currently, it executes task_work and calls try_to_freeze(), which
puts the task to sleep. If we remove that task_work_run() before
try_to_freeze(), it would not be able to sleep. Sounds like it should
be fine, it races anyway, but I'm trying to avoid side effect for fixes.

>>> Or perhaps something like the patch below makes more sense? I dunno...
>>
>> It needs a far backporting, I'd really prefer to keep it
>> lean and without more side effects if possible, unless
>> there is a strong opinion on that.
> 
> Well, I don't think my patch is really worse in this sense. Just it
> is buggy ;) it needs another recalc_sigpending() before goto start,
> so lets forget it.
> 
> So I am starting to agree with your change as a workaround until we
> find a clean solution (if ever ;).
> 
> But can I ask you to add this additional clear_notify_signal() +
> task_work_run() to the end of do_freezer_trap() ? get_signal() is
> already a mess...

Will change

> -----------------------------------------------------------------------
> Either way I have no idea whether a cgroup_task_frozen() task should
> react to task_work_add(TWA_SIGNAL) or not.
> 
> Documentation/admin-guide/cgroup-v2.rst says
> 
> 	Writing "1" to the file causes freezing of the cgroup and all
> 	descendant cgroups. This means that all belonging processes will
> 	be stopped and will not run until the cgroup will be explicitly
> 	unfrozen.
> 
> AFAICS this is not accurate, they can run but can't return to user-mode.
> So I guess task_work_run() is fine.

IIUC it's a user facing doc, so maybe it's accurate enough from that
perspective. But I do agree that the semantics around task_work is
not exactly clear.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal()
  2024-07-09 14:05         ` Pavel Begunkov
@ 2024-07-09 16:39           ` Tejun Heo
  2024-07-09 19:07             ` Oleg Nesterov
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2024-07-09 16:39 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Oleg Nesterov, io-uring, Jens Axboe, Andrew Morton,
	Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel,
	Julian Orth, Peter Zijlstra

Hello,

On Tue, Jul 09, 2024 at 03:05:21PM +0100, Pavel Begunkov wrote:
> > -----------------------------------------------------------------------
> > Either way I have no idea whether a cgroup_task_frozen() task should
> > react to task_work_add(TWA_SIGNAL) or not.
> > 
> > Documentation/admin-guide/cgroup-v2.rst says
> > 
> > 	Writing "1" to the file causes freezing of the cgroup and all
> > 	descendant cgroups. This means that all belonging processes will
> > 	be stopped and will not run until the cgroup will be explicitly
> > 	unfrozen.
> > 
> > AFAICS this is not accurate, they can run but can't return to user-mode.
> > So I guess task_work_run() is fine.
> 
> IIUC it's a user facing doc, so maybe it's accurate enough from that
> perspective. But I do agree that the semantics around task_work is
> not exactly clear.

A good correctness test for cgroup freezer is whether it'd be safe to
snapshot and restore the tasks in the cgroup while frozen. If that works,
it's most likely fine.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal()
  2024-07-09 16:39           ` Tejun Heo
@ 2024-07-09 19:07             ` Oleg Nesterov
  2024-07-09 19:26               ` Pavel Begunkov
  0 siblings, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2024-07-09 19:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Begunkov, io-uring, Jens Axboe, Andrew Morton,
	Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel,
	Julian Orth, Peter Zijlstra

Hi Tejun,

Thanks for looking at this, can you review this V2 patch from Pavel?
To me it makes sense even without 1/2 which I didn't even bother to
read. At least as a simple workaround for now.

On 07/09, Tejun Heo wrote:
>
> Hello,
>
> On Tue, Jul 09, 2024 at 03:05:21PM +0100, Pavel Begunkov wrote:
> > > -----------------------------------------------------------------------
> > > Either way I have no idea whether a cgroup_task_frozen() task should
> > > react to task_work_add(TWA_SIGNAL) or not.
> > >
> > > Documentation/admin-guide/cgroup-v2.rst says
> > >
> > > 	Writing "1" to the file causes freezing of the cgroup and all
> > > 	descendant cgroups. This means that all belonging processes will
> > > 	be stopped and will not run until the cgroup will be explicitly
> > > 	unfrozen.
> > >
> > > AFAICS this is not accurate, they can run but can't return to user-mode.
> > > So I guess task_work_run() is fine.
> >
> > IIUC it's a user facing doc, so maybe it's accurate enough from that
> > perspective. But I do agree that the semantics around task_work is
> > not exactly clear.
>
> A good correctness test for cgroup freezer is whether it'd be safe to
> snapshot and restore the tasks in the cgroup while frozen.

Well, I don't really understand what can snapshot/restore actually mean...

I forgot everything about cgroup freezer and I am already sleeping, but even
if we forget about task_work_add/TIF_NOTIFY_SIGNAL/etc, afaics ptrace can
change the state of cgroup_task_frozen() task between snapshot and restore ?

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal()
  2024-07-09 19:07             ` Oleg Nesterov
@ 2024-07-09 19:26               ` Pavel Begunkov
  2024-07-09 19:38                 ` Oleg Nesterov
  0 siblings, 1 reply; 20+ messages in thread
From: Pavel Begunkov @ 2024-07-09 19:26 UTC (permalink / raw)
  To: Oleg Nesterov, Tejun Heo
  Cc: io-uring, Jens Axboe, Andrew Morton, Christian Brauner,
	Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth,
	Peter Zijlstra

On 7/9/24 20:07, Oleg Nesterov wrote:
> Hi Tejun,
> 
> Thanks for looking at this, can you review this V2 patch from Pavel?
> To me it makes sense even without 1/2 which I didn't even bother to
> read. At least as a simple workaround for now.

They are kind of separate but without 1/2 this patch creates
another infinite loop, even though it's harder to hit and
is io_uring specific.

  
> On 07/09, Tejun Heo wrote:
>>
>> Hello,
>>
>> On Tue, Jul 09, 2024 at 03:05:21PM +0100, Pavel Begunkov wrote:
>>>> -----------------------------------------------------------------------
>>>> Either way I have no idea whether a cgroup_task_frozen() task should
>>>> react to task_work_add(TWA_SIGNAL) or not.
>>>>
>>>> Documentation/admin-guide/cgroup-v2.rst says
>>>>
>>>> 	Writing "1" to the file causes freezing of the cgroup and all
>>>> 	descendant cgroups. This means that all belonging processes will
>>>> 	be stopped and will not run until the cgroup will be explicitly
>>>> 	unfrozen.
>>>>
>>>> AFAICS this is not accurate, they can run but can't return to user-mode.
>>>> So I guess task_work_run() is fine.
>>>
>>> IIUC it's a user facing doc, so maybe it's accurate enough from that
>>> perspective. But I do agree that the semantics around task_work is
>>> not exactly clear.
>>
>> A good correctness test for cgroup freezer is whether it'd be safe to
>> snapshot and restore the tasks in the cgroup while frozen.
> 
> Well, I don't really understand what can snapshot/restore actually mean...

CRIU, I assume. I'll try it ...

> I forgot everything about cgroup freezer and I am already sleeping, but even
> if we forget about task_work_add/TIF_NOTIFY_SIGNAL/etc, afaics ptrace can
> change the state of cgroup_task_frozen() task between snapshot and restore ?

... but I'm inclined to think the patch makes sense regardless,
we're replacing an infinite loop with wait-wake-execute-wait.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal()
  2024-07-09 19:26               ` Pavel Begunkov
@ 2024-07-09 19:38                 ` Oleg Nesterov
  2024-07-09 19:55                   ` Pavel Begunkov
  0 siblings, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2024-07-09 19:38 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Tejun Heo, io-uring, Jens Axboe, Andrew Morton, Christian Brauner,
	Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth,
	Peter Zijlstra

On 07/09, Pavel Begunkov wrote:
>
> On 7/9/24 20:07, Oleg Nesterov wrote:
> >Hi Tejun,
> >
> >Thanks for looking at this, can you review this V2 patch from Pavel?

Just in case, I obviously meant our next (V2) patch

[PATCH v2 2/2] kernel: rerun task_work while freezing in get_signal()
https://lore.kernel.org/all/149ff5a762997c723880751e8a4019907a0b6457.1720534425.git.asml.silence@gmail.com/

> >Well, I don't really understand what can snapshot/restore actually mean...
>
> CRIU, I assume. I'll try it ...

Than I think we can forget about task_works and this patch. CRIU dumps
the tasks in TASK_TRACED state.

> ... but I'm inclined to think the patch makes sense regardless,
> we're replacing an infinite loop with wait-wake-execute-wait.

Agreed.

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal()
  2024-07-09 19:38                 ` Oleg Nesterov
@ 2024-07-09 19:55                   ` Pavel Begunkov
  2024-07-10  0:54                     ` Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Pavel Begunkov @ 2024-07-09 19:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Tejun Heo, io-uring, Jens Axboe, Andrew Morton, Christian Brauner,
	Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth,
	Peter Zijlstra

On 7/9/24 20:38, Oleg Nesterov wrote:
> On 07/09, Pavel Begunkov wrote:
>>
>> On 7/9/24 20:07, Oleg Nesterov wrote:
>>> Hi Tejun,
>>>
>>> Thanks for looking at this, can you review this V2 patch from Pavel?
> 
> Just in case, I obviously meant our next (V2) patch
> 
> [PATCH v2 2/2] kernel: rerun task_work while freezing in get_signal()
> https://lore.kernel.org/all/149ff5a762997c723880751e8a4019907a0b6457.1720534425.git.asml.silence@gmail.com/
> 
>>> Well, I don't really understand what can snapshot/restore actually mean...
>>
>> CRIU, I assume. I'll try it ...
> 
> Than I think we can forget about task_works and this patch. CRIU dumps
> the tasks in TASK_TRACED state.

And would be hard to test, io_uring (the main source of task_work)
is not supported

(00.466022) Error (criu/proc_parse.c:477): Unknown shit 600 (anon_inode:[io_uring])
...
(00.467642) Unfreezing tasks into 1
(00.467656)     Unseizing 15488 into 1
(00.468149) Error (criu/cr-dump.c:2111): Dumping FAILED.


>> ... but I'm inclined to think the patch makes sense regardless,
>> we're replacing an infinite loop with wait-wake-execute-wait.
> 
> Agreed.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal()
  2024-07-09 19:55                   ` Pavel Begunkov
@ 2024-07-10  0:54                     ` Tejun Heo
  2024-07-10 17:53                       ` Pavel Begunkov
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2024-07-10  0:54 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Oleg Nesterov, io-uring, Jens Axboe, Andrew Morton,
	Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel,
	Julian Orth, Peter Zijlstra

Hello,

On Tue, Jul 09, 2024 at 08:55:43PM +0100, Pavel Begunkov wrote:
...
> > > CRIU, I assume. I'll try it ...
> > 
> > Than I think we can forget about task_works and this patch. CRIU dumps
> > the tasks in TASK_TRACED state.
> 
> And would be hard to test, io_uring (the main source of task_work)
> is not supported
> 
> (00.466022) Error (criu/proc_parse.c:477): Unknown shit 600 (anon_inode:[io_uring])
> ...
> (00.467642) Unfreezing tasks into 1
> (00.467656)     Unseizing 15488 into 1
> (00.468149) Error (criu/cr-dump.c:2111): Dumping FAILED.

Yeah, the question is: If CRIU is to use cgroup freezer to freeze the tasks
and then go around tracing each to make dump, would the freezer be enough in
avoiding interim state changes? Using CRIU implementation is a bit arbitrary
but I think checkpoint-restart is a useful bar to measure what should stay
stable while a cgroup is frozen.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal()
  2024-07-10  0:54                     ` Tejun Heo
@ 2024-07-10 17:53                       ` Pavel Begunkov
  2024-07-10 19:10                         ` Oleg Nesterov
  0 siblings, 1 reply; 20+ messages in thread
From: Pavel Begunkov @ 2024-07-10 17:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, io-uring, Jens Axboe, Andrew Morton,
	Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel,
	Julian Orth, Peter Zijlstra

On 7/10/24 01:54, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jul 09, 2024 at 08:55:43PM +0100, Pavel Begunkov wrote:
> ...
>>>> CRIU, I assume. I'll try it ...
>>>
>>> Than I think we can forget about task_works and this patch. CRIU dumps
>>> the tasks in TASK_TRACED state.
>>
>> And would be hard to test, io_uring (the main source of task_work)
>> is not supported
>>
>> (00.466022) Error (criu/proc_parse.c:477): Unknown shit 600 (anon_inode:[io_uring])
>> ...
>> (00.467642) Unfreezing tasks into 1
>> (00.467656)     Unseizing 15488 into 1
>> (00.468149) Error (criu/cr-dump.c:2111): Dumping FAILED.
> 
> Yeah, the question is: If CRIU is to use cgroup freezer to freeze the tasks
> and then go around tracing each to make dump, would the freezer be enough in
> avoiding interim state changes? Using CRIU implementation is a bit arbitrary
> but I think checkpoint-restart is a useful bar to measure what should stay
> stable while a cgroup is frozen.

Sounds like in the long run we might want to ignore task_work while
it's frozen, but hard to say for all task_work users.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal()
  2024-07-10 17:53                       ` Pavel Begunkov
@ 2024-07-10 19:10                         ` Oleg Nesterov
  2024-07-10 19:20                           ` Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2024-07-10 19:10 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Tejun Heo, io-uring, Jens Axboe, Andrew Morton, Christian Brauner,
	Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth,
	Peter Zijlstra

On 07/10, Pavel Begunkov wrote:
>
> On 7/10/24 01:54, Tejun Heo wrote:
> >Yeah, the question is: If CRIU is to use cgroup freezer to freeze the tasks
> >and then go around tracing each to make dump, would the freezer be enough in
> >avoiding interim state changes? Using CRIU implementation is a bit arbitrary
> >but I think checkpoint-restart is a useful bar to measure what should stay
> >stable while a cgroup is frozen.
>
> Sounds like in the long run we might want to ignore task_work while
> it's frozen,

Just in case, this is what I have in mind right now, but I am still not sure
and can't make a "clean" patch.

If nothing else. CRIU needs to attach and make this task TASK_TRACED, right?
And once the target task is traced, it won't react to task_work_add(TWA_SIGNAL).

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal()
  2024-07-10 19:10                         ` Oleg Nesterov
@ 2024-07-10 19:20                           ` Tejun Heo
  2024-07-10 21:34                             ` Oleg Nesterov
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2024-07-10 19:20 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Pavel Begunkov, io-uring, Jens Axboe, Andrew Morton,
	Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel,
	Julian Orth, Peter Zijlstra

Hello,

On Wed, Jul 10, 2024 at 09:10:16PM +0200, Oleg Nesterov wrote:
...
> If nothing else. CRIU needs to attach and make this task TASK_TRACED, right?

Yeah, AFAIK, that's the only way to implement check-pointing for now.

> And once the target task is traced, it won't react to task_work_add(TWA_SIGNAL).

I don't know how task_work is being used but the requirement would be that
if a cgroup is frozen, task_works shouldn't be making state changes which
can't safely be replayed (e.g. by restarting the frozen syscalls). If
anything task_works do can just be reproduced by restarting the currently
in-flight and frozen syscalls, it should be okay to leave them running.
Otherwise, it'd be better to freeze them together. As this thing is kinda
difficult to reason about, it'd probably be easier to just freeze them
together if we can.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal()
  2024-07-10 19:20                           ` Tejun Heo
@ 2024-07-10 21:34                             ` Oleg Nesterov
  2024-07-10 22:01                               ` Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2024-07-10 21:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Begunkov, io-uring, Jens Axboe, Andrew Morton,
	Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel,
	Julian Orth, Peter Zijlstra

On 07/10, Tejun Heo wrote:
>
> Hello,
>
> On Wed, Jul 10, 2024 at 09:10:16PM +0200, Oleg Nesterov wrote:
> ...
> > If nothing else. CRIU needs to attach and make this task TASK_TRACED, right?
>
> Yeah, AFAIK, that's the only way to implement check-pointing for now.

OK,

> > And once the target task is traced, it won't react to task_work_add(TWA_SIGNAL).
>
> I don't know how task_work is being used but the requirement would be that
> if a cgroup is frozen, task_works shouldn't be making state changes which
> can't safely be replayed (e.g. by restarting the frozen syscalls).

Well, in theory task_work can do "anything".

Of course, it can't, say, restart a frozen syscall, task_work_run() just
executes the callbacks in kernel mode and returns.

> it'd be better to freeze them together.

And I tend to agree. simply beacase do_freezer_trap() (and more users of
clear_thread_flag(TIF_SIGPENDING) + schedule(TASK_INTERRUPTIBLE) pattern)
do not take TIF_NOTIFY_SIGNAL into account.

But how do you think this patch can make the things worse wrt CRIU ?

And let's even forget this patch which fixes the real problem.
How do you think the fact that the task sleeping in do_freezer_trap()
can react to TIF_NOTIFY_SIGNAL, call task_work_run(), and then sleep
in do_freezer_trap() again can make any difference in this sense?

> As this thing is kinda difficult to reason about,

Agreed,

> it'd probably be easier to just freeze them together if we can.

Agreed, but this needs some "generic" changes while Pavel needs a
simple and backportable workaround to suppress a real problem.

In short, I don't like this patch either, I just don't see a better
solution for now ;)

Thanks,

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal()
  2024-07-10 21:34                             ` Oleg Nesterov
@ 2024-07-10 22:01                               ` Tejun Heo
  2024-07-10 22:17                                 ` Oleg Nesterov
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2024-07-10 22:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Pavel Begunkov, io-uring, Jens Axboe, Andrew Morton,
	Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel,
	Julian Orth, Peter Zijlstra

Hello,

On Wed, Jul 10, 2024 at 11:34:19PM +0200, Oleg Nesterov wrote:
...
> > it'd be better to freeze them together.
> 
> And I tend to agree. simply beacase do_freezer_trap() (and more users of
> clear_thread_flag(TIF_SIGPENDING) + schedule(TASK_INTERRUPTIBLE) pattern)
> do not take TIF_NOTIFY_SIGNAL into account.
> 
> But how do you think this patch can make the things worse wrt CRIU ?

Oh, I'm not arguing against the patch at all. Just daydreaming about what
future cleanups should look like.

...
> In short, I don't like this patch either, I just don't see a better
> solution for now ;)

I think we're on the same page.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal()
  2024-07-10 22:01                               ` Tejun Heo
@ 2024-07-10 22:17                                 ` Oleg Nesterov
  0 siblings, 0 replies; 20+ messages in thread
From: Oleg Nesterov @ 2024-07-10 22:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Begunkov, io-uring, Jens Axboe, Andrew Morton,
	Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel,
	Julian Orth, Peter Zijlstra

On 07/10, Tejun Heo wrote:
>
> > But how do you think this patch can make the things worse wrt CRIU ?
>
> Oh, I'm not arguing against the patch at all. Just daydreaming about what
> future cleanups should look like.

Ah, sorry, I misunderstood you!

> > In short, I don't like this patch either, I just don't see a better
> > solution for now ;)
>
> I think we're on the same page.

Yes, and I agree with your concerns and your thoughts about future cleanups.

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-07-10 22:19 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-07 16:32 [PATCH 0/2] fix task_work interation with freezing Pavel Begunkov
2024-07-07 16:32 ` [PATCH 1/2] io_uring/io-wq: limit retrying worker initialisation Pavel Begunkov
2024-07-07 16:32 ` [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() Pavel Begunkov
2024-07-08 10:42   ` Oleg Nesterov
2024-07-08 15:40     ` Pavel Begunkov
2024-07-08 18:48       ` Peter Zijlstra
2024-07-09 10:36       ` Oleg Nesterov
2024-07-09 14:05         ` Pavel Begunkov
2024-07-09 16:39           ` Tejun Heo
2024-07-09 19:07             ` Oleg Nesterov
2024-07-09 19:26               ` Pavel Begunkov
2024-07-09 19:38                 ` Oleg Nesterov
2024-07-09 19:55                   ` Pavel Begunkov
2024-07-10  0:54                     ` Tejun Heo
2024-07-10 17:53                       ` Pavel Begunkov
2024-07-10 19:10                         ` Oleg Nesterov
2024-07-10 19:20                           ` Tejun Heo
2024-07-10 21:34                             ` Oleg Nesterov
2024-07-10 22:01                               ` Tejun Heo
2024-07-10 22:17                                 ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox