* [PATCH 0/2] fix task_work interation with freezing @ 2024-07-07 16:32 Pavel Begunkov 2024-07-07 16:32 ` [PATCH 1/2] io_uring/io-wq: limit retrying worker initialisation Pavel Begunkov 2024-07-07 16:32 ` [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() Pavel Begunkov 0 siblings, 2 replies; 20+ messages in thread From: Pavel Begunkov @ 2024-07-07 16:32 UTC (permalink / raw) To: io-uring Cc: Jens Axboe, asml.silence, Oleg Nesterov, Andrew Morton, Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel It's reported [1] that a task_work queued at a wrong time can prevent freezing and make the tasks to spin in get_signal() taking 100% of CPU. Patch 1 is a preparation. Patch 2 addresses the issue. [1] https://github.com/systemd/systemd/issues/33626 Pavel Begunkov (2): io_uring/io-wq: limit retrying worker initialisation signal: rerun task_work while freezing io_uring/io-wq.c | 10 +++++++--- kernel/signal.c | 4 ++++ 2 files changed, 11 insertions(+), 3 deletions(-) -- 2.44.0 ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 1/2] io_uring/io-wq: limit retrying worker initialisation 2024-07-07 16:32 [PATCH 0/2] fix task_work interation with freezing Pavel Begunkov @ 2024-07-07 16:32 ` Pavel Begunkov 2024-07-07 16:32 ` [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() Pavel Begunkov 1 sibling, 0 replies; 20+ messages in thread From: Pavel Begunkov @ 2024-07-07 16:32 UTC (permalink / raw) To: io-uring Cc: Jens Axboe, asml.silence, Oleg Nesterov, Andrew Morton, Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth If io-wq worker creation fails, we retry it by queueing up a task_work. tasK_work is needed because it should be done from the user process context. The problem is that retries are not limited, and if queueing a task_work is the reason for the failure, we might get into an infinite loop. It doesn't seem to happen now but it will with the following patch executing task_work in the freezing loop. For now, arbitrarily limit the number of attempts to create a worker. Cc: [email protected] Fixes: 3146cba99aa28 ("io-wq: make worker creation resilient against signals") Reported-by: Julian Orth <[email protected]> Signed-off-by: Pavel Begunkov <[email protected]> --- io_uring/io-wq.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c index 913c92249522..f1e7c670add8 100644 --- a/io_uring/io-wq.c +++ b/io_uring/io-wq.c @@ -23,6 +23,7 @@ #include "io_uring.h" #define WORKER_IDLE_TIMEOUT (5 * HZ) +#define WORKER_INIT_LIMIT 3 enum { IO_WORKER_F_UP = 0, /* up and active */ @@ -58,6 +59,7 @@ struct io_worker { unsigned long create_state; struct callback_head create_work; + int init_retries; union { struct rcu_head rcu; @@ -745,7 +747,7 @@ static bool io_wq_work_match_all(struct io_wq_work *work, void *data) return true; } -static inline bool io_should_retry_thread(long err) +static inline bool io_should_retry_thread(struct io_worker *worker, long err) { /* * Prevent perpetual task_work retry, if the task (or its group) is @@ -753,6 +755,8 @@ static inline bool io_should_retry_thread(long err) */ if (fatal_signal_pending(current)) return false; + if (worker->init_retries++ >= WORKER_INIT_LIMIT) + return false; switch (err) { case -EAGAIN: @@ -779,7 +783,7 @@ static void create_worker_cont(struct callback_head *cb) io_init_new_worker(wq, worker, tsk); io_worker_release(worker); return; - } else if (!io_should_retry_thread(PTR_ERR(tsk))) { + } else if (!io_should_retry_thread(worker, PTR_ERR(tsk))) { struct io_wq_acct *acct = io_wq_get_acct(worker); atomic_dec(&acct->nr_running); @@ -846,7 +850,7 @@ static bool create_io_worker(struct io_wq *wq, int index) tsk = create_io_thread(io_wq_worker, worker, NUMA_NO_NODE); if (!IS_ERR(tsk)) { io_init_new_worker(wq, worker, tsk); - } else if (!io_should_retry_thread(PTR_ERR(tsk))) { + } else if (!io_should_retry_thread(worker, PTR_ERR(tsk))) { kfree(worker); goto fail; } else { -- 2.44.0 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() 2024-07-07 16:32 [PATCH 0/2] fix task_work interation with freezing Pavel Begunkov 2024-07-07 16:32 ` [PATCH 1/2] io_uring/io-wq: limit retrying worker initialisation Pavel Begunkov @ 2024-07-07 16:32 ` Pavel Begunkov 2024-07-08 10:42 ` Oleg Nesterov 1 sibling, 1 reply; 20+ messages in thread From: Pavel Begunkov @ 2024-07-07 16:32 UTC (permalink / raw) To: io-uring Cc: Jens Axboe, asml.silence, Oleg Nesterov, Andrew Morton, Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth io_uring can asynchronously add a task_work while the task is getting freezed. TIF_NOTIFY_SIGNAL will prevent the task from sleeping in do_freezer_trap(), and since the get_signal()'s relock loop doesn't retry task_work, the task will spin there not being able to sleep until the freezing is cancelled / the task is killed / etc. Cc: [email protected] Link: https://github.com/systemd/systemd/issues/33626 Fixes: 3146cba99aa28 ("io-wq: make worker creation resilient against signals") Reported-by: Julian Orth <[email protected]> Signed-off-by: Pavel Begunkov <[email protected]> --- kernel/signal.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/kernel/signal.c b/kernel/signal.c index 1f9dd41c04be..790d60fcfff0 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2694,6 +2694,10 @@ bool get_signal(struct ksignal *ksig) try_to_freeze(); relock: + clear_notify_signal(); + if (unlikely(task_work_pending(current))) + task_work_run(); + spin_lock_irq(&sighand->siglock); /* -- 2.44.0 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() 2024-07-07 16:32 ` [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() Pavel Begunkov @ 2024-07-08 10:42 ` Oleg Nesterov 2024-07-08 15:40 ` Pavel Begunkov 0 siblings, 1 reply; 20+ messages in thread From: Oleg Nesterov @ 2024-07-08 10:42 UTC (permalink / raw) To: Pavel Begunkov Cc: io-uring, Jens Axboe, Andrew Morton, Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth On 07/07, Pavel Begunkov wrote: > > io_uring can asynchronously add a task_work while the task is getting > freezed. TIF_NOTIFY_SIGNAL will prevent the task from sleeping in > do_freezer_trap(), and since the get_signal()'s relock loop doesn't > retry task_work, the task will spin there not being able to sleep > until the freezing is cancelled / the task is killed / etc. > > Cc: [email protected] > Link: https://github.com/systemd/systemd/issues/33626 > Fixes: 3146cba99aa28 ("io-wq: make worker creation resilient against signals") I don't think we should blame io_uring even if so far it is the only user of TWA_SIGNAL. Perhaps we should change do_freezer_trap() somehow, not sure... It assumes that TIF_SIGPENDING is the only reason to not sleep in TASK_INTERRUPTIBLE, today this is not true. > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -2694,6 +2694,10 @@ bool get_signal(struct ksignal *ksig) > try_to_freeze(); > > relock: > + clear_notify_signal(); > + if (unlikely(task_work_pending(current))) > + task_work_run(); > + > spin_lock_irq(&sighand->siglock); Well, but can't we kill the same code at the start of get_signal() then? Of course, in this case get_signal() should check signal_pending(), not task_sigpending(). Or perhaps something like the patch below makes more sense? I dunno... Oleg. diff --git a/kernel/signal.c b/kernel/signal.c index 1f9dd41c04be..e2ae85293fbb 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2676,6 +2676,7 @@ bool get_signal(struct ksignal *ksig) struct signal_struct *signal = current->signal; int signr; +start: clear_notify_signal(); if (unlikely(task_work_pending(current))) task_work_run(); @@ -2760,10 +2761,11 @@ bool get_signal(struct ksignal *ksig) if (current->jobctl & JOBCTL_TRAP_MASK) { do_jobctl_trap(); spin_unlock_irq(&sighand->siglock); + goto relock; } else if (current->jobctl & JOBCTL_TRAP_FREEZE) do_freezer_trap(); - - goto relock; + goto start; + } } /* ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() 2024-07-08 10:42 ` Oleg Nesterov @ 2024-07-08 15:40 ` Pavel Begunkov 2024-07-08 18:48 ` Peter Zijlstra 2024-07-09 10:36 ` Oleg Nesterov 0 siblings, 2 replies; 20+ messages in thread From: Pavel Begunkov @ 2024-07-08 15:40 UTC (permalink / raw) To: Oleg Nesterov Cc: io-uring, Jens Axboe, Andrew Morton, Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth, Tejun Heo, Peter Zijlstra On 7/8/24 11:42, Oleg Nesterov wrote: > On 07/07, Pavel Begunkov wrote: >> >> io_uring can asynchronously add a task_work while the task is getting >> freezed. TIF_NOTIFY_SIGNAL will prevent the task from sleeping in >> do_freezer_trap(), and since the get_signal()'s relock loop doesn't >> retry task_work, the task will spin there not being able to sleep >> until the freezing is cancelled / the task is killed / etc. >> >> Cc: [email protected] >> Link: https://github.com/systemd/systemd/issues/33626 >> Fixes: 3146cba99aa28 ("io-wq: make worker creation resilient against signals") > > I don't think we should blame io_uring even if so far it is the only user > of TWA_SIGNAL. And it's not entirely correct even for backporting purposes, I'll pin it to when freezing was introduced then. > Perhaps we should change do_freezer_trap() somehow, not sure... It assumes > that TIF_SIGPENDING is the only reason to not sleep in TASK_INTERRUPTIBLE, > today this is not true. Let's CC Peter Zijlstra and Tejun in case they might have some input on that. Link to this patch for convenience: https://lore.kernel.org/all/1d935e9d87fd8672ef3e8a9a0db340d355ea08b4.1720368770.git.asml.silence@gmail.com/ >> --- a/kernel/signal.c >> +++ b/kernel/signal.c >> @@ -2694,6 +2694,10 @@ bool get_signal(struct ksignal *ksig) >> try_to_freeze(); >> >> relock: >> + clear_notify_signal(); >> + if (unlikely(task_work_pending(current))) >> + task_work_run(); >> + >> spin_lock_irq(&sighand->siglock); > > Well, but can't we kill the same code at the start of get_signal() then? > Of course, in this case get_signal() should check signal_pending(), not > task_sigpending(). Should be fine, but I didn't want to change the try_to_freeze() -> __refrigerator() path, which also reschedules. > Or perhaps something like the patch below makes more sense? I dunno... It needs a far backporting, I'd really prefer to keep it lean and without more side effects if possible, unless there is a strong opinion on that. > Oleg. > > diff --git a/kernel/signal.c b/kernel/signal.c > index 1f9dd41c04be..e2ae85293fbb 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -2676,6 +2676,7 @@ bool get_signal(struct ksignal *ksig) > struct signal_struct *signal = current->signal; > int signr; > > +start: > clear_notify_signal(); > if (unlikely(task_work_pending(current))) > task_work_run(); > @@ -2760,10 +2761,11 @@ bool get_signal(struct ksignal *ksig) > if (current->jobctl & JOBCTL_TRAP_MASK) { > do_jobctl_trap(); > spin_unlock_irq(&sighand->siglock); > + goto relock; > } else if (current->jobctl & JOBCTL_TRAP_FREEZE) > do_freezer_trap(); > - > - goto relock; > + goto start; > + } > } > > /* > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() 2024-07-08 15:40 ` Pavel Begunkov @ 2024-07-08 18:48 ` Peter Zijlstra 2024-07-09 10:36 ` Oleg Nesterov 1 sibling, 0 replies; 20+ messages in thread From: Peter Zijlstra @ 2024-07-08 18:48 UTC (permalink / raw) To: Pavel Begunkov Cc: Oleg Nesterov, io-uring, Jens Axboe, Andrew Morton, Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth, Tejun Heo On Mon, Jul 08, 2024 at 04:40:07PM +0100, Pavel Begunkov wrote: > > > --- a/kernel/signal.c > > > +++ b/kernel/signal.c > > > @@ -2694,6 +2694,10 @@ bool get_signal(struct ksignal *ksig) > > > try_to_freeze(); > > > relock: > > > + clear_notify_signal(); > > > + if (unlikely(task_work_pending(current))) > > > + task_work_run(); > > > + > > > spin_lock_irq(&sighand->siglock); > > > > Well, but can't we kill the same code at the start of get_signal() then? > > Of course, in this case get_signal() should check signal_pending(), not > > task_sigpending(). > > Should be fine, but I didn't want to change the > try_to_freeze() -> __refrigerator() path, which also reschedules. > > > Or perhaps something like the patch below makes more sense? I dunno... > > It needs a far backporting, I'd really prefer to keep it > lean and without more side effects if possible, unless > there is a strong opinion on that. It's been a minute since I dug my way through the signal code, but I think I slightly favour Oleg's version for not duplicating that task_work_run(). > > diff --git a/kernel/signal.c b/kernel/signal.c > > index 1f9dd41c04be..e2ae85293fbb 100644 > > --- a/kernel/signal.c > > +++ b/kernel/signal.c > > @@ -2676,6 +2676,7 @@ bool get_signal(struct ksignal *ksig) > > struct signal_struct *signal = current->signal; > > int signr; > > +start: > > clear_notify_signal(); > > if (unlikely(task_work_pending(current))) > > task_work_run(); > > @@ -2760,10 +2761,11 @@ bool get_signal(struct ksignal *ksig) > > if (current->jobctl & JOBCTL_TRAP_MASK) { > > do_jobctl_trap(); > > spin_unlock_irq(&sighand->siglock); > > + goto relock; > > } else if (current->jobctl & JOBCTL_TRAP_FREEZE) > > do_freezer_trap(); > > - > > - goto relock; > > + goto start; > > + } > > } > > /* > > > > -- > Pavel Begunkov ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() 2024-07-08 15:40 ` Pavel Begunkov 2024-07-08 18:48 ` Peter Zijlstra @ 2024-07-09 10:36 ` Oleg Nesterov 2024-07-09 14:05 ` Pavel Begunkov 1 sibling, 1 reply; 20+ messages in thread From: Oleg Nesterov @ 2024-07-09 10:36 UTC (permalink / raw) To: Pavel Begunkov Cc: io-uring, Jens Axboe, Andrew Morton, Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth, Tejun Heo, Peter Zijlstra On 07/08, Pavel Begunkov wrote: > > On 7/8/24 11:42, Oleg Nesterov wrote: > >I don't think we should blame io_uring even if so far it is the only user > >of TWA_SIGNAL. > > And it's not entirely correct even for backporting purposes, > I'll pin it to when freezing was introduced then. This is another problem introduced by 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") We need much more changes. Say, zap_threads() does the same and assumes that only SIGKILL or freezeing can make dump_interrupted() true. There are more similar problems. I'll try to think, so far I do not see a simple solution... As for this particular problem, I agree it needs a simple/backportable fix. > >> relock: > >>+ clear_notify_signal(); > >>+ if (unlikely(task_work_pending(current))) > >>+ task_work_run(); > >>+ > >> spin_lock_irq(&sighand->siglock); > > > >Well, but can't we kill the same code at the start of get_signal() then? > >Of course, in this case get_signal() should check signal_pending(), not > >task_sigpending(). > > Should be fine, Well, not really at least performance-wise... get_signal() should return asap if TIF_NOTIFY_SIGNAL was the only reason to call get_signal(). > but I didn't want to change the > try_to_freeze() -> __refrigerator() path, which also reschedules. Could you spell please? > >Or perhaps something like the patch below makes more sense? I dunno... > > It needs a far backporting, I'd really prefer to keep it > lean and without more side effects if possible, unless > there is a strong opinion on that. Well, I don't think my patch is really worse in this sense. Just it is buggy ;) it needs another recalc_sigpending() before goto start, so lets forget it. So I am starting to agree with your change as a workaround until we find a clean solution (if ever ;). But can I ask you to add this additional clear_notify_signal() + task_work_run() to the end of do_freezer_trap() ? get_signal() is already a mess... ----------------------------------------------------------------------- Either way I have no idea whether a cgroup_task_frozen() task should react to task_work_add(TWA_SIGNAL) or not. Documentation/admin-guide/cgroup-v2.rst says Writing "1" to the file causes freezing of the cgroup and all descendant cgroups. This means that all belonging processes will be stopped and will not run until the cgroup will be explicitly unfrozen. AFAICS this is not accurate, they can run but can't return to user-mode. So I guess task_work_run() is fine. Oleg. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() 2024-07-09 10:36 ` Oleg Nesterov @ 2024-07-09 14:05 ` Pavel Begunkov 2024-07-09 16:39 ` Tejun Heo 0 siblings, 1 reply; 20+ messages in thread From: Pavel Begunkov @ 2024-07-09 14:05 UTC (permalink / raw) To: Oleg Nesterov Cc: io-uring, Jens Axboe, Andrew Morton, Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth, Tejun Heo, Peter Zijlstra On 7/9/24 11:36, Oleg Nesterov wrote: > On 07/08, Pavel Begunkov wrote: >> >> On 7/8/24 11:42, Oleg Nesterov wrote: >>> I don't think we should blame io_uring even if so far it is the only user >>> of TWA_SIGNAL. >> >> And it's not entirely correct even for backporting purposes, >> I'll pin it to when freezing was introduced then. > > This is another problem introduced by 12db8b690010 ("entry: Add support for > TIF_NOTIFY_SIGNAL") Ah, yes, I forgot NOTIFY_SIGNAL was split out of SIGPENDING > We need much more changes. Say, zap_threads() does the same and assumes > that only SIGKILL or freezeing can make dump_interrupted() true. > > There are more similar problems. I'll try to think, so far I do not see > a simple solution... Thanks. And there was some patching done before against dumping being interrupted by task_work, indeed a reoccurring issue. > As for this particular problem, I agree it needs a simple/backportable fix. > >>>> relock: >>>> + clear_notify_signal(); >>>> + if (unlikely(task_work_pending(current))) >>>> + task_work_run(); >>>> + >>>> spin_lock_irq(&sighand->siglock); >>> >>> Well, but can't we kill the same code at the start of get_signal() then? >>> Of course, in this case get_signal() should check signal_pending(), not >>> task_sigpending(). >> >> Should be fine, > > Well, not really at least performance-wise... get_signal() should return > asap if TIF_NOTIFY_SIGNAL was the only reason to call get_signal(). > >> but I didn't want to change the >> try_to_freeze() -> __refrigerator() path, which also reschedules. > > Could you spell please? Let's say it calls get_signal() for freezing with a task_work pending. Currently, it executes task_work and calls try_to_freeze(), which puts the task to sleep. If we remove that task_work_run() before try_to_freeze(), it would not be able to sleep. Sounds like it should be fine, it races anyway, but I'm trying to avoid side effect for fixes. >>> Or perhaps something like the patch below makes more sense? I dunno... >> >> It needs a far backporting, I'd really prefer to keep it >> lean and without more side effects if possible, unless >> there is a strong opinion on that. > > Well, I don't think my patch is really worse in this sense. Just it > is buggy ;) it needs another recalc_sigpending() before goto start, > so lets forget it. > > So I am starting to agree with your change as a workaround until we > find a clean solution (if ever ;). > > But can I ask you to add this additional clear_notify_signal() + > task_work_run() to the end of do_freezer_trap() ? get_signal() is > already a mess... Will change > ----------------------------------------------------------------------- > Either way I have no idea whether a cgroup_task_frozen() task should > react to task_work_add(TWA_SIGNAL) or not. > > Documentation/admin-guide/cgroup-v2.rst says > > Writing "1" to the file causes freezing of the cgroup and all > descendant cgroups. This means that all belonging processes will > be stopped and will not run until the cgroup will be explicitly > unfrozen. > > AFAICS this is not accurate, they can run but can't return to user-mode. > So I guess task_work_run() is fine. IIUC it's a user facing doc, so maybe it's accurate enough from that perspective. But I do agree that the semantics around task_work is not exactly clear. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() 2024-07-09 14:05 ` Pavel Begunkov @ 2024-07-09 16:39 ` Tejun Heo 2024-07-09 19:07 ` Oleg Nesterov 0 siblings, 1 reply; 20+ messages in thread From: Tejun Heo @ 2024-07-09 16:39 UTC (permalink / raw) To: Pavel Begunkov Cc: Oleg Nesterov, io-uring, Jens Axboe, Andrew Morton, Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth, Peter Zijlstra Hello, On Tue, Jul 09, 2024 at 03:05:21PM +0100, Pavel Begunkov wrote: > > ----------------------------------------------------------------------- > > Either way I have no idea whether a cgroup_task_frozen() task should > > react to task_work_add(TWA_SIGNAL) or not. > > > > Documentation/admin-guide/cgroup-v2.rst says > > > > Writing "1" to the file causes freezing of the cgroup and all > > descendant cgroups. This means that all belonging processes will > > be stopped and will not run until the cgroup will be explicitly > > unfrozen. > > > > AFAICS this is not accurate, they can run but can't return to user-mode. > > So I guess task_work_run() is fine. > > IIUC it's a user facing doc, so maybe it's accurate enough from that > perspective. But I do agree that the semantics around task_work is > not exactly clear. A good correctness test for cgroup freezer is whether it'd be safe to snapshot and restore the tasks in the cgroup while frozen. If that works, it's most likely fine. Thanks. -- tejun ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() 2024-07-09 16:39 ` Tejun Heo @ 2024-07-09 19:07 ` Oleg Nesterov 2024-07-09 19:26 ` Pavel Begunkov 0 siblings, 1 reply; 20+ messages in thread From: Oleg Nesterov @ 2024-07-09 19:07 UTC (permalink / raw) To: Tejun Heo Cc: Pavel Begunkov, io-uring, Jens Axboe, Andrew Morton, Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth, Peter Zijlstra Hi Tejun, Thanks for looking at this, can you review this V2 patch from Pavel? To me it makes sense even without 1/2 which I didn't even bother to read. At least as a simple workaround for now. On 07/09, Tejun Heo wrote: > > Hello, > > On Tue, Jul 09, 2024 at 03:05:21PM +0100, Pavel Begunkov wrote: > > > ----------------------------------------------------------------------- > > > Either way I have no idea whether a cgroup_task_frozen() task should > > > react to task_work_add(TWA_SIGNAL) or not. > > > > > > Documentation/admin-guide/cgroup-v2.rst says > > > > > > Writing "1" to the file causes freezing of the cgroup and all > > > descendant cgroups. This means that all belonging processes will > > > be stopped and will not run until the cgroup will be explicitly > > > unfrozen. > > > > > > AFAICS this is not accurate, they can run but can't return to user-mode. > > > So I guess task_work_run() is fine. > > > > IIUC it's a user facing doc, so maybe it's accurate enough from that > > perspective. But I do agree that the semantics around task_work is > > not exactly clear. > > A good correctness test for cgroup freezer is whether it'd be safe to > snapshot and restore the tasks in the cgroup while frozen. Well, I don't really understand what can snapshot/restore actually mean... I forgot everything about cgroup freezer and I am already sleeping, but even if we forget about task_work_add/TIF_NOTIFY_SIGNAL/etc, afaics ptrace can change the state of cgroup_task_frozen() task between snapshot and restore ? Oleg. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() 2024-07-09 19:07 ` Oleg Nesterov @ 2024-07-09 19:26 ` Pavel Begunkov 2024-07-09 19:38 ` Oleg Nesterov 0 siblings, 1 reply; 20+ messages in thread From: Pavel Begunkov @ 2024-07-09 19:26 UTC (permalink / raw) To: Oleg Nesterov, Tejun Heo Cc: io-uring, Jens Axboe, Andrew Morton, Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth, Peter Zijlstra On 7/9/24 20:07, Oleg Nesterov wrote: > Hi Tejun, > > Thanks for looking at this, can you review this V2 patch from Pavel? > To me it makes sense even without 1/2 which I didn't even bother to > read. At least as a simple workaround for now. They are kind of separate but without 1/2 this patch creates another infinite loop, even though it's harder to hit and is io_uring specific. > On 07/09, Tejun Heo wrote: >> >> Hello, >> >> On Tue, Jul 09, 2024 at 03:05:21PM +0100, Pavel Begunkov wrote: >>>> ----------------------------------------------------------------------- >>>> Either way I have no idea whether a cgroup_task_frozen() task should >>>> react to task_work_add(TWA_SIGNAL) or not. >>>> >>>> Documentation/admin-guide/cgroup-v2.rst says >>>> >>>> Writing "1" to the file causes freezing of the cgroup and all >>>> descendant cgroups. This means that all belonging processes will >>>> be stopped and will not run until the cgroup will be explicitly >>>> unfrozen. >>>> >>>> AFAICS this is not accurate, they can run but can't return to user-mode. >>>> So I guess task_work_run() is fine. >>> >>> IIUC it's a user facing doc, so maybe it's accurate enough from that >>> perspective. But I do agree that the semantics around task_work is >>> not exactly clear. >> >> A good correctness test for cgroup freezer is whether it'd be safe to >> snapshot and restore the tasks in the cgroup while frozen. > > Well, I don't really understand what can snapshot/restore actually mean... CRIU, I assume. I'll try it ... > I forgot everything about cgroup freezer and I am already sleeping, but even > if we forget about task_work_add/TIF_NOTIFY_SIGNAL/etc, afaics ptrace can > change the state of cgroup_task_frozen() task between snapshot and restore ? ... but I'm inclined to think the patch makes sense regardless, we're replacing an infinite loop with wait-wake-execute-wait. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() 2024-07-09 19:26 ` Pavel Begunkov @ 2024-07-09 19:38 ` Oleg Nesterov 2024-07-09 19:55 ` Pavel Begunkov 0 siblings, 1 reply; 20+ messages in thread From: Oleg Nesterov @ 2024-07-09 19:38 UTC (permalink / raw) To: Pavel Begunkov Cc: Tejun Heo, io-uring, Jens Axboe, Andrew Morton, Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth, Peter Zijlstra On 07/09, Pavel Begunkov wrote: > > On 7/9/24 20:07, Oleg Nesterov wrote: > >Hi Tejun, > > > >Thanks for looking at this, can you review this V2 patch from Pavel? Just in case, I obviously meant our next (V2) patch [PATCH v2 2/2] kernel: rerun task_work while freezing in get_signal() https://lore.kernel.org/all/149ff5a762997c723880751e8a4019907a0b6457.1720534425.git.asml.silence@gmail.com/ > >Well, I don't really understand what can snapshot/restore actually mean... > > CRIU, I assume. I'll try it ... Than I think we can forget about task_works and this patch. CRIU dumps the tasks in TASK_TRACED state. > ... but I'm inclined to think the patch makes sense regardless, > we're replacing an infinite loop with wait-wake-execute-wait. Agreed. Oleg. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() 2024-07-09 19:38 ` Oleg Nesterov @ 2024-07-09 19:55 ` Pavel Begunkov 2024-07-10 0:54 ` Tejun Heo 0 siblings, 1 reply; 20+ messages in thread From: Pavel Begunkov @ 2024-07-09 19:55 UTC (permalink / raw) To: Oleg Nesterov Cc: Tejun Heo, io-uring, Jens Axboe, Andrew Morton, Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth, Peter Zijlstra On 7/9/24 20:38, Oleg Nesterov wrote: > On 07/09, Pavel Begunkov wrote: >> >> On 7/9/24 20:07, Oleg Nesterov wrote: >>> Hi Tejun, >>> >>> Thanks for looking at this, can you review this V2 patch from Pavel? > > Just in case, I obviously meant our next (V2) patch > > [PATCH v2 2/2] kernel: rerun task_work while freezing in get_signal() > https://lore.kernel.org/all/149ff5a762997c723880751e8a4019907a0b6457.1720534425.git.asml.silence@gmail.com/ > >>> Well, I don't really understand what can snapshot/restore actually mean... >> >> CRIU, I assume. I'll try it ... > > Than I think we can forget about task_works and this patch. CRIU dumps > the tasks in TASK_TRACED state. And would be hard to test, io_uring (the main source of task_work) is not supported (00.466022) Error (criu/proc_parse.c:477): Unknown shit 600 (anon_inode:[io_uring]) ... (00.467642) Unfreezing tasks into 1 (00.467656) Unseizing 15488 into 1 (00.468149) Error (criu/cr-dump.c:2111): Dumping FAILED. >> ... but I'm inclined to think the patch makes sense regardless, >> we're replacing an infinite loop with wait-wake-execute-wait. > > Agreed. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() 2024-07-09 19:55 ` Pavel Begunkov @ 2024-07-10 0:54 ` Tejun Heo 2024-07-10 17:53 ` Pavel Begunkov 0 siblings, 1 reply; 20+ messages in thread From: Tejun Heo @ 2024-07-10 0:54 UTC (permalink / raw) To: Pavel Begunkov Cc: Oleg Nesterov, io-uring, Jens Axboe, Andrew Morton, Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth, Peter Zijlstra Hello, On Tue, Jul 09, 2024 at 08:55:43PM +0100, Pavel Begunkov wrote: ... > > > CRIU, I assume. I'll try it ... > > > > Than I think we can forget about task_works and this patch. CRIU dumps > > the tasks in TASK_TRACED state. > > And would be hard to test, io_uring (the main source of task_work) > is not supported > > (00.466022) Error (criu/proc_parse.c:477): Unknown shit 600 (anon_inode:[io_uring]) > ... > (00.467642) Unfreezing tasks into 1 > (00.467656) Unseizing 15488 into 1 > (00.468149) Error (criu/cr-dump.c:2111): Dumping FAILED. Yeah, the question is: If CRIU is to use cgroup freezer to freeze the tasks and then go around tracing each to make dump, would the freezer be enough in avoiding interim state changes? Using CRIU implementation is a bit arbitrary but I think checkpoint-restart is a useful bar to measure what should stay stable while a cgroup is frozen. Thanks. -- tejun ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() 2024-07-10 0:54 ` Tejun Heo @ 2024-07-10 17:53 ` Pavel Begunkov 2024-07-10 19:10 ` Oleg Nesterov 0 siblings, 1 reply; 20+ messages in thread From: Pavel Begunkov @ 2024-07-10 17:53 UTC (permalink / raw) To: Tejun Heo Cc: Oleg Nesterov, io-uring, Jens Axboe, Andrew Morton, Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth, Peter Zijlstra On 7/10/24 01:54, Tejun Heo wrote: > Hello, > > On Tue, Jul 09, 2024 at 08:55:43PM +0100, Pavel Begunkov wrote: > ... >>>> CRIU, I assume. I'll try it ... >>> >>> Than I think we can forget about task_works and this patch. CRIU dumps >>> the tasks in TASK_TRACED state. >> >> And would be hard to test, io_uring (the main source of task_work) >> is not supported >> >> (00.466022) Error (criu/proc_parse.c:477): Unknown shit 600 (anon_inode:[io_uring]) >> ... >> (00.467642) Unfreezing tasks into 1 >> (00.467656) Unseizing 15488 into 1 >> (00.468149) Error (criu/cr-dump.c:2111): Dumping FAILED. > > Yeah, the question is: If CRIU is to use cgroup freezer to freeze the tasks > and then go around tracing each to make dump, would the freezer be enough in > avoiding interim state changes? Using CRIU implementation is a bit arbitrary > but I think checkpoint-restart is a useful bar to measure what should stay > stable while a cgroup is frozen. Sounds like in the long run we might want to ignore task_work while it's frozen, but hard to say for all task_work users. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() 2024-07-10 17:53 ` Pavel Begunkov @ 2024-07-10 19:10 ` Oleg Nesterov 2024-07-10 19:20 ` Tejun Heo 0 siblings, 1 reply; 20+ messages in thread From: Oleg Nesterov @ 2024-07-10 19:10 UTC (permalink / raw) To: Pavel Begunkov Cc: Tejun Heo, io-uring, Jens Axboe, Andrew Morton, Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth, Peter Zijlstra On 07/10, Pavel Begunkov wrote: > > On 7/10/24 01:54, Tejun Heo wrote: > >Yeah, the question is: If CRIU is to use cgroup freezer to freeze the tasks > >and then go around tracing each to make dump, would the freezer be enough in > >avoiding interim state changes? Using CRIU implementation is a bit arbitrary > >but I think checkpoint-restart is a useful bar to measure what should stay > >stable while a cgroup is frozen. > > Sounds like in the long run we might want to ignore task_work while > it's frozen, Just in case, this is what I have in mind right now, but I am still not sure and can't make a "clean" patch. If nothing else. CRIU needs to attach and make this task TASK_TRACED, right? And once the target task is traced, it won't react to task_work_add(TWA_SIGNAL). Oleg. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() 2024-07-10 19:10 ` Oleg Nesterov @ 2024-07-10 19:20 ` Tejun Heo 2024-07-10 21:34 ` Oleg Nesterov 0 siblings, 1 reply; 20+ messages in thread From: Tejun Heo @ 2024-07-10 19:20 UTC (permalink / raw) To: Oleg Nesterov Cc: Pavel Begunkov, io-uring, Jens Axboe, Andrew Morton, Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth, Peter Zijlstra Hello, On Wed, Jul 10, 2024 at 09:10:16PM +0200, Oleg Nesterov wrote: ... > If nothing else. CRIU needs to attach and make this task TASK_TRACED, right? Yeah, AFAIK, that's the only way to implement check-pointing for now. > And once the target task is traced, it won't react to task_work_add(TWA_SIGNAL). I don't know how task_work is being used but the requirement would be that if a cgroup is frozen, task_works shouldn't be making state changes which can't safely be replayed (e.g. by restarting the frozen syscalls). If anything task_works do can just be reproduced by restarting the currently in-flight and frozen syscalls, it should be okay to leave them running. Otherwise, it'd be better to freeze them together. As this thing is kinda difficult to reason about, it'd probably be easier to just freeze them together if we can. Thanks. -- tejun ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() 2024-07-10 19:20 ` Tejun Heo @ 2024-07-10 21:34 ` Oleg Nesterov 2024-07-10 22:01 ` Tejun Heo 0 siblings, 1 reply; 20+ messages in thread From: Oleg Nesterov @ 2024-07-10 21:34 UTC (permalink / raw) To: Tejun Heo Cc: Pavel Begunkov, io-uring, Jens Axboe, Andrew Morton, Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth, Peter Zijlstra On 07/10, Tejun Heo wrote: > > Hello, > > On Wed, Jul 10, 2024 at 09:10:16PM +0200, Oleg Nesterov wrote: > ... > > If nothing else. CRIU needs to attach and make this task TASK_TRACED, right? > > Yeah, AFAIK, that's the only way to implement check-pointing for now. OK, > > And once the target task is traced, it won't react to task_work_add(TWA_SIGNAL). > > I don't know how task_work is being used but the requirement would be that > if a cgroup is frozen, task_works shouldn't be making state changes which > can't safely be replayed (e.g. by restarting the frozen syscalls). Well, in theory task_work can do "anything". Of course, it can't, say, restart a frozen syscall, task_work_run() just executes the callbacks in kernel mode and returns. > it'd be better to freeze them together. And I tend to agree. simply beacase do_freezer_trap() (and more users of clear_thread_flag(TIF_SIGPENDING) + schedule(TASK_INTERRUPTIBLE) pattern) do not take TIF_NOTIFY_SIGNAL into account. But how do you think this patch can make the things worse wrt CRIU ? And let's even forget this patch which fixes the real problem. How do you think the fact that the task sleeping in do_freezer_trap() can react to TIF_NOTIFY_SIGNAL, call task_work_run(), and then sleep in do_freezer_trap() again can make any difference in this sense? > As this thing is kinda difficult to reason about, Agreed, > it'd probably be easier to just freeze them together if we can. Agreed, but this needs some "generic" changes while Pavel needs a simple and backportable workaround to suppress a real problem. In short, I don't like this patch either, I just don't see a better solution for now ;) Thanks, Oleg. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() 2024-07-10 21:34 ` Oleg Nesterov @ 2024-07-10 22:01 ` Tejun Heo 2024-07-10 22:17 ` Oleg Nesterov 0 siblings, 1 reply; 20+ messages in thread From: Tejun Heo @ 2024-07-10 22:01 UTC (permalink / raw) To: Oleg Nesterov Cc: Pavel Begunkov, io-uring, Jens Axboe, Andrew Morton, Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth, Peter Zijlstra Hello, On Wed, Jul 10, 2024 at 11:34:19PM +0200, Oleg Nesterov wrote: ... > > it'd be better to freeze them together. > > And I tend to agree. simply beacase do_freezer_trap() (and more users of > clear_thread_flag(TIF_SIGPENDING) + schedule(TASK_INTERRUPTIBLE) pattern) > do not take TIF_NOTIFY_SIGNAL into account. > > But how do you think this patch can make the things worse wrt CRIU ? Oh, I'm not arguing against the patch at all. Just daydreaming about what future cleanups should look like. ... > In short, I don't like this patch either, I just don't see a better > solution for now ;) I think we're on the same page. Thanks. -- tejun ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() 2024-07-10 22:01 ` Tejun Heo @ 2024-07-10 22:17 ` Oleg Nesterov 0 siblings, 0 replies; 20+ messages in thread From: Oleg Nesterov @ 2024-07-10 22:17 UTC (permalink / raw) To: Tejun Heo Cc: Pavel Begunkov, io-uring, Jens Axboe, Andrew Morton, Christian Brauner, Tycho Andersen, Thomas Gleixner, linux-kernel, Julian Orth, Peter Zijlstra On 07/10, Tejun Heo wrote: > > > But how do you think this patch can make the things worse wrt CRIU ? > > Oh, I'm not arguing against the patch at all. Just daydreaming about what > future cleanups should look like. Ah, sorry, I misunderstood you! > > In short, I don't like this patch either, I just don't see a better > > solution for now ;) > > I think we're on the same page. Yes, and I agree with your concerns and your thoughts about future cleanups. Oleg. ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2024-07-10 22:19 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-07-07 16:32 [PATCH 0/2] fix task_work interation with freezing Pavel Begunkov 2024-07-07 16:32 ` [PATCH 1/2] io_uring/io-wq: limit retrying worker initialisation Pavel Begunkov 2024-07-07 16:32 ` [PATCH 2/2] kernel: rerun task_work while freezing in get_signal() Pavel Begunkov 2024-07-08 10:42 ` Oleg Nesterov 2024-07-08 15:40 ` Pavel Begunkov 2024-07-08 18:48 ` Peter Zijlstra 2024-07-09 10:36 ` Oleg Nesterov 2024-07-09 14:05 ` Pavel Begunkov 2024-07-09 16:39 ` Tejun Heo 2024-07-09 19:07 ` Oleg Nesterov 2024-07-09 19:26 ` Pavel Begunkov 2024-07-09 19:38 ` Oleg Nesterov 2024-07-09 19:55 ` Pavel Begunkov 2024-07-10 0:54 ` Tejun Heo 2024-07-10 17:53 ` Pavel Begunkov 2024-07-10 19:10 ` Oleg Nesterov 2024-07-10 19:20 ` Tejun Heo 2024-07-10 21:34 ` Oleg Nesterov 2024-07-10 22:01 ` Tejun Heo 2024-07-10 22:17 ` Oleg Nesterov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox