* [PATCHSET 0/4] io_uring and task_work interactions @ 2020-04-06 19:48 Jens Axboe 2020-04-06 19:48 ` [PATCH 1/4] task_work: add task_work_pending() helper Jens Axboe ` (3 more replies) 0 siblings, 4 replies; 28+ messages in thread From: Jens Axboe @ 2020-04-06 19:48 UTC (permalink / raw) To: io-uring; +Cc: peterz There's a case for io_uring where we want to run the task_work on ring exit, but we can't easily do that as we don't know if the task_works has work, or if exit work has already been queued. Here's two prep patches that adds a task_work_pending() helper, and makes task_work_run() check it. Then we can remove these checks from the caller, and we can have io_uring check if we need to run work if it needs to on ring exit. -- Jens Axboe ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 1/4] task_work: add task_work_pending() helper 2020-04-06 19:48 [PATCHSET 0/4] io_uring and task_work interactions Jens Axboe @ 2020-04-06 19:48 ` Jens Axboe 2020-04-07 11:28 ` Oleg Nesterov 2020-04-06 19:48 ` [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued Jens Axboe ` (2 subsequent siblings) 3 siblings, 1 reply; 28+ messages in thread From: Jens Axboe @ 2020-04-06 19:48 UTC (permalink / raw) To: io-uring; +Cc: peterz, Jens Axboe Various callsites currently check current->task_works != NULL to know when to run work. Add a helper that we use internally for that. This is in preparation for also not running if exit queue has been queued. Signed-off-by: Jens Axboe <[email protected]> --- include/linux/task_work.h | 13 ++++++++++++- kernel/task_work.c | 2 +- 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/include/linux/task_work.h b/include/linux/task_work.h index bd9a6a91c097..54c911bbf754 100644 --- a/include/linux/task_work.h +++ b/include/linux/task_work.h @@ -15,7 +15,18 @@ init_task_work(struct callback_head *twork, task_work_func_t func) int task_work_add(struct task_struct *task, struct callback_head *twork, bool); struct callback_head *task_work_cancel(struct task_struct *, task_work_func_t); -void task_work_run(void); +void __task_work_run(void); + +static inline bool task_work_pending(void) +{ + return current->task_works; +} + +static inline void task_work_run(void) +{ + if (task_work_pending()) + __task_work_run(); +} static inline void exit_task_work(struct task_struct *task) { diff --git a/kernel/task_work.c b/kernel/task_work.c index 825f28259a19..9620333423a3 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -87,7 +87,7 @@ task_work_cancel(struct task_struct *task, task_work_func_t func) * it exits. In the latter case task_work_add() can no longer add the * new work after task_work_run() returns. */ -void task_work_run(void) +void __task_work_run(void) { struct task_struct *task = current; struct callback_head *work, *head, *next; -- 2.26.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 1/4] task_work: add task_work_pending() helper 2020-04-06 19:48 ` [PATCH 1/4] task_work: add task_work_pending() helper Jens Axboe @ 2020-04-07 11:28 ` Oleg Nesterov 2020-04-07 15:43 ` Jens Axboe 0 siblings, 1 reply; 28+ messages in thread From: Oleg Nesterov @ 2020-04-07 11:28 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, peterz On 04/06, Jens Axboe wrote: > > +static inline bool task_work_pending(void) > +{ > + return current->task_works; > +} > + > +static inline void task_work_run(void) > +{ > + if (task_work_pending()) > + __task_work_run(); > +} No, this is wrong. exit_task_work() must always call __task_work_run() to install work_exited. This helper (and 3/4) probably makes sense but please change exit_task_work() to use __task_work_run() then. Oleg. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/4] task_work: add task_work_pending() helper 2020-04-07 11:28 ` Oleg Nesterov @ 2020-04-07 15:43 ` Jens Axboe 0 siblings, 0 replies; 28+ messages in thread From: Jens Axboe @ 2020-04-07 15:43 UTC (permalink / raw) To: Oleg Nesterov; +Cc: io-uring, peterz On 4/7/20 4:28 AM, Oleg Nesterov wrote: > On 04/06, Jens Axboe wrote: >> >> +static inline bool task_work_pending(void) >> +{ >> + return current->task_works; >> +} >> + >> +static inline void task_work_run(void) >> +{ >> + if (task_work_pending()) >> + __task_work_run(); >> +} > > No, this is wrong. exit_task_work() must always call __task_work_run() > to install work_exited. > > This helper (and 3/4) probably makes sense but please change exit_task_work() > to use __task_work_run() then. Good catch, thanks. I'll make the change. -- Jens Axboe ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued 2020-04-06 19:48 [PATCHSET 0/4] io_uring and task_work interactions Jens Axboe 2020-04-06 19:48 ` [PATCH 1/4] task_work: add task_work_pending() helper Jens Axboe @ 2020-04-06 19:48 ` Jens Axboe 2020-04-07 11:39 ` Oleg Nesterov 2020-04-07 12:47 ` Peter Zijlstra 2020-04-06 19:48 ` [PATCH 3/4] task_work: kill current->task_works checking in callers Jens Axboe 2020-04-06 19:48 ` [PATCH 4/4] io_uring: flush task work before waiting for ring exit Jens Axboe 3 siblings, 2 replies; 28+ messages in thread From: Jens Axboe @ 2020-04-06 19:48 UTC (permalink / raw) To: io-uring; +Cc: peterz, Jens Axboe If task_work has already been run on task exit, we don't always know if it's safe to run again. Check for task_work_exited in the task_work_pending() helper. This makes it less fragile in calling from the exit files path, for example. Signed-off-by: Jens Axboe <[email protected]> --- include/linux/task_work.h | 4 +++- kernel/task_work.c | 8 ++++---- 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/include/linux/task_work.h b/include/linux/task_work.h index 54c911bbf754..24f977a8fc35 100644 --- a/include/linux/task_work.h +++ b/include/linux/task_work.h @@ -7,6 +7,8 @@ typedef void (*task_work_func_t)(struct callback_head *); +extern struct callback_head task_work_exited; + static inline void init_task_work(struct callback_head *twork, task_work_func_t func) { @@ -19,7 +21,7 @@ void __task_work_run(void); static inline bool task_work_pending(void) { - return current->task_works; + return current->task_works && current->task_works != &task_work_exited; } static inline void task_work_run(void) diff --git a/kernel/task_work.c b/kernel/task_work.c index 9620333423a3..d6a8b4ab4858 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -3,7 +3,7 @@ #include <linux/task_work.h> #include <linux/tracehook.h> -static struct callback_head work_exited; /* all we need is ->next == NULL */ +struct callback_head task_work_exited = { }; /** * task_work_add - ask the @task to execute @work->func() @@ -31,7 +31,7 @@ task_work_add(struct task_struct *task, struct callback_head *work, bool notify) do { head = READ_ONCE(task->task_works); - if (unlikely(head == &work_exited)) + if (unlikely(head == &task_work_exited)) return -ESRCH; work->next = head; } while (cmpxchg(&task->task_works, head, work) != head); @@ -95,14 +95,14 @@ void __task_work_run(void) for (;;) { /* * work->func() can do task_work_add(), do not set - * work_exited unless the list is empty. + * task_work_exited unless the list is empty. */ do { head = NULL; work = READ_ONCE(task->task_works); if (!work) { if (task->flags & PF_EXITING) - head = &work_exited; + head = &task_work_exited; else break; } -- 2.26.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued 2020-04-06 19:48 ` [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued Jens Axboe @ 2020-04-07 11:39 ` Oleg Nesterov 2020-04-07 11:54 ` Oleg Nesterov 2020-04-07 15:40 ` Jens Axboe 2020-04-07 12:47 ` Peter Zijlstra 1 sibling, 2 replies; 28+ messages in thread From: Oleg Nesterov @ 2020-04-07 11:39 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, peterz On 04/06, Jens Axboe wrote: > > +extern struct callback_head task_work_exited; > + > static inline void > init_task_work(struct callback_head *twork, task_work_func_t func) > { > @@ -19,7 +21,7 @@ void __task_work_run(void); > > static inline bool task_work_pending(void) > { > - return current->task_works; > + return current->task_works && current->task_works != &task_work_exited; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Well, this penalizes all the current users, they can't hit work_exited. IIUC, this is needed for the next change which adds task_work_run() into io_ring_ctx_wait_and_kill(), right? could you explain how the exiting can call io_ring_ctx_wait_and_kill() after it passed exit_task_work() ? Oleg. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued 2020-04-07 11:39 ` Oleg Nesterov @ 2020-04-07 11:54 ` Oleg Nesterov 2020-04-07 15:40 ` Jens Axboe 1 sibling, 0 replies; 28+ messages in thread From: Oleg Nesterov @ 2020-04-07 11:54 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, peterz On 04/07, Oleg Nesterov wrote: > > On 04/06, Jens Axboe wrote: > > > > +extern struct callback_head task_work_exited; > > + > > static inline void > > init_task_work(struct callback_head *twork, task_work_func_t func) > > { > > @@ -19,7 +21,7 @@ void __task_work_run(void); > > > > static inline bool task_work_pending(void) > > { > > - return current->task_works; > > + return current->task_works && current->task_works != &task_work_exited; > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Well, this penalizes all the current users, they can't hit work_exited. > > IIUC, this is needed for the next change which adds task_work_run() into > io_ring_ctx_wait_and_kill(), right? > > could you explain how the exiting can call io_ring_ctx_wait_and_kill() > after it passed exit_task_work() ? Note also that currently we assume that task_work_run() must not be called by the exiting process (except exit_task_work). If io_ring adds the precedent we need change the PF_EXITING logic in task_work_run(). Oleg. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued 2020-04-07 11:39 ` Oleg Nesterov 2020-04-07 11:54 ` Oleg Nesterov @ 2020-04-07 15:40 ` Jens Axboe 2020-04-07 16:19 ` Oleg Nesterov 1 sibling, 1 reply; 28+ messages in thread From: Jens Axboe @ 2020-04-07 15:40 UTC (permalink / raw) To: Oleg Nesterov; +Cc: io-uring, peterz On 4/7/20 4:39 AM, Oleg Nesterov wrote: > On 04/06, Jens Axboe wrote: >> >> +extern struct callback_head task_work_exited; >> + >> static inline void >> init_task_work(struct callback_head *twork, task_work_func_t func) >> { >> @@ -19,7 +21,7 @@ void __task_work_run(void); >> >> static inline bool task_work_pending(void) >> { >> - return current->task_works; >> + return current->task_works && current->task_works != &task_work_exited; > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Well, this penalizes all the current users, they can't hit work_exited. > > IIUC, this is needed for the next change which adds task_work_run() into > io_ring_ctx_wait_and_kill(), right? Right - so you'd rather I localize that check there instead? Can certainly do that. > could you explain how the exiting can call io_ring_ctx_wait_and_kill() > after it passed exit_task_work() ? Sure, here's a trace where it happens: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page PGD 0 P4D 0 Oops: 0010 [#1] SMP CPU: 51 PID: 7290 Comm: mc_worker Kdump: loaded Not tainted 5.2.9-03696-gf2db01aa1e97 #190 Hardware name: Quanta Leopard ORv2-DDR4/Leopard ORv2-DDR4, BIOS F06_3B17 03/16/2018 RIP: 0010:0x0 Code: Bad RIP value. RSP: 0018:ffffc9002721bc78 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffffff82d10ff0 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffc9002721bc60 RDI: ffffffff82d10ff0 RBP: ffff889fd220e8f0 R08: 0000000000000000 R09: ffffffff812f1000 R10: ffff88bfa5fcb100 R11: 0000000000000000 R12: ffff889fd220e200 R13: ffff889fd220e92c R14: ffffffff82d10ff0 R15: 0000000000000000 FS: 00007f03161ff700(0000) GS:ffff88bfff9c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 0000000002409004 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __task_work_run+0x66/0xa0 io_ring_ctx_wait_and_kill+0x14e/0x3c0 io_uring_release+0x1c/0x20 __fput+0xaa/0x200 __task_work_run+0x66/0xa0 do_exit+0x9cf/0xb40 do_group_exit+0x3a/0xa0 get_signal+0x152/0x800 do_signal+0x36/0x640 ? __audit_syscall_exit+0x23c/0x290 exit_to_usermode_loop+0x65/0x100 do_syscall_64+0xd4/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f77057fe8ce Code: Bad RIP value. RSP: 002b:00007f03161f8960 EFLAGS: 00000293 ORIG_RAX: 000000000000002e RAX: 000000000000002a RBX: 00007f03161f8a30 RCX: 00007f77057fe8ce RDX: 0000000000004040 RSI: 00007f03161f8a30 RDI: 00000000000057a4 RBP: 00007f03161f8980 R08: 0000000000000000 R09: 00007f03161fb7b8 R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000 R13: 0000000000004040 R14: 00007f02dc12bc00 R15: 00007f02dc21b900 -- Jens Axboe ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued 2020-04-07 15:40 ` Jens Axboe @ 2020-04-07 16:19 ` Oleg Nesterov 2020-04-07 16:59 ` Jens Axboe 0 siblings, 1 reply; 28+ messages in thread From: Oleg Nesterov @ 2020-04-07 16:19 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, peterz On 04/07, Jens Axboe wrote: > > On 4/7/20 4:39 AM, Oleg Nesterov wrote: > > > > IIUC, this is needed for the next change which adds task_work_run() into > > io_ring_ctx_wait_and_kill(), right? > > Right - so you'd rather I localize that check there instead? Can certainly > do that. I am still not sure we need this check at all... probably this is because I don't understand the problem. > > could you explain how the exiting can call io_ring_ctx_wait_and_kill() > > after it passed exit_task_work() ? > > Sure, here's a trace where it happens: but this task has not passed exit_task_work(), > __task_work_run+0x66/0xa0 > io_ring_ctx_wait_and_kill+0x14e/0x3c0 > io_uring_release+0x1c/0x20 > __fput+0xaa/0x200 > __task_work_run+0x66/0xa0 > do_exit+0x9cf/0xb40 So task_work_run() is called recursively from exit_task_work()->task_work_run(). See my another email, this is wrong with or without this series. And that is why I think task_work_run() hits work_exited. Could you explain why io_ring_ctx_wait_and_kill() needs task_work_run() ? Oleg. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued 2020-04-07 16:19 ` Oleg Nesterov @ 2020-04-07 16:59 ` Jens Axboe 2020-04-07 17:43 ` Oleg Nesterov 0 siblings, 1 reply; 28+ messages in thread From: Jens Axboe @ 2020-04-07 16:59 UTC (permalink / raw) To: Oleg Nesterov; +Cc: io-uring, peterz On 4/7/20 9:19 AM, Oleg Nesterov wrote: > On 04/07, Jens Axboe wrote: >> >> On 4/7/20 4:39 AM, Oleg Nesterov wrote: >>> >>> IIUC, this is needed for the next change which adds task_work_run() into >>> io_ring_ctx_wait_and_kill(), right? >> >> Right - so you'd rather I localize that check there instead? Can certainly >> do that. > > I am still not sure we need this check at all... probably this is because > I don't understand the problem. Probably because I'm not explaining it very well... Let me try. io_uring uses the task_work to handle completion of poll requests. Either an explicit poll, or one done implicitly because someone did a recv/send (or whatever) on a socket that wasn't ready. When we get the poll waitqueue callback, we queue up task_work to handle the completion of it. These can come in at any time, obviously, as space or data becomes available. If the task is exiting, our task_work_add() fails, and we queue with someone else. But there seems to be a case where it does get queued locally, and then io_uring doesn't know if it's safe to run task_work or not. Sometimes that manifests itself in hitting the RIP == 0 case that I included here. With the work_pending && work != exit_work in place, it works fine. >>> could you explain how the exiting can call io_ring_ctx_wait_and_kill() >>> after it passed exit_task_work() ? >> >> Sure, here's a trace where it happens: > > but this task has not passed exit_task_work(), But it's definitely hitting callback->func == NULL, which is the exit_work. So if it's not from past exit_task_work(), where is it from? > >> __task_work_run+0x66/0xa0 >> io_ring_ctx_wait_and_kill+0x14e/0x3c0 >> io_uring_release+0x1c/0x20 >> __fput+0xaa/0x200 >> __task_work_run+0x66/0xa0 >> do_exit+0x9cf/0xb40 > > So task_work_run() is called recursively from > exit_task_work()->task_work_run(). See my another email, this is > wrong with or without this series. And that is why I think > task_work_run() hits work_exited. I see your newer email on this, I'll go read it. > Could you explain why io_ring_ctx_wait_and_kill() needs > task_work_run() ? Hopefully the above helped! If I'm way off somehow, cluebat away. -- Jens Axboe ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued 2020-04-07 16:59 ` Jens Axboe @ 2020-04-07 17:43 ` Oleg Nesterov 0 siblings, 0 replies; 28+ messages in thread From: Oleg Nesterov @ 2020-04-07 17:43 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, peterz Jens, I'll read your email tomorrow, just one note for now ... On 04/07, Jens Axboe wrote: > > On 4/7/20 9:19 AM, Oleg Nesterov wrote: > > > > but this task has not passed exit_task_work(), > > But it's definitely hitting callback->func == NULL, which is the > exit_work. So if it's not from past exit_task_work(), where is it from? I guess it comes from task_work_run() added by the next patch ;) > I see your newer email on this, I'll go read it. please look at the "bad_work_func" example. Oleg. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued 2020-04-06 19:48 ` [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued Jens Axboe 2020-04-07 11:39 ` Oleg Nesterov @ 2020-04-07 12:47 ` Peter Zijlstra 2020-04-07 15:43 ` Jens Axboe 1 sibling, 1 reply; 28+ messages in thread From: Peter Zijlstra @ 2020-04-07 12:47 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, Oleg Nesterov, viro You seem to have lost Oleg and Al from the Cc list.. On Mon, Apr 06, 2020 at 01:48:51PM -0600, Jens Axboe wrote: > If task_work has already been run on task exit, we don't always know > if it's safe to run again. Check for task_work_exited in the > task_work_pending() helper. This makes it less fragile in calling > from the exit files path, for example. > > Signed-off-by: Jens Axboe <[email protected]> > --- > include/linux/task_work.h | 4 +++- > kernel/task_work.c | 8 ++++---- > 2 files changed, 7 insertions(+), 5 deletions(-) > > diff --git a/include/linux/task_work.h b/include/linux/task_work.h > index 54c911bbf754..24f977a8fc35 100644 > --- a/include/linux/task_work.h > +++ b/include/linux/task_work.h > @@ -7,6 +7,8 @@ > > typedef void (*task_work_func_t)(struct callback_head *); > > +extern struct callback_head task_work_exited; > + > static inline void > init_task_work(struct callback_head *twork, task_work_func_t func) > { > @@ -19,7 +21,7 @@ void __task_work_run(void); > > static inline bool task_work_pending(void) > { > - return current->task_works; > + return current->task_works && current->task_works != &task_work_exited; > } Hurmph.. not sure I like this. It inlines that second condition to every caller of task_work_run() even though for pretty much all of them this is impossible. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued 2020-04-07 12:47 ` Peter Zijlstra @ 2020-04-07 15:43 ` Jens Axboe 0 siblings, 0 replies; 28+ messages in thread From: Jens Axboe @ 2020-04-07 15:43 UTC (permalink / raw) To: Peter Zijlstra; +Cc: io-uring, Oleg Nesterov, viro On 4/7/20 5:47 AM, Peter Zijlstra wrote: > > You seem to have lost Oleg and Al from the Cc list.. I'll add them for v2, I did point Oleg at it! > On Mon, Apr 06, 2020 at 01:48:51PM -0600, Jens Axboe wrote: >> If task_work has already been run on task exit, we don't always know >> if it's safe to run again. Check for task_work_exited in the >> task_work_pending() helper. This makes it less fragile in calling >> from the exit files path, for example. >> >> Signed-off-by: Jens Axboe <[email protected]> >> --- >> include/linux/task_work.h | 4 +++- >> kernel/task_work.c | 8 ++++---- >> 2 files changed, 7 insertions(+), 5 deletions(-) >> >> diff --git a/include/linux/task_work.h b/include/linux/task_work.h >> index 54c911bbf754..24f977a8fc35 100644 >> --- a/include/linux/task_work.h >> +++ b/include/linux/task_work.h >> @@ -7,6 +7,8 @@ >> >> typedef void (*task_work_func_t)(struct callback_head *); >> >> +extern struct callback_head task_work_exited; >> + >> static inline void >> init_task_work(struct callback_head *twork, task_work_func_t func) >> { >> @@ -19,7 +21,7 @@ void __task_work_run(void); >> >> static inline bool task_work_pending(void) >> { >> - return current->task_works; >> + return current->task_works && current->task_works != &task_work_exited; >> } > > Hurmph.. not sure I like this. It inlines that second condition to > every caller of task_work_run() even though for pretty much all of them > this is impossible. Oleg had the same concern, and I agree with both of you. Would you prefer it we just leave task_work_run() as: static inline void task_work_run(void) { if (current->task_works) __task_work_run(); } and then have the io_uring caller do: if (current->task_works != &task_work_exited) task_work_run(); instead? -- Jens Axboe ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 3/4] task_work: kill current->task_works checking in callers 2020-04-06 19:48 [PATCHSET 0/4] io_uring and task_work interactions Jens Axboe 2020-04-06 19:48 ` [PATCH 1/4] task_work: add task_work_pending() helper Jens Axboe 2020-04-06 19:48 ` [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued Jens Axboe @ 2020-04-06 19:48 ` Jens Axboe 2020-04-06 19:48 ` [PATCH 4/4] io_uring: flush task work before waiting for ring exit Jens Axboe 3 siblings, 0 replies; 28+ messages in thread From: Jens Axboe @ 2020-04-06 19:48 UTC (permalink / raw) To: io-uring; +Cc: peterz, Jens Axboe If the callsite cares, use task_work_pending(). If not, just call task_work_run() unconditionally, that makes the check inline and doesn't add any extra overhead. Signed-off-by: Jens Axboe <[email protected]> --- fs/io-wq.c | 7 ++----- fs/io_uring.c | 20 ++++++++------------ include/linux/tracehook.h | 3 +-- kernel/signal.c | 7 +++---- 4 files changed, 14 insertions(+), 23 deletions(-) diff --git a/fs/io-wq.c b/fs/io-wq.c index 4023c9846860..5bee3f5f67e1 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -717,8 +717,7 @@ static int io_wq_manager(void *data) complete(&wq->done); while (!kthread_should_stop()) { - if (current->task_works) - task_work_run(); + task_work_run(); for_each_node(node) { struct io_wqe *wqe = wq->wqes[node]; @@ -742,9 +741,7 @@ static int io_wq_manager(void *data) schedule_timeout(HZ); } - if (current->task_works) - task_work_run(); - + task_work_run(); return 0; err: set_bit(IO_WQ_BIT_ERROR, &wq->state); diff --git a/fs/io_uring.c b/fs/io_uring.c index 79bd22289d73..1579390c7c53 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5967,8 +5967,7 @@ static int io_sq_thread(void *data) if (!list_empty(&ctx->poll_list) || (!time_after(jiffies, timeout) && ret != -EBUSY && !percpu_ref_is_dying(&ctx->refs))) { - if (current->task_works) - task_work_run(); + task_work_run(); cond_resched(); continue; } @@ -6000,8 +5999,8 @@ static int io_sq_thread(void *data) finish_wait(&ctx->sqo_wait, &wait); break; } - if (current->task_works) { - task_work_run(); + if (task_work_pending()) { + __task_work_run(); finish_wait(&ctx->sqo_wait, &wait); continue; } @@ -6024,8 +6023,7 @@ static int io_sq_thread(void *data) timeout = jiffies + ctx->sq_thread_idle; } - if (current->task_works) - task_work_run(); + task_work_run(); set_fs(old_fs); if (cur_mm) { @@ -6094,9 +6092,9 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, do { if (io_cqring_events(ctx, false) >= min_events) return 0; - if (!current->task_works) + if (!task_work_pending()) break; - task_work_run(); + __task_work_run(); } while (1); if (sig) { @@ -6117,8 +6115,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, do { prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE); - if (current->task_works) - task_work_run(); + task_work_run(); if (io_should_wake(&iowq, false)) break; schedule(); @@ -7467,8 +7464,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, int submitted = 0; struct fd f; - if (current->task_works) - task_work_run(); + task_work_run(); if (flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP)) return -EINVAL; diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h index 36fb3bbed6b2..608a2d12bc14 100644 --- a/include/linux/tracehook.h +++ b/include/linux/tracehook.h @@ -184,8 +184,7 @@ static inline void tracehook_notify_resume(struct pt_regs *regs) * hlist_add_head(task->task_works); */ smp_mb__after_atomic(); - if (unlikely(current->task_works)) - task_work_run(); + task_work_run(); #ifdef CONFIG_KEYS_REQUEST_CACHE if (unlikely(current->cached_requested_key)) { diff --git a/kernel/signal.c b/kernel/signal.c index e58a6c619824..d62b7a3f2045 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2271,8 +2271,8 @@ static void ptrace_do_notify(int signr, int exit_code, int why) void ptrace_notify(int exit_code) { BUG_ON((exit_code & (0x7f | ~0xffff)) != SIGTRAP); - if (unlikely(current->task_works)) - task_work_run(); + + task_work_run(); spin_lock_irq(¤t->sighand->siglock); ptrace_do_notify(SIGTRAP, exit_code, CLD_TRAPPED); @@ -2529,8 +2529,7 @@ bool get_signal(struct ksignal *ksig) struct signal_struct *signal = current->signal; int signr; - if (unlikely(current->task_works)) - task_work_run(); + task_work_run(); if (unlikely(uprobe_deny_signal())) return false; -- 2.26.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 4/4] io_uring: flush task work before waiting for ring exit 2020-04-06 19:48 [PATCHSET 0/4] io_uring and task_work interactions Jens Axboe ` (2 preceding siblings ...) 2020-04-06 19:48 ` [PATCH 3/4] task_work: kill current->task_works checking in callers Jens Axboe @ 2020-04-06 19:48 ` Jens Axboe 3 siblings, 0 replies; 28+ messages in thread From: Jens Axboe @ 2020-04-06 19:48 UTC (permalink / raw) To: io-uring; +Cc: peterz, Jens Axboe We could have triggered task work when we removed poll completions. It's now safe to unconditionally call task_work_run() since it checks for the exited task work, do so in case we have pending items there. Ensure we do so before flushing the CQ ring overflow. Signed-off-by: Jens Axboe <[email protected]> --- fs/io_uring.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 1579390c7c53..183bd28761e3 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7295,10 +7295,13 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) io_wq_cancel_all(ctx->io_wq); io_iopoll_reap_events(ctx); + idr_for_each(&ctx->personality_idr, io_remove_personalities, ctx); + task_work_run(); + /* if we failed setting up the ctx, we might not have any rings */ if (ctx->rings) io_cqring_overflow_flush(ctx, true); - idr_for_each(&ctx->personality_idr, io_remove_personalities, ctx); + wait_for_completion(&ctx->completions[0]); io_ring_ctx_free(ctx); } -- 2.26.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCHSET v2] io_uring and task_work interactions @ 2020-04-07 16:02 Jens Axboe 2020-04-07 16:02 ` [PATCH 4/4] io_uring: flush task work before waiting for ring exit Jens Axboe 0 siblings, 1 reply; 28+ messages in thread From: Jens Axboe @ 2020-04-07 16:02 UTC (permalink / raw) To: io-uring; +Cc: viro There's a case for io_uring where we want to run the task_work on ring exit, but we can't easily do that as we don't know if the task_works has work, or if exit work has already been queued. First two are just cleanups, third one makes task_work_exited externally visible, and fourth one let's io_uring call task_work_run() off the fops->release() path to fix an issue where we need to potentially flush work here. Since v1: - Don't add task_work_exited check to the fast path - Ensure exit_task_work() always runs -- Jens Axboe ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 4/4] io_uring: flush task work before waiting for ring exit 2020-04-07 16:02 [PATCHSET v2] io_uring and task_work interactions Jens Axboe @ 2020-04-07 16:02 ` Jens Axboe 2020-04-07 16:24 ` Oleg Nesterov 0 siblings, 1 reply; 28+ messages in thread From: Jens Axboe @ 2020-04-07 16:02 UTC (permalink / raw) To: io-uring; +Cc: viro, Jens Axboe, Oleg Nesterov, Peter Zijlstra We could have triggered task work when we removed poll completions. Be sure to check if we're called off the exit path, as the exit task work could have already been queued. If it is, then the poll flush will have queued to the io-wq helper thread, so there's no need to run the work. Ensure we do so before flushing the CQ ring overflow, in case the poll flush puts us into overflow mode. Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Jens Axboe <[email protected]> --- fs/io_uring.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 7fb51c383e51..4e760b7cd772 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7293,10 +7293,15 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) io_wq_cancel_all(ctx->io_wq); io_iopoll_reap_events(ctx); + idr_for_each(&ctx->personality_idr, io_remove_personalities, ctx); + + if (current->task_works != &task_work_exited) + task_work_run(); + /* if we failed setting up the ctx, we might not have any rings */ if (ctx->rings) io_cqring_overflow_flush(ctx, true); - idr_for_each(&ctx->personality_idr, io_remove_personalities, ctx); + wait_for_completion(&ctx->completions[0]); io_ring_ctx_free(ctx); } -- 2.26.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit 2020-04-07 16:02 ` [PATCH 4/4] io_uring: flush task work before waiting for ring exit Jens Axboe @ 2020-04-07 16:24 ` Oleg Nesterov 2020-04-07 16:38 ` Oleg Nesterov 0 siblings, 1 reply; 28+ messages in thread From: Oleg Nesterov @ 2020-04-07 16:24 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, viro, Peter Zijlstra On 04/07, Jens Axboe wrote: > > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -7293,10 +7293,15 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) > io_wq_cancel_all(ctx->io_wq); > > io_iopoll_reap_events(ctx); > + idr_for_each(&ctx->personality_idr, io_remove_personalities, ctx); > + > + if (current->task_works != &task_work_exited) > + task_work_run(); this is still wrong, please see the email I sent a minute ago. Oleg. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit 2020-04-07 16:24 ` Oleg Nesterov @ 2020-04-07 16:38 ` Oleg Nesterov 2020-04-07 20:30 ` Jens Axboe 0 siblings, 1 reply; 28+ messages in thread From: Oleg Nesterov @ 2020-04-07 16:38 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, viro, Peter Zijlstra On 04/07, Oleg Nesterov wrote: > > On 04/07, Jens Axboe wrote: > > > > --- a/fs/io_uring.c > > +++ b/fs/io_uring.c > > @@ -7293,10 +7293,15 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) > > io_wq_cancel_all(ctx->io_wq); > > > > io_iopoll_reap_events(ctx); > > + idr_for_each(&ctx->personality_idr, io_remove_personalities, ctx); > > + > > + if (current->task_works != &task_work_exited) > > + task_work_run(); > > this is still wrong, please see the email I sent a minute ago. Let me try to explain in case it was not clear. Lets forget about io_uring. void bad_work_func(struct callback_head *cb) { task_work_run(); } ... init_task_work(&my_work, bad_work_func); task_work_add(task, &my_work); If the "task" above is exiting the kernel will crash; because the 2nd task_work_run() called by bad_work_func() will install work_exited, then we return to task_work_run() which was called by exit_task_work(), it will notice ->task_works != NULL, restart the main loop, and execute work_exited->fn == NULL. Again, if we want to allow task_work_run() in do_exit() paths we need something like below. But still do not understand why do we need this :/ Oleg. diff --git a/include/linux/task_work.h b/include/linux/task_work.h index bd9a6a91c097..c9f36d233c39 100644 --- a/include/linux/task_work.h +++ b/include/linux/task_work.h @@ -15,11 +15,16 @@ init_task_work(struct callback_head *twork, task_work_func_t func) int task_work_add(struct task_struct *task, struct callback_head *twork, bool); struct callback_head *task_work_cancel(struct task_struct *, task_work_func_t); -void task_work_run(void); +void __task_work_run(void); + +static inline void task_work_run(void) +{ + __task_work_run(false); +} static inline void exit_task_work(struct task_struct *task) { - task_work_run(); + __task_work_run(true); } #endif /* _LINUX_TASK_WORK_H */ diff --git a/kernel/task_work.c b/kernel/task_work.c index 825f28259a19..7b26203a583e 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -87,7 +87,7 @@ task_work_cancel(struct task_struct *task, task_work_func_t func) * it exits. In the latter case task_work_add() can no longer add the * new work after task_work_run() returns. */ -void task_work_run(void) +void __task_work_run(bool is_exit) { struct task_struct *task = current; struct callback_head *work, *head, *next; @@ -101,7 +101,7 @@ void task_work_run(void) head = NULL; work = READ_ONCE(task->task_works); if (!work) { - if (task->flags & PF_EXITING) + if (is_exit) head = &work_exited; else break; ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit 2020-04-07 16:38 ` Oleg Nesterov @ 2020-04-07 20:30 ` Jens Axboe 2020-04-07 20:39 ` Jens Axboe 2020-04-08 18:40 ` Oleg Nesterov 0 siblings, 2 replies; 28+ messages in thread From: Jens Axboe @ 2020-04-07 20:30 UTC (permalink / raw) To: Oleg Nesterov; +Cc: io-uring, viro, Peter Zijlstra On 4/7/20 9:38 AM, Oleg Nesterov wrote: > On 04/07, Oleg Nesterov wrote: >> >> On 04/07, Jens Axboe wrote: >>> >>> --- a/fs/io_uring.c >>> +++ b/fs/io_uring.c >>> @@ -7293,10 +7293,15 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) >>> io_wq_cancel_all(ctx->io_wq); >>> >>> io_iopoll_reap_events(ctx); >>> + idr_for_each(&ctx->personality_idr, io_remove_personalities, ctx); >>> + >>> + if (current->task_works != &task_work_exited) >>> + task_work_run(); >> >> this is still wrong, please see the email I sent a minute ago. > > Let me try to explain in case it was not clear. Lets forget about io_uring. > > void bad_work_func(struct callback_head *cb) > { > task_work_run(); > } > > ... > > init_task_work(&my_work, bad_work_func); > > task_work_add(task, &my_work); > > If the "task" above is exiting the kernel will crash; because the 2nd > task_work_run() called by bad_work_func() will install work_exited, then > we return to task_work_run() which was called by exit_task_work(), it will > notice ->task_works != NULL, restart the main loop, and execute > work_exited->fn == NULL. > > Again, if we want to allow task_work_run() in do_exit() paths we need > something like below. But still do not understand why do we need this :/ The crash I sent was from the exit path, I don't think we need to run the task_work for that case, as the ordering should imply that we either queue the work with the task (if not exiting), and it'll get run just fine, or we queue it with another task. For both those cases, no need to run the local task work. io_uring exit removes the pending poll requests, but what if (for non exit invocation), we get poll requests completing before they are torn down. Now we have task_work queued up that won't get run, because we are are in the task_work handler for the __fput(). For this case, we need to run the task work. But I can't tell them apart easily, hence I don't know when it's safe to run it. That's what I'm trying to solve by exposing task_work_exited so I can check for that specifically. Not really a great solution as it doesn't tell me which of the cases I'm in, but at least it tells me if it's safe to run the task work? -- Jens Axboe ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit 2020-04-07 20:30 ` Jens Axboe @ 2020-04-07 20:39 ` Jens Axboe 2020-04-08 18:40 ` Oleg Nesterov 1 sibling, 0 replies; 28+ messages in thread From: Jens Axboe @ 2020-04-07 20:39 UTC (permalink / raw) To: Oleg Nesterov; +Cc: io-uring, viro, Peter Zijlstra On 4/7/20 1:30 PM, Jens Axboe wrote: > On 4/7/20 9:38 AM, Oleg Nesterov wrote: >> On 04/07, Oleg Nesterov wrote: >>> >>> On 04/07, Jens Axboe wrote: >>>> >>>> --- a/fs/io_uring.c >>>> +++ b/fs/io_uring.c >>>> @@ -7293,10 +7293,15 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) >>>> io_wq_cancel_all(ctx->io_wq); >>>> >>>> io_iopoll_reap_events(ctx); >>>> + idr_for_each(&ctx->personality_idr, io_remove_personalities, ctx); >>>> + >>>> + if (current->task_works != &task_work_exited) >>>> + task_work_run(); >>> >>> this is still wrong, please see the email I sent a minute ago. >> >> Let me try to explain in case it was not clear. Lets forget about io_uring. >> >> void bad_work_func(struct callback_head *cb) >> { >> task_work_run(); >> } >> >> ... >> >> init_task_work(&my_work, bad_work_func); >> >> task_work_add(task, &my_work); >> >> If the "task" above is exiting the kernel will crash; because the 2nd >> task_work_run() called by bad_work_func() will install work_exited, then >> we return to task_work_run() which was called by exit_task_work(), it will >> notice ->task_works != NULL, restart the main loop, and execute >> work_exited->fn == NULL. >> >> Again, if we want to allow task_work_run() in do_exit() paths we need >> something like below. But still do not understand why do we need this :/ > > The crash I sent was from the exit path, I don't think we need to run > the task_work for that case, as the ordering should imply that we either > queue the work with the task (if not exiting), and it'll get run just fine, > or we queue it with another task. For both those cases, no need to run > the local task work. > > io_uring exit removes the pending poll requests, but what if (for non > exit invocation), we get poll requests completing before they are torn > down. Now we have task_work queued up that won't get run, because we > are are in the task_work handler for the __fput(). For this case, we > need to run the task work. > > But I can't tell them apart easily, hence I don't know when it's safe > to run it. That's what I'm trying to solve by exposing task_work_exited > so I can check for that specifically. Not really a great solution as > it doesn't tell me which of the cases I'm in, but at least it tells me > if it's safe to run the task work? It's also possible I totally mis-analyzed it, and it really is back to "just" being an ordering issue than I then work-around by re-running the task_work within the handler. -- Jens Axboe ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit 2020-04-07 20:30 ` Jens Axboe 2020-04-07 20:39 ` Jens Axboe @ 2020-04-08 18:40 ` Oleg Nesterov 2020-04-08 18:48 ` Jens Axboe 1 sibling, 1 reply; 28+ messages in thread From: Oleg Nesterov @ 2020-04-08 18:40 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, viro, Peter Zijlstra Jens, I am sorry. I tried to understand your explanations but I can't :/ Just in case, I know nothing about io_uring. However, I strongly believe that - the "task_work_exited" check in 4/4 can't help, the kernel will crash anyway if a task-work callback runs with current->task_works == &task_work_exited. - this check is not needed with the patch I sent. UNLESS io_ring_ctx_wait_and_kill() can be called by the exiting task AFTER it passes exit_task_work(), but I don't see how this is possible. Lets forget this problem, lets assume that task_work_run() is always safe. I still can not understand why io_ring_ctx_wait_and_kill() needs to call task_work_run(). On 04/07, Jens Axboe wrote: > > io_uring exit removes the pending poll requests, but what if (for non > exit invocation), we get poll requests completing before they are torn > down. Now we have task_work queued up that won't get run, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ this must not be possible. If task_work is queued it will run, or we have another bug. > because we > are are in the task_work handler for the __fput(). this doesn't matter... > For this case, we > need to run the task work. This is what I fail to understand :/ Oleg. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit 2020-04-08 18:40 ` Oleg Nesterov @ 2020-04-08 18:48 ` Jens Axboe 2020-04-08 19:06 ` Jens Axboe 0 siblings, 1 reply; 28+ messages in thread From: Jens Axboe @ 2020-04-08 18:48 UTC (permalink / raw) To: Oleg Nesterov; +Cc: io-uring, viro, Peter Zijlstra On 4/8/20 11:40 AM, Oleg Nesterov wrote: > Jens, I am sorry. I tried to understand your explanations but I can't :/ > Just in case, I know nothing about io_uring. > > However, I strongly believe that > > - the "task_work_exited" check in 4/4 can't help, the kernel > will crash anyway if a task-work callback runs with > current->task_works == &task_work_exited. > > - this check is not needed with the patch I sent. > UNLESS io_ring_ctx_wait_and_kill() can be called by the exiting > task AFTER it passes exit_task_work(), but I don't see how this > is possible. > > Lets forget this problem, lets assume that task_work_run() is always safe. > > I still can not understand why io_ring_ctx_wait_and_kill() needs to call > task_work_run(). > > On 04/07, Jens Axboe wrote: >> >> io_uring exit removes the pending poll requests, but what if (for non >> exit invocation), we get poll requests completing before they are torn >> down. Now we have task_work queued up that won't get run, > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > this must not be possible. If task_work is queued it will run, or we > have another bug. > >> because we >> are are in the task_work handler for the __fput(). > > this doesn't matter... > >> For this case, we >> need to run the task work. > > This is what I fail to understand :/ Actually debugging this just now to attempt to get to the bottom of it. I'm running with Peter's "put fput work at the end at task_work_run time" patch (with a head == NULL check that was missing). I get a hang on the wait_for_completion() on io_uring exit, and if I dump the task_work, this is what I get: dump_work: dump cb cb=ffff88bff25589b8, func=ffffffff812f7310 <- io_poll_task_func() cb=ffff88bfdd164600, func=ffffffff812925e0 <- some __fput() cb=ffff88bfece13cb8, func=ffffffff812f7310 <- io_poll_task_func() cb=ffff88bff78393b8, func=ffffffff812b2c40 and we hang because io_poll_task_func() got queued twice on this task _after_ we yanked the current list of work. I'm adding some more debug items to figure out why this is, just wanted to let you know that I'm currently looking into this and will provide more data when I have it. -- Jens Axboe ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit 2020-04-08 18:48 ` Jens Axboe @ 2020-04-08 19:06 ` Jens Axboe 2020-04-08 20:17 ` Oleg Nesterov 0 siblings, 1 reply; 28+ messages in thread From: Jens Axboe @ 2020-04-08 19:06 UTC (permalink / raw) To: Oleg Nesterov; +Cc: io-uring, viro, Peter Zijlstra On 4/8/20 11:48 AM, Jens Axboe wrote: > On 4/8/20 11:40 AM, Oleg Nesterov wrote: >> Jens, I am sorry. I tried to understand your explanations but I can't :/ >> Just in case, I know nothing about io_uring. >> >> However, I strongly believe that >> >> - the "task_work_exited" check in 4/4 can't help, the kernel >> will crash anyway if a task-work callback runs with >> current->task_works == &task_work_exited. >> >> - this check is not needed with the patch I sent. >> UNLESS io_ring_ctx_wait_and_kill() can be called by the exiting >> task AFTER it passes exit_task_work(), but I don't see how this >> is possible. >> >> Lets forget this problem, lets assume that task_work_run() is always safe. >> >> I still can not understand why io_ring_ctx_wait_and_kill() needs to call >> task_work_run(). >> >> On 04/07, Jens Axboe wrote: >>> >>> io_uring exit removes the pending poll requests, but what if (for non >>> exit invocation), we get poll requests completing before they are torn >>> down. Now we have task_work queued up that won't get run, >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> >> this must not be possible. If task_work is queued it will run, or we >> have another bug. >> >>> because we >>> are are in the task_work handler for the __fput(). >> >> this doesn't matter... >> >>> For this case, we >>> need to run the task work. >> >> This is what I fail to understand :/ > > Actually debugging this just now to attempt to get to the bottom of it. > I'm running with Peter's "put fput work at the end at task_work_run > time" patch (with a head == NULL check that was missing). I get a hang > on the wait_for_completion() on io_uring exit, and if I dump the > task_work, this is what I get: > > dump_work: dump cb > cb=ffff88bff25589b8, func=ffffffff812f7310 <- io_poll_task_func() > cb=ffff88bfdd164600, func=ffffffff812925e0 <- some __fput() > cb=ffff88bfece13cb8, func=ffffffff812f7310 <- io_poll_task_func() > cb=ffff88bff78393b8, func=ffffffff812b2c40 > > and we hang because io_poll_task_func() got queued twice on this task > _after_ we yanked the current list of work. > > I'm adding some more debug items to figure out why this is, just wanted > to let you know that I'm currently looking into this and will provide > more data when I have it. Here's some more data. I added a WARN_ON_ONCE() for task->flags & PF_EXITING on task_work_add() success, and it triggers with the following backtrace: [ 628.291061] RIP: 0010:__io_async_wake+0x14a/0x160 [ 628.300452] Code: 8b b8 c8 00 00 00 e8 75 df 00 00 ba 01 00 00 00 48 89 ee 48 89 c7 49 89 c6 e8 82 dd de ff e9 59 ff ff ff 0f 0b e9 52 ff ff ff <0f> 0b e9 40 ff ff ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 [ 628.337930] RSP: 0018:ffffc9000c830a40 EFLAGS: 00010002 [ 628.348365] RAX: 0000000000000000 RBX: ffff88bfe85fc200 RCX: ffff88bff58491b8 [ 628.362610] RDX: 0000000000000001 RSI: ffff88bfe85fc2b8 RDI: ffff889fc929f000 [ 628.376855] RBP: ffff88bfe85fc2b8 R08: 00000000000000c3 R09: ffffc9000c830ad0 [ 628.391087] R10: 0000000000000000 R11: ffff889ff01000a0 R12: ffffc9000c830ad0 [ 628.405317] R13: ffff889fb405fc00 R14: ffff889fc929f000 R15: ffff88bfe85fc200 [ 628.419549] FS: 0000000000000000(0000) GS:ffff889fff5c0000(0000) knlGS:0000000000000000 [ 628.435706] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 628.447178] CR2: 00007f40a3c4b8f0 CR3: 0000000002409002 CR4: 00000000003606e0 [ 628.461427] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 628.475675] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 628.489924] Call Trace: [ 628.494810] <IRQ> [ 628.498835] ? __io_queue_sqe.part.97+0x750/0x750 [ 628.508229] ? tcp_ack_update_rtt.isra.55+0x113/0x430 [ 628.518320] __wake_up_common+0x71/0x140 [ 628.526152] __wake_up_common_lock+0x7c/0xc0 [ 628.534681] sock_def_readable+0x3c/0x70 [ 628.542516] tcp_data_queue+0x2d9/0xb50 [ 628.550175] tcp_rcv_established+0x1ce/0x620 [ 628.558703] ? sk_filter_trim_cap+0x4f/0x200 [ 628.567232] tcp_v6_do_rcv+0xbe/0x3b0 [ 628.574545] tcp_v6_rcv+0xa8d/0xb20 [ 628.581516] ip6_protocol_deliver_rcu+0xb4/0x450 [ 628.590736] ip6_input_finish+0x11/0x20 [ 628.598396] ip6_input+0xa0/0xb0 [ 628.604845] ? tcp_v6_early_demux+0x90/0x140 [ 628.613371] ? tcp_v6_early_demux+0xdb/0x140 [ 628.621902] ? ip6_rcv_finish_core.isra.21+0x66/0x90 [ 628.631816] ipv6_rcv+0xc0/0xd0 [ 628.638092] __netif_receive_skb_one_core+0x50/0x70 [ 628.647833] netif_receive_skb_internal+0x2f/0xa0 [ 628.657226] napi_gro_receive+0xe7/0x150 [ 628.665068] mlx5e_handle_rx_cqe+0x8c/0x100 [ 628.673416] mlx5e_poll_rx_cq+0xef/0x95b [ 628.681252] mlx5e_napi_poll+0xe2/0x610 [ 628.688913] net_rx_action+0x132/0x360 [ 628.696403] __do_softirq+0xd3/0x2e6 [ 628.703545] irq_exit+0xa5/0xb0 [ 628.709816] do_IRQ+0x79/0xd0 [ 628.715744] common_interrupt+0xf/0xf [ 628.723057] </IRQ> [ 628.727256] RIP: 0010:cpuidle_enter_state+0xac/0x410 which means that we've successfully added the task_work while the process is exiting. Maybe I can work-around this by checking myself instead of relying on task_work_add() finding work_exited on the list. -- Jens Axboe ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit 2020-04-08 19:06 ` Jens Axboe @ 2020-04-08 20:17 ` Oleg Nesterov 2020-04-08 20:25 ` Jens Axboe 0 siblings, 1 reply; 28+ messages in thread From: Oleg Nesterov @ 2020-04-08 20:17 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, viro, Peter Zijlstra On 04/08, Jens Axboe wrote: > > Here's some more data. I added a WARN_ON_ONCE() for task->flags & > PF_EXITING on task_work_add() success, and it triggers with the > following backtrace: ... > which means that we've successfully added the task_work while the > process is exiting. but this is fine, task_work_add(task) can succeed if task->flags & EXITING. task_work_add(task, work) should only fail if this "task" has already passed exit_task_work(). Because if this task has already passed exit_task_work(), nothing else can flush this work and call work->func(). Oleg. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit 2020-04-08 20:17 ` Oleg Nesterov @ 2020-04-08 20:25 ` Jens Axboe 2020-04-08 21:19 ` Jens Axboe 2020-04-09 18:50 ` Oleg Nesterov 0 siblings, 2 replies; 28+ messages in thread From: Jens Axboe @ 2020-04-08 20:25 UTC (permalink / raw) To: Oleg Nesterov; +Cc: io-uring, viro, Peter Zijlstra On 4/8/20 1:17 PM, Oleg Nesterov wrote: > On 04/08, Jens Axboe wrote: >> >> Here's some more data. I added a WARN_ON_ONCE() for task->flags & >> PF_EXITING on task_work_add() success, and it triggers with the >> following backtrace: > ... >> which means that we've successfully added the task_work while the >> process is exiting. > > but this is fine, task_work_add(task) can succeed if task->flags & EXITING. > > task_work_add(task, work) should only fail if this "task" has already passed > exit_task_work(). Because if this task has already passed exit_task_work(), > nothing else can flush this work and call work->func(). So the question remains, we basically have this: A B task_work_run(tsk) task_work_add(tsk, io_poll_task_func()) process cbs wait_for_completion() with the last wait needing to flush the work added on the B side, since that isn't part of the initial list. I don't I can fully close that race _without_ re-running task work there. Could do something ala: A B mark context "dead" task_work_run(tsk) if (context dead) task_work_add(helper, io_poll_task_func()) else task_work_add(tsk, io_poll_task_func()) process cbs wait_for_completion() which would do the trick, but I still need to flush work after having marked the context dead. -- Jens Axboe ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit 2020-04-08 20:25 ` Jens Axboe @ 2020-04-08 21:19 ` Jens Axboe 2020-04-09 18:50 ` Oleg Nesterov 1 sibling, 0 replies; 28+ messages in thread From: Jens Axboe @ 2020-04-08 21:19 UTC (permalink / raw) To: Oleg Nesterov; +Cc: io-uring, viro, Peter Zijlstra On 4/8/20 1:25 PM, Jens Axboe wrote: > On 4/8/20 1:17 PM, Oleg Nesterov wrote: >> On 04/08, Jens Axboe wrote: >>> >>> Here's some more data. I added a WARN_ON_ONCE() for task->flags & >>> PF_EXITING on task_work_add() success, and it triggers with the >>> following backtrace: >> ... >>> which means that we've successfully added the task_work while the >>> process is exiting. >> >> but this is fine, task_work_add(task) can succeed if task->flags & EXITING. >> >> task_work_add(task, work) should only fail if this "task" has already passed >> exit_task_work(). Because if this task has already passed exit_task_work(), >> nothing else can flush this work and call work->func(). > > So the question remains, we basically have this: > > A B > task_work_run(tsk) > task_work_add(tsk, io_poll_task_func()) > process cbs > wait_for_completion() > > with the last wait needing to flush the work added on the B side, since > that isn't part of the initial list. > > I don't I can fully close that race _without_ re-running task work > there. Could do something ala: > > A B > mark context "dead" > task_work_run(tsk) > if (context dead) > task_work_add(helper, io_poll_task_func()) > else > task_work_add(tsk, io_poll_task_func()) > process cbs > wait_for_completion() > > which would do the trick, but I still need to flush work after having > marked the context dead. Actually, I guess it's not enough to re-run the work, we could also have ordering issues if we have io_poll_task_func() after the fput of the ring. Maybe this could all work just fine if we just make the ring exit non-blocking... Testing. -- Jens Axboe ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit 2020-04-08 20:25 ` Jens Axboe 2020-04-08 21:19 ` Jens Axboe @ 2020-04-09 18:50 ` Oleg Nesterov 2020-04-10 0:29 ` Jens Axboe 1 sibling, 1 reply; 28+ messages in thread From: Oleg Nesterov @ 2020-04-09 18:50 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, viro, Peter Zijlstra On 04/08, Jens Axboe wrote: > > So the question remains, we basically have this: > > A B > task_work_run(tsk) > task_work_add(tsk, io_poll_task_func()) > process cbs > wait_for_completion() > > with the last wait needing to flush the work added on the B side, since > that isn't part of the initial list. I don't understand you, even remotely :/ maybe you can write some pseudo-code ? who does wait_for_completion(), a callback? or this "tsk" after it does task_work_run() ? Who does complete() ? How can this wait_for_completion() help to flush the work added on the B side? And why do you need to do something special to flush that work? Could you also explain the comment above task_work_add() in __io_async_wake() ? If this fails, then the task is exiting. OK, If that is the case, then the exit check which exit check? will ultimately cancel these work items. what does this mean? there is nothing to cancel if task_work_add() fails, I guess this means something else... Oleg. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit 2020-04-09 18:50 ` Oleg Nesterov @ 2020-04-10 0:29 ` Jens Axboe 0 siblings, 0 replies; 28+ messages in thread From: Jens Axboe @ 2020-04-10 0:29 UTC (permalink / raw) To: Oleg Nesterov; +Cc: io-uring, viro, Peter Zijlstra On 4/9/20 11:50 AM, Oleg Nesterov wrote: > On 04/08, Jens Axboe wrote: >> >> So the question remains, we basically have this: >> >> A B >> task_work_run(tsk) >> task_work_add(tsk, io_poll_task_func()) >> process cbs >> wait_for_completion() >> >> with the last wait needing to flush the work added on the B side, since >> that isn't part of the initial list. > > I don't understand you, even remotely :/ > > maybe you can write some pseudo-code ? Sorry, I guess my explanation skills aren't stellar, or perhaps they assume too much prior knowledge about this. I just wasted about a day on debugging this because I had Peter's patch applied, and I should have reviewed that more carefully. So a lot of the hangs were due to just missing the running of some of the work. > who does wait_for_completion(), a callback? or this "tsk" after it does > task_work_run() ? Who does complete() ? How can this wait_for_completion() > help to flush the work added on the B side? And why do you need to do > something special to flush that work? I'll try to explain this. Each io_uring operation has a request associated with it. Each request grabs a reference to the ctx, and the ctx can't exit before we reach zero references (obviously). When we enter ctx teardown, we wait for us to hit zero references. Each ctx is basically just a file descriptor, which means that we most often end up waiting for zero references off the fput() path. Some requests can spawn task_works work items. If we have ordering issues on the task_works list, we can have the fput() ordered before items that need to complete. These items are what ultimately put the request, so you end up in a situation where you block waiting for requests to complete, but these requests can't complete until the the current work has completed. This is the deadlock. Once I got rid of Peter's patch, I solved this by just making the wait-and-free operation go async. This allows any dependent work to run and complete just fine. Here's the patch: https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.7&id=2baf397719af3a8121fcaba964470950356a4a7a and I could perhaps make this smarter by checking if current->task_works != NULL, but I don't think there's much point to that. The important part is that we complete the fput() task_work without blocking, so the remaining work gets a chance to run. Hope this helps! As mentioned in the commit message, we could have a separate list of task_works for io_uring, which is what I originally proposed. But I can also work around it like in the patch referenced above. -- Jens Axboe ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2020-04-10 0:29 UTC | newest] Thread overview: 28+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-04-06 19:48 [PATCHSET 0/4] io_uring and task_work interactions Jens Axboe 2020-04-06 19:48 ` [PATCH 1/4] task_work: add task_work_pending() helper Jens Axboe 2020-04-07 11:28 ` Oleg Nesterov 2020-04-07 15:43 ` Jens Axboe 2020-04-06 19:48 ` [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued Jens Axboe 2020-04-07 11:39 ` Oleg Nesterov 2020-04-07 11:54 ` Oleg Nesterov 2020-04-07 15:40 ` Jens Axboe 2020-04-07 16:19 ` Oleg Nesterov 2020-04-07 16:59 ` Jens Axboe 2020-04-07 17:43 ` Oleg Nesterov 2020-04-07 12:47 ` Peter Zijlstra 2020-04-07 15:43 ` Jens Axboe 2020-04-06 19:48 ` [PATCH 3/4] task_work: kill current->task_works checking in callers Jens Axboe 2020-04-06 19:48 ` [PATCH 4/4] io_uring: flush task work before waiting for ring exit Jens Axboe -- strict thread matches above, loose matches on Subject: below -- 2020-04-07 16:02 [PATCHSET v2] io_uring and task_work interactions Jens Axboe 2020-04-07 16:02 ` [PATCH 4/4] io_uring: flush task work before waiting for ring exit Jens Axboe 2020-04-07 16:24 ` Oleg Nesterov 2020-04-07 16:38 ` Oleg Nesterov 2020-04-07 20:30 ` Jens Axboe 2020-04-07 20:39 ` Jens Axboe 2020-04-08 18:40 ` Oleg Nesterov 2020-04-08 18:48 ` Jens Axboe 2020-04-08 19:06 ` Jens Axboe 2020-04-08 20:17 ` Oleg Nesterov 2020-04-08 20:25 ` Jens Axboe 2020-04-08 21:19 ` Jens Axboe 2020-04-09 18:50 ` Oleg Nesterov 2020-04-10 0:29 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox