[PATCHSET 0/4] io_uring and task

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCHSET 0/4] io_uring and task_work interactions
@ 2020-04-06 19:48 Jens Axboe
  2020-04-06 19:48 ` [PATCH 1/4] task_work: add task_work_pending() helper Jens Axboe
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Jens Axboe @ 2020-04-06 19:48 UTC (permalink / raw)
  To: io-uring; +Cc: peterz

There's a case for io_uring where we want to run the task_work on ring
exit, but we can't easily do that as we don't know if the task_works has
work, or if exit work has already been queued.

Here's two prep patches that adds a task_work_pending() helper, and
makes task_work_run() check it. Then we can remove these checks from
the caller, and we can have io_uring check if we need to run work if
it needs to on ring exit.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/4] task_work: add task_work_pending() helper
  2020-04-06 19:48 [PATCHSET 0/4] io_uring and task_work interactions Jens Axboe
@ 2020-04-06 19:48 ` Jens Axboe
  2020-04-07 11:28   ` Oleg Nesterov
  2020-04-06 19:48 ` [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued Jens Axboe
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2020-04-06 19:48 UTC (permalink / raw)
  To: io-uring; +Cc: peterz, Jens Axboe

Various callsites currently check current->task_works != NULL to know
when to run work. Add a helper that we use internally for that. This is
in preparation for also not running if exit queue has been queued.

Signed-off-by: Jens Axboe <[email protected]>
---
 include/linux/task_work.h | 13 ++++++++++++-
 kernel/task_work.c        |  2 +-
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index bd9a6a91c097..54c911bbf754 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -15,7 +15,18 @@ init_task_work(struct callback_head *twork, task_work_func_t func)
 
 int task_work_add(struct task_struct *task, struct callback_head *twork, bool);
 struct callback_head *task_work_cancel(struct task_struct *, task_work_func_t);
-void task_work_run(void);
+void __task_work_run(void);
+
+static inline bool task_work_pending(void)
+{
+	return current->task_works;
+}
+
+static inline void task_work_run(void)
+{
+	if (task_work_pending())
+		__task_work_run();
+}
 
 static inline void exit_task_work(struct task_struct *task)
 {
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 825f28259a19..9620333423a3 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -87,7 +87,7 @@ task_work_cancel(struct task_struct *task, task_work_func_t func)
  * it exits. In the latter case task_work_add() can no longer add the
  * new work after task_work_run() returns.
  */
-void task_work_run(void)
+void __task_work_run(void)
 {
 	struct task_struct *task = current;
 	struct callback_head *work, *head, *next;
-- 
2.26.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] task_work: add task_work_pending() helper
  2020-04-06 19:48 ` [PATCH 1/4] task_work: add task_work_pending() helper Jens Axboe
@ 2020-04-07 11:28   ` Oleg Nesterov
  2020-04-07 15:43     ` Jens Axboe
  0 siblings, 1 reply; 28+ messages in thread
From: Oleg Nesterov @ 2020-04-07 11:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, peterz

On 04/06, Jens Axboe wrote:
>
> +static inline bool task_work_pending(void)
> +{
> +	return current->task_works;
> +}
> +
> +static inline void task_work_run(void)
> +{
> +	if (task_work_pending())
> +		__task_work_run();
> +}

No, this is wrong. exit_task_work() must always call __task_work_run()
to install work_exited.

This helper (and 3/4) probably makes sense but please change exit_task_work()
to use __task_work_run() then.

Oleg.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] task_work: add task_work_pending() helper
  2020-04-07 11:28   ` Oleg Nesterov
@ 2020-04-07 15:43     ` Jens Axboe
  0 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2020-04-07 15:43 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: io-uring, peterz

On 4/7/20 4:28 AM, Oleg Nesterov wrote:
> On 04/06, Jens Axboe wrote:
>>
>> +static inline bool task_work_pending(void)
>> +{
>> +	return current->task_works;
>> +}
>> +
>> +static inline void task_work_run(void)
>> +{
>> +	if (task_work_pending())
>> +		__task_work_run();
>> +}
> 
> No, this is wrong. exit_task_work() must always call __task_work_run()
> to install work_exited.
> 
> This helper (and 3/4) probably makes sense but please change exit_task_work()
> to use __task_work_run() then.

Good catch, thanks. I'll make the change.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued
  2020-04-06 19:48 [PATCHSET 0/4] io_uring and task_work interactions Jens Axboe
  2020-04-06 19:48 ` [PATCH 1/4] task_work: add task_work_pending() helper Jens Axboe
@ 2020-04-06 19:48 ` Jens Axboe
  2020-04-07 11:39   ` Oleg Nesterov
  2020-04-07 12:47   ` Peter Zijlstra
  2020-04-06 19:48 ` [PATCH 3/4] task_work: kill current->task_works checking in callers Jens Axboe
  2020-04-06 19:48 ` [PATCH 4/4] io_uring: flush task work before waiting for ring exit Jens Axboe
  3 siblings, 2 replies; 28+ messages in thread
From: Jens Axboe @ 2020-04-06 19:48 UTC (permalink / raw)
  To: io-uring; +Cc: peterz, Jens Axboe

If task_work has already been run on task exit, we don't always know
if it's safe to run again. Check for task_work_exited in the
task_work_pending() helper. This makes it less fragile in calling
from the exit files path, for example.

Signed-off-by: Jens Axboe <[email protected]>
---
 include/linux/task_work.h | 4 +++-
 kernel/task_work.c        | 8 ++++----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index 54c911bbf754..24f977a8fc35 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -7,6 +7,8 @@
 
 typedef void (*task_work_func_t)(struct callback_head *);
 
+extern struct callback_head task_work_exited;
+
 static inline void
 init_task_work(struct callback_head *twork, task_work_func_t func)
 {
@@ -19,7 +21,7 @@ void __task_work_run(void);
 
 static inline bool task_work_pending(void)
 {
-	return current->task_works;
+	return current->task_works && current->task_works != &task_work_exited;
 }
 
 static inline void task_work_run(void)
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 9620333423a3..d6a8b4ab4858 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -3,7 +3,7 @@
 #include <linux/task_work.h>
 #include <linux/tracehook.h>
 
-static struct callback_head work_exited; /* all we need is ->next == NULL */
+struct callback_head task_work_exited = { };
 
 /**
  * task_work_add - ask the @task to execute @work->func()
@@ -31,7 +31,7 @@ task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
 
 	do {
 		head = READ_ONCE(task->task_works);
-		if (unlikely(head == &work_exited))
+		if (unlikely(head == &task_work_exited))
 			return -ESRCH;
 		work->next = head;
 	} while (cmpxchg(&task->task_works, head, work) != head);
@@ -95,14 +95,14 @@ void __task_work_run(void)
 	for (;;) {
 		/*
 		 * work->func() can do task_work_add(), do not set
-		 * work_exited unless the list is empty.
+		 * task_work_exited unless the list is empty.
 		 */
 		do {
 			head = NULL;
 			work = READ_ONCE(task->task_works);
 			if (!work) {
 				if (task->flags & PF_EXITING)
-					head = &work_exited;
+					head = &task_work_exited;
 				else
 					break;
 			}
-- 
2.26.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued
  2020-04-06 19:48 ` [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued Jens Axboe
@ 2020-04-07 11:39   ` Oleg Nesterov
  2020-04-07 11:54     ` Oleg Nesterov
  2020-04-07 15:40     ` Jens Axboe
  2020-04-07 12:47   ` Peter Zijlstra
  1 sibling, 2 replies; 28+ messages in thread
From: Oleg Nesterov @ 2020-04-07 11:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, peterz

On 04/06, Jens Axboe wrote:
>
> +extern struct callback_head task_work_exited;
> +
>  static inline void
>  init_task_work(struct callback_head *twork, task_work_func_t func)
>  {
> @@ -19,7 +21,7 @@ void __task_work_run(void);
>
>  static inline bool task_work_pending(void)
>  {
> -	return current->task_works;
> +	return current->task_works && current->task_works != &task_work_exited;
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Well, this penalizes all the current users, they can't hit work_exited.

IIUC, this is needed for the next change which adds task_work_run() into
io_ring_ctx_wait_and_kill(), right?

could you explain how the exiting can call io_ring_ctx_wait_and_kill()
after it passed exit_task_work() ?

Oleg.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued
  2020-04-07 11:39   ` Oleg Nesterov
@ 2020-04-07 11:54     ` Oleg Nesterov
  2020-04-07 15:40     ` Jens Axboe
  1 sibling, 0 replies; 28+ messages in thread
From: Oleg Nesterov @ 2020-04-07 11:54 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, peterz

On 04/07, Oleg Nesterov wrote:
>
> On 04/06, Jens Axboe wrote:
> >
> > +extern struct callback_head task_work_exited;
> > +
> >  static inline void
> >  init_task_work(struct callback_head *twork, task_work_func_t func)
> >  {
> > @@ -19,7 +21,7 @@ void __task_work_run(void);
> >
> >  static inline bool task_work_pending(void)
> >  {
> > -	return current->task_works;
> > +	return current->task_works && current->task_works != &task_work_exited;
>                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Well, this penalizes all the current users, they can't hit work_exited.
>
> IIUC, this is needed for the next change which adds task_work_run() into
> io_ring_ctx_wait_and_kill(), right?
>
> could you explain how the exiting can call io_ring_ctx_wait_and_kill()
> after it passed exit_task_work() ?

Note also that currently we assume that task_work_run() must not be called
by the exiting process (except exit_task_work). If io_ring adds the precedent
we need change the PF_EXITING logic in task_work_run().

Oleg.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued
  2020-04-07 11:39   ` Oleg Nesterov
  2020-04-07 11:54     ` Oleg Nesterov
@ 2020-04-07 15:40     ` Jens Axboe
  2020-04-07 16:19       ` Oleg Nesterov
  1 sibling, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2020-04-07 15:40 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: io-uring, peterz

On 4/7/20 4:39 AM, Oleg Nesterov wrote:
> On 04/06, Jens Axboe wrote:
>>
>> +extern struct callback_head task_work_exited;
>> +
>>  static inline void
>>  init_task_work(struct callback_head *twork, task_work_func_t func)
>>  {
>> @@ -19,7 +21,7 @@ void __task_work_run(void);
>>
>>  static inline bool task_work_pending(void)
>>  {
>> -	return current->task_works;
>> +	return current->task_works && current->task_works != &task_work_exited;
>                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> Well, this penalizes all the current users, they can't hit work_exited.
> 
> IIUC, this is needed for the next change which adds task_work_run() into
> io_ring_ctx_wait_and_kill(), right?

Right - so you'd rather I localize that check there instead? Can certainly
do that.

> could you explain how the exiting can call io_ring_ctx_wait_and_kill()
> after it passed exit_task_work() ?

Sure, here's a trace where it happens:


BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
PGD 0 P4D 0 
Oops: 0010 [#1] SMP
CPU: 51 PID: 7290 Comm: mc_worker Kdump: loaded Not tainted 5.2.9-03696-gf2db01aa1e97 #190
Hardware name: Quanta Leopard ORv2-DDR4/Leopard ORv2-DDR4, BIOS F06_3B17 03/16/2018
RIP: 0010:0x0
Code: Bad RIP value.
RSP: 0018:ffffc9002721bc78 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffffff82d10ff0 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffc9002721bc60 RDI: ffffffff82d10ff0
RBP: ffff889fd220e8f0 R08: 0000000000000000 R09: ffffffff812f1000
R10: ffff88bfa5fcb100 R11: 0000000000000000 R12: ffff889fd220e200
R13: ffff889fd220e92c R14: ffffffff82d10ff0 R15: 0000000000000000
FS:  00007f03161ff700(0000) GS:ffff88bfff9c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 0000000002409004 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 __task_work_run+0x66/0xa0
 io_ring_ctx_wait_and_kill+0x14e/0x3c0
 io_uring_release+0x1c/0x20
 __fput+0xaa/0x200
 __task_work_run+0x66/0xa0
 do_exit+0x9cf/0xb40
 do_group_exit+0x3a/0xa0
 get_signal+0x152/0x800
 do_signal+0x36/0x640
 ? __audit_syscall_exit+0x23c/0x290
 exit_to_usermode_loop+0x65/0x100
 do_syscall_64+0xd4/0x100
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f77057fe8ce
Code: Bad RIP value.
RSP: 002b:00007f03161f8960 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
RAX: 000000000000002a RBX: 00007f03161f8a30 RCX: 00007f77057fe8ce
RDX: 0000000000004040 RSI: 00007f03161f8a30 RDI: 00000000000057a4
RBP: 00007f03161f8980 R08: 0000000000000000 R09: 00007f03161fb7b8
R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
R13: 0000000000004040 R14: 00007f02dc12bc00 R15: 00007f02dc21b900

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued
  2020-04-07 15:40     ` Jens Axboe
@ 2020-04-07 16:19       ` Oleg Nesterov
  2020-04-07 16:59         ` Jens Axboe
  0 siblings, 1 reply; 28+ messages in thread
From: Oleg Nesterov @ 2020-04-07 16:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, peterz

On 04/07, Jens Axboe wrote:
>
> On 4/7/20 4:39 AM, Oleg Nesterov wrote:
> >
> > IIUC, this is needed for the next change which adds task_work_run() into
> > io_ring_ctx_wait_and_kill(), right?
>
> Right - so you'd rather I localize that check there instead? Can certainly
> do that.

I am still not sure we need this check at all... probably this is because
I don't understand the problem.

> > could you explain how the exiting can call io_ring_ctx_wait_and_kill()
> > after it passed exit_task_work() ?
>
> Sure, here's a trace where it happens:

but this task has not passed exit_task_work(),

>  __task_work_run+0x66/0xa0
>  io_ring_ctx_wait_and_kill+0x14e/0x3c0
>  io_uring_release+0x1c/0x20
>  __fput+0xaa/0x200
>  __task_work_run+0x66/0xa0
>  do_exit+0x9cf/0xb40

So task_work_run() is called recursively from exit_task_work()->task_work_run().
See my another email, this is wrong with or without this series. And that is
why I think task_work_run() hits work_exited.

Could you explain why io_ring_ctx_wait_and_kill() needs task_work_run() ?

Oleg.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued
  2020-04-07 16:19       ` Oleg Nesterov
@ 2020-04-07 16:59         ` Jens Axboe
  2020-04-07 17:43           ` Oleg Nesterov
  0 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2020-04-07 16:59 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: io-uring, peterz

On 4/7/20 9:19 AM, Oleg Nesterov wrote:
> On 04/07, Jens Axboe wrote:
>>
>> On 4/7/20 4:39 AM, Oleg Nesterov wrote:
>>>
>>> IIUC, this is needed for the next change which adds task_work_run() into
>>> io_ring_ctx_wait_and_kill(), right?
>>
>> Right - so you'd rather I localize that check there instead? Can certainly
>> do that.
> 
> I am still not sure we need this check at all... probably this is because
> I don't understand the problem.

Probably because I'm not explaining it very well... Let me try. io_uring
uses the task_work to handle completion of poll requests. Either an
explicit poll, or one done implicitly because someone did a recv/send
(or whatever) on a socket that wasn't ready. When we get the poll
waitqueue callback, we queue up task_work to handle the completion of
it.

These can come in at any time, obviously, as space or data becomes
available. If the task is exiting, our task_work_add() fails, and we
queue with someone else. But there seems to be a case where it does get
queued locally, and then io_uring doesn't know if it's safe to run
task_work or not. Sometimes that manifests itself in hitting the RIP ==
0 case that I included here. With the work_pending && work != exit_work
in place, it works fine.

>>> could you explain how the exiting can call io_ring_ctx_wait_and_kill()
>>> after it passed exit_task_work() ?
>>
>> Sure, here's a trace where it happens:
> 
> but this task has not passed exit_task_work(),

But it's definitely hitting callback->func == NULL, which is the
exit_work. So if it's not from past exit_task_work(), where is it from?

> 
>>  __task_work_run+0x66/0xa0
>>  io_ring_ctx_wait_and_kill+0x14e/0x3c0
>>  io_uring_release+0x1c/0x20
>>  __fput+0xaa/0x200
>>  __task_work_run+0x66/0xa0
>>  do_exit+0x9cf/0xb40
> 
> So task_work_run() is called recursively from
> exit_task_work()->task_work_run().  See my another email, this is
> wrong with or without this series. And that is why I think
> task_work_run() hits work_exited.

I see your newer email on this, I'll go read it.

> Could you explain why io_ring_ctx_wait_and_kill() needs
> task_work_run() ?

Hopefully the above helped! If I'm way off somehow, cluebat away.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued
  2020-04-07 16:59         ` Jens Axboe
@ 2020-04-07 17:43           ` Oleg Nesterov
  0 siblings, 0 replies; 28+ messages in thread
From: Oleg Nesterov @ 2020-04-07 17:43 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, peterz

Jens,

I'll read your email tomorrow, just one note for now ...

On 04/07, Jens Axboe wrote:
>
> On 4/7/20 9:19 AM, Oleg Nesterov wrote:
> >
> > but this task has not passed exit_task_work(),
>
> But it's definitely hitting callback->func == NULL, which is the
> exit_work. So if it's not from past exit_task_work(), where is it from?

I guess it comes from task_work_run() added by the next patch ;)

> I see your newer email on this, I'll go read it.

please look at the "bad_work_func" example.

Oleg.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued
  2020-04-06 19:48 ` [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued Jens Axboe
  2020-04-07 11:39   ` Oleg Nesterov
@ 2020-04-07 12:47   ` Peter Zijlstra
  2020-04-07 15:43     ` Jens Axboe
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2020-04-07 12:47 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Oleg Nesterov, viro


You seem to have lost Oleg and Al from the Cc list..

On Mon, Apr 06, 2020 at 01:48:51PM -0600, Jens Axboe wrote:
> If task_work has already been run on task exit, we don't always know
> if it's safe to run again. Check for task_work_exited in the
> task_work_pending() helper. This makes it less fragile in calling
> from the exit files path, for example.
> 
> Signed-off-by: Jens Axboe <[email protected]>
> ---
>  include/linux/task_work.h | 4 +++-
>  kernel/task_work.c        | 8 ++++----
>  2 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/task_work.h b/include/linux/task_work.h
> index 54c911bbf754..24f977a8fc35 100644
> --- a/include/linux/task_work.h
> +++ b/include/linux/task_work.h
> @@ -7,6 +7,8 @@
>  
>  typedef void (*task_work_func_t)(struct callback_head *);
>  
> +extern struct callback_head task_work_exited;
> +
>  static inline void
>  init_task_work(struct callback_head *twork, task_work_func_t func)
>  {
> @@ -19,7 +21,7 @@ void __task_work_run(void);
>  
>  static inline bool task_work_pending(void)
>  {
> -	return current->task_works;
> +	return current->task_works && current->task_works != &task_work_exited;
>  }

Hurmph..  not sure I like this. It inlines that second condition to
every caller of task_work_run() even though for pretty much all of them
this is impossible.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued
  2020-04-07 12:47   ` Peter Zijlstra
@ 2020-04-07 15:43     ` Jens Axboe
  0 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2020-04-07 15:43 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: io-uring, Oleg Nesterov, viro

On 4/7/20 5:47 AM, Peter Zijlstra wrote:
> 
> You seem to have lost Oleg and Al from the Cc list..

I'll add them for v2, I did point Oleg at it!

> On Mon, Apr 06, 2020 at 01:48:51PM -0600, Jens Axboe wrote:
>> If task_work has already been run on task exit, we don't always know
>> if it's safe to run again. Check for task_work_exited in the
>> task_work_pending() helper. This makes it less fragile in calling
>> from the exit files path, for example.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>> ---
>>  include/linux/task_work.h | 4 +++-
>>  kernel/task_work.c        | 8 ++++----
>>  2 files changed, 7 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/task_work.h b/include/linux/task_work.h
>> index 54c911bbf754..24f977a8fc35 100644
>> --- a/include/linux/task_work.h
>> +++ b/include/linux/task_work.h
>> @@ -7,6 +7,8 @@
>>  
>>  typedef void (*task_work_func_t)(struct callback_head *);
>>  
>> +extern struct callback_head task_work_exited;
>> +
>>  static inline void
>>  init_task_work(struct callback_head *twork, task_work_func_t func)
>>  {
>> @@ -19,7 +21,7 @@ void __task_work_run(void);
>>  
>>  static inline bool task_work_pending(void)
>>  {
>> -	return current->task_works;
>> +	return current->task_works && current->task_works != &task_work_exited;
>>  }
> 
> Hurmph..  not sure I like this. It inlines that second condition to
> every caller of task_work_run() even though for pretty much all of them
> this is impossible.

Oleg had the same concern, and I agree with both of you. Would you
prefer it we just leave task_work_run() as:

static inline void task_work_run(void)
{
	if (current->task_works)
		__task_work_run();
}

and then have the io_uring caller do:

	if (current->task_works != &task_work_exited)
		task_work_run();

instead?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3/4] task_work: kill current->task_works checking in callers
  2020-04-06 19:48 [PATCHSET 0/4] io_uring and task_work interactions Jens Axboe
  2020-04-06 19:48 ` [PATCH 1/4] task_work: add task_work_pending() helper Jens Axboe
  2020-04-06 19:48 ` [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued Jens Axboe
@ 2020-04-06 19:48 ` Jens Axboe
  2020-04-06 19:48 ` [PATCH 4/4] io_uring: flush task work before waiting for ring exit Jens Axboe
  3 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2020-04-06 19:48 UTC (permalink / raw)
  To: io-uring; +Cc: peterz, Jens Axboe

If the callsite cares, use task_work_pending(). If not, just call
task_work_run() unconditionally, that makes the check inline and
doesn't add any extra overhead.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/io-wq.c                |  7 ++-----
 fs/io_uring.c             | 20 ++++++++------------
 include/linux/tracehook.h |  3 +--
 kernel/signal.c           |  7 +++----
 4 files changed, 14 insertions(+), 23 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 4023c9846860..5bee3f5f67e1 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -717,8 +717,7 @@ static int io_wq_manager(void *data)
 	complete(&wq->done);
 
 	while (!kthread_should_stop()) {
-		if (current->task_works)
-			task_work_run();
+		task_work_run();
 
 		for_each_node(node) {
 			struct io_wqe *wqe = wq->wqes[node];
@@ -742,9 +741,7 @@ static int io_wq_manager(void *data)
 		schedule_timeout(HZ);
 	}
 
-	if (current->task_works)
-		task_work_run();
-
+	task_work_run();
 	return 0;
 err:
 	set_bit(IO_WQ_BIT_ERROR, &wq->state);
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 79bd22289d73..1579390c7c53 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -5967,8 +5967,7 @@ static int io_sq_thread(void *data)
 			if (!list_empty(&ctx->poll_list) ||
 			    (!time_after(jiffies, timeout) && ret != -EBUSY &&
 			    !percpu_ref_is_dying(&ctx->refs))) {
-				if (current->task_works)
-					task_work_run();
+				task_work_run();
 				cond_resched();
 				continue;
 			}
@@ -6000,8 +5999,8 @@ static int io_sq_thread(void *data)
 					finish_wait(&ctx->sqo_wait, &wait);
 					break;
 				}
-				if (current->task_works) {
-					task_work_run();
+				if (task_work_pending()) {
+					__task_work_run();
 					finish_wait(&ctx->sqo_wait, &wait);
 					continue;
 				}
@@ -6024,8 +6023,7 @@ static int io_sq_thread(void *data)
 		timeout = jiffies + ctx->sq_thread_idle;
 	}
 
-	if (current->task_works)
-		task_work_run();
+	task_work_run();
 
 	set_fs(old_fs);
 	if (cur_mm) {
@@ -6094,9 +6092,9 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 	do {
 		if (io_cqring_events(ctx, false) >= min_events)
 			return 0;
-		if (!current->task_works)
+		if (!task_work_pending())
 			break;
-		task_work_run();
+		__task_work_run();
 	} while (1);
 
 	if (sig) {
@@ -6117,8 +6115,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 	do {
 		prepare_to_wait_exclusive(&ctx->wait, &iowq.wq,
 						TASK_INTERRUPTIBLE);
-		if (current->task_works)
-			task_work_run();
+		task_work_run();
 		if (io_should_wake(&iowq, false))
 			break;
 		schedule();
@@ -7467,8 +7464,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 	int submitted = 0;
 	struct fd f;
 
-	if (current->task_works)
-		task_work_run();
+	task_work_run();
 
 	if (flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP))
 		return -EINVAL;
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 36fb3bbed6b2..608a2d12bc14 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -184,8 +184,7 @@ static inline void tracehook_notify_resume(struct pt_regs *regs)
 	 * hlist_add_head(task->task_works);
 	 */
 	smp_mb__after_atomic();
-	if (unlikely(current->task_works))
-		task_work_run();
+	task_work_run();
 
 #ifdef CONFIG_KEYS_REQUEST_CACHE
 	if (unlikely(current->cached_requested_key)) {
diff --git a/kernel/signal.c b/kernel/signal.c
index e58a6c619824..d62b7a3f2045 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2271,8 +2271,8 @@ static void ptrace_do_notify(int signr, int exit_code, int why)
 void ptrace_notify(int exit_code)
 {
 	BUG_ON((exit_code & (0x7f | ~0xffff)) != SIGTRAP);
-	if (unlikely(current->task_works))
-		task_work_run();
+
+	task_work_run();
 
 	spin_lock_irq(&current->sighand->siglock);
 	ptrace_do_notify(SIGTRAP, exit_code, CLD_TRAPPED);
@@ -2529,8 +2529,7 @@ bool get_signal(struct ksignal *ksig)
 	struct signal_struct *signal = current->signal;
 	int signr;
 
-	if (unlikely(current->task_works))
-		task_work_run();
+	task_work_run();
 
 	if (unlikely(uprobe_deny_signal()))
 		return false;
-- 
2.26.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/4] io_uring: flush task work before waiting for ring exit
  2020-04-06 19:48 [PATCHSET 0/4] io_uring and task_work interactions Jens Axboe
                   ` (2 preceding siblings ...)
  2020-04-06 19:48 ` [PATCH 3/4] task_work: kill current->task_works checking in callers Jens Axboe
@ 2020-04-06 19:48 ` Jens Axboe
  3 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2020-04-06 19:48 UTC (permalink / raw)
  To: io-uring; +Cc: peterz, Jens Axboe

We could have triggered task work when we removed poll completions.
It's now safe to unconditionally call task_work_run() since it checks
for the exited task work, do so in case we have pending items there.
Ensure we do so before flushing the CQ ring overflow.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/io_uring.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 1579390c7c53..183bd28761e3 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7295,10 +7295,13 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
 		io_wq_cancel_all(ctx->io_wq);
 
 	io_iopoll_reap_events(ctx);
+	idr_for_each(&ctx->personality_idr, io_remove_personalities, ctx);
+	task_work_run();
+
 	/* if we failed setting up the ctx, we might not have any rings */
 	if (ctx->rings)
 		io_cqring_overflow_flush(ctx, true);
-	idr_for_each(&ctx->personality_idr, io_remove_personalities, ctx);
+
 	wait_for_completion(&ctx->completions[0]);
 	io_ring_ctx_free(ctx);
 }
-- 
2.26.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCHSET v2] io_uring and task_work interactions
@ 2020-04-07 16:02 Jens Axboe
  2020-04-07 16:02 ` [PATCH 4/4] io_uring: flush task work before waiting for ring exit Jens Axboe
  0 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2020-04-07 16:02 UTC (permalink / raw)
  To: io-uring; +Cc: viro

There's a case for io_uring where we want to run the task_work on ring
exit, but we can't easily do that as we don't know if the task_works has
work, or if exit work has already been queued.

First two are just cleanups, third one makes task_work_exited externally
visible, and fourth one let's io_uring call task_work_run() off the
fops->release() path to fix an issue where we need to potentially flush
work here.

Since v1:
- Don't add task_work_exited check to the fast path
- Ensure exit_task_work() always runs

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 4/4] io_uring: flush task work before waiting for ring exit
  2020-04-07 16:02 [PATCHSET v2] io_uring and task_work interactions Jens Axboe
@ 2020-04-07 16:02 ` Jens Axboe
  2020-04-07 16:24   ` Oleg Nesterov
  0 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2020-04-07 16:02 UTC (permalink / raw)
  To: io-uring; +Cc: viro, Jens Axboe, Oleg Nesterov, Peter Zijlstra

We could have triggered task work when we removed poll completions.
Be sure to check if we're called off the exit path, as the exit task
work could have already been queued. If it is, then the poll flush
will have queued to the io-wq helper thread, so there's no need to run
the work.

Ensure we do so before flushing the CQ ring overflow, in case the poll
flush puts us into overflow mode.

Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
---
 fs/io_uring.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 7fb51c383e51..4e760b7cd772 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7293,10 +7293,15 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
 		io_wq_cancel_all(ctx->io_wq);
 
 	io_iopoll_reap_events(ctx);
+	idr_for_each(&ctx->personality_idr, io_remove_personalities, ctx);
+
+	if (current->task_works != &task_work_exited)
+		task_work_run();
+
 	/* if we failed setting up the ctx, we might not have any rings */
 	if (ctx->rings)
 		io_cqring_overflow_flush(ctx, true);
-	idr_for_each(&ctx->personality_idr, io_remove_personalities, ctx);
+
 	wait_for_completion(&ctx->completions[0]);
 	io_ring_ctx_free(ctx);
 }
-- 
2.26.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit
  2020-04-07 16:02 ` [PATCH 4/4] io_uring: flush task work before waiting for ring exit Jens Axboe
@ 2020-04-07 16:24   ` Oleg Nesterov
  2020-04-07 16:38     ` Oleg Nesterov
  0 siblings, 1 reply; 28+ messages in thread
From: Oleg Nesterov @ 2020-04-07 16:24 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, viro, Peter Zijlstra

On 04/07, Jens Axboe wrote:
>
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -7293,10 +7293,15 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
>  		io_wq_cancel_all(ctx->io_wq);
>  
>  	io_iopoll_reap_events(ctx);
> +	idr_for_each(&ctx->personality_idr, io_remove_personalities, ctx);
> +
> +	if (current->task_works != &task_work_exited)
> +		task_work_run();

this is still wrong, please see the email I sent a minute ago.

Oleg.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit
  2020-04-07 16:24   ` Oleg Nesterov
@ 2020-04-07 16:38     ` Oleg Nesterov
  2020-04-07 20:30       ` Jens Axboe
  0 siblings, 1 reply; 28+ messages in thread
From: Oleg Nesterov @ 2020-04-07 16:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, viro, Peter Zijlstra

On 04/07, Oleg Nesterov wrote:
>
> On 04/07, Jens Axboe wrote:
> >
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -7293,10 +7293,15 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
> >  		io_wq_cancel_all(ctx->io_wq);
> >
> >  	io_iopoll_reap_events(ctx);
> > +	idr_for_each(&ctx->personality_idr, io_remove_personalities, ctx);
> > +
> > +	if (current->task_works != &task_work_exited)
> > +		task_work_run();
>
> this is still wrong, please see the email I sent a minute ago.

Let me try to explain in case it was not clear. Lets forget about io_uring.

	void bad_work_func(struct callback_head *cb)
	{
		task_work_run();
	}

	...

	init_task_work(&my_work, bad_work_func);

	task_work_add(task, &my_work);

If the "task" above is exiting the kernel will crash; because the 2nd
task_work_run() called by bad_work_func() will install work_exited, then
we return to task_work_run() which was called by exit_task_work(), it will
notice ->task_works != NULL, restart the main loop, and execute
work_exited->fn == NULL.

Again, if we want to allow task_work_run() in do_exit() paths we need
something like below. But still do not understand why do we need this :/

Oleg.


diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index bd9a6a91c097..c9f36d233c39 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -15,11 +15,16 @@ init_task_work(struct callback_head *twork, task_work_func_t func)
 
 int task_work_add(struct task_struct *task, struct callback_head *twork, bool);
 struct callback_head *task_work_cancel(struct task_struct *, task_work_func_t);
-void task_work_run(void);
+void __task_work_run(void);
+
+static inline void task_work_run(void)
+{
+	__task_work_run(false);
+}
 
 static inline void exit_task_work(struct task_struct *task)
 {
-	task_work_run();
+	__task_work_run(true);
 }
 
 #endif	/* _LINUX_TASK_WORK_H */
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 825f28259a19..7b26203a583e 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -87,7 +87,7 @@ task_work_cancel(struct task_struct *task, task_work_func_t func)
  * it exits. In the latter case task_work_add() can no longer add the
  * new work after task_work_run() returns.
  */
-void task_work_run(void)
+void __task_work_run(bool is_exit)
 {
 	struct task_struct *task = current;
 	struct callback_head *work, *head, *next;
@@ -101,7 +101,7 @@ void task_work_run(void)
 			head = NULL;
 			work = READ_ONCE(task->task_works);
 			if (!work) {
-				if (task->flags & PF_EXITING)
+				if (is_exit)
 					head = &work_exited;
 				else
 					break;


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit
  2020-04-07 16:38     ` Oleg Nesterov
@ 2020-04-07 20:30       ` Jens Axboe
  2020-04-07 20:39         ` Jens Axboe
  2020-04-08 18:40         ` Oleg Nesterov
  0 siblings, 2 replies; 28+ messages in thread
From: Jens Axboe @ 2020-04-07 20:30 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: io-uring, viro, Peter Zijlstra

On 4/7/20 9:38 AM, Oleg Nesterov wrote:
> On 04/07, Oleg Nesterov wrote:
>>
>> On 04/07, Jens Axboe wrote:
>>>
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -7293,10 +7293,15 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
>>>  		io_wq_cancel_all(ctx->io_wq);
>>>
>>>  	io_iopoll_reap_events(ctx);
>>> +	idr_for_each(&ctx->personality_idr, io_remove_personalities, ctx);
>>> +
>>> +	if (current->task_works != &task_work_exited)
>>> +		task_work_run();
>>
>> this is still wrong, please see the email I sent a minute ago.
> 
> Let me try to explain in case it was not clear. Lets forget about io_uring.
> 
> 	void bad_work_func(struct callback_head *cb)
> 	{
> 		task_work_run();
> 	}
> 
> 	...
> 
> 	init_task_work(&my_work, bad_work_func);
> 
> 	task_work_add(task, &my_work);
> 
> If the "task" above is exiting the kernel will crash; because the 2nd
> task_work_run() called by bad_work_func() will install work_exited, then
> we return to task_work_run() which was called by exit_task_work(), it will
> notice ->task_works != NULL, restart the main loop, and execute
> work_exited->fn == NULL.
> 
> Again, if we want to allow task_work_run() in do_exit() paths we need
> something like below. But still do not understand why do we need this :/

The crash I sent was from the exit path, I don't think we need to run
the task_work for that case, as the ordering should imply that we either
queue the work with the task (if not exiting), and it'll get run just fine,
or we queue it with another task. For both those cases, no need to run
the local task work.

io_uring exit removes the pending poll requests, but what if (for non
exit invocation), we get poll requests completing before they are torn
down. Now we have task_work queued up that won't get run, because we
are are in the task_work handler for the __fput(). For this case, we
need to run the task work.

But I can't tell them apart easily, hence I don't know when it's safe
to run it. That's what I'm trying to solve by exposing task_work_exited
so I can check for that specifically. Not really a great solution as
it doesn't tell me which of the cases I'm in, but at least it tells me
if it's safe to run the task work?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit
  2020-04-07 20:30       ` Jens Axboe
@ 2020-04-07 20:39         ` Jens Axboe
  2020-04-08 18:40         ` Oleg Nesterov
  1 sibling, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2020-04-07 20:39 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: io-uring, viro, Peter Zijlstra

On 4/7/20 1:30 PM, Jens Axboe wrote:
> On 4/7/20 9:38 AM, Oleg Nesterov wrote:
>> On 04/07, Oleg Nesterov wrote:
>>>
>>> On 04/07, Jens Axboe wrote:
>>>>
>>>> --- a/fs/io_uring.c
>>>> +++ b/fs/io_uring.c
>>>> @@ -7293,10 +7293,15 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
>>>>  		io_wq_cancel_all(ctx->io_wq);
>>>>
>>>>  	io_iopoll_reap_events(ctx);
>>>> +	idr_for_each(&ctx->personality_idr, io_remove_personalities, ctx);
>>>> +
>>>> +	if (current->task_works != &task_work_exited)
>>>> +		task_work_run();
>>>
>>> this is still wrong, please see the email I sent a minute ago.
>>
>> Let me try to explain in case it was not clear. Lets forget about io_uring.
>>
>> 	void bad_work_func(struct callback_head *cb)
>> 	{
>> 		task_work_run();
>> 	}
>>
>> 	...
>>
>> 	init_task_work(&my_work, bad_work_func);
>>
>> 	task_work_add(task, &my_work);
>>
>> If the "task" above is exiting the kernel will crash; because the 2nd
>> task_work_run() called by bad_work_func() will install work_exited, then
>> we return to task_work_run() which was called by exit_task_work(), it will
>> notice ->task_works != NULL, restart the main loop, and execute
>> work_exited->fn == NULL.
>>
>> Again, if we want to allow task_work_run() in do_exit() paths we need
>> something like below. But still do not understand why do we need this :/
> 
> The crash I sent was from the exit path, I don't think we need to run
> the task_work for that case, as the ordering should imply that we either
> queue the work with the task (if not exiting), and it'll get run just fine,
> or we queue it with another task. For both those cases, no need to run
> the local task work.
> 
> io_uring exit removes the pending poll requests, but what if (for non
> exit invocation), we get poll requests completing before they are torn
> down. Now we have task_work queued up that won't get run, because we
> are are in the task_work handler for the __fput(). For this case, we
> need to run the task work.
> 
> But I can't tell them apart easily, hence I don't know when it's safe
> to run it. That's what I'm trying to solve by exposing task_work_exited
> so I can check for that specifically. Not really a great solution as
> it doesn't tell me which of the cases I'm in, but at least it tells me
> if it's safe to run the task work?

It's also possible I totally mis-analyzed it, and it really is back to
"just" being an ordering issue than I then work-around by re-running the
task_work within the handler.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit
  2020-04-07 20:30       ` Jens Axboe
  2020-04-07 20:39         ` Jens Axboe
@ 2020-04-08 18:40         ` Oleg Nesterov
  2020-04-08 18:48           ` Jens Axboe
  1 sibling, 1 reply; 28+ messages in thread
From: Oleg Nesterov @ 2020-04-08 18:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, viro, Peter Zijlstra

Jens, I am sorry. I tried to understand your explanations but I can't :/
Just in case, I know nothing about io_uring.

However, I strongly believe that

	- the "task_work_exited" check in 4/4 can't help, the kernel
	  will crash anyway if a task-work callback runs with
	  current->task_works == &task_work_exited.

	- this check is not needed with the patch I sent.
	  UNLESS io_ring_ctx_wait_and_kill() can be called by the exiting
	  task AFTER it passes exit_task_work(), but I don't see how this
	  is possible.

Lets forget this problem, lets assume that task_work_run() is always safe.

I still can not understand why io_ring_ctx_wait_and_kill() needs to call
task_work_run().

On 04/07, Jens Axboe wrote:
>
> io_uring exit removes the pending poll requests, but what if (for non
> exit invocation), we get poll requests completing before they are torn
> down. Now we have task_work queued up that won't get run,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

this must not be possible. If task_work is queued it will run, or we
have another bug.

> because we
> are are in the task_work handler for the __fput().

this doesn't matter...

> For this case, we
> need to run the task work.

This is what I fail to understand :/

Oleg.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit
  2020-04-08 18:40         ` Oleg Nesterov
@ 2020-04-08 18:48           ` Jens Axboe
  2020-04-08 19:06             ` Jens Axboe
  0 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2020-04-08 18:48 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: io-uring, viro, Peter Zijlstra

On 4/8/20 11:40 AM, Oleg Nesterov wrote:
> Jens, I am sorry. I tried to understand your explanations but I can't :/
> Just in case, I know nothing about io_uring.
> 
> However, I strongly believe that
> 
> 	- the "task_work_exited" check in 4/4 can't help, the kernel
> 	  will crash anyway if a task-work callback runs with
> 	  current->task_works == &task_work_exited.
> 
> 	- this check is not needed with the patch I sent.
> 	  UNLESS io_ring_ctx_wait_and_kill() can be called by the exiting
> 	  task AFTER it passes exit_task_work(), but I don't see how this
> 	  is possible.
> 
> Lets forget this problem, lets assume that task_work_run() is always safe.
> 
> I still can not understand why io_ring_ctx_wait_and_kill() needs to call
> task_work_run().
> 
> On 04/07, Jens Axboe wrote:
>>
>> io_uring exit removes the pending poll requests, but what if (for non
>> exit invocation), we get poll requests completing before they are torn
>> down. Now we have task_work queued up that won't get run,
>         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> this must not be possible. If task_work is queued it will run, or we
> have another bug.
> 
>> because we
>> are are in the task_work handler for the __fput().
> 
> this doesn't matter...
> 
>> For this case, we
>> need to run the task work.
> 
> This is what I fail to understand :/

Actually debugging this just now to attempt to get to the bottom of it.
I'm running with Peter's "put fput work at the end at task_work_run
time" patch (with a head == NULL check that was missing). I get a hang
on the wait_for_completion() on io_uring exit, and if I dump the
task_work, this is what I get:

dump_work: dump cb
cb=ffff88bff25589b8, func=ffffffff812f7310	<- io_poll_task_func()
cb=ffff88bfdd164600, func=ffffffff812925e0	<- some __fput()
cb=ffff88bfece13cb8, func=ffffffff812f7310	<- io_poll_task_func()
cb=ffff88bff78393b8, func=ffffffff812b2c40

and we hang because io_poll_task_func() got queued twice on this task
_after_ we yanked the current list of work.

I'm adding some more debug items to figure out why this is, just wanted
to let you know that I'm currently looking into this and will provide
more data when I have it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit
  2020-04-08 18:48           ` Jens Axboe
@ 2020-04-08 19:06             ` Jens Axboe
  2020-04-08 20:17               ` Oleg Nesterov
  0 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2020-04-08 19:06 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: io-uring, viro, Peter Zijlstra

On 4/8/20 11:48 AM, Jens Axboe wrote:
> On 4/8/20 11:40 AM, Oleg Nesterov wrote:
>> Jens, I am sorry. I tried to understand your explanations but I can't :/
>> Just in case, I know nothing about io_uring.
>>
>> However, I strongly believe that
>>
>> 	- the "task_work_exited" check in 4/4 can't help, the kernel
>> 	  will crash anyway if a task-work callback runs with
>> 	  current->task_works == &task_work_exited.
>>
>> 	- this check is not needed with the patch I sent.
>> 	  UNLESS io_ring_ctx_wait_and_kill() can be called by the exiting
>> 	  task AFTER it passes exit_task_work(), but I don't see how this
>> 	  is possible.
>>
>> Lets forget this problem, lets assume that task_work_run() is always safe.
>>
>> I still can not understand why io_ring_ctx_wait_and_kill() needs to call
>> task_work_run().
>>
>> On 04/07, Jens Axboe wrote:
>>>
>>> io_uring exit removes the pending poll requests, but what if (for non
>>> exit invocation), we get poll requests completing before they are torn
>>> down. Now we have task_work queued up that won't get run,
>>         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>> this must not be possible. If task_work is queued it will run, or we
>> have another bug.
>>
>>> because we
>>> are are in the task_work handler for the __fput().
>>
>> this doesn't matter...
>>
>>> For this case, we
>>> need to run the task work.
>>
>> This is what I fail to understand :/
> 
> Actually debugging this just now to attempt to get to the bottom of it.
> I'm running with Peter's "put fput work at the end at task_work_run
> time" patch (with a head == NULL check that was missing). I get a hang
> on the wait_for_completion() on io_uring exit, and if I dump the
> task_work, this is what I get:
> 
> dump_work: dump cb
> cb=ffff88bff25589b8, func=ffffffff812f7310	<- io_poll_task_func()
> cb=ffff88bfdd164600, func=ffffffff812925e0	<- some __fput()
> cb=ffff88bfece13cb8, func=ffffffff812f7310	<- io_poll_task_func()
> cb=ffff88bff78393b8, func=ffffffff812b2c40
> 
> and we hang because io_poll_task_func() got queued twice on this task
> _after_ we yanked the current list of work.
> 
> I'm adding some more debug items to figure out why this is, just wanted
> to let you know that I'm currently looking into this and will provide
> more data when I have it.

Here's some more data. I added a WARN_ON_ONCE() for task->flags &
PF_EXITING on task_work_add() success, and it triggers with the
following backtrace:

[  628.291061] RIP: 0010:__io_async_wake+0x14a/0x160
[  628.300452] Code: 8b b8 c8 00 00 00 e8 75 df 00 00 ba 01 00 00 00 48 89 ee 48 89 c7 49 89 c6 e8 82 dd de ff e9 59 ff ff ff 0f 0b e9 52 ff ff ff <0f> 0b e9 40 ff ff ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00
[  628.337930] RSP: 0018:ffffc9000c830a40 EFLAGS: 00010002
[  628.348365] RAX: 0000000000000000 RBX: ffff88bfe85fc200 RCX: ffff88bff58491b8
[  628.362610] RDX: 0000000000000001 RSI: ffff88bfe85fc2b8 RDI: ffff889fc929f000
[  628.376855] RBP: ffff88bfe85fc2b8 R08: 00000000000000c3 R09: ffffc9000c830ad0
[  628.391087] R10: 0000000000000000 R11: ffff889ff01000a0 R12: ffffc9000c830ad0
[  628.405317] R13: ffff889fb405fc00 R14: ffff889fc929f000 R15: ffff88bfe85fc200
[  628.419549] FS:  0000000000000000(0000) GS:ffff889fff5c0000(0000) knlGS:0000000000000000
[  628.435706] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  628.447178] CR2: 00007f40a3c4b8f0 CR3: 0000000002409002 CR4: 00000000003606e0
[  628.461427] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  628.475675] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  628.489924] Call Trace:
[  628.494810]  <IRQ>
[  628.498835]  ? __io_queue_sqe.part.97+0x750/0x750
[  628.508229]  ? tcp_ack_update_rtt.isra.55+0x113/0x430
[  628.518320]  __wake_up_common+0x71/0x140
[  628.526152]  __wake_up_common_lock+0x7c/0xc0
[  628.534681]  sock_def_readable+0x3c/0x70
[  628.542516]  tcp_data_queue+0x2d9/0xb50
[  628.550175]  tcp_rcv_established+0x1ce/0x620
[  628.558703]  ? sk_filter_trim_cap+0x4f/0x200
[  628.567232]  tcp_v6_do_rcv+0xbe/0x3b0
[  628.574545]  tcp_v6_rcv+0xa8d/0xb20
[  628.581516]  ip6_protocol_deliver_rcu+0xb4/0x450
[  628.590736]  ip6_input_finish+0x11/0x20
[  628.598396]  ip6_input+0xa0/0xb0
[  628.604845]  ? tcp_v6_early_demux+0x90/0x140
[  628.613371]  ? tcp_v6_early_demux+0xdb/0x140
[  628.621902]  ? ip6_rcv_finish_core.isra.21+0x66/0x90
[  628.631816]  ipv6_rcv+0xc0/0xd0
[  628.638092]  __netif_receive_skb_one_core+0x50/0x70
[  628.647833]  netif_receive_skb_internal+0x2f/0xa0
[  628.657226]  napi_gro_receive+0xe7/0x150
[  628.665068]  mlx5e_handle_rx_cqe+0x8c/0x100
[  628.673416]  mlx5e_poll_rx_cq+0xef/0x95b
[  628.681252]  mlx5e_napi_poll+0xe2/0x610
[  628.688913]  net_rx_action+0x132/0x360
[  628.696403]  __do_softirq+0xd3/0x2e6
[  628.703545]  irq_exit+0xa5/0xb0
[  628.709816]  do_IRQ+0x79/0xd0
[  628.715744]  common_interrupt+0xf/0xf
[  628.723057]  </IRQ>
[  628.727256] RIP: 0010:cpuidle_enter_state+0xac/0x410

which means that we've successfully added the task_work while the
process is exiting. Maybe I can work-around this by checking myself
instead of relying on task_work_add() finding work_exited on the list.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit
  2020-04-08 19:06             ` Jens Axboe
@ 2020-04-08 20:17               ` Oleg Nesterov
  2020-04-08 20:25                 ` Jens Axboe
  0 siblings, 1 reply; 28+ messages in thread
From: Oleg Nesterov @ 2020-04-08 20:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, viro, Peter Zijlstra

On 04/08, Jens Axboe wrote:
>
> Here's some more data. I added a WARN_ON_ONCE() for task->flags &
> PF_EXITING on task_work_add() success, and it triggers with the
> following backtrace:
...
> which means that we've successfully added the task_work while the
> process is exiting.

but this is fine, task_work_add(task) can succeed if task->flags & EXITING.

task_work_add(task, work) should only fail if this "task" has already passed
exit_task_work(). Because if this task has already passed exit_task_work(),
nothing else can flush this work and call work->func().

Oleg.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit
  2020-04-08 20:17               ` Oleg Nesterov
@ 2020-04-08 20:25                 ` Jens Axboe
  2020-04-08 21:19                   ` Jens Axboe
  2020-04-09 18:50                   ` Oleg Nesterov
  0 siblings, 2 replies; 28+ messages in thread
From: Jens Axboe @ 2020-04-08 20:25 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: io-uring, viro, Peter Zijlstra

On 4/8/20 1:17 PM, Oleg Nesterov wrote:
> On 04/08, Jens Axboe wrote:
>>
>> Here's some more data. I added a WARN_ON_ONCE() for task->flags &
>> PF_EXITING on task_work_add() success, and it triggers with the
>> following backtrace:
> ...
>> which means that we've successfully added the task_work while the
>> process is exiting.
> 
> but this is fine, task_work_add(task) can succeed if task->flags & EXITING.
> 
> task_work_add(task, work) should only fail if this "task" has already passed
> exit_task_work(). Because if this task has already passed exit_task_work(),
> nothing else can flush this work and call work->func().

So the question remains, we basically have this:

A			B
task_work_run(tsk)
			task_work_add(tsk, io_poll_task_func())
process cbs
wait_for_completion()

with the last wait needing to flush the work added on the B side, since
that isn't part of the initial list.

I don't I can fully close that race _without_ re-running task work
there. Could do something ala:

A			B
mark context "dead"
task_work_run(tsk)
			if (context dead)
				task_work_add(helper, io_poll_task_func())
			else
				task_work_add(tsk, io_poll_task_func())
process cbs
wait_for_completion()

which would do the trick, but I still need to flush work after having
marked the context dead.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit
  2020-04-08 20:25                 ` Jens Axboe
@ 2020-04-08 21:19                   ` Jens Axboe
  2020-04-09 18:50                   ` Oleg Nesterov
  1 sibling, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2020-04-08 21:19 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: io-uring, viro, Peter Zijlstra

On 4/8/20 1:25 PM, Jens Axboe wrote:
> On 4/8/20 1:17 PM, Oleg Nesterov wrote:
>> On 04/08, Jens Axboe wrote:
>>>
>>> Here's some more data. I added a WARN_ON_ONCE() for task->flags &
>>> PF_EXITING on task_work_add() success, and it triggers with the
>>> following backtrace:
>> ...
>>> which means that we've successfully added the task_work while the
>>> process is exiting.
>>
>> but this is fine, task_work_add(task) can succeed if task->flags & EXITING.
>>
>> task_work_add(task, work) should only fail if this "task" has already passed
>> exit_task_work(). Because if this task has already passed exit_task_work(),
>> nothing else can flush this work and call work->func().
> 
> So the question remains, we basically have this:
> 
> A			B
> task_work_run(tsk)
> 			task_work_add(tsk, io_poll_task_func())
> process cbs
> wait_for_completion()
> 
> with the last wait needing to flush the work added on the B side, since
> that isn't part of the initial list.
> 
> I don't I can fully close that race _without_ re-running task work
> there. Could do something ala:
> 
> A			B
> mark context "dead"
> task_work_run(tsk)
> 			if (context dead)
> 				task_work_add(helper, io_poll_task_func())
> 			else
> 				task_work_add(tsk, io_poll_task_func())
> process cbs
> wait_for_completion()
> 
> which would do the trick, but I still need to flush work after having
> marked the context dead.

Actually, I guess it's not enough to re-run the work, we could also
have ordering issues if we have io_poll_task_func() after the fput
of the ring. Maybe this could all work just fine if we just make
the ring exit non-blocking... Testing.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit
  2020-04-08 20:25                 ` Jens Axboe
  2020-04-08 21:19                   ` Jens Axboe
@ 2020-04-09 18:50                   ` Oleg Nesterov
  2020-04-10  0:29                     ` Jens Axboe
  1 sibling, 1 reply; 28+ messages in thread
From: Oleg Nesterov @ 2020-04-09 18:50 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, viro, Peter Zijlstra

On 04/08, Jens Axboe wrote:
>
> So the question remains, we basically have this:
>
> A			B
> task_work_run(tsk)
> 			task_work_add(tsk, io_poll_task_func())
> process cbs
> wait_for_completion()
>
> with the last wait needing to flush the work added on the B side, since
> that isn't part of the initial list.

I don't understand you, even remotely :/

maybe you can write some pseudo-code ?

who does wait_for_completion(), a callback? or this "tsk" after it does
task_work_run() ? Who does complete() ? How can this wait_for_completion()
help to flush the work added on the B side? And why do you need to do
something special to flush that work?

Could you also explain the comment above task_work_add() in
__io_async_wake() ?

	If this fails, then the task is exiting.

OK,

	If that is the case, then the exit check

which exit check?

	will ultimately cancel these work items.

what does this mean? there is nothing to cancel if task_work_add() fails,
I guess this means something else...

Oleg.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit
  2020-04-09 18:50                   ` Oleg Nesterov
@ 2020-04-10  0:29                     ` Jens Axboe
  0 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2020-04-10  0:29 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: io-uring, viro, Peter Zijlstra

On 4/9/20 11:50 AM, Oleg Nesterov wrote:
> On 04/08, Jens Axboe wrote:
>>
>> So the question remains, we basically have this:
>>
>> A			B
>> task_work_run(tsk)
>> 			task_work_add(tsk, io_poll_task_func())
>> process cbs
>> wait_for_completion()
>>
>> with the last wait needing to flush the work added on the B side, since
>> that isn't part of the initial list.
> 
> I don't understand you, even remotely :/
> 
> maybe you can write some pseudo-code ?

Sorry, I guess my explanation skills aren't stellar, or perhaps they
assume too much prior knowledge about this.

I just wasted about a day on debugging this because I had Peter's patch
applied, and I should have reviewed that more carefully. So a lot of the
hangs were due to just missing the running of some of the work.

> who does wait_for_completion(), a callback? or this "tsk" after it does
> task_work_run() ? Who does complete() ? How can this wait_for_completion()
> help to flush the work added on the B side? And why do you need to do
> something special to flush that work?

I'll try to explain this. Each io_uring operation has a request
associated with it. Each request grabs a reference to the ctx, and the
ctx can't exit before we reach zero references (obviously). When we
enter ctx teardown, we wait for us to hit zero references. Each ctx is
basically just a file descriptor, which means that we most often end up
waiting for zero references off the fput() path.

Some requests can spawn task_works work items. If we have ordering
issues on the task_works list, we can have the fput() ordered before
items that need to complete. These items are what ultimately put the
request, so you end up in a situation where you block waiting for
requests to complete, but these requests can't complete until the the
current work has completed. This is the deadlock.

Once I got rid of Peter's patch, I solved this by just making the
wait-and-free operation go async. This allows any dependent work to run
and complete just fine. Here's the patch:

https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.7&id=2baf397719af3a8121fcaba964470950356a4a7a

and I could perhaps make this smarter by checking if current->task_works
!= NULL, but I don't think there's much point to that. The important
part is that we complete the fput() task_work without blocking, so the
remaining work gets a chance to run.

Hope this helps! As mentioned in the commit message, we could have a
separate list of task_works for io_uring, which is what I originally
proposed. But I can also work around it like in the patch referenced
above.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-04-10  0:29 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-04-06 19:48 [PATCHSET 0/4] io_uring and task_work interactions Jens Axboe
2020-04-06 19:48 ` [PATCH 1/4] task_work: add task_work_pending() helper Jens Axboe
2020-04-07 11:28   ` Oleg Nesterov
2020-04-07 15:43     ` Jens Axboe
2020-04-06 19:48 ` [PATCH 2/4] task_work: don't run task_work if task_work_exited is queued Jens Axboe
2020-04-07 11:39   ` Oleg Nesterov
2020-04-07 11:54     ` Oleg Nesterov
2020-04-07 15:40     ` Jens Axboe
2020-04-07 16:19       ` Oleg Nesterov
2020-04-07 16:59         ` Jens Axboe
2020-04-07 17:43           ` Oleg Nesterov
2020-04-07 12:47   ` Peter Zijlstra
2020-04-07 15:43     ` Jens Axboe
2020-04-06 19:48 ` [PATCH 3/4] task_work: kill current->task_works checking in callers Jens Axboe
2020-04-06 19:48 ` [PATCH 4/4] io_uring: flush task work before waiting for ring exit Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2020-04-07 16:02 [PATCHSET v2] io_uring and task_work interactions Jens Axboe
2020-04-07 16:02 ` [PATCH 4/4] io_uring: flush task work before waiting for ring exit Jens Axboe
2020-04-07 16:24   ` Oleg Nesterov
2020-04-07 16:38     ` Oleg Nesterov
2020-04-07 20:30       ` Jens Axboe
2020-04-07 20:39         ` Jens Axboe
2020-04-08 18:40         ` Oleg Nesterov
2020-04-08 18:48           ` Jens Axboe
2020-04-08 19:06             ` Jens Axboe
2020-04-08 20:17               ` Oleg Nesterov
2020-04-08 20:25                 ` Jens Axboe
2020-04-08 21:19                   ` Jens Axboe
2020-04-09 18:50                   ` Oleg Nesterov
2020-04-10  0:29                     ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox