From: Jens Axboe <[email protected]>
To: Peter Zijlstra <[email protected]>, Jann Horn <[email protected]>
Cc: io-uring <[email protected]>,
Glauber Costa <[email protected]>,
Pavel Begunkov <[email protected]>
Subject: Re: [PATCH 7/9] io_uring: add per-task callback handler
Date: Fri, 21 Feb 2020 06:49:16 -0800 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On 2/21/20 3:47 AM, Peter Zijlstra wrote:
> On Thu, Feb 20, 2020 at 11:02:16PM +0100, Jann Horn wrote:
>> On Thu, Feb 20, 2020 at 9:32 PM Jens Axboe <[email protected]> wrote:
>>>
>>> For poll requests, it's not uncommon to link a read (or write) after
>>> the poll to execute immediately after the file is marked as ready.
>>> Since the poll completion is called inside the waitqueue wake up handler,
>>> we have to punt that linked request to async context. This slows down
>>> the processing, and actually means it's faster to not use a link for this
>>> use case.
>>>
>>> We also run into problems if the completion_lock is contended, as we're
>>> doing a different lock ordering than the issue side is. Hence we have
>>> to do trylock for completion, and if that fails, go async. Poll removal
>>> needs to go async as well, for the same reason.
>>>
>>> eventfd notification needs special case as well, to avoid stack blowing
>>> recursion or deadlocks.
>>>
>>> These are all deficiencies that were inherited from the aio poll
>>> implementation, but I think we can do better. When a poll completes,
>>> simply queue it up in the task poll list. When the task completes the
>>> list, we can run dependent links inline as well. This means we never
>>> have to go async, and we can remove a bunch of code associated with
>>> that, and optimizations to try and make that run faster. The diffstat
>>> speaks for itself.
>> [...]
>>> -static void io_poll_trigger_evfd(struct io_wq_work **workptr)
>>> +static void io_poll_task_func(struct callback_head *cb)
>>> {
>>> - struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
>>> + struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work);
>>> + struct io_kiocb *nxt = NULL;
>>>
>> [...]
>>> + io_poll_task_handler(req, &nxt);
>>> + if (nxt)
>>> + __io_queue_sqe(nxt, NULL);
>>
>> This can now get here from anywhere that calls schedule(), right?
>> Which means that this might almost double the required kernel stack
>> size, if one codepath exists that calls schedule() while near the
>> bottom of the stack and another codepath exists that goes from here
>> through the VFS and again uses a big amount of stack space? This is a
>> somewhat ugly suggestion, but I wonder whether it'd make sense to
>> check whether we've consumed over 25% of stack space, or something
>> like that, and if so, directly punt the request.
>
> I'm still completely confused as to how io_uring works, and concequently
> the ramifications of all this.
>
> But I thought to understand that these sched_work things were only
> queued on tasks that were stuck waiting on POLL (or it's io_uring
> equivalent). Earlier patches were explicitly running things from
> io_cqring_wait(), which might have given me this impression.
No, that is correct.
> The above seems to suggest this is not the case. Which then does indeed
> lead to all the worries expressed by Jann. All sorts of nasty nesting is
> possible with this.
>
> Can someone please spell this out for me?
Let me try with an example - the tldr is that a task wants to eg read
from a socket, it issues a io_uring recv() for example. We always do
these non-blocking, there's no data there, the task gets -EAGAIN on the
attempt. What would happen in the previous code is the task would then
offload the recv() to a worker thread, and the worker thread would
block waiting on the receive. This is sub-optimal, in that it both
requires a thread offload and has a thread alive waiting for that data
to come in.
This, instead, arms a poll handler for the task. When we get notified of
data availability, we queue a work item that will the perform the
recv(). This is what is offloaded to the sched_work list currently.
> Afaict the req->tsk=current thing is set for whomever happens to run
> io_poll_add_prep(), which is either a sys_io_uring_enter() or an io-wq
> thread afaict.
>
> But I'm then unsure what happens to that thread afterwards.
>
> Jens, what exactly is the benefit of running this on every random
> schedule() vs in io_cqring_wait() ? Or even, since io_cqring_wait() is
> the very last thing the syscall does, task_work.
I took a step back and I think we can just use the task work, which
makes this a lot less complicated in terms of locking and schedule
state. Ran some quick testing with the below and it works for me.
I'm going to re-spin based on this and just dump the sched_work
addition.
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 81aa3959f326..413ac86d7882 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3529,7 +3529,7 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll,
* the exit check will ultimately cancel these work items. Hence we
* don't need to check here and handle it specifically.
*/
- sched_work_add(tsk, &req->sched_work);
+ task_work_add(tsk, &req->sched_work, true);
wake_up_process(tsk);
return 1;
}
@@ -5367,9 +5367,9 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
do {
if (io_cqring_events(ctx, false) >= min_events)
return 0;
- if (!current->sched_work)
+ if (!current->task_works)
break;
- sched_work_run();
+ task_work_run();
} while (1);
if (sig) {
@@ -5392,6 +5392,12 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
TASK_INTERRUPTIBLE);
if (io_should_wake(&iowq, false))
break;
+ if (current->task_works) {
+ task_work_run();
+ if (io_should_wake(&iowq, false))
+ break;
+ continue;
+ }
schedule();
if (signal_pending(current)) {
ret = -EINTR;
@@ -6611,7 +6617,7 @@ static void __io_uring_cancel_task(struct task_struct *tsk,
{
struct callback_head *head;
- while ((head = sched_work_cancel(tsk, func)) != NULL) {
+ while ((head = task_work_cancel(tsk, func)) != NULL) {
struct io_kiocb *req;
req = container_of(head, struct io_kiocb, sched_work);
@@ -6886,7 +6892,7 @@ static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m)
struct io_kiocb *req;
hlist_for_each_entry(req, list, hash_node)
- seq_printf(m, " req=%lx, op=%d, tsk list=%d\n", (long) req, req->opcode, req->task->sched_work != NULL);
+ seq_printf(m, " req=%lx, op=%d, tsk list=%d\n", (long) req, req->opcode, req->task->task_works != NULL);
}
spin_unlock_irq(&ctx->completion_lock);
mutex_unlock(&ctx->uring_lock);
--
Jens Axboe
next prev parent reply other threads:[~2020-02-21 14:49 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-02-20 20:31 [PATCHSET 0/9] io_uring: use polled async retry Jens Axboe
2020-02-20 20:31 ` [PATCH 1/9] io_uring: consider any io_read/write -EAGAIN as final Jens Axboe
2020-02-20 20:31 ` [PATCH 2/9] io_uring: io_accept() should hold on to submit reference on retry Jens Axboe
2020-02-20 20:31 ` [PATCH 3/9] sched: move io-wq/workqueue worker sched in/out into helpers Jens Axboe
2020-02-20 20:31 ` [PATCH 4/9] task_work_run: don't take ->pi_lock unconditionally Jens Axboe
2020-02-20 20:31 ` [PATCH 5/9] kernel: abstract out task work helpers Jens Axboe
2020-02-20 21:07 ` Peter Zijlstra
2020-02-20 21:08 ` Jens Axboe
2020-02-20 20:31 ` [PATCH 6/9] sched: add a sched_work list Jens Axboe
2020-02-20 21:17 ` Peter Zijlstra
2020-02-20 21:53 ` Jens Axboe
2020-02-20 22:02 ` Jens Axboe
2020-02-20 20:31 ` [PATCH 7/9] io_uring: add per-task callback handler Jens Axboe
2020-02-20 22:02 ` Jann Horn
2020-02-20 22:14 ` Jens Axboe
2020-02-20 22:18 ` Jens Axboe
2020-02-20 22:25 ` Jann Horn
2020-02-20 22:23 ` Jens Axboe
2020-02-20 22:38 ` Jann Horn
2020-02-20 22:56 ` Jens Axboe
2020-02-20 22:58 ` Jann Horn
2020-02-20 23:02 ` Jens Axboe
2020-02-20 22:23 ` Jann Horn
2020-02-20 23:00 ` Jens Axboe
2020-02-20 23:12 ` Jann Horn
2020-02-20 23:22 ` Jens Axboe
2020-02-21 1:29 ` Jann Horn
2020-02-21 17:32 ` Jens Axboe
2020-02-21 19:24 ` Jann Horn
2020-02-21 20:18 ` Jens Axboe
2020-02-20 22:56 ` Jann Horn
2020-02-21 10:47 ` Peter Zijlstra
2020-02-21 14:49 ` Jens Axboe [this message]
2020-02-21 15:02 ` Jann Horn
2020-02-21 16:12 ` Peter Zijlstra
2020-02-21 16:23 ` Peter Zijlstra
2020-02-21 20:13 ` Jens Axboe
2020-02-21 13:51 ` Pavel Begunkov
2020-02-21 14:50 ` Jens Axboe
2020-02-21 18:30 ` Pavel Begunkov
2020-02-21 19:10 ` Jens Axboe
2020-02-21 19:22 ` Pavel Begunkov
2020-02-23 6:00 ` Jens Axboe
2020-02-23 6:26 ` Jens Axboe
2020-02-23 11:02 ` Pavel Begunkov
2020-02-23 14:49 ` Jens Axboe
2020-02-23 14:58 ` Jens Axboe
2020-02-23 15:07 ` Jens Axboe
2020-02-23 18:04 ` Pavel Begunkov
2020-02-23 18:06 ` Jens Axboe
2020-02-23 17:55 ` Pavel Begunkov
2020-02-20 20:31 ` [PATCH 8/9] io_uring: mark requests that we can do poll async in io_op_defs Jens Axboe
2020-02-20 20:31 ` [PATCH 9/9] io_uring: use poll driven retry for files that support it Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox