From: Jens Axboe <[email protected]>
To: Jann Horn <[email protected]>
Cc: io-uring <[email protected]>,
Glauber Costa <[email protected]>,
Peter Zijlstra <[email protected]>,
Pavel Begunkov <[email protected]>
Subject: Re: [PATCH 5/7] io_uring: add per-task callback handler
Date: Sat, 22 Feb 2020 07:09:31 -0800 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On 2/22/20 7:41 AM, Jens Axboe wrote:
> On 2/21/20 4:00 PM, Jann Horn wrote:
>> On Fri, Feb 21, 2020 at 11:56 PM Jann Horn <[email protected]> wrote:
>>> On Fri, Feb 21, 2020 at 10:46 PM Jens Axboe <[email protected]> wrote:
>>>> For poll requests, it's not uncommon to link a read (or write) after
>>>> the poll to execute immediately after the file is marked as ready.
>>>> Since the poll completion is called inside the waitqueue wake up handler,
>>>> we have to punt that linked request to async context. This slows down
>>>> the processing, and actually means it's faster to not use a link for this
>>>> use case.
>>>>
>>>> We also run into problems if the completion_lock is contended, as we're
>>>> doing a different lock ordering than the issue side is. Hence we have
>>>> to do trylock for completion, and if that fails, go async. Poll removal
>>>> needs to go async as well, for the same reason.
>>>>
>>>> eventfd notification needs special case as well, to avoid stack blowing
>>>> recursion or deadlocks.
>>>>
>>>> These are all deficiencies that were inherited from the aio poll
>>>> implementation, but I think we can do better. When a poll completes,
>>>> simply queue it up in the task poll list. When the task completes the
>>>> list, we can run dependent links inline as well. This means we never
>>>> have to go async, and we can remove a bunch of code associated with
>>>> that, and optimizations to try and make that run faster. The diffstat
>>>> speaks for itself.
>>> [...]
>>>> @@ -3637,8 +3587,8 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
>>>> {
>>>> struct io_kiocb *req = wait->private;
>>>> struct io_poll_iocb *poll = &req->poll;
>>>> - struct io_ring_ctx *ctx = req->ctx;
>>>> __poll_t mask = key_to_poll(key);
>>>> + struct task_struct *tsk;
>>>>
>>>> /* for instances that support it check for an event match first: */
>>>> if (mask && !(mask & poll->events))
>>>> @@ -3646,46 +3596,11 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
>>>>
>>>> list_del_init(&poll->wait.entry);
>>>>
>>> [...]
>>>> + tsk = req->task;
>>>> + req->result = mask;
>>>> + init_task_work(&req->task_work, io_poll_task_func);
>>>> + task_work_add(tsk, &req->task_work, true);
>>>> + wake_up_process(tsk);
>>>> return 1;
>>>> }
>>>
>>> Let's say userspace has some code like this:
>>>
>>> [prepare two uring requests: one POLL and a RECVMSG linked behind it]
>>> // submit requests
>>> io_uring_enter(uring_fd, 2, 0, 0, NULL, 0);
>>> // wait for something to happen, either a completion event from uring
>>> or input from stdin
>>> struct pollfd fds[] = {
>>> { .fd = 0, .events = POLLIN },
>>> { .fd = uring_fd, .events = POLLIN }
>>> };
>>> while (1) {
>>> poll(fds, 2, -1);
>>> if (fds[0].revents) {
>>> [read stuff from stdin]
>>> }
>>> if (fds[1].revents) {
>>> [fetch completions from shared memory]
>>> }
>>> }
>>>
>>> If userspace has reached the poll() by the time the uring POLL op
>>> completes, I think you'll wake up the do_poll() loop while it is in
>>> poll_schedule_timeout(); then it will do another iteration, see that
>>> no signals are pending and none of the polled files have become ready,
>>> and go to sleep again. So things are stuck until the io_uring fd
>>> signals that it is ready.
>>>
>>> The options I see are:
>>>
>>> - Tell the kernel to go through signal delivery code, which I think
>>> will cause the pending syscall to actually abort and return to
>>> userspace (which I think is kinda gross). You could maybe add a
>>> special case where that doesn't happen if the task is already in
>>> io_uring_enter() and waiting for CQ events.
>>> - Forbid eventfd notifications, ensure that the ring's ->poll handler
>>> reports POLLIN when work items are pending for userspace, and then
>>> rely on the fact that those work items will be picked up when
>>> returning from the poll syscall. Unfortunately, this gets a bit messy
>>> when you're dealing with multiple threads that access the same ring,
>>> since then you'd have to ensure that *any* thread can pick up this
>>> work, and that that doesn't mismatch how the uring instance is shared
>>> between threads; but you could probably engineer your way around this.
>>> For userspace, this whole thing just means "POLLIN may be spurious".
>>> - Like the previous item, except you tell userspace that if it gets
>>> POLLIN (or some special poll status like POLLRDBAND) and sees nothing
>>> in the completion queue, it should call io_uring_enter() to process
>>> the work. This addresses the submitter-is-not-completion-reaper
>>> scenario without having to add some weird version of task_work that
>>> will be processed by the first thread, but you'd get some extra
>>> syscalls.
>>
>> ... or I suppose you could punt to worker context if anyone uses the
>> ring's ->poll handler or has an eventfd registered, if you don't
>> expect high-performance users to do those things.
>
> Good points, thanks Jann. We have some precedence in the area of
> requiring the application to enter the kernel, that's how the CQ ring
> overflow is handled as well. For liburing users, that'd be trivial to
> hide, for the raw interface that's not necessarily the case. I'd hate to
> make the feature opt-in rather than just generally available.
>
> I'll try and play with some ideas in this area and see how it falls out.
I wonder if the below is enough - it'll trigger a poll and eventfd
wakeup, if we add work. If current->task_works != NULL, we could also
set POLLPRI to make it explicit why this happened, that seems like a
better fit than POLLRDBAND.
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 6f7ae3eab21f..dba2f0e1ae6a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3501,6 +3501,18 @@ static void io_async_queue_proc(struct file *file, struct wait_queue_head *head,
__io_queue_proc(&pt->req->apoll->poll, pt, head);
}
+static void io_task_work_notify(struct io_ring_ctx *ctx,
+ struct task_struct *tsk)
+{
+ if (wq_has_sleeper(&ctx->cq_wait)) {
+ wake_up_interruptible(&ctx->cq_wait);
+ kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN);
+ }
+ if (ctx->cq_ev_fd)
+ eventfd_signal(ctx->cq_ev_fd, 0);
+ wake_up_process(tsk);
+}
+
static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll,
__poll_t mask, task_work_func_t func)
{
@@ -3523,7 +3535,7 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll,
* don't need to check here and handle it specifically.
*/
task_work_add(tsk, &req->task_work, true);
- wake_up_process(tsk);
+ io_task_work_notify(req->ctx, tsk);
return 1;
}
@@ -6488,7 +6500,7 @@ static __poll_t io_uring_poll(struct file *file, poll_table *wait)
if (READ_ONCE(ctx->rings->sq.tail) - ctx->cached_sq_head !=
ctx->rings->sq_ring_entries)
mask |= EPOLLOUT | EPOLLWRNORM;
- if (io_cqring_events(ctx, false))
+ if (io_cqring_events(ctx, false) || current->task_works != NULL)
mask |= EPOLLIN | EPOLLRDNORM;
return mask;
--
Jens Axboe
next prev parent reply other threads:[~2020-02-22 15:09 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-02-21 21:45 [PATCHSET v2 0/7] io_uring: use polled async retry Jens Axboe
2020-02-21 21:46 ` [PATCH 1/7] io_uring: consider any io_read/write -EAGAIN as final Jens Axboe
2020-02-21 21:46 ` [PATCH 2/7] io_uring: io_accept() should hold on to submit reference on retry Jens Axboe
2020-02-21 21:46 ` [PATCH 3/7] task_work_run: don't take ->pi_lock unconditionally Jens Axboe
2020-02-21 21:46 ` [PATCH 4/7] io_uring: store io_kiocb in wait->private Jens Axboe
2020-02-21 21:46 ` [PATCH 5/7] io_uring: add per-task callback handler Jens Axboe
2020-02-21 22:56 ` Jann Horn
2020-02-21 23:00 ` Jann Horn
2020-02-22 14:41 ` Jens Axboe
2020-02-22 15:09 ` Jens Axboe [this message]
2020-02-22 18:49 ` Jens Axboe
2020-02-22 18:54 ` Jens Axboe
2020-02-23 3:30 ` Jens Axboe
2020-02-21 21:46 ` [PATCH 6/7] io_uring: mark requests that we can do poll async in io_op_defs Jens Axboe
2020-02-21 21:46 ` [PATCH 7/7] io_uring: use poll driven retry for files that support it Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox