From: Jens Axboe <[email protected]>
To: Jann Horn <[email protected]>
Cc: io-uring <[email protected]>,
Glauber Costa <[email protected]>,
Peter Zijlstra <[email protected]>,
Pavel Begunkov <[email protected]>
Subject: Re: [PATCH 7/9] io_uring: add per-task callback handler
Date: Fri, 21 Feb 2020 12:18:33 -0800 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <CAG48ez0wObFg58RyZW4eOiJP39xSxeU9VedOD2OJQqSFV7dAwg@mail.gmail.com>
On 2/21/20 12:24 PM, Jann Horn wrote:
> On Fri, Feb 21, 2020 at 6:32 PM Jens Axboe <[email protected]> wrote:
>> On 2/20/20 6:29 PM, Jann Horn wrote:
>>> On Fri, Feb 21, 2020 at 12:22 AM Jens Axboe <[email protected]> wrote:
>>>> On 2/20/20 4:12 PM, Jann Horn wrote:
>>>>> On Fri, Feb 21, 2020 at 12:00 AM Jens Axboe <[email protected]> wrote:
>>>>>> On 2/20/20 3:23 PM, Jann Horn wrote:
>>>>>>> On Thu, Feb 20, 2020 at 11:14 PM Jens Axboe <[email protected]> wrote:
>>>>>>>> On 2/20/20 3:02 PM, Jann Horn wrote:
>>>>>>>>> On Thu, Feb 20, 2020 at 9:32 PM Jens Axboe <[email protected]> wrote:
>>>>>>>>>> For poll requests, it's not uncommon to link a read (or write) after
>>>>>>>>>> the poll to execute immediately after the file is marked as ready.
>>>>>>>>>> Since the poll completion is called inside the waitqueue wake up handler,
>>>>>>>>>> we have to punt that linked request to async context. This slows down
>>>>>>>>>> the processing, and actually means it's faster to not use a link for this
>>>>>>>>>> use case.
>>>>> [...]
>>>>>>>>>> -static void io_poll_trigger_evfd(struct io_wq_work **workptr)
>>>>>>>>>> +static void io_poll_task_func(struct callback_head *cb)
>>>>>>>>>> {
>>>>>>>>>> - struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
>>>>>>>>>> + struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work);
>>>>>>>>>> + struct io_kiocb *nxt = NULL;
>>>>>>>>>>
>>>>>>>>> [...]
>>>>>>>>>> + io_poll_task_handler(req, &nxt);
>>>>>>>>>> + if (nxt)
>>>>>>>>>> + __io_queue_sqe(nxt, NULL);
>>>>>>>>>
>>>>>>>>> This can now get here from anywhere that calls schedule(), right?
>>>>>>>>> Which means that this might almost double the required kernel stack
>>>>>>>>> size, if one codepath exists that calls schedule() while near the
>>>>>>>>> bottom of the stack and another codepath exists that goes from here
>>>>>>>>> through the VFS and again uses a big amount of stack space? This is a
>>>>>>>>> somewhat ugly suggestion, but I wonder whether it'd make sense to
>>>>>>>>> check whether we've consumed over 25% of stack space, or something
>>>>>>>>> like that, and if so, directly punt the request.
>>>>> [...]
>>>>>>>>> Also, can we recursively hit this point? Even if __io_queue_sqe()
>>>>>>>>> doesn't *want* to block, the code it calls into might still block on a
>>>>>>>>> mutex or something like that, at which point the mutex code would call
>>>>>>>>> into schedule(), which would then again hit sched_out_update() and get
>>>>>>>>> here, right? As far as I can tell, this could cause unbounded
>>>>>>>>> recursion.
>>>>>>>>
>>>>>>>> The sched_work items are pruned before being run, so that can't happen.
>>>>>>>
>>>>>>> And is it impossible for new ones to be added in the meantime if a
>>>>>>> second poll operation completes in the background just when we're
>>>>>>> entering __io_queue_sqe()?
>>>>>>
>>>>>> True, that can happen.
>>>>>>
>>>>>> I wonder if we just prevent the recursion whether we can ignore most
>>>>>> of it. Eg never process the sched_work list if we're not at the top
>>>>>> level, so to speak.
>>>>>>
>>>>>> This should also prevent the deadlock that you mentioned with FUSE
>>>>>> in the next email that just rolled in.
>>>>>
>>>>> But there the first ->read_iter could be from outside io_uring. So you
>>>>> don't just have to worry about nesting inside an already-running uring
>>>>> work; you also have to worry about nesting inside more or less
>>>>> anything else that might be holding mutexes. So I think you'd pretty
>>>>> much have to whitelist known-safe schedule() callers, or something
>>>>> like that.
>>>>
>>>> I'll see if I can come up with something for that. Ideally any issue
>>>> with IOCB_NOWAIT set should be honored, and trylock etc should be used.
>>>
>>> Are you sure? For example, an IO operation typically copies data to
>>> userspace, which can take pagefaults. And those should be handled
>>> synchronously even with IOCB_NOWAIT set, right? And the page fault
>>> code can block on mutexes (like the mmap_sem) or even wait for a
>>> blocking filesystem operation (via file mappings) or for userspace
>>> (via userfaultfd or FUSE mappings).
>>
>> Yeah that's a good point. The more I think about it, the less I think
>> the scheduler invoked callback is going to work. We need to be able to
>> manage the context of when we are called, see later messages on the
>> task_work usage instead.
>>
>>>> But I don't think we can fully rely on that, we need something a bit
>>>> more solid...
>>>>
>>>>> Taking a step back: Do you know why this whole approach brings the
>>>>> kind of performance benefit you mentioned in the cover letter? 4x is a
>>>>> lot... Is it that expensive to take a trip through the scheduler?
>>>>> I wonder whether the performance numbers for the echo test would
>>>>> change if you commented out io_worker_spin_for_work()...
>>>>
>>>> If anything, I expect the spin removal to make it worse. There's really
>>>> no magic there on why it's faster, if you offload work to a thread that
>>>> is essentially sync, then you're going to take a huge hit in
>>>> performance. It's the difference between:
>>>>
>>>> 1) Queue work with thread, wake up thread
>>>> 2) Thread wakes, starts work, goes to sleep.
>>>
>>> If we go to sleep here, then the other side hasn't yet sent us
>>> anything, so up to this point, it shouldn't have any impact on the
>>> measured throughput, right?
>>>
>>>> 3) Data available, thread is woken, does work
>>>
>>> This is the same in the other case: Data is available, the
>>> application's thread is woken and does the work.
>>>
>>>> 4) Thread signals completion of work
>>>
>>> And this is also basically the same, except that in the worker-thread
>>> case, we have to go through the scheduler to reach userspace, while
>>> with this patch series, we can signal "work is completed" and return
>>> to userspace without an extra trip through the scheduler.
>>
>> There's a big difference between:
>>
>> - Task needs to do work, task goes to sleep on it, task is woken
>>
>> and
>>
>> - Task needs to do work, task passes work to thread. Task goes to sleep.
>> Thread wakes up, tries to do work, goes to sleep. Thread is woken,
>> does work, notifies task. Task is woken up.
>>
>> If you've ever done any sort of thread poll (userspace or otherwise),
>> this is painful, and particularly so when you're only keeping one
>> work item in flight. That kind of pipeline is rife with bubbles. If we
>> can have multiple items in flight, then we start to gain ground due to
>> the parallelism.
>>
>>> I could imagine this optimization having some performance benefit, but
>>> I'm still sceptical about it buying a 4x benefit without some more
>>> complicated reason behind it.
>>
>> I just re-ran the testing, this time on top of the current tree, where
>> instead of doing the task/sched_work_add() we simply queue for async.
>> This should be an even better case than before, since hopefully the
>> thread will not need to go to sleep to process the work, it'll complete
>> without blocking. For an echo test setup over a socket, this approach
>> yields about 45-48K requests per second. This, btw, is with the io-wq
>> spin removed. Using the callback method where the task itself does the
>> work, 175K-180K requests per second.
>
> Huh. So that's like, what, somewhere on the order of 7.6 microseconds
> or somewhere around 15000 cycles overhead for shoving a request
> completion event from worker context over to a task, assuming that
> you're running at something around 2GHz? Well, I guess that's a little
> more than twice as much time as it takes to switch from one blocked
> thread to another via eventfd (including overhead from syscall and CPU
> mitigations and stuff), so I guess it's not completely unreasonable...
This is on my laptop, running the kernel in kvm for testing. So it's not
a beefy setup:
Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
> Anyway, I'll stop nagging about this since it sounds like you're going
> to implement this in a less unorthodox way now. ^^
I'll post the updated series later today, processing off ->task_works
instead. I do agree that this is much saner than trying to entangle task
state on schedule() entry/exit, and it seems to work just as well in my
testing.
--
Jens Axboe
next prev parent reply other threads:[~2020-02-21 20:18 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-02-20 20:31 [PATCHSET 0/9] io_uring: use polled async retry Jens Axboe
2020-02-20 20:31 ` [PATCH 1/9] io_uring: consider any io_read/write -EAGAIN as final Jens Axboe
2020-02-20 20:31 ` [PATCH 2/9] io_uring: io_accept() should hold on to submit reference on retry Jens Axboe
2020-02-20 20:31 ` [PATCH 3/9] sched: move io-wq/workqueue worker sched in/out into helpers Jens Axboe
2020-02-20 20:31 ` [PATCH 4/9] task_work_run: don't take ->pi_lock unconditionally Jens Axboe
2020-02-20 20:31 ` [PATCH 5/9] kernel: abstract out task work helpers Jens Axboe
2020-02-20 21:07 ` Peter Zijlstra
2020-02-20 21:08 ` Jens Axboe
2020-02-20 20:31 ` [PATCH 6/9] sched: add a sched_work list Jens Axboe
2020-02-20 21:17 ` Peter Zijlstra
2020-02-20 21:53 ` Jens Axboe
2020-02-20 22:02 ` Jens Axboe
2020-02-20 20:31 ` [PATCH 7/9] io_uring: add per-task callback handler Jens Axboe
2020-02-20 22:02 ` Jann Horn
2020-02-20 22:14 ` Jens Axboe
2020-02-20 22:18 ` Jens Axboe
2020-02-20 22:25 ` Jann Horn
2020-02-20 22:23 ` Jens Axboe
2020-02-20 22:38 ` Jann Horn
2020-02-20 22:56 ` Jens Axboe
2020-02-20 22:58 ` Jann Horn
2020-02-20 23:02 ` Jens Axboe
2020-02-20 22:23 ` Jann Horn
2020-02-20 23:00 ` Jens Axboe
2020-02-20 23:12 ` Jann Horn
2020-02-20 23:22 ` Jens Axboe
2020-02-21 1:29 ` Jann Horn
2020-02-21 17:32 ` Jens Axboe
2020-02-21 19:24 ` Jann Horn
2020-02-21 20:18 ` Jens Axboe [this message]
2020-02-20 22:56 ` Jann Horn
2020-02-21 10:47 ` Peter Zijlstra
2020-02-21 14:49 ` Jens Axboe
2020-02-21 15:02 ` Jann Horn
2020-02-21 16:12 ` Peter Zijlstra
2020-02-21 16:23 ` Peter Zijlstra
2020-02-21 20:13 ` Jens Axboe
2020-02-21 13:51 ` Pavel Begunkov
2020-02-21 14:50 ` Jens Axboe
2020-02-21 18:30 ` Pavel Begunkov
2020-02-21 19:10 ` Jens Axboe
2020-02-21 19:22 ` Pavel Begunkov
2020-02-23 6:00 ` Jens Axboe
2020-02-23 6:26 ` Jens Axboe
2020-02-23 11:02 ` Pavel Begunkov
2020-02-23 14:49 ` Jens Axboe
2020-02-23 14:58 ` Jens Axboe
2020-02-23 15:07 ` Jens Axboe
2020-02-23 18:04 ` Pavel Begunkov
2020-02-23 18:06 ` Jens Axboe
2020-02-23 17:55 ` Pavel Begunkov
2020-02-20 20:31 ` [PATCH 8/9] io_uring: mark requests that we can do poll async in io_op_defs Jens Axboe
2020-02-20 20:31 ` [PATCH 9/9] io_uring: use poll driven retry for files that support it Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox