From: Pavel Begunkov <[email protected]>
To: Jens Axboe <[email protected]>, [email protected]
Subject: Re: [PATCHSET 0/4] Use io_wq_work_list for task_work
Date: Wed, 27 Mar 2024 18:04:22 +0000 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On 3/27/24 16:36, Jens Axboe wrote:
> On 3/27/24 7:33 AM, Pavel Begunkov wrote:
>> On 3/26/24 18:42, Jens Axboe wrote:
>>> Hi,
>>>
>>> This converts the deferred, normal, and fallback task_work to use a
>>> normal io_wq_work_list, rather than an llist.
>>>
>>> The main motivation behind this is to get rid of the need to reverse
>>> the list once it's deleted and run. I tested this basic conversion of
>>> just switching it from an llist to an io_wq_work_list with a spinlock,
>>> and I don't see any benefits from the lockless list. And for cases where
>>> we get a bursty addition of task_work, this approach is faster as it
>>> avoids the need to iterate the list upfront while reversing it.
>>
>> I'm curious how you benchmarked it including accounting of irq/softirq
>> where tw add usually happens?
>
> Performance based and profiles. I tested send zc with small packets, as
> that is task_work intensive and exhibits the bursty behavior I mentioned
> in the patch / cover letter. And normal storage IO, IRQ driven.
I assume IRQs are firing on random CPUs then unless you configured
it. In which case it should be bouncing of the cacheline + that
peeking at the prev also needs to fetch it from RAM / further caches.
Unless there is enough of time for TCP to batch them.
> For send zc, we're spending about 2% of the time doing list reversal,
> and I've seen as high as 5 in other testing. And as that test is CPU
I've seen similar before, but for me it was overhead shifted from
__io_run_local_work() fetching requests into reverse touching all
of them. There should be a change in __io_run_local_work()
total cycles (incl children) then I assume
> bound, performance is up about 2% as well.
Did you count by any chance how many items there was in the
list? Average or so
> With the patches, task work adding accounts for about 0.25% of the
> cycles, before it's about 0.66%.
i.e. spinlock is faster. How come? Same cmpxchg in spinlock
with often cache misses, but with irq on/off on top. The only
diff I can remember is that peek into prev req.
> We're spending a bit more time in __io_run_local_work(), but I think
> that's deceptive as we have to disable/enable interrupts now. If an
> interrupt triggers on the unlock, that time tends to be attributed there
> in terms of cycles.
Hmm, I think if run_local_work runtime doesn't change you'd
statistically get same number of interrupts "items" hitting
it, but the would be condensed more to irq-off. Or are you
accounting for some irq delivery / hw differences?
>> One known problem with the current list approach I mentioned several
>> times before is that it peeks at the previous queued tw to count them.
>> It's not nice, but that can be easily done with cmpxchg double. I
>> wonder how much of an issue is that.
>
> That's more of a wart than a real issue though, but we this approach
Assuming tw add executing on random CPUs, that would be additional
fetch every tw add, I wouldn't disregard it right away.
> obviously doesn't do that. And then we can drop the rcu section around
> adding local task_work. Not a huge deal, but still nice.
I don't see how. After you queue the req it might be immediately
executed dropping the ctx ref, which was previously protecting ctx.
The rcu section protects ctx.
>>> And this is less code and simpler, so I'd prefer to go that route.
>>
>> I'm not sure it's less code, if you return optimisations that I
>> believe were killed, see comments to patch 2, it might turn out to
>> be even bulkier and not that simpler.
>
> It's still considerably less:
>
> 3 files changed, 59 insertions(+), 84 deletions(-)
>
> thought that isn't conclusive by itself, as eg io_llist_xchg() goes away
> which has some comments. But I do think the resulting code is simpler
> and more straight forward.
--
Pavel Begunkov
prev parent reply other threads:[~2024-03-27 18:09 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-26 18:42 [PATCHSET 0/4] Use io_wq_work_list for task_work Jens Axboe
2024-03-26 18:42 ` [PATCH 1/4] io_uring: use the right type for work_llist empty check Jens Axboe
2024-03-26 18:42 ` [PATCH 2/4] io_uring: switch deferred task_work to an io_wq_work_list Jens Axboe
2024-03-27 13:24 ` Pavel Begunkov
2024-03-27 15:45 ` Jens Axboe
2024-03-27 16:37 ` Jens Axboe
2024-03-27 17:28 ` Pavel Begunkov
2024-03-27 17:34 ` Jens Axboe
2024-03-26 18:42 ` [PATCH 3/4] io_uring: switch fallback work to io_wq_work_list Jens Axboe
2024-03-26 18:42 ` [PATCH 4/4] io_uring: switch normal task_work " Jens Axboe
2024-03-27 13:33 ` [PATCHSET 0/4] Use io_wq_work_list for task_work Pavel Begunkov
2024-03-27 16:36 ` Jens Axboe
2024-03-27 17:05 ` Jens Axboe
2024-03-27 18:04 ` Pavel Begunkov [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox