public inbox for [email protected]
 help / color / mirror / Atom feed
From: Pavel Begunkov <[email protected]>
To: Jens Axboe <[email protected]>, [email protected]
Subject: Re: [PATCHSET 0/4] Use io_wq_work_list for task_work
Date: Wed, 27 Mar 2024 18:04:22 +0000	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

On 3/27/24 16:36, Jens Axboe wrote:
> On 3/27/24 7:33 AM, Pavel Begunkov wrote:
>> On 3/26/24 18:42, Jens Axboe wrote:
>>> Hi,
>>>
>>> This converts the deferred, normal, and fallback task_work to use a
>>> normal io_wq_work_list, rather than an llist.
>>>
>>> The main motivation behind this is to get rid of the need to reverse
>>> the list once it's deleted and run. I tested this basic conversion of
>>> just switching it from an llist to an io_wq_work_list with a spinlock,
>>> and I don't see any benefits from the lockless list. And for cases where
>>> we get a bursty addition of task_work, this approach is faster as it
>>> avoids the need to iterate the list upfront while reversing it.
>>
>> I'm curious how you benchmarked it including accounting of irq/softirq
>> where tw add usually happens?
> 
> Performance based and profiles. I tested send zc with small packets, as
> that is task_work intensive and exhibits the bursty behavior I mentioned
> in the patch / cover letter. And normal storage IO, IRQ driven.

I assume IRQs are firing on random CPUs then unless you configured
it. In which case it should be bouncing of the cacheline + that
peeking at the prev also needs to fetch it from RAM / further caches.
Unless there is enough of time for TCP to batch them.

> For send zc, we're spending about 2% of the time doing list reversal,
> and I've seen as high as 5 in other testing. And as that test is CPU

I've seen similar before, but for me it was overhead shifted from
__io_run_local_work() fetching requests into reverse touching all
of them. There should be a change in __io_run_local_work()
total cycles (incl children) then I assume

> bound, performance is up about 2% as well.

Did you count by any chance how many items there was in the
list? Average or so

> With the patches, task work adding accounts for about 0.25% of the
> cycles, before it's about 0.66%.

i.e. spinlock is faster. How come? Same cmpxchg in spinlock
with often cache misses, but with irq on/off on top. The only
diff I can remember is that peek into prev req.

> We're spending a bit more time in __io_run_local_work(), but I think
> that's deceptive as we have to disable/enable interrupts now. If an
> interrupt triggers on the unlock, that time tends to be attributed there
> in terms of cycles.

Hmm, I think if run_local_work runtime doesn't change you'd
statistically get same number of interrupts "items" hitting
it, but the would be condensed more to irq-off. Or are you
accounting for some irq delivery / hw differences?

>> One known problem with the current list approach I mentioned several
>> times before is that it peeks at the previous queued tw to count them.
>> It's not nice, but that can be easily done with cmpxchg double. I
>> wonder how much of an issue is that.
> 
> That's more of a wart than a real issue though, but we this approach

Assuming tw add executing on random CPUs, that would be additional
fetch every tw add, I wouldn't disregard it right away.

> obviously doesn't do that. And then we can drop the rcu section around
> adding local task_work. Not a huge deal, but still nice.

I don't see how. After you queue the req it might be immediately
executed dropping the ctx ref, which was previously protecting ctx.
The rcu section protects ctx.

>>> And this is less code and simpler, so I'd prefer to go that route.
>>
>> I'm not sure it's less code, if you return optimisations that I
>> believe were killed, see comments to patch 2, it might turn out to
>> be even bulkier and not that simpler.
> 
> It's still considerably less:
> 
>   3 files changed, 59 insertions(+), 84 deletions(-)
> 
> thought that isn't conclusive by itself, as eg io_llist_xchg() goes away
> which has some comments. But I do think the resulting code is simpler
> and more straight forward.

-- 
Pavel Begunkov

      parent reply	other threads:[~2024-03-27 18:09 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-26 18:42 [PATCHSET 0/4] Use io_wq_work_list for task_work Jens Axboe
2024-03-26 18:42 ` [PATCH 1/4] io_uring: use the right type for work_llist empty check Jens Axboe
2024-03-26 18:42 ` [PATCH 2/4] io_uring: switch deferred task_work to an io_wq_work_list Jens Axboe
2024-03-27 13:24   ` Pavel Begunkov
2024-03-27 15:45     ` Jens Axboe
2024-03-27 16:37       ` Jens Axboe
2024-03-27 17:28         ` Pavel Begunkov
2024-03-27 17:34           ` Jens Axboe
2024-03-26 18:42 ` [PATCH 3/4] io_uring: switch fallback work to io_wq_work_list Jens Axboe
2024-03-26 18:42 ` [PATCH 4/4] io_uring: switch normal task_work " Jens Axboe
2024-03-27 13:33 ` [PATCHSET 0/4] Use io_wq_work_list for task_work Pavel Begunkov
2024-03-27 16:36   ` Jens Axboe
2024-03-27 17:05     ` Jens Axboe
2024-03-27 18:04     ` Pavel Begunkov [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox