public inbox for [email protected]
 help / color / mirror / Atom feed
From: Dylan Yudaken <[email protected]>
To: "[email protected]" <[email protected]>,
	"[email protected]" <[email protected]>
Cc: "[email protected]" <[email protected]>
Subject: Re: [GIT PULL] io_uring updates for 6.1-rc1
Date: Sat, 8 Oct 2022 18:12:00 +0000	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

On Fri, 2022-10-07 at 10:30 -0600, Jens Axboe wrote:
> On 10/7/22 10:07 AM, Linus Torvalds wrote:
> > On Mon, Oct 3, 2022 at 7:18 AM Jens Axboe <[email protected]> wrote:
> > > 
> > > - Series that adds supported for more directly managed task_work
> > >   running. This is beneficial for real world applications that
> > > end up
> > >   issuing lots of system calls as part of handling work.
> > 
> > While I agree with the concept, I'm not convinced this is done the
> > right way.
> > 
> > It looks very much like it was done in a "this is perfect for
> > benchmarks" mode.
> > 
> > I think you should consider it much more similar to plugging (both
> > network and disk IO). In particular, I think that you'll find that
> > once you have random events like memory allocations blocking in
> > other
> > places, you actually will want to unplug early, so that you don't
> > end
> > up sleeping with unstarted work to do.
> > 
> > And the reason I say this code looks like "made for benchmarks" is
> > that you'll basically never see those kinds of issues when you just
> > run some benchmark that is tuned for this.  For the benchmark, you
> > just want the user to control exactly when to start the load,
> > because
> > you control pretty much everything.
> > 
> > And then real life happens, and you have situations where you get
> > those odd hiccups from other things going on, and you wonder "why
> > was
> > no IO taking place?"
> > 
> > Maybe I'm misreading the code, but it looks to me that the deferred
> > io_uring work is basically deferred completely synchronously.
> > 
> > I've pulled this, and maybe I'm misreading it. Or maybe there's
> > some
> > reason why io_uring is completely different from all the other
> > situations where we've ever wanted to do this kind of plugging for
> > batching, but I really doubt that io_uring is magically
> > different...
> 
> I'll try and address these separately.
> 
> It's interesting that you suspect it's made for benchmarks. In
> practice,
> this came about from very much the opposite angle - benchmarks were
> fine, but production code for Thrift was showing cases where the
> io_uring backend didn't perform as well as the epoll one. Dylan did a
> lot of debugging and head scratching here, because it wasn't one of
> those "let's profile and see what it is - oh yep, this needs to be
> improved" kind of cases. Benchmark are easy because they are very
> much
> targeted - if you write something that tries to behave like thrift,
> then
> it too will perform fine. One of the key differences is that
> production code actually does a bunch of things when processing a
> request, issuing other system calls. A benchmark does not.
> 
> When the backend issues a receive and no data is available, an
> internal
> poll trigger is used to know when we can actually receive data. When
> that triggers, task_work is queued to do the actual receive. That's
> considered the fast part, because it's basically just copying the
> data.

Just want to point out that "just copying the data" is not actually all
that fast for some workloads (eg a burst of very large socket receives
arrives). 

This actually compounds the problem, as while processing the big
receives, more packets might arrive needing to be processed, which
further delays things.

This was actually the scenario that was breaking: socket sends were
waiting to be pushed out, but got queued behind a burst of receives
which added noticeable response latency at high load. But since
io_uring has a pretty good idea of when the user will want to process
completions it made sense to me to defer the aync work until just
before that point, rather than piecemeal doing bit of it. 

> task_work is tied to exiting to userspace, which means that basically
> anything the backend does that isn't strict CPU processing will end
> up
> flushing the task_work. This really hurts efficiencies at certain
> rates
> of load. The problem was worse when task_work actually triggered a
> forced kernel enter/exit via TWA_SIGNAL doing an IPI to reschedule if
> it
> was running in userspace, but it was still an issue even with just
> deferring it to be run whenever a transition happened anyway.
> 
> I do agree that you have a point on it being somewhat similar to
> plugging in the sense that you would probably want this flushed if
> you
> get scheduled out anyway. For production loads where you end up being
> resource constrained (not a rare occurence...), we want them run
> before
> putting the task to sleep. We'll look into making a tweak like that,
> it
> seems like the right thing to do.

I hadn't considered this strongly enough but it makes total sense I
think.

The defered work is mainly a latency win, and if the task is put to
sleep that win has gone anyway, it may as well process IO. Especially
since there might be write/send ops where the CQE latency is less
important than just getting the IO done. My assumption using deferred
work for sends is rare enough (as it would require the send buffer to
be full and poll to be armed), that I hadn't noticed it.

I'll take a look into getting a patch for this done.

> 
> I'm sure Dylan can chime in on the above too once he's back, to add
> some
> more data to this change.
> 

Thanks,
Dylan

  reply	other threads:[~2022-10-08 18:12 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-03 14:18 [GIT PULL] io_uring updates for 6.1-rc1 Jens Axboe
2022-10-07 16:07 ` Linus Torvalds
2022-10-07 16:30   ` Jens Axboe
2022-10-08 18:12     ` Dylan Yudaken [this message]
2022-10-07 17:00 ` pr-tracker-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=38378ac08073991ab3a7d4c4695382975f808d11.camel@fb.com \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox