* [GIT PULL] io_uring updates for 6.1-rc1
@ 2022-10-03 14:18 Jens Axboe
2022-10-07 16:07 ` Linus Torvalds
2022-10-07 17:00 ` pr-tracker-bot
0 siblings, 2 replies; 5+ messages in thread
From: Jens Axboe @ 2022-10-03 14:18 UTC (permalink / raw)
To: Linus Torvalds; +Cc: io-uring
Hi Linus,
Here are the io_uring updates and fixes that should go into this
release:
- Series that adds supported for more directly managed task_work
running. This is beneficial for real world applications that end up
issuing lots of system calls as part of handling work. Normal
task_work will always execute as we transition in and out of the
kernel, even for "unrelated" system calls. It's more efficient to
defer the handling of io_uring's deferred work until the application
wants it to be run, generally in batches. As part of ongoing work to
write an io_uring network backend for Thrift, this has been shown to
greatly improve performance. (Dylan)
- Series adding IOPOLL support for passthrough (Kanchan)
- Improvements and fixes to the send zero-copy support (Pavel)
- Partial IO handling fixes (Pavel)
- CQE ordering fixes around CQ ring overflow (Pavel)
- Support sendto() for non-zc as well (Pavel)
- Support sendmsg for zerocopy (Pavel)
- Networking iov_iter fix (Stefan)
- Misc fixes and cleanups (Pavel, me)
Please pull!
The following changes since commit 521a547ced6477c54b4b0cc206000406c221b4d6:
Linux 6.0-rc6 (2022-09-18 13:44:14 -0700)
are available in the Git repository at:
git://git.kernel.dk/linux.git tags/for-6.1/io_uring-2022-10-03
for you to fetch changes up to 108893ddcc4d3aa0a4a02aeb02d478e997001227:
io_uring/net: fix notif cqe reordering (2022-09-29 17:46:04 -0600)
----------------------------------------------------------------
for-6.1/io_uring-2022-10-03
----------------------------------------------------------------
Dylan Yudaken (9):
eventfd: guard wake_up in eventfd fs calls as well
io_uring: remove unnecessary variable
io_uring: introduce io_has_work
io_uring: do not run task work at the start of io_uring_enter
io_uring: add IORING_SETUP_DEFER_TASKRUN
io_uring: move io_eventfd_put
io_uring: signal registered eventfd to process deferred task work
io_uring: trace local task work run
io_uring: allow buffer recycling in READV
Jens Axboe (9):
io_uring: cleanly separate request types for iopoll
io_uring: add local task_work run helper that is entered locked
io_uring: ensure iopoll runs local task work as well
fs: add batch and poll flags to the uring_cmd_iopoll() handler
io_uring/fdinfo: get rid of unnecessary is_cqe32 variable
io_uring/fdinfo: fix sqe dumping for IORING_SETUP_SQE128
io_uring: ensure local task_work marks task as running
io_uring/rw: defer fsnotify calls to task context
io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL
Kanchan Joshi (4):
fs: add file_operations->uring_cmd_iopoll
io_uring: add iopoll infrastructure for io_uring_cmd
block: export blk_rq_is_poll
nvme: wire up async polling for io passthrough commands
Pavel Begunkov (33):
io_uring: kill an outdated comment
io_uring: use io_cq_lock consistently
io_uring/net: reshuffle error handling
io_uring/net: use async caches for async prep
io_uring/net: io_async_msghdr caches for sendzc
io_uring/net: add non-bvec sg chunking callback
io_uring/net: refactor io_sr_msg types
io_uring/net: use io_sr_msg for sendzc
io_uring: further limit non-owner defer-tw cq waiting
io_uring: disallow defer-tw run w/ no submitters
io_uring/iopoll: fix unexpected returns
io_uring/iopoll: unify tw breaking logic
io_uring: add fast path for io_run_local_work()
io_uring: remove unused return from io_disarm_next
io_uring: add custom opcode hooks on fail
io_uring/rw: don't lose partial IO result on fail
io_uring/net: don't lose partial send/recv on fail
io_uring/net: don't lose partial send_zc on fail
io_uring/net: refactor io_setup_async_addr
io_uring/net: support non-zerocopy sendto
io_uring/net: rename io_sendzc()
io_uring/net: combine fail handlers
io_uring/net: zerocopy sendmsg
selftest/net: adjust io_uring sendzc notif handling
io_uring/net: fix UAF in io_sendrecv_fail()
io_uring: fix CQE reordering
io_uring/net: fix cleanup double free free_iov init
io_uring/rw: fix unexpected link breakage
io_uring/rw: don't lose short results on io_setup_async_rw()
io_uring/net: don't skip notifs for failed requests
io_uring/net: fix non-zc send with address
io_uring/net: don't update msg_name if not provided
io_uring/net: fix notif cqe reordering
Stefan Metzmacher (1):
io_uring/net: fix fast_iov assignment in io_setup_async_msg()
block/blk-mq.c | 3 +-
drivers/nvme/host/core.c | 1 +
drivers/nvme/host/ioctl.c | 77 +++++-
drivers/nvme/host/multipath.c | 1 +
drivers/nvme/host/nvme.h | 4 +
fs/eventfd.c | 10 +-
include/linux/blk-mq.h | 1 +
include/linux/eventfd.h | 2 +-
include/linux/fs.h | 2 +
include/linux/io_uring.h | 8 +-
include/linux/io_uring_types.h | 4 +
include/linux/sched.h | 2 +-
include/trace/events/io_uring.h | 29 ++
include/uapi/linux/io_uring.h | 8 +
io_uring/cancel.c | 2 +-
io_uring/fdinfo.c | 48 ++--
io_uring/io_uring.c | 304 ++++++++++++++++----
io_uring/io_uring.h | 62 ++++-
io_uring/kbuf.h | 12 -
io_uring/net.c | 308 +++++++++++++++------
io_uring/net.h | 12 +-
io_uring/opdef.c | 44 ++-
io_uring/opdef.h | 1 +
io_uring/rsrc.c | 2 +-
io_uring/rw.c | 189 +++++++------
io_uring/rw.h | 1 +
io_uring/timeout.c | 13 +-
io_uring/timeout.h | 2 +-
io_uring/uring_cmd.c | 11 +-
tools/testing/selftests/net/io_uring_zerocopy_tx.c | 22 +-
30 files changed, 859 insertions(+), 326 deletions(-)
--
Jens Axboe
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [GIT PULL] io_uring updates for 6.1-rc1
2022-10-03 14:18 [GIT PULL] io_uring updates for 6.1-rc1 Jens Axboe
@ 2022-10-07 16:07 ` Linus Torvalds
2022-10-07 16:30 ` Jens Axboe
2022-10-07 17:00 ` pr-tracker-bot
1 sibling, 1 reply; 5+ messages in thread
From: Linus Torvalds @ 2022-10-07 16:07 UTC (permalink / raw)
To: Jens Axboe, Dylan Yudaken; +Cc: io-uring
On Mon, Oct 3, 2022 at 7:18 AM Jens Axboe <[email protected]> wrote:
>
> - Series that adds supported for more directly managed task_work
> running. This is beneficial for real world applications that end up
> issuing lots of system calls as part of handling work.
While I agree with the concept, I'm not convinced this is done the right way.
It looks very much like it was done in a "this is perfect for benchmarks" mode.
I think you should consider it much more similar to plugging (both
network and disk IO). In particular, I think that you'll find that
once you have random events like memory allocations blocking in other
places, you actually will want to unplug early, so that you don't end
up sleeping with unstarted work to do.
And the reason I say this code looks like "made for benchmarks" is
that you'll basically never see those kinds of issues when you just
run some benchmark that is tuned for this. For the benchmark, you
just want the user to control exactly when to start the load, because
you control pretty much everything.
And then real life happens, and you have situations where you get
those odd hiccups from other things going on, and you wonder "why was
no IO taking place?"
Maybe I'm misreading the code, but it looks to me that the deferred
io_uring work is basically deferred completely synchronously.
I've pulled this, and maybe I'm misreading it. Or maybe there's some
reason why io_uring is completely different from all the other
situations where we've ever wanted to do this kind of plugging for
batching, but I really doubt that io_uring is magically different...
Linus
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [GIT PULL] io_uring updates for 6.1-rc1
2022-10-07 16:07 ` Linus Torvalds
@ 2022-10-07 16:30 ` Jens Axboe
2022-10-08 18:12 ` Dylan Yudaken
0 siblings, 1 reply; 5+ messages in thread
From: Jens Axboe @ 2022-10-07 16:30 UTC (permalink / raw)
To: Linus Torvalds, Dylan Yudaken; +Cc: io-uring
On 10/7/22 10:07 AM, Linus Torvalds wrote:
> On Mon, Oct 3, 2022 at 7:18 AM Jens Axboe <[email protected]> wrote:
>>
>> - Series that adds supported for more directly managed task_work
>> running. This is beneficial for real world applications that end up
>> issuing lots of system calls as part of handling work.
>
> While I agree with the concept, I'm not convinced this is done the
> right way.
>
> It looks very much like it was done in a "this is perfect for
> benchmarks" mode.
>
> I think you should consider it much more similar to plugging (both
> network and disk IO). In particular, I think that you'll find that
> once you have random events like memory allocations blocking in other
> places, you actually will want to unplug early, so that you don't end
> up sleeping with unstarted work to do.
>
> And the reason I say this code looks like "made for benchmarks" is
> that you'll basically never see those kinds of issues when you just
> run some benchmark that is tuned for this. For the benchmark, you
> just want the user to control exactly when to start the load, because
> you control pretty much everything.
>
> And then real life happens, and you have situations where you get
> those odd hiccups from other things going on, and you wonder "why was
> no IO taking place?"
>
> Maybe I'm misreading the code, but it looks to me that the deferred
> io_uring work is basically deferred completely synchronously.
>
> I've pulled this, and maybe I'm misreading it. Or maybe there's some
> reason why io_uring is completely different from all the other
> situations where we've ever wanted to do this kind of plugging for
> batching, but I really doubt that io_uring is magically different...
I'll try and address these separately.
It's interesting that you suspect it's made for benchmarks. In practice,
this came about from very much the opposite angle - benchmarks were
fine, but production code for Thrift was showing cases where the
io_uring backend didn't perform as well as the epoll one. Dylan did a
lot of debugging and head scratching here, because it wasn't one of
those "let's profile and see what it is - oh yep, this needs to be
improved" kind of cases. Benchmark are easy because they are very much
targeted - if you write something that tries to behave like thrift, then
it too will perform fine. One of the key differences is that
production code actually does a bunch of things when processing a
request, issuing other system calls. A benchmark does not.
When the backend issues a receive and no data is available, an internal
poll trigger is used to know when we can actually receive data. When
that triggers, task_work is queued to do the actual receive. That's
considered the fast part, because it's basically just copying the data.
task_work is tied to exiting to userspace, which means that basically
anything the backend does that isn't strict CPU processing will end up
flushing the task_work. This really hurts efficiencies at certain rates
of load. The problem was worse when task_work actually triggered a
forced kernel enter/exit via TWA_SIGNAL doing an IPI to reschedule if it
was running in userspace, but it was still an issue even with just
deferring it to be run whenever a transition happened anyway.
I do agree that you have a point on it being somewhat similar to
plugging in the sense that you would probably want this flushed if you
get scheduled out anyway. For production loads where you end up being
resource constrained (not a rare occurence...), we want them run before
putting the task to sleep. We'll look into making a tweak like that, it
seems like the right thing to do.
I'm sure Dylan can chime in on the above too once he's back, to add some
more data to this change.
--
Jens Axboe
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [GIT PULL] io_uring updates for 6.1-rc1
2022-10-03 14:18 [GIT PULL] io_uring updates for 6.1-rc1 Jens Axboe
2022-10-07 16:07 ` Linus Torvalds
@ 2022-10-07 17:00 ` pr-tracker-bot
1 sibling, 0 replies; 5+ messages in thread
From: pr-tracker-bot @ 2022-10-07 17:00 UTC (permalink / raw)
To: Jens Axboe; +Cc: Linus Torvalds, io-uring
The pull request you sent on Mon, 3 Oct 2022 08:18:29 -0600:
> git://git.kernel.dk/linux.git tags/for-6.1/io_uring-2022-10-03
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/5853a7b5512c3017f64ca26494bd7361a12d6992
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [GIT PULL] io_uring updates for 6.1-rc1
2022-10-07 16:30 ` Jens Axboe
@ 2022-10-08 18:12 ` Dylan Yudaken
0 siblings, 0 replies; 5+ messages in thread
From: Dylan Yudaken @ 2022-10-08 18:12 UTC (permalink / raw)
To: [email protected], [email protected]; +Cc: [email protected]
On Fri, 2022-10-07 at 10:30 -0600, Jens Axboe wrote:
> On 10/7/22 10:07 AM, Linus Torvalds wrote:
> > On Mon, Oct 3, 2022 at 7:18 AM Jens Axboe <[email protected]> wrote:
> > >
> > > - Series that adds supported for more directly managed task_work
> > > running. This is beneficial for real world applications that
> > > end up
> > > issuing lots of system calls as part of handling work.
> >
> > While I agree with the concept, I'm not convinced this is done the
> > right way.
> >
> > It looks very much like it was done in a "this is perfect for
> > benchmarks" mode.
> >
> > I think you should consider it much more similar to plugging (both
> > network and disk IO). In particular, I think that you'll find that
> > once you have random events like memory allocations blocking in
> > other
> > places, you actually will want to unplug early, so that you don't
> > end
> > up sleeping with unstarted work to do.
> >
> > And the reason I say this code looks like "made for benchmarks" is
> > that you'll basically never see those kinds of issues when you just
> > run some benchmark that is tuned for this. For the benchmark, you
> > just want the user to control exactly when to start the load,
> > because
> > you control pretty much everything.
> >
> > And then real life happens, and you have situations where you get
> > those odd hiccups from other things going on, and you wonder "why
> > was
> > no IO taking place?"
> >
> > Maybe I'm misreading the code, but it looks to me that the deferred
> > io_uring work is basically deferred completely synchronously.
> >
> > I've pulled this, and maybe I'm misreading it. Or maybe there's
> > some
> > reason why io_uring is completely different from all the other
> > situations where we've ever wanted to do this kind of plugging for
> > batching, but I really doubt that io_uring is magically
> > different...
>
> I'll try and address these separately.
>
> It's interesting that you suspect it's made for benchmarks. In
> practice,
> this came about from very much the opposite angle - benchmarks were
> fine, but production code for Thrift was showing cases where the
> io_uring backend didn't perform as well as the epoll one. Dylan did a
> lot of debugging and head scratching here, because it wasn't one of
> those "let's profile and see what it is - oh yep, this needs to be
> improved" kind of cases. Benchmark are easy because they are very
> much
> targeted - if you write something that tries to behave like thrift,
> then
> it too will perform fine. One of the key differences is that
> production code actually does a bunch of things when processing a
> request, issuing other system calls. A benchmark does not.
>
> When the backend issues a receive and no data is available, an
> internal
> poll trigger is used to know when we can actually receive data. When
> that triggers, task_work is queued to do the actual receive. That's
> considered the fast part, because it's basically just copying the
> data.
Just want to point out that "just copying the data" is not actually all
that fast for some workloads (eg a burst of very large socket receives
arrives).
This actually compounds the problem, as while processing the big
receives, more packets might arrive needing to be processed, which
further delays things.
This was actually the scenario that was breaking: socket sends were
waiting to be pushed out, but got queued behind a burst of receives
which added noticeable response latency at high load. But since
io_uring has a pretty good idea of when the user will want to process
completions it made sense to me to defer the aync work until just
before that point, rather than piecemeal doing bit of it.
> task_work is tied to exiting to userspace, which means that basically
> anything the backend does that isn't strict CPU processing will end
> up
> flushing the task_work. This really hurts efficiencies at certain
> rates
> of load. The problem was worse when task_work actually triggered a
> forced kernel enter/exit via TWA_SIGNAL doing an IPI to reschedule if
> it
> was running in userspace, but it was still an issue even with just
> deferring it to be run whenever a transition happened anyway.
>
> I do agree that you have a point on it being somewhat similar to
> plugging in the sense that you would probably want this flushed if
> you
> get scheduled out anyway. For production loads where you end up being
> resource constrained (not a rare occurence...), we want them run
> before
> putting the task to sleep. We'll look into making a tweak like that,
> it
> seems like the right thing to do.
I hadn't considered this strongly enough but it makes total sense I
think.
The defered work is mainly a latency win, and if the task is put to
sleep that win has gone anyway, it may as well process IO. Especially
since there might be write/send ops where the CQE latency is less
important than just getting the IO done. My assumption using deferred
work for sends is rare enough (as it would require the send buffer to
be full and poll to be armed), that I hadn't noticed it.
I'll take a look into getting a patch for this done.
>
> I'm sure Dylan can chime in on the above too once he's back, to add
> some
> more data to this change.
>
Thanks,
Dylan
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2022-10-08 18:12 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-10-03 14:18 [GIT PULL] io_uring updates for 6.1-rc1 Jens Axboe
2022-10-07 16:07 ` Linus Torvalds
2022-10-07 16:30 ` Jens Axboe
2022-10-08 18:12 ` Dylan Yudaken
2022-10-07 17:00 ` pr-tracker-bot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox