* [GIT PULL] io_uring updates for 5.18-rc1 @ 2022-03-18 21:59 Jens Axboe 2022-03-22 0:25 ` pr-tracker-bot [not found] ` <20220326122838.19d7193f@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com> 0 siblings, 2 replies; 11+ messages in thread From: Jens Axboe @ 2022-03-18 21:59 UTC (permalink / raw) To: Linus Torvalds; +Cc: io-uring Hi Linus, io_uring updates for the 5.18-rc1 merge window. This pull request contains: - Fixes for current file position. Still doesn't have the f_pos_lock sorted, but it's a step in the right direction (Dylan) - Tracing updates (Dylan, Stefan) - Improvements to io-wq locking (Hao) - Improvements for provided buffers (me, Pavel) - Support for registered file descriptors (me, Xiaoguang) - Support for ring messages (me) - Poll improvements (me) - Fix for fixed buffers and non-iterator reads/writes (me - Support for NAPI on sockets (Olivier) - Ring quiesce improvements (Usama) - Misc fixes (Olivier, Pavel) Will merge cleanly. Please pull! The following changes since commit ffb217a13a2eaf6d5bd974fc83036a53ca69f1e2: Linux 5.17-rc7 (2022-03-06 14:28:31 -0800) are available in the Git repository at: git://git.kernel.dk/linux-block.git tags/for-5.18/io_uring-2022-03-18 for you to fetch changes up to 5e929367468c8f97cd1ffb0417316cecfebef94b: io_uring: terminate manual loop iterator loop correctly for non-vecs (2022-03-18 11:42:48 -0600) ---------------------------------------------------------------- for-5.18/io_uring-2022-03-18 ---------------------------------------------------------------- Dylan Yudaken (5): io_uring: remove duplicated calls to io_kiocb_ppos io_uring: update kiocb->ki_pos at execution time io_uring: do not recalculate ppos unnecessarily io_uring: documentation fixup io_uring: make tracing format consistent Hao Xu (3): io-wq: decouple work_list protection from the big wqe->lock io-wq: reduce acct->lock crossing functions lock/unlock io-wq: use IO_WQ_ACCT_NR rather than hardcoded number Jens Axboe (15): io_uring: add support for registering ring file descriptors io_uring: speedup provided buffer handling io_uring: add support for IORING_OP_MSG_RING command io_uring: retry early for reads if we can poll io_uring: ensure reads re-import for selected buffers io_uring: recycle provided buffers if request goes async io_uring: allow submissions to continue on error io_uring: remove duplicated member check for io_msg_ring_prep() io_uring: recycle apoll_poll entries io_uring: move req->poll_refs into previous struct hole io_uring: cache req->apoll->events in req->cflags io_uring: cache poll/double-poll state with a request flag io_uring: manage provided buffers strictly ordered io_uring: don't check unrelated req->open.how in accept request io_uring: terminate manual loop iterator loop correctly for non-vecs Nathan Chancellor (1): io_uring: Fix use of uninitialized ret in io_eventfd_register() Olivier Langlois (3): io_uring: Remove unneeded test in io_run_task_work_sig() io_uring: minor io_cqring_wait() optimization io_uring: Add support for napi_busy_poll Pavel Begunkov (8): io_uring: normilise naming for fill_cqe* io_uring: refactor timeout cancellation cqe posting io_uring: extend provided buf return to fails io_uring: fix provided buffer return on failure for kiocb_done() io_uring: remove extra barrier for non-sqpoll iopoll io_uring: shuffle io_eventfd_signal() bits around io_uring: thin down io_commit_cqring() io_uring: fold evfd signalling under a slower path Stefan Roesch (2): io-uring: add __fill_cqe function io-uring: Make tracepoints consistent. Usama Arif (5): io_uring: remove trace for eventfd io_uring: avoid ring quiesce while registering/unregistering eventfd io_uring: avoid ring quiesce while registering async eventfd io_uring: avoid ring quiesce while registering restrictions and enabling rings io_uring: remove ring quiesce for io_uring_register fs/io-wq.c | 114 ++-- fs/io_uring.c | 1251 ++++++++++++++++++++++++++++++--------- include/linux/io_uring.h | 5 +- include/trace/events/io_uring.h | 333 +++++------ include/uapi/linux/io_uring.h | 17 +- 5 files changed, 1200 insertions(+), 520 deletions(-) -- Jens Axboe ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [GIT PULL] io_uring updates for 5.18-rc1 2022-03-18 21:59 [GIT PULL] io_uring updates for 5.18-rc1 Jens Axboe @ 2022-03-22 0:25 ` pr-tracker-bot [not found] ` <20220326122838.19d7193f@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com> 1 sibling, 0 replies; 11+ messages in thread From: pr-tracker-bot @ 2022-03-22 0:25 UTC (permalink / raw) To: Jens Axboe; +Cc: Linus Torvalds, io-uring The pull request you sent on Fri, 18 Mar 2022 15:59:16 -0600: > git://git.kernel.dk/linux-block.git tags/for-5.18/io_uring-2022-03-18 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/af472a9efdf65cbb3398cb6478ec0e89fbc84109 Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html ^ permalink raw reply [flat|nested] 11+ messages in thread
[parent not found: <20220326122838.19d7193f@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>]
[parent not found: <[email protected]>]
[parent not found: <20220326130615.2d3c6c85@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>]
[parent not found: <[email protected]>]
[parent not found: <[email protected]>]
[parent not found: <[email protected]>]
* Re: [GIT PULL] io_uring updates for 5.18-rc1 [not found] ` <[email protected]> @ 2022-06-01 6:59 ` Olivier Langlois 2022-06-01 16:24 ` Jakub Kicinski 2022-06-01 18:09 ` Linus Torvalds 0 siblings, 2 replies; 11+ messages in thread From: Olivier Langlois @ 2022-06-01 6:59 UTC (permalink / raw) To: Jakub Kicinski, Jens Axboe; +Cc: Linus Torvalds, io-uring On Sat, 2022-03-26 at 14:30 -0700, Jakub Kicinski wrote: > On Sat, 26 Mar 2022 15:06:40 -0600 Jens Axboe wrote: > > On 3/26/22 2:57 PM, Jens Axboe wrote: > > > > I'd also like to have a conversation about continuing to use > > > > the socket as a proxy for NAPI_ID, NAPI_ID is exposed to user > > > > space now. io_uring being a new interface I wonder if it's not > > > > better to let the user specify the request parameters > > > > directly. > > > > > > Definitely open to something that makes more sense, given we > > > don't > > > have to shoehorn things through the regular API for NAPI with > > > io_uring. > > > > The most appropriate is probably to add a way to get/set NAPI > > settings > > on a per-io_uring basis, eg through io_uring_register(2). It's a > > bit > > more difficult if they have to be per-socket, as the polling > > happens off > > what would normally be the event wait path. > > > > What did you have in mind? > > Not sure I fully comprehend what the current code does. IIUC it uses > the socket and the caches its napi_id, presumably because it doesn't > want to hold a reference on the socket? Again, the io_uring napi busy_poll integration is strongly inspired from epoll implementation which caches a single napi_id. I guess that I did reverse engineer the rational justifying the epoll design decisions. If you were to busy poll receive queues for a socket set containing hundreds of thousands of sockets, would you rather scan the whole socket set to retrieve which queues to poll or simple iterate through a list containing a dozen of so of ids? > > This may give the user a false impression that the polling follows > the socket. NAPIs may get reshuffled underneath on pretty random > reconfiguration / recovery events (random == driver dependent). There is nothing random. When a socket is added to the poll set, its receive queue is added to the short list of queues to poll. A very common usage pattern among networking applications it is to reinsert the socket into the polling set after each polling event. In recognition to this pattern and to avoid allocating/deallocating memory to modify the napi_id list all the time, each napi id is kept in the list until a very long period of inactivity is reached where it is finally removed to stop the receive queue busy polling. > > I'm not entirely clear how the thing is supposed to be used with TCP > socket, as from a quick grep it appears that listening sockets don't > get napi_id marked at all. > > The commit mentions a UDP benchmark, Olivier can you point me to more > info on the use case? I'm mostly familiar with NAPI busy poll with > XDP > sockets, where it's pretty obvious. https://github.com/lano1106/io_uring_udp_ping IDK what else I can tell you. I choose to unit test the new feature with an UDP app because it was the simplest setup for testing. AFAIK, the ultimate goal of busy polling is to minimize latency in packets reception and the NAPI busy polling code should not treat differently packets whether they are UDP or TCP or whatever the type of frames the NIC does receive... > > My immediate reaction is that we should either explicitly call out > NAPI > instances by id in uAPI, or make sure we follow the socket in every > case. Also we can probably figure out an easy way of avoiding the > hash > table lookups and cache a pointer to the NAPI struct. > That is an interesting idea. If this is something that NAPI API would offer, I would gladly use that to avoid the hash lookup but IMHO, I see it as a very interesting improvement but hopefully this should not block my patch... ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [GIT PULL] io_uring updates for 5.18-rc1 2022-06-01 6:59 ` Olivier Langlois @ 2022-06-01 16:24 ` Jakub Kicinski 2022-06-01 18:09 ` Linus Torvalds 1 sibling, 0 replies; 11+ messages in thread From: Jakub Kicinski @ 2022-06-01 16:24 UTC (permalink / raw) To: Olivier Langlois; +Cc: Jens Axboe, Linus Torvalds, io-uring On Wed, 01 Jun 2022 02:59:12 -0400 Olivier Langlois wrote: > > I'm not entirely clear how the thing is supposed to be used with TCP > > socket, as from a quick grep it appears that listening sockets don't > > get napi_id marked at all. > > > > The commit mentions a UDP benchmark, Olivier can you point me to more > > info on the use case? I'm mostly familiar with NAPI busy poll with > > XDP > > sockets, where it's pretty obvious. > > https://github.com/lano1106/io_uring_udp_ping > > IDK what else I can tell you. I choose to unit test the new feature > with an UDP app because it was the simplest setup for testing. AFAIK, > the ultimate goal of busy polling is to minimize latency in packets > reception and the NAPI busy polling code should not treat differently > packets whether they are UDP or TCP or whatever the type of frames the > NIC does receive... IDK how you use the busy polling, so I'm asking you to describe what your app does. You said elsewhere that you don't have dedicated thread per queue so it's not a server app (polling for requests) but a client app (polling for responses)? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [GIT PULL] io_uring updates for 5.18-rc1 2022-06-01 6:59 ` Olivier Langlois 2022-06-01 16:24 ` Jakub Kicinski @ 2022-06-01 18:09 ` Linus Torvalds 2022-06-01 18:21 ` Jens Axboe 1 sibling, 1 reply; 11+ messages in thread From: Linus Torvalds @ 2022-06-01 18:09 UTC (permalink / raw) To: Olivier Langlois; +Cc: Jakub Kicinski, Jens Axboe, io-uring On Tue, May 31, 2022 at 11:59 PM Olivier Langlois <[email protected]> wrote: > > Again, the io_uring napi busy_poll integration is strongly inspired > from epoll implementation which caches a single napi_id. Note that since epoll is the worst possible implementation of a horribly bad idea, and one of the things I would really want people to kill off, that "it's designed based on epoll" is about the worst possible explanation fo anything at all. Epoll is the CVS of kernel interfaces: look at it, cry, run away, and try to avoid making that mistake ever again. I'm looking forward to the day when we can just delete all epoll code, but io_uring may be a making that even worse, in how it has then exposed epoll as an io_uring operation. That was probably a *HORRIBLE* mistake. (For the two prime issues with epoll: epoll recursion and the completely invalid expectations of what an "edge" in the edge triggering is. But there are other mistakes in there, with the lifetime of the epoll waitqueues having been nasty problems several times, because of how it doesn't follow any of the normal poll() rules, and made a mockery of any sane interfaces). Linus ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [GIT PULL] io_uring updates for 5.18-rc1 2022-06-01 18:09 ` Linus Torvalds @ 2022-06-01 18:21 ` Jens Axboe 2022-06-01 18:28 ` Linus Torvalds 0 siblings, 1 reply; 11+ messages in thread From: Jens Axboe @ 2022-06-01 18:21 UTC (permalink / raw) To: Linus Torvalds, Olivier Langlois; +Cc: Jakub Kicinski, io-uring On 6/1/22 12:09 PM, Linus Torvalds wrote: > I'm looking forward to the day when we can just delete all epoll code, > but io_uring may be a making that even worse, in how it has then > exposed epoll as an io_uring operation. That was probably a *HORRIBLE* > mistake. Of the added opcodes in io_uring, that one I'm actually certain never ended up getting used. I see no reason why we can't just deprecate it and eventually just wire it up to io_eopnotsupp(). IOW, that won't be the one holding us back killing epoll. -- Jens Axboe ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [GIT PULL] io_uring updates for 5.18-rc1 2022-06-01 18:21 ` Jens Axboe @ 2022-06-01 18:28 ` Linus Torvalds 2022-06-01 18:34 ` Jens Axboe 0 siblings, 1 reply; 11+ messages in thread From: Linus Torvalds @ 2022-06-01 18:28 UTC (permalink / raw) To: Jens Axboe; +Cc: Olivier Langlois, Jakub Kicinski, io-uring On Wed, Jun 1, 2022 at 11:21 AM Jens Axboe <[email protected]> wrote: > > Of the added opcodes in io_uring, that one I'm actually certain never > ended up getting used. I see no reason why we can't just deprecate it > and eventually just wire it up to io_eopnotsupp(). > > IOW, that won't be the one holding us back killing epoll. That really would be lovely. I think io_uring at least in theory might have the potential to _help_ kill epoll, since I suspect a lot of epoll users might well prefer io_uring instead. I say "in theory", because it does require that io_uring itself doesn't keep any of the epoll code alive, but also because we've seen over and over that people just don't migrate to newer interfaces because it's just too much work and the old ones still work.. Of course, we haven't exactly helped things - right now the whole EPOLL thing is "default y" and behind a EXPERT define, so people aren't even asked if they want it. Because it used to be one of those things everybody enabled because it was new and shiny and cool. And sadly, there are a few things that epoll really shines at, so I suspect that will never really change ;( Linus ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [GIT PULL] io_uring updates for 5.18-rc1 2022-06-01 18:28 ` Linus Torvalds @ 2022-06-01 18:34 ` Jens Axboe 2022-06-01 18:52 ` Linus Torvalds 0 siblings, 1 reply; 11+ messages in thread From: Jens Axboe @ 2022-06-01 18:34 UTC (permalink / raw) To: Linus Torvalds; +Cc: Olivier Langlois, Jakub Kicinski, io-uring On 6/1/22 12:28 PM, Linus Torvalds wrote: > On Wed, Jun 1, 2022 at 11:21 AM Jens Axboe <[email protected]> wrote: >> >> Of the added opcodes in io_uring, that one I'm actually certain never >> ended up getting used. I see no reason why we can't just deprecate it >> and eventually just wire it up to io_eopnotsupp(). >> >> IOW, that won't be the one holding us back killing epoll. > > That really would be lovely. > > I think io_uring at least in theory might have the potential to _help_ > kill epoll, since I suspect a lot of epoll users might well prefer > io_uring instead. > > I say "in theory", because it does require that io_uring itself > doesn't keep any of the epoll code alive, but also because we've seen > over and over that people just don't migrate to newer interfaces > because it's just too much work and the old ones still work.. > > Of course, we haven't exactly helped things - right now the whole > EPOLL thing is "default y" and behind a EXPERT define, so people > aren't even asked if they want it. Because it used to be one of those > things everybody enabled because it was new and shiny and cool. > > And sadly, there are a few things that epoll really shines at, so I > suspect that will never really change ;( I think there are two ways that io_uring can help kill epoll: 1) As a basic replacement as an event notifier. I'm not a huge fan of these conversions in general, as they just swap one readiness notifier for another one. Hence they don't end up taking full advantage of that io_uring has to offer. But they are easy and event libraries obviously often take this approach. 2) From scratch implementations or actual adoptions in applications will switch from an epoll driven readiness model to the io_uring completion model. These are the conversion that I am the most excited about, as the end up using the (imho) better model that io_uring has to offer. But as a first step, let's just mark it deprecated with a pr_warn() for 5.20 and then plan to kill it off whenever a suitable amount of relases have passed since that addition. -- Jens Axboe ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [GIT PULL] io_uring updates for 5.18-rc1 2022-06-01 18:34 ` Jens Axboe @ 2022-06-01 18:52 ` Linus Torvalds 2022-06-01 19:10 ` Jens Axboe 0 siblings, 1 reply; 11+ messages in thread From: Linus Torvalds @ 2022-06-01 18:52 UTC (permalink / raw) To: Jens Axboe; +Cc: Olivier Langlois, Jakub Kicinski, io-uring On Wed, Jun 1, 2022 at 11:34 AM Jens Axboe <[email protected]> wrote: > > But as a first step, let's just mark it deprecated with a pr_warn() for > 5.20 and then plan to kill it off whenever a suitable amount of relases > have passed since that addition. I'd love to, but it's not actually realistic as things stand now. epoll() is used in a *lot* of random libraries. A "pr_warn()" would just be senseless noise, I bet. No, there's a reason that EPOLL is still there, still 'default y', even though I dislike it and think it was a mistake, and we've had several nasty bugs related to it over the years. It really can be a very useful system call, it's just that it really doesn't work the way the actual ->poll() interface was designed, and it kind of hijacks it in ways that mostly work, but the have subtle lifetime issues that you don't see with a regular select/poll because those will always tear down the wait queues. Realistically, the proper fix to epoll is likely to make it explicit, and make files and drivers that want to support it have to actually opt in. Because a lot of the problems have been due to epoll() looking *exactly* like a regular poll/select to a driver or a filesystem, but having those very subtle extended requirements. (And no, the extended requirements aren't generally onerous, and regular ->poll() works fine for 99% of all cases. It's just that occasionally, special users are then fooled about special contexts). In other words, it's a bit like our bad old days when "splice()" ended up falling back to regular ->read()/->write() implementations with set_fs(KERNEL_DS). Yes, that worked fine for 99% of all cases, and we did it for years, but it also caused several really nasty issues for when the read/write actor did something slightly unusual. So I may dislike epoll quite intensely, but I don't think we can *really* get rid of it. But we might be able to make it a bit more controlled. But so far every time it has caused issues, we've worked around it by fixing it up in the particular driver or whatever that ended up being triggered by epoll semantics. Linus ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [GIT PULL] io_uring updates for 5.18-rc1 2022-06-01 18:52 ` Linus Torvalds @ 2022-06-01 19:10 ` Jens Axboe 2022-06-01 19:20 ` Linus Torvalds 0 siblings, 1 reply; 11+ messages in thread From: Jens Axboe @ 2022-06-01 19:10 UTC (permalink / raw) To: Linus Torvalds; +Cc: Olivier Langlois, Jakub Kicinski, io-uring On 6/1/22 12:52 PM, Linus Torvalds wrote: > On Wed, Jun 1, 2022 at 11:34 AM Jens Axboe <[email protected]> wrote: >> >> But as a first step, let's just mark it deprecated with a pr_warn() for >> 5.20 and then plan to kill it off whenever a suitable amount of relases >> have passed since that addition. > > I'd love to, but it's not actually realistic as things stand now. > epoll() is used in a *lot* of random libraries. A "pr_warn()" would > just be senseless noise, I bet. I mean only for the IORING_OP_EPOLL_CTL opcode, which is the only epoll connection we have in there. It'd be jumping the gun to do it for the epoll_ctl syscall for sure... And I really have no personal skin in that game, other than having a better alternative. But that's obviously a long pole type of deprecation. > No, there's a reason that EPOLL is still there, still 'default y', > even though I dislike it and think it was a mistake, and we've had > several nasty bugs related to it over the years. > > It really can be a very useful system call, it's just that it really > doesn't work the way the actual ->poll() interface was designed, and > it kind of hijacks it in ways that mostly work, but the have subtle > lifetime issues that you don't see with a regular select/poll because > those will always tear down the wait queues. > > Realistically, the proper fix to epoll is likely to make it explicit, > and make files and drivers that want to support it have to actually > opt in. Because a lot of the problems have been due to epoll() looking > *exactly* like a regular poll/select to a driver or a filesystem, but > having those very subtle extended requirements. > > (And no, the extended requirements aren't generally onerous, and > regular ->poll() works fine for 99% of all cases. It's just that > occasionally, special users are then fooled about special contexts). It's not an uncommon approach to make the initial adoption / implementation more palatable, though commonly then also ends up being a mistake. I've certainly been guilty of that myself too... > In other words, it's a bit like our bad old days when "splice()" ended > up falling back to regular ->read()/->write() implementations with > set_fs(KERNEL_DS). Yes, that worked fine for 99% of all cases, and we > did it for years, but it also caused several really nasty issues for > when the read/write actor did something slightly unusual. Unfortunately that particular change I just had to deal with, and noticed that we're up to more than two handfuls of fixes for that and I bet we're not done. Not saying it wasn't the right choice in terms of sanity, but it has been more painful than I thought it would be. > So I may dislike epoll quite intensely, but I don't think we can > *really* get rid of it. But we might be able to make it a bit more > controlled. > > But so far every time it has caused issues, we've worked around it by > fixing it up in the particular driver or whatever that ended up being > triggered by epoll semantics. The io_uring side of the epoll management I'm very sure can go in a few releases, and a pr_warn_once() for 5.20 is the right choice. epoll itself, probably not even down the line, though I am hoping we can continue to move people off of it. Maybe in another 20 years :-) -- Jens Axboe ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [GIT PULL] io_uring updates for 5.18-rc1 2022-06-01 19:10 ` Jens Axboe @ 2022-06-01 19:20 ` Linus Torvalds 0 siblings, 0 replies; 11+ messages in thread From: Linus Torvalds @ 2022-06-01 19:20 UTC (permalink / raw) To: Jens Axboe; +Cc: Olivier Langlois, Jakub Kicinski, io-uring On Wed, Jun 1, 2022 at 12:10 PM Jens Axboe <[email protected]> wrote: > > I mean only for the IORING_OP_EPOLL_CTL opcode, which is the only epoll > connection we have in there. Ok, that removal sounds fine to me. Thanks. Linus ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2022-06-01 20:54 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-03-18 21:59 [GIT PULL] io_uring updates for 5.18-rc1 Jens Axboe 2022-03-22 0:25 ` pr-tracker-bot [not found] ` <20220326122838.19d7193f@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com> [not found] ` <[email protected]> [not found] ` <20220326130615.2d3c6c85@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com> [not found] ` <[email protected]> [not found] ` <[email protected]> [not found] ` <[email protected]> 2022-06-01 6:59 ` Olivier Langlois 2022-06-01 16:24 ` Jakub Kicinski 2022-06-01 18:09 ` Linus Torvalds 2022-06-01 18:21 ` Jens Axboe 2022-06-01 18:28 ` Linus Torvalds 2022-06-01 18:34 ` Jens Axboe 2022-06-01 18:52 ` Linus Torvalds 2022-06-01 19:10 ` Jens Axboe 2022-06-01 19:20 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox