public inbox for [email protected]
 help / color / mirror / Atom feed
* [GIT PULL] io_uring updates for 5.18-rc1
@ 2022-03-18 21:59 Jens Axboe
  2022-03-22  0:25 ` pr-tracker-bot
  2022-03-26 19:28 ` Jakub Kicinski
  0 siblings, 2 replies; 24+ messages in thread
From: Jens Axboe @ 2022-03-18 21:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: io-uring

Hi Linus,

io_uring updates for the 5.18-rc1 merge window. This pull request
contains:

- Fixes for current file position. Still doesn't have the f_pos_lock
  sorted, but it's a step in the right direction (Dylan)

- Tracing updates (Dylan, Stefan)

- Improvements to io-wq locking (Hao)

- Improvements for provided buffers (me, Pavel)

- Support for registered file descriptors (me, Xiaoguang)

- Support for ring messages (me)

- Poll improvements (me)

- Fix for fixed buffers and non-iterator reads/writes (me

- Support for NAPI on sockets (Olivier)

- Ring quiesce improvements (Usama)

- Misc fixes (Olivier, Pavel)

Will merge cleanly. Please pull!



The following changes since commit ffb217a13a2eaf6d5bd974fc83036a53ca69f1e2:

  Linux 5.17-rc7 (2022-03-06 14:28:31 -0800)

are available in the Git repository at:

  git://git.kernel.dk/linux-block.git tags/for-5.18/io_uring-2022-03-18

for you to fetch changes up to 5e929367468c8f97cd1ffb0417316cecfebef94b:

  io_uring: terminate manual loop iterator loop correctly for non-vecs (2022-03-18 11:42:48 -0600)

----------------------------------------------------------------
for-5.18/io_uring-2022-03-18

----------------------------------------------------------------
Dylan Yudaken (5):
      io_uring: remove duplicated calls to io_kiocb_ppos
      io_uring: update kiocb->ki_pos at execution time
      io_uring: do not recalculate ppos unnecessarily
      io_uring: documentation fixup
      io_uring: make tracing format consistent

Hao Xu (3):
      io-wq: decouple work_list protection from the big wqe->lock
      io-wq: reduce acct->lock crossing functions lock/unlock
      io-wq: use IO_WQ_ACCT_NR rather than hardcoded number

Jens Axboe (15):
      io_uring: add support for registering ring file descriptors
      io_uring: speedup provided buffer handling
      io_uring: add support for IORING_OP_MSG_RING command
      io_uring: retry early for reads if we can poll
      io_uring: ensure reads re-import for selected buffers
      io_uring: recycle provided buffers if request goes async
      io_uring: allow submissions to continue on error
      io_uring: remove duplicated member check for io_msg_ring_prep()
      io_uring: recycle apoll_poll entries
      io_uring: move req->poll_refs into previous struct hole
      io_uring: cache req->apoll->events in req->cflags
      io_uring: cache poll/double-poll state with a request flag
      io_uring: manage provided buffers strictly ordered
      io_uring: don't check unrelated req->open.how in accept request
      io_uring: terminate manual loop iterator loop correctly for non-vecs

Nathan Chancellor (1):
      io_uring: Fix use of uninitialized ret in io_eventfd_register()

Olivier Langlois (3):
      io_uring: Remove unneeded test in io_run_task_work_sig()
      io_uring: minor io_cqring_wait() optimization
      io_uring: Add support for napi_busy_poll

Pavel Begunkov (8):
      io_uring: normilise naming for fill_cqe*
      io_uring: refactor timeout cancellation cqe posting
      io_uring: extend provided buf return to fails
      io_uring: fix provided buffer return on failure for kiocb_done()
      io_uring: remove extra barrier for non-sqpoll iopoll
      io_uring: shuffle io_eventfd_signal() bits around
      io_uring: thin down io_commit_cqring()
      io_uring: fold evfd signalling under a slower path

Stefan Roesch (2):
      io-uring: add __fill_cqe function
      io-uring: Make tracepoints consistent.

Usama Arif (5):
      io_uring: remove trace for eventfd
      io_uring: avoid ring quiesce while registering/unregistering eventfd
      io_uring: avoid ring quiesce while registering async eventfd
      io_uring: avoid ring quiesce while registering restrictions and enabling rings
      io_uring: remove ring quiesce for io_uring_register

 fs/io-wq.c                      |  114 ++--
 fs/io_uring.c                   | 1251 ++++++++++++++++++++++++++++++---------
 include/linux/io_uring.h        |    5 +-
 include/trace/events/io_uring.h |  333 +++++------
 include/uapi/linux/io_uring.h   |   17 +-
 5 files changed, 1200 insertions(+), 520 deletions(-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [GIT PULL] io_uring updates for 5.18-rc1
  2022-03-18 21:59 [GIT PULL] io_uring updates for 5.18-rc1 Jens Axboe
@ 2022-03-22  0:25 ` pr-tracker-bot
  2022-03-26 19:28 ` Jakub Kicinski
  1 sibling, 0 replies; 24+ messages in thread
From: pr-tracker-bot @ 2022-03-22  0:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linus Torvalds, io-uring

The pull request you sent on Fri, 18 Mar 2022 15:59:16 -0600:

> git://git.kernel.dk/linux-block.git tags/for-5.18/io_uring-2022-03-18

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/af472a9efdf65cbb3398cb6478ec0e89fbc84109

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [GIT PULL] io_uring updates for 5.18-rc1
  2022-03-18 21:59 [GIT PULL] io_uring updates for 5.18-rc1 Jens Axboe
  2022-03-22  0:25 ` pr-tracker-bot
@ 2022-03-26 19:28 ` Jakub Kicinski
  2022-03-26 19:47   ` Jens Axboe
  2022-06-01  6:58   ` Olivier Langlois
  1 sibling, 2 replies; 24+ messages in thread
From: Jakub Kicinski @ 2022-03-26 19:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linus Torvalds, io-uring

On Fri, 18 Mar 2022 15:59:16 -0600 Jens Axboe wrote:
> - Support for NAPI on sockets (Olivier)

Were we CCed on these patches? I don't remember seeing them, 
and looks like looks like it's inventing it's own constants
instead of using the config APIs we have.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [GIT PULL] io_uring updates for 5.18-rc1
  2022-03-26 19:28 ` Jakub Kicinski
@ 2022-03-26 19:47   ` Jens Axboe
  2022-03-26 20:06     ` Jakub Kicinski
  2022-06-01  6:58   ` Olivier Langlois
  1 sibling, 1 reply; 24+ messages in thread
From: Jens Axboe @ 2022-03-26 19:47 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Linus Torvalds, io-uring, Olivier Langlois

On 3/26/22 1:28 PM, Jakub Kicinski wrote:
> On Fri, 18 Mar 2022 15:59:16 -0600 Jens Axboe wrote:
>> - Support for NAPI on sockets (Olivier)
> 
> Were we CCed on these patches? I don't remember seeing them, 
> and looks like looks like it's inventing it's own constants
> instead of using the config APIs we have.

Don't know if it was ever posted on the netdev list, but the
patches have been discussed for 6-9 months on the io_uring
list.

Which constants are you referring to? Only odd one I see is
NAPI_TIMEOUT, other ones are using the sysctl bits. If we're
missing something here, do speak up and we'll make sure it's
consistent with the regular NAPI.

Adding Olivier who wrote the NAPI support.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [GIT PULL] io_uring updates for 5.18-rc1
  2022-03-26 19:47   ` Jens Axboe
@ 2022-03-26 20:06     ` Jakub Kicinski
  2022-03-26 20:57       ` Jens Axboe
  2022-06-01  6:58       ` Olivier Langlois
  0 siblings, 2 replies; 24+ messages in thread
From: Jakub Kicinski @ 2022-03-26 20:06 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linus Torvalds, io-uring, Olivier Langlois

On Sat, 26 Mar 2022 13:47:24 -0600 Jens Axboe wrote:
> On 3/26/22 1:28 PM, Jakub Kicinski wrote:
> > On Fri, 18 Mar 2022 15:59:16 -0600 Jens Axboe wrote:  
> >> - Support for NAPI on sockets (Olivier)  
> > 
> > Were we CCed on these patches? I don't remember seeing them, 
> > and looks like looks like it's inventing it's own constants
> > instead of using the config APIs we have.  
> 
> Don't know if it was ever posted on the netdev list

Hm, maybe I don't understand how things are supposed to work. 
I'm surprised you're unfazed.

> but the patches have been discussed for 6-9 months on the io_uring
> list.

You may need explain to me how that's relevant :) 
The point is the networking maintainers have not seen it.

> Which constants are you referring to? Only odd one I see is
> NAPI_TIMEOUT, other ones are using the sysctl bits. If we're
> missing something here, do speak up and we'll make sure it's
> consistent with the regular NAPI.

SO_BUSY_POLL_BUDGET, 8 is quite low for many practical uses.
I'd also like to have a conversation about continuing to use
the socket as a proxy for NAPI_ID, NAPI_ID is exposed to user
space now. io_uring being a new interface I wonder if it's not 
better to let the user specify the request parameters directly.

> Adding Olivier who wrote the NAPI support.
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [GIT PULL] io_uring updates for 5.18-rc1
  2022-03-26 20:06     ` Jakub Kicinski
@ 2022-03-26 20:57       ` Jens Axboe
  2022-03-26 21:06         ` Jens Axboe
  2022-06-01  6:58       ` Olivier Langlois
  1 sibling, 1 reply; 24+ messages in thread
From: Jens Axboe @ 2022-03-26 20:57 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Linus Torvalds, io-uring, Olivier Langlois

On 3/26/22 2:06 PM, Jakub Kicinski wrote:
> On Sat, 26 Mar 2022 13:47:24 -0600 Jens Axboe wrote:
>> On 3/26/22 1:28 PM, Jakub Kicinski wrote:
>>> On Fri, 18 Mar 2022 15:59:16 -0600 Jens Axboe wrote:  
>>>> - Support for NAPI on sockets (Olivier)  
>>>
>>> Were we CCed on these patches? I don't remember seeing them, 
>>> and looks like looks like it's inventing it's own constants
>>> instead of using the config APIs we have.  
>>
>> Don't know if it was ever posted on the netdev list
> 
> Hm, maybe I don't understand how things are supposed to work. 
> I'm surprised you're unfazed.

I'm surprised you're this surprised, to be honest. It's not like someone
has been sneaking in core networking bits behind your back.

>> but the patches have been discussed for 6-9 months on the io_uring
>> list.
> 
> You may need explain to me how that's relevant :) 
> The point is the networking maintainers have not seen it.

Sorry, should've expanded on that. It's been discussed on that list for
a while, and since it was just an io_uring feature, I guess nobody
considered that it would've been great to have netdev take a look at it.
For me personally, not because I think networking need to sign off on
it, but because if it is missing using some tunables that are relevant
for NAPI outside of io_uring, we of course need to be consistent there.

>> Which constants are you referring to? Only odd one I see is
>> NAPI_TIMEOUT, other ones are using the sysctl bits. If we're
>> missing something here, do speak up and we'll make sure it's
>> consistent with the regular NAPI.
> 
> SO_BUSY_POLL_BUDGET, 8 is quite low for many practical uses.

OK. I've just started doing some network tuning, but haven't gotten
around to specifically targeting NAPI just yet. Will do so soon.

> I'd also like to have a conversation about continuing to use
> the socket as a proxy for NAPI_ID, NAPI_ID is exposed to user
> space now. io_uring being a new interface I wonder if it's not 
> better to let the user specify the request parameters directly.

Definitely open to something that makes more sense, given we don't have
to shoehorn things through the regular API for NAPI with io_uring.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [GIT PULL] io_uring updates for 5.18-rc1
  2022-03-26 20:57       ` Jens Axboe
@ 2022-03-26 21:06         ` Jens Axboe
  2022-03-26 21:30           ` Jakub Kicinski
  0 siblings, 1 reply; 24+ messages in thread
From: Jens Axboe @ 2022-03-26 21:06 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Linus Torvalds, io-uring, Olivier Langlois

On 3/26/22 2:57 PM, Jens Axboe wrote:
>> I'd also like to have a conversation about continuing to use
>> the socket as a proxy for NAPI_ID, NAPI_ID is exposed to user
>> space now. io_uring being a new interface I wonder if it's not 
>> better to let the user specify the request parameters directly.
> 
> Definitely open to something that makes more sense, given we don't
> have to shoehorn things through the regular API for NAPI with
> io_uring.

The most appropriate is probably to add a way to get/set NAPI settings
on a per-io_uring basis, eg through io_uring_register(2). It's a bit
more difficult if they have to be per-socket, as the polling happens off
what would normally be the event wait path.

What did you have in mind?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [GIT PULL] io_uring updates for 5.18-rc1
  2022-03-26 21:06         ` Jens Axboe
@ 2022-03-26 21:30           ` Jakub Kicinski
  2022-03-30 23:30             ` Jakub Kicinski
                               ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Jakub Kicinski @ 2022-03-26 21:30 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linus Torvalds, io-uring, Olivier Langlois

On Sat, 26 Mar 2022 15:06:40 -0600 Jens Axboe wrote:
> On 3/26/22 2:57 PM, Jens Axboe wrote:
> >> I'd also like to have a conversation about continuing to use
> >> the socket as a proxy for NAPI_ID, NAPI_ID is exposed to user
> >> space now. io_uring being a new interface I wonder if it's not 
> >> better to let the user specify the request parameters directly.  
> > 
> > Definitely open to something that makes more sense, given we don't
> > have to shoehorn things through the regular API for NAPI with
> > io_uring.  
> 
> The most appropriate is probably to add a way to get/set NAPI settings
> on a per-io_uring basis, eg through io_uring_register(2). It's a bit
> more difficult if they have to be per-socket, as the polling happens off
> what would normally be the event wait path.
> 
> What did you have in mind?

Not sure I fully comprehend what the current code does. IIUC it uses
the socket and the caches its napi_id, presumably because it doesn't
want to hold a reference on the socket?

This may give the user a false impression that the polling follows 
the socket. NAPIs may get reshuffled underneath on pretty random
reconfiguration / recovery events (random == driver dependent).

I'm not entirely clear how the thing is supposed to be used with TCP
socket, as from a quick grep it appears that listening sockets don't
get napi_id marked at all.

The commit mentions a UDP benchmark, Olivier can you point me to more
info on the use case? I'm mostly familiar with NAPI busy poll with XDP
sockets, where it's pretty obvious.

My immediate reaction is that we should either explicitly call out NAPI
instances by id in uAPI, or make sure we follow the socket in every
case. Also we can probably figure out an easy way of avoiding the hash
table lookups and cache a pointer to the NAPI struct.

In any case, let's look in detail on Monday :)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [GIT PULL] io_uring updates for 5.18-rc1
  2022-03-26 21:30           ` Jakub Kicinski
@ 2022-03-30 23:30             ` Jakub Kicinski
  2022-03-31  0:44               ` Jens Axboe
  2022-06-01  6:59             ` Olivier Langlois
  2022-06-01  8:01             ` [GIT PULL] io_uring updates for 5.18-rc1 Olivier Langlois
  2 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2022-03-30 23:30 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linus Torvalds, io-uring, Olivier Langlois

On Sat, 26 Mar 2022 14:30:49 -0700 Jakub Kicinski wrote:
> In any case, let's look in detail on Monday :)

I have too many questions, best if we discuss this over the code.

Can we get the change reverted and cross-posted to netdev?

Thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [GIT PULL] io_uring updates for 5.18-rc1
  2022-03-30 23:30             ` Jakub Kicinski
@ 2022-03-31  0:44               ` Jens Axboe
  0 siblings, 0 replies; 24+ messages in thread
From: Jens Axboe @ 2022-03-31  0:44 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Linus Torvalds, io-uring, Olivier Langlois

On 3/30/22 5:30 PM, Jakub Kicinski wrote:
> On Sat, 26 Mar 2022 14:30:49 -0700 Jakub Kicinski wrote:
>> In any case, let's look in detail on Monday :)
> 
> I have too many questions, best if we discuss this over the code.
> 
> Can we get the change reverted and cross-posted to netdev?

I'm not going to revert this upfront, but I can certainly start
a discussion on netdev so we can poke at it. If we end up
deciding that the API definitely needs to change, then by all
means we'll revert it before 5.18-final. So far it all seems a
bit handwavy to me, so let's discuss it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [GIT PULL] io_uring updates for 5.18-rc1
  2022-03-26 19:28 ` Jakub Kicinski
  2022-03-26 19:47   ` Jens Axboe
@ 2022-06-01  6:58   ` Olivier Langlois
  2022-06-01 17:04     ` Jakub Kicinski
  1 sibling, 1 reply; 24+ messages in thread
From: Olivier Langlois @ 2022-06-01  6:58 UTC (permalink / raw)
  To: Jakub Kicinski, Jens Axboe; +Cc: Linus Torvalds, io-uring

On Sat, 2022-03-26 at 12:28 -0700, Jakub Kicinski wrote:
> On Fri, 18 Mar 2022 15:59:16 -0600 Jens Axboe wrote:
> > - Support for NAPI on sockets (Olivier)
> 
> Were we CCed on these patches? I don't remember seeing them, 
> and looks like looks like it's inventing it's own constants
> instead of using the config APIs we have.

Jakub,

No, you werent CCed on those patches. No one, including myself, did
think that you would want to review code not in your subsystem and
running scripts/get_maintainer.pl on the patch did not list your group.
If not receiving an invitation did upset you, I am sorry. This was not
intentional

Can you be more explicit about the constants and the config API you are
refering to?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [GIT PULL] io_uring updates for 5.18-rc1
  2022-03-26 20:06     ` Jakub Kicinski
  2022-03-26 20:57       ` Jens Axboe
@ 2022-06-01  6:58       ` Olivier Langlois
  1 sibling, 0 replies; 24+ messages in thread
From: Olivier Langlois @ 2022-06-01  6:58 UTC (permalink / raw)
  To: Jakub Kicinski, Jens Axboe; +Cc: Linus Torvalds, io-uring

On Sat, 2022-03-26 at 13:06 -0700, Jakub Kicinski wrote:
> On Sat, 26 Mar 2022 13:47:24 -0600 Jens Axboe wrote:
> 
> > Which constants are you referring to? Only odd one I see is
> > NAPI_TIMEOUT, other ones are using the sysctl bits. If we're
> > missing something here, do speak up and we'll make sure it's
> > consistent with the regular NAPI.
> 
> SO_BUSY_POLL_BUDGET, 8 is quite low for many practical uses.
> I'd also like to have a conversation about continuing to use
> the socket as a proxy for NAPI_ID, NAPI_ID is exposed to user
> space now. io_uring being a new interface I wonder if it's not 
> better to let the user specify the request parameters directly.
> 
> 
My napi busy poll integration is strongly inspired from epoll code as
its persistent context is much closer to the io_uring situation than
what select/poll code is doing.

For instance, BUSY_POLL_BUDGET constant is taken straight from epoll
code.

I am a little bit surprised about your questioning. If BUSY_POLL_BUDGET
is quite low for many practical uses, how is it ok to use for epoll
code?

If 8 is not a good default value, may I suggest that you change the
define value?

TBH, I didn't find the documentation about the busy poll budget
parameter so I did assume that existing code was doing the right
thing...

For your other suggestion, I do not think that it is a good idea to let
user specify the request napi ids to busy poll because it would make
the io_uring interface more complex without being even sure that this
is something that people want or need.

select/poll implementation examine each and every sockets on every call
and it can afford to do it since it is rebuilding the polling set every
time through sock_poll().

epoll code does not want to do that as it would defeat its purpose and
it relies on the busy poll global setting. Also, epoll code makes a
pretty bold assumption that its users desiring busy polling will be
willing to create an epoll set per receive queue and presumably run
each set in a dedicated thread.

In:
https://legacy.netdevconf.info/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf

Linux-4.12 changes:
epoll() support was added by Sridhar Samudrala and Alexander Duyck,
with the assumption that an application using epoll() and busy polling
would first make sure that it would classify sockets based on their
receive queue (NAPI ID), and use at least one epoll fd per receive
queue.

To me, this is a very big burden placed on the shoulders of their users
as not every application design can accomodate this requirement. For
instance, I have an Intel igb nic with 8 receive queues. I am not
running my app on a 24 cores machine where I can easily allocate 8
threads just for the networking I/O. I sincerely feel that epoll busy
poll integration has been specifically tailored for the patch authors
needs without all the usability concerns that you appear to have for
the io_uring implementation.

I went beyond what epoll offers by allowing the busy polling of several
receive queues from a single ring.

When I did mention my interest in a napi busy poll to the io_uring list
but I did not know how to proceed due to several unknown issues, Jens
did encourage to give it a shot and in that context, my design goal has
been to keep the initial implementation reasonable and simple.

One concession that could be done to address your concern, it is that
the socket receive queues added to the list of queues busy polled could
be further narrowed by using sk_can_busy_loop() instead of just
checking net_busy_loop_on().

Would that be a satisfactory compromise to you and your team?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [GIT PULL] io_uring updates for 5.18-rc1
  2022-03-26 21:30           ` Jakub Kicinski
  2022-03-30 23:30             ` Jakub Kicinski
@ 2022-06-01  6:59             ` Olivier Langlois
  2022-06-01 16:24               ` Jakub Kicinski
  2022-06-01 18:09               ` Linus Torvalds
  2022-06-01  8:01             ` [GIT PULL] io_uring updates for 5.18-rc1 Olivier Langlois
  2 siblings, 2 replies; 24+ messages in thread
From: Olivier Langlois @ 2022-06-01  6:59 UTC (permalink / raw)
  To: Jakub Kicinski, Jens Axboe; +Cc: Linus Torvalds, io-uring

On Sat, 2022-03-26 at 14:30 -0700, Jakub Kicinski wrote:
> On Sat, 26 Mar 2022 15:06:40 -0600 Jens Axboe wrote:
> > On 3/26/22 2:57 PM, Jens Axboe wrote:
> > > > I'd also like to have a conversation about continuing to use
> > > > the socket as a proxy for NAPI_ID, NAPI_ID is exposed to user
> > > > space now. io_uring being a new interface I wonder if it's not 
> > > > better to let the user specify the request parameters
> > > > directly.  
> > > 
> > > Definitely open to something that makes more sense, given we
> > > don't
> > > have to shoehorn things through the regular API for NAPI with
> > > io_uring.  
> > 
> > The most appropriate is probably to add a way to get/set NAPI
> > settings
> > on a per-io_uring basis, eg through io_uring_register(2). It's a
> > bit
> > more difficult if they have to be per-socket, as the polling
> > happens off
> > what would normally be the event wait path.
> > 
> > What did you have in mind?
> 
> Not sure I fully comprehend what the current code does. IIUC it uses
> the socket and the caches its napi_id, presumably because it doesn't
> want to hold a reference on the socket?

Again, the io_uring napi busy_poll integration is strongly inspired
from epoll implementation which caches a single napi_id.

I guess that I did reverse engineer the rational justifying the epoll
design decisions.

If you were to busy poll receive queues for a socket set containing
hundreds of thousands of sockets, would you rather scan the whole
socket set to retrieve which queues to poll or simple iterate through a
list containing a dozen of so of ids?
> 
> This may give the user a false impression that the polling follows 
> the socket. NAPIs may get reshuffled underneath on pretty random
> reconfiguration / recovery events (random == driver dependent).

There is nothing random. When a socket is added to the poll set, its
receive queue is added to the short list of queues to poll.

A very common usage pattern among networking applications it is to
reinsert the socket into the polling set after each polling event. In
recognition to this pattern and to avoid allocating/deallocating memory
to modify the napi_id list all the time, each napi id is kept in the
list until a very long period of inactivity is reached where it is
finally removed to stop the receive queue busy polling.
> 
> I'm not entirely clear how the thing is supposed to be used with TCP
> socket, as from a quick grep it appears that listening sockets don't
> get napi_id marked at all.
> 
> The commit mentions a UDP benchmark, Olivier can you point me to more
> info on the use case? I'm mostly familiar with NAPI busy poll with
> XDP
> sockets, where it's pretty obvious.

https://github.com/lano1106/io_uring_udp_ping

IDK what else I can tell you. I choose to unit test the new feature
with an UDP app because it was the simplest setup for testing. AFAIK,
the ultimate goal of busy polling is to minimize latency in packets
reception and the NAPI busy polling code should not treat differently
packets whether they are UDP or TCP or whatever the type of frames the
NIC does receive...
> 
> My immediate reaction is that we should either explicitly call out
> NAPI
> instances by id in uAPI, or make sure we follow the socket in every
> case. Also we can probably figure out an easy way of avoiding the
> hash
> table lookups and cache a pointer to the NAPI struct.
> 
That is an interesting idea. If this is something that NAPI API would
offer, I would gladly use that to avoid the hash lookup but IMHO, I see
it as a very interesting improvement but hopefully this should not
block my patch...


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [GIT PULL] io_uring updates for 5.18-rc1
  2022-03-26 21:30           ` Jakub Kicinski
  2022-03-30 23:30             ` Jakub Kicinski
  2022-06-01  6:59             ` Olivier Langlois
@ 2022-06-01  8:01             ` Olivier Langlois
  2 siblings, 0 replies; 24+ messages in thread
From: Olivier Langlois @ 2022-06-01  8:01 UTC (permalink / raw)
  To: Jakub Kicinski, Jens Axboe; +Cc: Linus Torvalds, io-uring

On Sat, 2022-03-26 at 14:30 -0700, Jakub Kicinski wrote:
> On Sat, 26 Mar 2022 15:06:40 -0600 Jens Axboe wrote:
> > On 3/26/22 2:57 PM, Jens Axboe wrote:
> > > > I'd also like to have a conversation about continuing to use
> > > > the socket as a proxy for NAPI_ID, NAPI_ID is exposed to user
> > > > space now. io_uring being a new interface I wonder if it's not 
> > > > better to let the user specify the request parameters
> > > > directly.  
> > > 
> > > Definitely open to something that makes more sense, given we
> > > don't
> > > have to shoehorn things through the regular API for NAPI with
> > > io_uring.  
> > 
> > The most appropriate is probably to add a way to get/set NAPI
> > settings
> > on a per-io_uring basis, eg through io_uring_register(2). It's a
> > bit
> > more difficult if they have to be per-socket, as the polling
> > happens off
> > what would normally be the event wait path.
> > 
> > What did you have in mind?
> 
> Not sure I fully comprehend what the current code does. IIUC it uses
> the socket and the caches its napi_id, presumably because it doesn't
> want to hold a reference on the socket?
> 
> This may give the user a false impression that the polling follows 
> the socket. NAPIs may get reshuffled underneath on pretty random
> reconfiguration / recovery events (random == driver dependent).
> 
> I'm not entirely clear how the thing is supposed to be used with TCP
> socket, as from a quick grep it appears that listening sockets don't
> get napi_id marked at all.
> 
> The commit mentions a UDP benchmark, Olivier can you point me to more
> info on the use case? I'm mostly familiar with NAPI busy poll with
> XDP
> sockets, where it's pretty obvious.
> 
> My immediate reaction is that we should either explicitly call out
> NAPI
> instances by id in uAPI, or make sure we follow the socket in every
> case. Also we can probably figure out an easy way of avoiding the
> hash
> table lookups and cache a pointer to the NAPI struct.
> 
> In any case, let's look in detail on Monday :)

On second reading of this email, my understanding of it has become
clearer.

1. epoll design is exposed to the same NAPIs reshuffling and it seems
to be resilient to that event
2. By assuming the NAPI functions manage well the case where they are
passed an expired NAPI id, io_uring integration would rapidly drop the
expired ids and start using the new ones.
3. With my igb nic, after 2 months of non-stop usage, I have never ever
seen napi ids change. The id values appear to solely depend on
CONFIG_NR_CPUS value
4. If random napi ids reshuffling is a thing, this kinda eliminate the
option of skipping the hash lookup by storing directly the pointer.
With the reshuffling thing, storing a pointer seems like a dangerous
proposal...

Greetings,

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [GIT PULL] io_uring updates for 5.18-rc1
  2022-06-01  6:59             ` Olivier Langlois
@ 2022-06-01 16:24               ` Jakub Kicinski
  2022-06-01 18:09               ` Linus Torvalds
  1 sibling, 0 replies; 24+ messages in thread
From: Jakub Kicinski @ 2022-06-01 16:24 UTC (permalink / raw)
  To: Olivier Langlois; +Cc: Jens Axboe, Linus Torvalds, io-uring

On Wed, 01 Jun 2022 02:59:12 -0400 Olivier Langlois wrote:
> > I'm not entirely clear how the thing is supposed to be used with TCP
> > socket, as from a quick grep it appears that listening sockets don't
> > get napi_id marked at all.
> > 
> > The commit mentions a UDP benchmark, Olivier can you point me to more
> > info on the use case? I'm mostly familiar with NAPI busy poll with
> > XDP
> > sockets, where it's pretty obvious.  
> 
> https://github.com/lano1106/io_uring_udp_ping
> 
> IDK what else I can tell you. I choose to unit test the new feature
> with an UDP app because it was the simplest setup for testing. AFAIK,
> the ultimate goal of busy polling is to minimize latency in packets
> reception and the NAPI busy polling code should not treat differently
> packets whether they are UDP or TCP or whatever the type of frames the
> NIC does receive...

IDK how you use the busy polling, so I'm asking you to describe what
your app does. You said elsewhere that you don't have dedicated thread
per queue so it's not a server app (polling for requests) but a client
app (polling for responses)?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [GIT PULL] io_uring updates for 5.18-rc1
  2022-06-01  6:58   ` Olivier Langlois
@ 2022-06-01 17:04     ` Jakub Kicinski
  0 siblings, 0 replies; 24+ messages in thread
From: Jakub Kicinski @ 2022-06-01 17:04 UTC (permalink / raw)
  To: Olivier Langlois; +Cc: Jens Axboe, Linus Torvalds, io-uring

On Wed, 01 Jun 2022 02:58:25 -0400 Olivier Langlois wrote:
> On Sat, 2022-03-26 at 12:28 -0700, Jakub Kicinski wrote:
> > On Fri, 18 Mar 2022 15:59:16 -0600 Jens Axboe wrote:  
> > > - Support for NAPI on sockets (Olivier)  
> > 
> > Were we CCed on these patches? I don't remember seeing them, 
> > and looks like looks like it's inventing it's own constants
> > instead of using the config APIs we have.  
> 
> No, you werent CCed on those patches. No one, including myself, did
> think that you would want to review code not in your subsystem and
> running scripts/get_maintainer.pl on the patch did not list your group.
> If not receiving an invitation did upset you, I am sorry. This was not
> intentional

All the talk about teams, groups and APIs. Like EXPORT_SYMBOL() is
supposed to be about crossing organizational boundaries.

Seems pretty obvious to me that netdev people and authors of the
original busy polling may have guidance and advice on networking uAPIs.

Maybe thinking more open and collaborative approach leads to better
results is naive. I'm certainly not trying to mark the territory or
anything like that.

> Can you be more explicit about the constants and the config API you are
> refering to?

I think I was talking about the setsockopts which can be used to modify
the polling parameters. I'm not 100% sure now, it was 2 months ago.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [GIT PULL] io_uring updates for 5.18-rc1
  2022-06-01  6:59             ` Olivier Langlois
  2022-06-01 16:24               ` Jakub Kicinski
@ 2022-06-01 18:09               ` Linus Torvalds
  2022-06-01 18:21                 ` Jens Axboe
  1 sibling, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2022-06-01 18:09 UTC (permalink / raw)
  To: Olivier Langlois; +Cc: Jakub Kicinski, Jens Axboe, io-uring

On Tue, May 31, 2022 at 11:59 PM Olivier Langlois
<[email protected]> wrote:
>
> Again, the io_uring napi busy_poll integration is strongly inspired
> from epoll implementation which caches a single napi_id.

Note that since epoll is the worst possible implementation of a
horribly bad idea, and one of the things I would really want people to
kill off, that "it's designed based on epoll" is about the worst
possible explanation fo anything at all.

Epoll is the CVS of kernel interfaces: look at it, cry, run away, and
try to avoid making that mistake ever again.

I'm looking forward to the day when we can just delete all epoll code,
but io_uring may be a making that even worse, in how it has then
exposed epoll as an io_uring operation. That was probably a *HORRIBLE*
mistake.

(For the two prime issues with epoll: epoll recursion and the
completely invalid expectations of what an "edge" in the edge
triggering is. But there are other mistakes in there, with the
lifetime of the epoll waitqueues having been nasty problems several
times, because of how it doesn't follow any of the normal poll()
rules, and made a mockery of any sane interfaces).

            Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [GIT PULL] io_uring updates for 5.18-rc1
  2022-06-01 18:09               ` Linus Torvalds
@ 2022-06-01 18:21                 ` Jens Axboe
  2022-06-01 18:28                   ` Linus Torvalds
  0 siblings, 1 reply; 24+ messages in thread
From: Jens Axboe @ 2022-06-01 18:21 UTC (permalink / raw)
  To: Linus Torvalds, Olivier Langlois; +Cc: Jakub Kicinski, io-uring

On 6/1/22 12:09 PM, Linus Torvalds wrote:
> I'm looking forward to the day when we can just delete all epoll code,
> but io_uring may be a making that even worse, in how it has then
> exposed epoll as an io_uring operation. That was probably a *HORRIBLE*
> mistake.

Of the added opcodes in io_uring, that one I'm actually certain never
ended up getting used. I see no reason why we can't just deprecate it
and eventually just wire it up to io_eopnotsupp().

IOW, that won't be the one holding us back killing epoll.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [GIT PULL] io_uring updates for 5.18-rc1
  2022-06-01 18:21                 ` Jens Axboe
@ 2022-06-01 18:28                   ` Linus Torvalds
  2022-06-01 18:34                     ` Jens Axboe
  0 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2022-06-01 18:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Olivier Langlois, Jakub Kicinski, io-uring

On Wed, Jun 1, 2022 at 11:21 AM Jens Axboe <[email protected]> wrote:
>
> Of the added opcodes in io_uring, that one I'm actually certain never
> ended up getting used. I see no reason why we can't just deprecate it
> and eventually just wire it up to io_eopnotsupp().
>
> IOW, that won't be the one holding us back killing epoll.

That really would be lovely.

I think io_uring at least in theory might have the potential to _help_
kill epoll, since I suspect a lot of epoll users might well prefer
io_uring instead.

I say "in theory", because it does require that io_uring itself
doesn't keep any of the epoll code alive, but also because we've seen
over and over that people just don't migrate to newer interfaces
because it's just too much work and the old ones still work..

Of course, we haven't exactly helped things - right now the whole
EPOLL thing is "default y" and behind a EXPERT define, so people
aren't even asked if they want it. Because it used to be one of those
things everybody enabled because it was new and shiny and cool.

And sadly, there are a few things that epoll really shines at, so I
suspect that will never really change ;(

                  Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [GIT PULL] io_uring updates for 5.18-rc1
  2022-06-01 18:28                   ` Linus Torvalds
@ 2022-06-01 18:34                     ` Jens Axboe
  2022-06-01 18:52                       ` Linus Torvalds
  0 siblings, 1 reply; 24+ messages in thread
From: Jens Axboe @ 2022-06-01 18:34 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Olivier Langlois, Jakub Kicinski, io-uring

On 6/1/22 12:28 PM, Linus Torvalds wrote:
> On Wed, Jun 1, 2022 at 11:21 AM Jens Axboe <[email protected]> wrote:
>>
>> Of the added opcodes in io_uring, that one I'm actually certain never
>> ended up getting used. I see no reason why we can't just deprecate it
>> and eventually just wire it up to io_eopnotsupp().
>>
>> IOW, that won't be the one holding us back killing epoll.
> 
> That really would be lovely.
> 
> I think io_uring at least in theory might have the potential to _help_
> kill epoll, since I suspect a lot of epoll users might well prefer
> io_uring instead.
> 
> I say "in theory", because it does require that io_uring itself
> doesn't keep any of the epoll code alive, but also because we've seen
> over and over that people just don't migrate to newer interfaces
> because it's just too much work and the old ones still work..
> 
> Of course, we haven't exactly helped things - right now the whole
> EPOLL thing is "default y" and behind a EXPERT define, so people
> aren't even asked if they want it. Because it used to be one of those
> things everybody enabled because it was new and shiny and cool.
> 
> And sadly, there are a few things that epoll really shines at, so I
> suspect that will never really change ;(

I think there are two ways that io_uring can help kill epoll:

1) As a basic replacement as an event notifier. I'm not a huge fan of
   these conversions in general, as they just swap one readiness
   notifier for another one. Hence they don't end up taking full
   advantage of that io_uring has to offer. But they are easy and event
   libraries obviously often take this approach.

2) From scratch implementations or actual adoptions in applications will
   switch from an epoll driven readiness model to the io_uring
   completion model. These are the conversion that I am the most excited
   about, as the end up using the (imho) better model that io_uring has
   to offer.

But as a first step, let's just mark it deprecated with a pr_warn() for
5.20 and then plan to kill it off whenever a suitable amount of relases
have passed since that addition.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [GIT PULL] io_uring updates for 5.18-rc1
  2022-06-01 18:34                     ` Jens Axboe
@ 2022-06-01 18:52                       ` Linus Torvalds
  2022-06-01 19:10                         ` Jens Axboe
  0 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2022-06-01 18:52 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Olivier Langlois, Jakub Kicinski, io-uring

On Wed, Jun 1, 2022 at 11:34 AM Jens Axboe <[email protected]> wrote:
>
> But as a first step, let's just mark it deprecated with a pr_warn() for
> 5.20 and then plan to kill it off whenever a suitable amount of relases
> have passed since that addition.

I'd love to, but it's not actually realistic as things stand now.
epoll() is used in a *lot* of random libraries. A "pr_warn()" would
just be senseless noise, I bet.

No, there's a reason that EPOLL is still there, still 'default y',
even though I dislike it and think it was a mistake, and we've had
several nasty bugs related to it over the years.

It really can be a very useful system call, it's just that it really
doesn't work the way the actual ->poll() interface was designed, and
it kind of hijacks it in ways that mostly work, but the have subtle
lifetime issues that you don't see with a regular select/poll because
those will always tear down the wait queues.

Realistically, the proper fix to epoll is likely to make it explicit,
and make files and drivers that want to support it have to actually
opt in. Because a lot of the problems have been due to epoll() looking
*exactly* like a regular poll/select to a driver or a filesystem, but
having those very subtle extended requirements.

(And no, the extended requirements aren't generally onerous, and
regular ->poll() works fine for 99% of all cases. It's just that
occasionally, special users are then fooled about special contexts).

In other words, it's a bit like our bad old days when "splice()" ended
up falling back to regular ->read()/->write() implementations with
set_fs(KERNEL_DS). Yes, that worked fine for 99% of all cases, and we
did it for years, but it also caused several really nasty issues for
when the read/write actor did something slightly unusual.

So I may dislike epoll quite intensely, but I don't think we can
*really* get rid of it. But we might be able to make it a bit more
controlled.

But so far every time it has caused issues, we've worked around it by
fixing it up in the particular driver or whatever that ended up being
triggered by epoll semantics.

                Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [GIT PULL] io_uring updates for 5.18-rc1
  2022-06-01 18:52                       ` Linus Torvalds
@ 2022-06-01 19:10                         ` Jens Axboe
  2022-06-01 19:20                           ` Linus Torvalds
  2022-08-16 15:53                           ` Deprecation of IORING_OP_EPOLL_CTL (Re: [GIT PULL] io_uring updates for 5.18-rc1) Stefan Metzmacher
  0 siblings, 2 replies; 24+ messages in thread
From: Jens Axboe @ 2022-06-01 19:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Olivier Langlois, Jakub Kicinski, io-uring

On 6/1/22 12:52 PM, Linus Torvalds wrote:
> On Wed, Jun 1, 2022 at 11:34 AM Jens Axboe <[email protected]> wrote:
>>
>> But as a first step, let's just mark it deprecated with a pr_warn() for
>> 5.20 and then plan to kill it off whenever a suitable amount of relases
>> have passed since that addition.
> 
> I'd love to, but it's not actually realistic as things stand now.
> epoll() is used in a *lot* of random libraries. A "pr_warn()" would
> just be senseless noise, I bet.

I mean only for the IORING_OP_EPOLL_CTL opcode, which is the only epoll
connection we have in there. It'd be jumping the gun to do it for the
epoll_ctl syscall for sure... And I really have no personal skin in that
game, other than having a better alternative. But that's obviously a
long pole type of deprecation.

> No, there's a reason that EPOLL is still there, still 'default y',
> even though I dislike it and think it was a mistake, and we've had
> several nasty bugs related to it over the years.
> 
> It really can be a very useful system call, it's just that it really
> doesn't work the way the actual ->poll() interface was designed, and
> it kind of hijacks it in ways that mostly work, but the have subtle
> lifetime issues that you don't see with a regular select/poll because
> those will always tear down the wait queues.
> 
> Realistically, the proper fix to epoll is likely to make it explicit,
> and make files and drivers that want to support it have to actually
> opt in. Because a lot of the problems have been due to epoll() looking
> *exactly* like a regular poll/select to a driver or a filesystem, but
> having those very subtle extended requirements.
> 
> (And no, the extended requirements aren't generally onerous, and
> regular ->poll() works fine for 99% of all cases. It's just that
> occasionally, special users are then fooled about special contexts).

It's not an uncommon approach to make the initial adoption /
implementation more palatable, though commonly then also ends up being a
mistake. I've certainly been guilty of that myself too...

> In other words, it's a bit like our bad old days when "splice()" ended
> up falling back to regular ->read()/->write() implementations with
> set_fs(KERNEL_DS). Yes, that worked fine for 99% of all cases, and we
> did it for years, but it also caused several really nasty issues for
> when the read/write actor did something slightly unusual.

Unfortunately that particular change I just had to deal with, and
noticed that we're up to more than two handfuls of fixes for that and I
bet we're not done. Not saying it wasn't the right choice in terms of
sanity, but it has been more painful than I thought it would be.

> So I may dislike epoll quite intensely, but I don't think we can
> *really* get rid of it. But we might be able to make it a bit more
> controlled.
> 
> But so far every time it has caused issues, we've worked around it by
> fixing it up in the particular driver or whatever that ended up being
> triggered by epoll semantics.

The io_uring side of the epoll management I'm very sure can go in a few
releases, and a pr_warn_once() for 5.20 is the right choice. epoll
itself, probably not even down the line, though I am hoping we can
continue to move people off of it. Maybe in another 20 years :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [GIT PULL] io_uring updates for 5.18-rc1
  2022-06-01 19:10                         ` Jens Axboe
@ 2022-06-01 19:20                           ` Linus Torvalds
  2022-08-16 15:53                           ` Deprecation of IORING_OP_EPOLL_CTL (Re: [GIT PULL] io_uring updates for 5.18-rc1) Stefan Metzmacher
  1 sibling, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2022-06-01 19:20 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Olivier Langlois, Jakub Kicinski, io-uring

On Wed, Jun 1, 2022 at 12:10 PM Jens Axboe <[email protected]> wrote:
>
> I mean only for the IORING_OP_EPOLL_CTL opcode, which is the only epoll
> connection we have in there.

Ok, that removal sounds fine to me. Thanks.

             Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Deprecation of IORING_OP_EPOLL_CTL (Re: [GIT PULL] io_uring updates for 5.18-rc1)
  2022-06-01 19:10                         ` Jens Axboe
  2022-06-01 19:20                           ` Linus Torvalds
@ 2022-08-16 15:53                           ` Stefan Metzmacher
  1 sibling, 0 replies; 24+ messages in thread
From: Stefan Metzmacher @ 2022-08-16 15:53 UTC (permalink / raw)
  To: Jens Axboe, Linus Torvalds; +Cc: Olivier Langlois, Jakub Kicinski, io-uring


Hi Jens, hi Linus,

I just noticed this commit:

commit 61a2732af4b0337f7e36093612c846e9f5962965
Author: Jens Axboe <[email protected]>
Date:   Wed Jun 1 12:36:42 2022 -0600

     io_uring: deprecate epoll_ctl support

     As far as we know, nobody ever adopted the epoll_ctl management via
     io_uring. Deprecate it now with a warning, and plan on removing it in
     a later kernel version. When we do remove it, we can revert the following
     commits as well:

     39220e8d4a2a ("eventpoll: support non-blocking do_epoll_ctl() calls")
     58e41a44c488 ("eventpoll: abstract out epoll_ctl() handler")

     Suggested-by: Linus Torvalds <[email protected]>
     Link: https://lore.kernel.org/io-uring/CAHk-=wiTyisXBgKnVHAGYCNvkmjk=50agS2Uk6nr+n3ssLZg2w@mail.gmail.com/
     Signed-off-by: Jens Axboe <[email protected]>

>> On Wed, Jun 1, 2022 at 11:34 AM Jens Axboe <[email protected]> wrote:
>>>
>>> But as a first step, let's just mark it deprecated with a pr_warn() for
>>> 5.20 and then plan to kill it off whenever a suitable amount of relases
>>> have passed since that addition.
>>
>> I'd love to, but it's not actually realistic as things stand now.
>> epoll() is used in a *lot* of random libraries. A "pr_warn()" would
>> just be senseless noise, I bet.
> 
> I mean only for the IORING_OP_EPOLL_CTL opcode, which is the only epoll
> connection we have in there.

+       pr_warn_once("%s: epoll_ctl support in io_uring is deprecated and will "
+                    "be removed in a future Linux kernel version.\n",
+                    current->comm);

I don't think application writer will ever notice such warning in log files.

Wouldn't it be better to rename IORING_OP_EPOLL_CTL to IORING_OP_EPOLL_CTL_DEPRECATED,
so that developers notice the problem and can complain, because their application
doesn't compile.

As we don't know about any application using it, most likely nobody will ever notice that rename.

If I haven't noticed that commit, by reading
https://lwn.net/SubscriberLink/903487/5fddc7bb8e3bdcdd/
I may have started to use it in future within Samba's libtevent.

While researching on it I also noticed that
.prep = io_eopnotsupp_prep, is not paired with .not_supported = 1;
so that io_probe() will still report IO_URING_OP_SUPPORTED.
I think io_uring_optable_init() should assert that both are in sync.

>> No, there's a reason that EPOLL is still there, still 'default y',
>> even though I dislike it and think it was a mistake, and we've had
>> several nasty bugs related to it over the years.
>>
>> It really can be a very useful system call, it's just that it really
>> doesn't work the way the actual ->poll() interface was designed, and
>> it kind of hijacks it in ways that mostly work, but the have subtle
>> lifetime issues that you don't see with a regular select/poll because
>> those will always tear down the wait queues.
>>
>> Realistically, the proper fix to epoll is likely to make it explicit,
>> and make files and drivers that want to support it have to actually
>> opt in. Because a lot of the problems have been due to epoll() looking
>> *exactly* like a regular poll/select to a driver or a filesystem, but
>> having those very subtle extended requirements.
>>
>> (And no, the extended requirements aren't generally onerous, and
>> regular ->poll() works fine for 99% of all cases. It's just that
>> occasionally, special users are then fooled about special contexts).

Currently we're using epoll_wait() as central blocking syscall,
we use it with level triggering and only ask for a single
struct epoll_event at a time. And we use the raw memory pointer
values as epoll_event.data.ptr in order to find the related
in memory structure belonging to the event.

This works very nicely because if more than one file descriptor
is ready, calling epoll_wait() seems traverse through the ready list,
which generated fair results.

We avoid asking for more events in a single epoll_wait() calls,
because we often have multiple file descriptor belonging together
to a high level application request. And the event handler for
one file descriptor may close another file descriptor of changes it
in some other way, which means the information of a 2nd epoll_event
from a single epoll_wait(), is not unlikely be stale.

So for userspace epoll seems to be a very useful interface
(at least with level based triggering, I never understood edge-triggering).

Are the epoll problems you were discussion mostly about edge-triggering
and/or only kernel internals are also about the userspace intercace?

I'm now wondering how we would be able to move to
io_uring_enter() as central blocking syscall.

Over time we'll try to use native calls like, IORING_OP_SENDMSG
and others. But we'll ever have to support poll based notifications
in order to try wait for file descriptor state changes.

So I looked at IORING_OP_POLL_ADD/REMOVE and noticed the recent
addition of IORING_POLL_ADD_LEVEL.

But I'm having a hard time understanding how this will work
at runtime.

I'm seeing some interaction with task work and vfs_poll() but I'm
not really sure what it means in the end.

Taskwork (without IORING_SETUP_DEFER_TASKRUN) seems to run after each syscall.

In case I did 10.000 IORING_OP_POLL_ADD calls to monitor 10.000 file descriptor,
does it means after each syscall there's a loop of 10.000 vfs_poll() calls?

What is the event that lets IORING_POLL_ADD_LEVEL trigger the next cqe?

And with IORING_POLL_ADD_LEVEL does it means that each loop will try to post
10.000 cqes (over and over again), if all 10.000 file descriptors are ready and would state that way.

This would likely overflow the cqe array without a chance to recover itself.
Also the information in the already posted cqes might be already out of date
when they arrive in the application. While with epoll_wait(maxevents=1) I'm sure
the information of the single event I'm extracting is recent.

Dealing with cqe overflows it seems to be very nasty.
And from userspace it seems to be easier to take single fd from epoll_create()
and monitor that with IORING_OP_POLL_ADD and use IORING_OP_EPOLL_CTL to do
modification while still calling epoll_wait(maxevents=1, timwout=0) to get just
one event.

You see that I'm very likely (hopefully) have the wrong picture in mind how
epoll_wait could be avoided, so it would very nice if someone could explain
it to me.

I need to make the transition io_uring_enter() as central syscall first
without any native IORING_OP_* calls, as there's no way I can convert all Samba
code at once, and I also don't intend to try it at all for most parts, as we need
to stay portable to platforms like freebsd and aix where we only have plain poll().
So any hints how to do that plain replacement would be nice.

When that's done I can start to convert performance critical in the fileserver
to native IORING_OP_* calls.

Sorry for the long mail and thanks for any comments in advance!
metze

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-08-16 15:55 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-03-18 21:59 [GIT PULL] io_uring updates for 5.18-rc1 Jens Axboe
2022-03-22  0:25 ` pr-tracker-bot
2022-03-26 19:28 ` Jakub Kicinski
2022-03-26 19:47   ` Jens Axboe
2022-03-26 20:06     ` Jakub Kicinski
2022-03-26 20:57       ` Jens Axboe
2022-03-26 21:06         ` Jens Axboe
2022-03-26 21:30           ` Jakub Kicinski
2022-03-30 23:30             ` Jakub Kicinski
2022-03-31  0:44               ` Jens Axboe
2022-06-01  6:59             ` Olivier Langlois
2022-06-01 16:24               ` Jakub Kicinski
2022-06-01 18:09               ` Linus Torvalds
2022-06-01 18:21                 ` Jens Axboe
2022-06-01 18:28                   ` Linus Torvalds
2022-06-01 18:34                     ` Jens Axboe
2022-06-01 18:52                       ` Linus Torvalds
2022-06-01 19:10                         ` Jens Axboe
2022-06-01 19:20                           ` Linus Torvalds
2022-08-16 15:53                           ` Deprecation of IORING_OP_EPOLL_CTL (Re: [GIT PULL] io_uring updates for 5.18-rc1) Stefan Metzmacher
2022-06-01  8:01             ` [GIT PULL] io_uring updates for 5.18-rc1 Olivier Langlois
2022-06-01  6:58       ` Olivier Langlois
2022-06-01  6:58   ` Olivier Langlois
2022-06-01 17:04     ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox