Problems replacing epoll with io

public inbox for [email protected]
 help / color / mirror / Atom feed

* Problems replacing epoll with io_uring in tevent
@ 2022-10-18 14:42 Stefan Metzmacher
  2022-10-26 16:00 ` Stefan Metzmacher
  0 siblings, 1 reply; 14+ messages in thread
From: Stefan Metzmacher @ 2022-10-18 14:42 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Linus Torvalds, Samba Technical

Hi Jens,

here's first summary of the problems I hit when trying to
add an io_uring backend to Samba's libtevent.

BTW: It would be nice to get some feedback to my mail from August 16th 2022:
"Deprecation of IORING_OP_EPOLL_CTL (Re: [GIT PULL] io_uring updates for 5.18-rc1)"
https://lore.kernel.org/io-uring/[email protected]/
@Linus, that's basically the reason I cc'ed you...

First an overview of what features tevent needs from the os and
provides for its consumers, then how I tried to use io_uring, followed by the problems I hit):
(skip to 9. if you just want to see the problems)

1. tevent is basically looping around a tevent_loop_once() function,
    which will use epoll_wait() or poll() as blocking function in the backend.
    It only invokes a single event handler, which is very important for us
    as it avoids a whole class of use after free problem we had in the earlier days.

    There's a tevent_loop_wait() wrapper, which will typically called
    by main() in order to have an endless server loop.

2. tevent has support for the following event handlers:
    - 'immediate': It's a way to call something directly in the next
                   tevent_loop_once() iteration.
    - 'timer':     It's a way to call something at a specific time
    - 'signal':    It's a way to call something when a signal e.g. SIGHUP arrived
    - 'fd':        It's a way to get notified with TEVENT_FD_READ and/or TEVENT_FD_WRITE
                   on a given file descriptor

    'immediate', 'timer' and 'signal' events are handled in the core tevent code
    and only if none of them is ready the backend is called to wait for fd events
    to get ready. The backend is passed the timeout for the next timer.

3. 'fd' events operate with the following properties:
    a) level triggering mode: TEVENT_FD_READ/TEVENT_FD_WRITE are delivered over and over again,
       if the handler doesn't consume or send data, you would get 100% cpu spinning, but you can't
       miss any event (which could happen with edge triggering)
    b) all registered fd handlers are called in a fair fashion, they are part of a linked list
       and rotated to the end after each invocation.
    c) as written above only a single fd event is reported per tevent_loop_once()
    d) when the file descriptor is closed using close() the event handler will no longer trigger
    e) we allow separate handlers for TEVENT_FD_READ and TEVENT_FD_WRITE for the same file descriptor
    f) there's a hook into the backend to set/clear the TEVENT_FD_READ/TEVENT_FD_WRITE on
       an existing event handler state (tevent_fd_get/set_flags())

4. A single process may have more than a single central/global tevent context instance.
    a) Separate instances might be allocated and may registered the same file descriptors as
       other instances.
    b) The individual instances might be used just temporary or never while
       they may be allocated for a long time. It means that tevent_loop_once() may not be called
       for a long time.

5. On linux we use epoll:
    a) We use epoll_wait() with maxevents=1, in order to avoid stale epoll_event.user_data,
       as the fd event handler for one fd may change/remove the state of another one.

    b) When we get EEXIST from EPOLL_CTL_ADD, we merge/dispatch multiple events for the same
       file descriptor in user space.

    c) Without epoll, we use poll() generating the pollfd array and dispatch based on the
       sorted/rotated list.

With that background I created an io_uring based backend. I modeled it like this:

6. Data model:
    a) Per tevent_context I have private struct samba_io_uring, wrapping struct io_uring and a list of
       samba_io_uring_submission structures.

    b) struct samba_io_uring_completion basically maps 'uint64_t user_data' by using its own pointer
       address to a callback function pointer and a private callback argument.

    c) struct samba_io_uring_submission basically wraps struct io_uring_sqe and has a pointer
       a struct samba_io_uring_completion in order to fill sqe.user_data.

7. The core tevent_loop_once() logic is this:

    a) loop over the queued samba_io_uring->submissions and move them into lowlevel
       kernel ring based on io_uring_sq_space_left/io_uring_get_sqe
       (we also call samba_io_uring_submission->submission_fn() in order to
       have a way to capture a timestamp for profiling or have a last chance
       to cancel the submission)

    b) call io_uring_submit_and_wait_timeout() waiting for 1 cqe or a timeout

    c) for the resulting cqe we lookup the samba_io_uring_completion and call
       samba_io_uring_completion->completion_fn()

    This loop will also allow generic io_uring operations like IORING_OP_SENDMSG
    and others (unrelated to what tevent normally provides). This will be used
    in order to improve the performance is performance critical code with io_uring
    aware code.

8. The mapping of fd events to IORING_OP_POLL_ADD/REMOVE

    By default we won't have io_uring aware code as most of
    it is not performance critical and we want to be portable also
    to non linux environments. So as a first step everything still needs
    to function while just exchanging the tevent backend.

    a) In order to represent what all kernels provide I only
       used the basic IORING_OP_POLL_ADD/REMOVE (without any update)

    b) In order to provide the level triggered behavior outlined in (3.a above)
       we need to loop over IORING_OP_POLL_ADD operations.
       If the requested poll mask is already ready, IORING_OP_POLL_ADD returns
       immediately. If it's not ready edge triggering (EPOLLET) will trigger a single
       (EPOLLONESHOT) completion. But as we call IORING_OP_POLL_ADD again,
       we'll get the effective level triggering, which we need.

    c) IORING_OP_POLL_ADD is queued into samba_io_uring->submissions
       while 7.a will construct the final sqe via the submission_fn
       reflecting the current POLLIN/OUT flags.

    d) When the TEVENT_FD_READ/WRITE flags change (via tevent_fd_set_flags()
       we may change the pending submission (before submission_fn() was called)

    e) If IORING_OP_POLL_ADD is already pending in the kernel
       we use IORING_OP_POLL_REMOVE to remove it hard linked
       with a new IORING_OP_POLL_ADD representing the current POLLIN/OUT flags.
       For that to work reliable I toggle between 2 IORING_OP_POLL_ADD completions.

9. The above works mostly, but manual testing and our massive automated regression tests
    found the following problems:

    a) Related to https://github.com/axboe/liburing/issues/684 I was also wondering
       about the return value of io_uring_submit_and_wait_timeout(),
       but in addition I noticed that the timeout parameter doesn't work
       as expected, the function will wait for two times of the timeout value.
       I hacked a fix here:
       https://git.samba.org/?p=metze/samba/wip.git;a=commitdiff;h=06fec644dd9f5748952c8b875878e0e1b0000d33

    b) The major show stopper is that IORING_OP_POLL_ADD calls fget(), while
       it's pending. Which means that a close() on the related file descriptor
       is not able to remove the last reference! This is a problem for points 3.d,
       4.a and 4.b from above.

       I doubt IORING_ASYNC_CANCEL_FD would be able to be used as there's not always
       code being triggered around a raw close() syscall, which could do a sync cancel.

       For now I plan to epoll_ctl (or IORING_OP_EPOLL_CTL) and only
       register the fd from epoll_create() with IORING_OP_POLL_ADD
       or I keep epoll_wait() as blocking call and register the io_uring fd
       with epoll.

       I looked at the related epoll code and found that it uses
       a list in struct file->f_ep to keep the reference, which gets
       detached also via eventpoll_release_file() called from __fput()

       Would it be possible move IORING_OP_POLL_ADD to use a similar model
       so that close() will causes a cqe with -ECANCELED?

    c) A simple pipe based performance test shows the following numbers:
       - 'poll':               Got 232387.31 pipe events/sec
       - 'epoll':              Got 251125.25 pipe events/sec
       - 'samba_io_uring_ev':  Got 210998.77 pipe events/sec
       So the io_uring backend is even slower than the 'poll' backend.
       I guess the reason is the constant re-submission of IORING_OP_POLL_ADD.
       My hope would be that IORING_POLL_ADD_MULTI + IORING_POLL_ADD_LEVEL
       would be able to avoid the performance problem with samba_io_uring_ev
       compared to epoll.

       I looked at how epoll implements level triggered notifications:
       The key is that is maintains two logical lists:
       - interest list with all registered file descriptor in the "epoll set"
         each registration is also registered into the file's waitqueue via
         init_poll_funcptr() -> vfs_poll() -> [sock_]poll_wait()
       - ready list, this is filled by the callback passed to init_poll_funcptr(),
         which is triggered when there's an "edge"/state change on the fd.
       The thing is that epoll_wait() fills the passed epoll_event array
       by traversing the ready list. For each element in the ready list
       we call vfs_poll() to re-check the most recent state before putting
       in the result array. For level triggered registrations the entry is move
       to the end of the ready list in order to provide fair results, otherwise
       the entry is remove from the ready list, for one shot entries it's also
       removed from the interest list.

       In order to implement level triggering in io_uring
       we would need to have some kind of ready list and have a way
       to let the caller configure a number of cqe's which should be generated
       during a single io_uring_enter() syscall based on the ready list,
       without such a value we'd constantly overflow the cqe array.

       As a site note the problem with your IORING_POLL_ADD_LEVEL was this:
       - IORING_OP_POLL_ADD does check the current value with vfs_poll,
         but it never triggers the io_poll_can_finish_inline case,
         so it will *always* wait for the next edge triggering to happen
         in the background.

         So it means it's move a deferred edge triggering and has nothing to
         do with level triggering (see 3.a).

         Even if I allow the io_poll_can_finish_inline case, I don't get
         level triggering, I tried it with there commits:
         https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=4f89a3fb02c1e4ea4650ea6f9fa9fd642453d2b2
         https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=4d10a69d9925f546214f9437aef424bade9c5aaa
         https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=306e743af055fea105df792c2756a0a81a95871a
         setting the io_uring_poll_always_finish_now option to true...


As summary I think 9.a (io_uring_submit_and_wait_timeout) should be trivial to fix.

In order to get the best performance it would be great to get 9.b and (most likely) 9.c
addressed.

Sorry for the long mail, but I hope we can figure out how to move forward.

Thanks!
metze

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problems replacing epoll with io_uring in tevent
  2022-10-18 14:42 Problems replacing epoll with io_uring in tevent Stefan Metzmacher
@ 2022-10-26 16:00 ` Stefan Metzmacher
  2022-10-26 17:08   ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Stefan Metzmacher @ 2022-10-26 16:00 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Linus Torvalds, Samba Technical

Hi Jens,

> 9. The above works mostly, but manual testing and our massive automated regression tests
>     found the following problems:
> 
>     a) Related to https://github.com/axboe/liburing/issues/684 I was also wondering
>        about the return value of io_uring_submit_and_wait_timeout(),
>        but in addition I noticed that the timeout parameter doesn't work
>        as expected, the function will wait for two times of the timeout value.
>        I hacked a fix here:
>        https://git.samba.org/?p=metze/samba/wip.git;a=commitdiff;h=06fec644dd9f5748952c8b875878e0e1b0000d33

Thanks for doing an upstream fix for the problem.

>     b) The major show stopper is that IORING_OP_POLL_ADD calls fget(), while
>        it's pending. Which means that a close() on the related file descriptor
>        is not able to remove the last reference! This is a problem for points 3.d,
>        4.a and 4.b from above.
> 
>        I doubt IORING_ASYNC_CANCEL_FD would be able to be used as there's not always
>        code being triggered around a raw close() syscall, which could do a sync cancel.
> 
>        For now I plan to epoll_ctl (or IORING_OP_EPOLL_CTL) and only
>        register the fd from epoll_create() with IORING_OP_POLL_ADD
>        or I keep epoll_wait() as blocking call and register the io_uring fd
>        with epoll.
> 
>        I looked at the related epoll code and found that it uses
>        a list in struct file->f_ep to keep the reference, which gets
>        detached also via eventpoll_release_file() called from __fput()
> 
>        Would it be possible move IORING_OP_POLL_ADD to use a similar model
>        so that close() will causes a cqe with -ECANCELED?

I'm currently trying to prototype for an IORING_POLL_CANCEL_ON_CLOSE
flag that can be passed to POLL_ADD. With that we'll register
the request in &req->file->f_uring_poll (similar to the file->f_ep list for epoll)
Then we only get a real reference to the file during the call to
vfs_poll() otherwise we drop the fget/fput reference and rely on
an io_uring_poll_release_file() (similar to eventpoll_release_file())
to cancel our registered poll request.

>     c) A simple pipe based performance test shows the following numbers:
>        - 'poll':               Got 232387.31 pipe events/sec
>        - 'epoll':              Got 251125.25 pipe events/sec
>        - 'samba_io_uring_ev':  Got 210998.77 pipe events/sec
>        So the io_uring backend is even slower than the 'poll' backend.
>        I guess the reason is the constant re-submission of IORING_OP_POLL_ADD.

Added some feature autodetection today and I'm now using
IORING_SETUP_COOP_TASKRUN, IORING_SETUP_TASKRUN_FLAG,
IORING_SETUP_SINGLE_ISSUER and IORING_SETUP_DEFER_TASKRUN if supported
by the kernel.

On a 6.1 kernel this improved the performance a lot, it's now faster
than the epoll backend.

The key flag is IORING_SETUP_DEFER_TASKRUN. On a different system than above
I'm getting the following numbers:
- epoll:                                    Got 114450.16 pipe events/sec
- poll:                                     Got 105872.52 pipe events/sec
- samba_io_uring_ev-without-defer_taskrun': Got  95564.22 pipe events/sec
- samba_io_uring_ev-with-defer_taskrun':    Got 122853.85 pipe events/sec

>        My hope would be that IORING_POLL_ADD_MULTI + IORING_POLL_ADD_LEVEL
>        would be able to avoid the performance problem with samba_io_uring_ev
>        compared to epoll.

I've started with a IORING_POLL_ADD_MULTI + IORING_POLL_ADD_LEVEL prototype,
but it's not very far yet and due to the IORING_SETUP_DEFER_TASKRUN
speedup, I'll postpone working on IORING_POLL_ADD_LEVEL.

metze


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problems replacing epoll with io_uring in tevent
  2022-10-26 16:00 ` Stefan Metzmacher
@ 2022-10-26 17:08   ` Jens Axboe
  2022-10-26 17:41     ` Pavel Begunkov
                       ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Jens Axboe @ 2022-10-26 17:08 UTC (permalink / raw)
  To: Stefan Metzmacher; +Cc: io-uring, Linus Torvalds, Samba Technical

On 10/26/22 10:00 AM, Stefan Metzmacher wrote:
> Hi Jens,
> 
>> 9. The above works mostly, but manual testing and our massive automated regression tests
>>     found the following problems:
>>
>>     a) Related to https://github.com/axboe/liburing/issues/684 I was also wondering
>>        about the return value of io_uring_submit_and_wait_timeout(),
>>        but in addition I noticed that the timeout parameter doesn't work
>>        as expected, the function will wait for two times of the timeout value.
>>        I hacked a fix here:
>>        https://git.samba.org/?p=metze/samba/wip.git;a=commitdiff;h=06fec644dd9f5748952c8b875878e0e1b0000d33
> 
> Thanks for doing an upstream fix for the problem.

No problem - have you been able to test the current repo in general? I want to
cut a 2.3 release shortly, but since that particular change impacts any kind of
cqe waiting, would be nice to have a bit more confidence in it.

>>     b) The major show stopper is that IORING_OP_POLL_ADD calls fget(), while
>>        it's pending. Which means that a close() on the related file descriptor
>>        is not able to remove the last reference! This is a problem for points 3.d,
>>        4.a and 4.b from above.
>>
>>        I doubt IORING_ASYNC_CANCEL_FD would be able to be used as there's not always
>>        code being triggered around a raw close() syscall, which could do a sync cancel.
>>
>>        For now I plan to epoll_ctl (or IORING_OP_EPOLL_CTL) and only
>>        register the fd from epoll_create() with IORING_OP_POLL_ADD
>>        or I keep epoll_wait() as blocking call and register the io_uring fd
>>        with epoll.
>>
>>        I looked at the related epoll code and found that it uses
>>        a list in struct file->f_ep to keep the reference, which gets
>>        detached also via eventpoll_release_file() called from __fput()
>>
>>        Would it be possible move IORING_OP_POLL_ADD to use a similar model
>>        so that close() will causes a cqe with -ECANCELED?
> 
> I'm currently trying to prototype for an IORING_POLL_CANCEL_ON_CLOSE
> flag that can be passed to POLL_ADD. With that we'll register
> the request in &req->file->f_uring_poll (similar to the file->f_ep list for epoll)
> Then we only get a real reference to the file during the call to
> vfs_poll() otherwise we drop the fget/fput reference and rely on
> an io_uring_poll_release_file() (similar to eventpoll_release_file())
> to cancel our registered poll request.

Yes, this is a bit tricky as we hold the file ref across the operation. I'd
be interested in seeing your approach to this, and also how it would
interact with registered files...

>>     c) A simple pipe based performance test shows the following numbers:
>>        - 'poll':               Got 232387.31 pipe events/sec
>>        - 'epoll':              Got 251125.25 pipe events/sec
>>        - 'samba_io_uring_ev':  Got 210998.77 pipe events/sec
>>        So the io_uring backend is even slower than the 'poll' backend.
>>        I guess the reason is the constant re-submission of IORING_OP_POLL_ADD.
> 
> Added some feature autodetection today and I'm now using
> IORING_SETUP_COOP_TASKRUN, IORING_SETUP_TASKRUN_FLAG,
> IORING_SETUP_SINGLE_ISSUER and IORING_SETUP_DEFER_TASKRUN if supported
> by the kernel.
> 
> On a 6.1 kernel this improved the performance a lot, it's now faster
> than the epoll backend.
> 
> The key flag is IORING_SETUP_DEFER_TASKRUN. On a different system than above
> I'm getting the following numbers:
> - epoll:                                    Got 114450.16 pipe events/sec
> - poll:                                     Got 105872.52 pipe events/sec
> - samba_io_uring_ev-without-defer_taskrun': Got  95564.22 pipe events/sec
> - samba_io_uring_ev-with-defer_taskrun':    Got 122853.85 pipe events/sec

Any chance you can do a run with just IORING_SETUP_COOP_TASKRUN set? I'm
curious how big of an impact the IPI elimination is, where it slots in
compared to the defer taskrun and the default settings.

>>        My hope would be that IORING_POLL_ADD_MULTI + IORING_POLL_ADD_LEVEL
>>        would be able to avoid the performance problem with samba_io_uring_ev
>>        compared to epoll.
> 
> I've started with a IORING_POLL_ADD_MULTI + IORING_POLL_ADD_LEVEL prototype,
> but it's not very far yet and due to the IORING_SETUP_DEFER_TASKRUN
> speedup, I'll postpone working on IORING_POLL_ADD_LEVEL.

OK

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problems replacing epoll with io_uring in tevent
  2022-10-26 17:08   ` Jens Axboe
@ 2022-10-26 17:41     ` Pavel Begunkov
  2022-10-27  8:18       ` Stefan Metzmacher
  2022-10-27  8:05     ` Stefan Metzmacher
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Pavel Begunkov @ 2022-10-26 17:41 UTC (permalink / raw)
  To: Jens Axboe, Stefan Metzmacher; +Cc: io-uring, Linus Torvalds, Samba Technical

On 10/26/22 18:08, Jens Axboe wrote:
> On 10/26/22 10:00 AM, Stefan Metzmacher wrote:
>> Hi Jens,
[...]
>>>      b) The major show stopper is that IORING_OP_POLL_ADD calls fget(), while
>>>         it's pending. Which means that a close() on the related file descriptor
>>>         is not able to remove the last reference! This is a problem for points 3.d,
>>>         4.a and 4.b from above.
>>>
>>>         I doubt IORING_ASYNC_CANCEL_FD would be able to be used as there's not always
>>>         code being triggered around a raw close() syscall, which could do a sync cancel.
>>>
>>>         For now I plan to epoll_ctl (or IORING_OP_EPOLL_CTL) and only
>>>         register the fd from epoll_create() with IORING_OP_POLL_ADD
>>>         or I keep epoll_wait() as blocking call and register the io_uring fd
>>>         with epoll.
>>>
>>>         I looked at the related epoll code and found that it uses
>>>         a list in struct file->f_ep to keep the reference, which gets
>>>         detached also via eventpoll_release_file() called from __fput()
>>>
>>>         Would it be possible move IORING_OP_POLL_ADD to use a similar model
>>>         so that close() will causes a cqe with -ECANCELED?
>>
>> I'm currently trying to prototype for an IORING_POLL_CANCEL_ON_CLOSE
>> flag that can be passed to POLL_ADD. With that we'll register
>> the request in &req->file->f_uring_poll (similar to the file->f_ep list for epoll)
>> Then we only get a real reference to the file during the call to
>> vfs_poll() otherwise we drop the fget/fput reference and rely on
>> an io_uring_poll_release_file() (similar to eventpoll_release_file())
>> to cancel our registered poll request.
> 
> Yes, this is a bit tricky as we hold the file ref across the operation. I'd
> be interested in seeing your approach to this, and also how it would
> interact with registered files...

Not sure I mentioned before but shutdown(2) / IORING_OP_SHUTDOWN
usually helps. Is there anything keeping you from doing that?
Do you only poll sockets or pipes as well?


>>>      c) A simple pipe based performance test shows the following numbers:
>>>         - 'poll':               Got 232387.31 pipe events/sec
>>>         - 'epoll':              Got 251125.25 pipe events/sec
>>>         - 'samba_io_uring_ev':  Got 210998.77 pipe events/sec
>>>         So the io_uring backend is even slower than the 'poll' backend.
>>>         I guess the reason is the constant re-submission of IORING_OP_POLL_ADD.
>>
>> Added some feature autodetection today and I'm now using
>> IORING_SETUP_COOP_TASKRUN, IORING_SETUP_TASKRUN_FLAG,
>> IORING_SETUP_SINGLE_ISSUER and IORING_SETUP_DEFER_TASKRUN if supported
>> by the kernel.
>>
>> On a 6.1 kernel this improved the performance a lot, it's now faster
>> than the epoll backend.
>>
>> The key flag is IORING_SETUP_DEFER_TASKRUN. On a different system than above
>> I'm getting the following numbers:
>> - epoll:                                    Got 114450.16 pipe events/sec
>> - poll:                                     Got 105872.52 pipe events/sec
>> - samba_io_uring_ev-without-defer_taskrun': Got  95564.22 pipe events/sec
>> - samba_io_uring_ev-with-defer_taskrun':    Got 122853.85 pipe events/sec
> 
> Any chance you can do a run with just IORING_SETUP_COOP_TASKRUN set? I'm
> curious how big of an impact the IPI elimination is, where it slots in
> compared to the defer taskrun and the default settings.

And if it doesn't take too much time to test, it would also be interesting
to see if there is any impact from IORING_SETUP_SINGLE_ISSUER alone,
without TASKRUN flags.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problems replacing epoll with io_uring in tevent
  2022-10-26 17:08   ` Jens Axboe
  2022-10-26 17:41     ` Pavel Begunkov
@ 2022-10-27  8:05     ` Stefan Metzmacher
  2022-10-27 19:25       ` Stefan Metzmacher
  2022-10-27  8:51     ` Stefan Metzmacher
  2022-10-27 19:54     ` Stefan Metzmacher
  3 siblings, 1 reply; 14+ messages in thread
From: Stefan Metzmacher @ 2022-10-27  8:05 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Linus Torvalds, Samba Technical

Hi Jens,

> No problem - have you been able to test the current repo in general? I want to
> cut a 2.3 release shortly, but since that particular change impacts any kind of
> cqe waiting, would be nice to have a bit more confidence in it.

At least the timing bug is still fixed (as with my change).

>> I'm currently trying to prototype for an IORING_POLL_CANCEL_ON_CLOSE
>> flag that can be passed to POLL_ADD. With that we'll register
>> the request in &req->file->f_uring_poll (similar to the file->f_ep list for epoll)
>> Then we only get a real reference to the file during the call to
>> vfs_poll() otherwise we drop the fget/fput reference and rely on
>> an io_uring_poll_release_file() (similar to eventpoll_release_file())
>> to cancel our registered poll request.
> 
> Yes, this is a bit tricky as we hold the file ref across the operation. I'd
> be interested in seeing your approach to this, and also how it would
> interact with registered files...

Here's my current patch:
https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=b9cccfac515739fc279c6eec87ce655a96f94685
It compiles, but I haven't tested it yet. And I'm not sure if the locking is done correctly...

>>>      c) A simple pipe based performance test shows the following numbers:
>>>         - 'poll':               Got 232387.31 pipe events/sec
>>>         - 'epoll':              Got 251125.25 pipe events/sec
>>>         - 'samba_io_uring_ev':  Got 210998.77 pipe events/sec
>>>         So the io_uring backend is even slower than the 'poll' backend.
>>>         I guess the reason is the constant re-submission of IORING_OP_POLL_ADD.
>>
>> Added some feature autodetection today and I'm now using
>> IORING_SETUP_COOP_TASKRUN, IORING_SETUP_TASKRUN_FLAG,
>> IORING_SETUP_SINGLE_ISSUER and IORING_SETUP_DEFER_TASKRUN if supported
>> by the kernel.
>>
>> On a 6.1 kernel this improved the performance a lot, it's now faster
>> than the epoll backend.
>>
>> The key flag is IORING_SETUP_DEFER_TASKRUN. On a different system than above
>> I'm getting the following numbers:
>> - epoll:                                    Got 114450.16 pipe events/sec
>> - poll:                                     Got 105872.52 pipe events/sec
>> - samba_io_uring_ev-without-defer_taskrun': Got  95564.22 pipe events/sec
>> - samba_io_uring_ev-with-defer_taskrun':    Got 122853.85 pipe events/sec
> 
> Any chance you can do a run with just IORING_SETUP_COOP_TASKRUN set? I'm
> curious how big of an impact the IPI elimination is, where it slots in
> compared to the defer taskrun and the default settings.

There's no real difference between these:

- no flag
- IORING_SETUP_COOP_TASKRUN|IORING_SETUP_TASKRUN_FLAG
- IORING_SETUP_SINGLE_ISSUER
- IORING_SETUP_COOP_TASKRUN|IORING_SETUP_TASKRUN_FLAG|IORING_SETUP_SINGLE_ISSUER


only these make it fast:
- IORING_SETUP_SINGLE_ISSUER|IORING_SETUP_DEFER_TASKRUN
- IORING_SETUP_COOP_TASKRUN|IORING_SETUP_TASKRUN_FLAG|IORING_SETUP_SINGLE_ISSUER|IORING_SETUP_DEFER_TASKRUN

metze

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problems replacing epoll with io_uring in tevent
  2022-10-26 17:41     ` Pavel Begunkov
@ 2022-10-27  8:18       ` Stefan Metzmacher
  0 siblings, 0 replies; 14+ messages in thread
From: Stefan Metzmacher @ 2022-10-27  8:18 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Linus Torvalds, Samba Technical

Hi Pavel,

>>> I'm currently trying to prototype for an IORING_POLL_CANCEL_ON_CLOSE
>>> flag that can be passed to POLL_ADD. With that we'll register
>>> the request in &req->file->f_uring_poll (similar to the file->f_ep list for epoll)
>>> Then we only get a real reference to the file during the call to
>>> vfs_poll() otherwise we drop the fget/fput reference and rely on
>>> an io_uring_poll_release_file() (similar to eventpoll_release_file())
>>> to cancel our registered poll request.
>>
>> Yes, this is a bit tricky as we hold the file ref across the operation. I'd
>> be interested in seeing your approach to this, and also how it would
>> interact with registered files...
> 
> Not sure I mentioned before but shutdown(2) / IORING_OP_SHUTDOWN
> usually helps. Is there anything keeping you from doing that?

The thing is that the tevent backend has no control over what
callers do and there's no way to audit all high level samba code,
everything needs to work exactly as it does with the poll and epoll backends.

Only high level code that actively/directly uses io_uring features could cope
with io_uring specific behavior.

As a side note I recently work on bug related to half closed tcp sockets,
see https://bugzilla.samba.org/show_bug.cgi?id=15202
It seems that shutdown() (even with SHUT_RDWR) behaves differently
for tcp sockets compared to a (final) close().

> Do you only poll sockets or pipes as well?

Anything that's supported by poll needs to work.

>>> The key flag is IORING_SETUP_DEFER_TASKRUN. On a different system than above
>>> I'm getting the following numbers:
>>> - epoll:                                    Got 114450.16 pipe events/sec
>>> - poll:                                     Got 105872.52 pipe events/sec
>>> - samba_io_uring_ev-without-defer_taskrun': Got  95564.22 pipe events/sec
>>> - samba_io_uring_ev-with-defer_taskrun':    Got 122853.85 pipe events/sec
>>
>> Any chance you can do a run with just IORING_SETUP_COOP_TASKRUN set? I'm
>> curious how big of an impact the IPI elimination is, where it slots in
>> compared to the defer taskrun and the default settings.
> 
> And if it doesn't take too much time to test, it would also be interesting
> to see if there is any impact from IORING_SETUP_SINGLE_ISSUER alone,
> without TASKRUN flags.

See the other mail to Jens.

metze


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problems replacing epoll with io_uring in tevent
  2022-10-26 17:08   ` Jens Axboe
  2022-10-26 17:41     ` Pavel Begunkov
  2022-10-27  8:05     ` Stefan Metzmacher
@ 2022-10-27  8:51     ` Stefan Metzmacher
  2022-10-27 12:12       ` Jens Axboe
  2022-10-27 19:54     ` Stefan Metzmacher
  3 siblings, 1 reply; 14+ messages in thread
From: Stefan Metzmacher @ 2022-10-27  8:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

Hi Jens,

> No problem - have you been able to test the current repo in general? I want to
> cut a 2.3 release shortly, but since that particular change impacts any kind of
> cqe waiting, would be nice to have a bit more confidence in it.

Is 2.3 designed to be useful for 6.0 or also 6.1?

Maybe wait for IORING_SEND_ZC_REPORT_USAGE to arrive?

metze


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problems replacing epoll with io_uring in tevent
  2022-10-27  8:51     ` Stefan Metzmacher
@ 2022-10-27 12:12       ` Jens Axboe
  2022-10-27 18:35         ` Stefan Metzmacher
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2022-10-27 12:12 UTC (permalink / raw)
  To: Stefan Metzmacher; +Cc: io-uring

On 10/27/22 2:51 AM, Stefan Metzmacher wrote:
> Hi Jens,
> 
>> No problem - have you been able to test the current repo in general? I want to
>> cut a 2.3 release shortly, but since that particular change impacts any kind of
>> cqe waiting, would be nice to have a bit more confidence in it.
> 
> Is 2.3 designed to be useful for 6.0 or also 6.1?

2.3 should be uptodate as of 6.0, don't want to release a new version
that has bits from a kernel release that hasn't happened yet. The
plan is to roughly do a liburing release at the same cadence as the
kernel releases. Not that they are necessarily linked, but some features
do obviously happen in lockstep like that.

> Maybe wait for IORING_SEND_ZC_REPORT_USAGE to arrive?

That'll be 2.4.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problems replacing epoll with io_uring in tevent
  2022-10-27 12:12       ` Jens Axboe
@ 2022-10-27 18:35         ` Stefan Metzmacher
  0 siblings, 0 replies; 14+ messages in thread
From: Stefan Metzmacher @ 2022-10-27 18:35 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

Am 27.10.22 um 14:12 schrieb Jens Axboe:
> On 10/27/22 2:51 AM, Stefan Metzmacher wrote:
>> Hi Jens,
>>
>>> No problem - have you been able to test the current repo in general? I want to
>>> cut a 2.3 release shortly, but since that particular change impacts any kind of
>>> cqe waiting, would be nice to have a bit more confidence in it.
>>
>> Is 2.3 designed to be useful for 6.0 or also 6.1?
> 
> 2.3 should be uptodate as of 6.0, don't want to release a new version
> that has bits from a kernel release that hasn't happened yet. The
> plan is to roughly do a liburing release at the same cadence as the
> kernel releases. Not that they are necessarily linked, but some features
> do obviously happen in lockstep like that.
> 
>> Maybe wait for IORING_SEND_ZC_REPORT_USAGE to arrive?
> 
> That'll be 2.4.

Ok, fine.

metze


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problems replacing epoll with io_uring in tevent
  2022-10-27  8:05     ` Stefan Metzmacher
@ 2022-10-27 19:25       ` Stefan Metzmacher
  2022-12-28 16:19         ` Stefan Metzmacher
  0 siblings, 1 reply; 14+ messages in thread
From: Stefan Metzmacher @ 2022-10-27 19:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Samba Technical, Linus Torvalds, io-uring

Hi Jens,

>>> I'm currently trying to prototype for an IORING_POLL_CANCEL_ON_CLOSE
>>> flag that can be passed to POLL_ADD. With that we'll register
>>> the request in &req->file->f_uring_poll (similar to the file->f_ep list for epoll)
>>> Then we only get a real reference to the file during the call to
>>> vfs_poll() otherwise we drop the fget/fput reference and rely on
>>> an io_uring_poll_release_file() (similar to eventpoll_release_file())
>>> to cancel our registered poll request.
>>
>> Yes, this is a bit tricky as we hold the file ref across the operation. I'd
>> be interested in seeing your approach to this, and also how it would
>> interact with registered files...
> 
> Here's my current patch:
> https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=b9cccfac515739fc279c6eec87ce655a96f94685
> It compiles, but I haven't tested it yet. And I'm not sure if the locking is done correctly...

It doesn't deadlock nor blow up immediately :-)
And it does fix the problem I had.

So what do you think about that patch?
Am I doing stupid things there?

These points might be changed:
- returning -EBADF instead of -ECANCELED
   might be better and allow the caller to avoid
   retrying.
- I guess we could use a single linked list, but
   I'm mostly used to how struct list_head works.
   And I want something that works first.
- We may find a better name than IORING_POLL_CANCEL_ON_CLOSE
- struct io_poll is completely full, as well as io_kiocb->flags
   (maybe we should move flags to 64 bit?),
   so we need to use some other generic struct io_kiocb space,
   which might also be good in order make it possible to keep io_poll_add()
   and io_arm_poll_handler() in common.
   But we may have the new field a bit differently. Note that
   struct io_kiocb (without this patch) still has 32 free bytes before
   4 64 byte cachelines are filled. With my patch 24 bytes are left...
- In struct file it might be possible to share a reference list with
   with the epoll code, where each element can indicate it epoll
   or io_uring is used.

I'm pasting it below in order to make it easier to get comments...

metze

  fs/file_table.c                |   3 ++
  include/linux/fs.h             |   1 +
  include/linux/io_uring.h       |  12 +++++
  include/linux/io_uring_types.h |   4 ++
  include/uapi/linux/io_uring.h  |   1 +
  io_uring/opdef.c               |   1 +
  io_uring/poll.c                | 100 ++++++++++++++++++++++++++++++++++++++++-
  io_uring/poll.h                |   1 +
  8 files changed, 122 insertions(+), 1 deletion(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index dd88701e54a9..cad408e9c0f5 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -16,6 +16,7 @@
  #include <linux/security.h>
  #include <linux/cred.h>
  #include <linux/eventpoll.h>
+#include <linux/io_uring.h>
  #include <linux/rcupdate.h>
  #include <linux/mount.h>
  #include <linux/capability.h>
@@ -147,6 +148,7 @@ static struct file *__alloc_file(int flags, const struct cred *cred)
  	}

  	atomic_long_set(&f->f_count, 1);
+	INIT_LIST_HEAD(&f->f_uring_poll);
  	rwlock_init(&f->f_owner.lock);
  	spin_lock_init(&f->f_lock);
  	mutex_init(&f->f_pos_lock);
@@ -309,6 +311,7 @@ static void __fput(struct file *file)
  	 * in the file cleanup chain.
  	 */
  	eventpoll_release(file);
+	io_uring_poll_release(file);
  	locks_remove_file(file);

  	ima_file_free(file);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e654435f1651..7f99efa7a1dc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -972,6 +972,7 @@ struct file {
  	/* Used by fs/eventpoll.c to link all the hooks to this file */
  	struct hlist_head	*f_ep;
  #endif /* #ifdef CONFIG_EPOLL */
+	struct list_head	f_uring_poll;
  	struct address_space	*f_mapping;
  	errseq_t		f_wb_err;
  	errseq_t		f_sb_err; /* for syncfs */
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 43bc8a2edccf..c931ea92c29a 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -61,6 +61,15 @@ static inline void io_uring_free(struct task_struct *tsk)
  	if (tsk->io_uring)
  		__io_uring_free(tsk);
  }
+
+void io_uring_poll_release_file(struct file *file);
+static inline void io_uring_poll_release(struct file *file)
+{
+	if (likely(list_empty_careful(&file->f_uring_poll)))
+		return;
+
+	io_uring_poll_release_file(file);
+}
  #else
  static inline int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
  			      struct iov_iter *iter, void *ioucmd)
@@ -92,6 +101,9 @@ static inline const char *io_uring_get_opcode(u8 opcode)
  {
  	return "";
  }
+static inline void io_uring_poll_release(struct file *file)
+{
+}
  #endif

  #endif
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index f5b687a787a3..2373e01c57e7 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -547,8 +547,12 @@ struct io_kiocb {
  	union {
  		/* used by request caches, completion batching and iopoll */
  		struct io_wq_work_node	comp_list;
+		struct {
  		/* cache ->apoll->events */
  		__poll_t apoll_events;
+		u8 poll_cancel_on_close:1;
+		struct list_head		f_uring_poll_entry;
+		};
  	};
  	atomic_t			refs;
  	atomic_t			poll_refs;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index a2ce8ba7abb5..fe311667cb8c 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -276,6 +276,7 @@ enum io_uring_op {
  #define IORING_POLL_UPDATE_EVENTS	(1U << 1)
  #define IORING_POLL_UPDATE_USER_DATA	(1U << 2)
  #define IORING_POLL_ADD_LEVEL		(1U << 3)
+#define IORING_POLL_CANCEL_ON_CLOSE	(1U << 4)

  /*
   * ASYNC_CANCEL flags.
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 34b08c87ffa5..540ee55961a3 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -131,6 +131,7 @@ const struct io_op_def io_op_defs[] = {
  		.name			= "POLL_ADD",
  		.prep			= io_poll_add_prep,
  		.issue			= io_poll_add,
+		.cleanup		= io_poll_cleanup,
  	},
  	[IORING_OP_POLL_REMOVE] = {
  		.audit_skip		= 1,
diff --git a/io_uring/poll.c b/io_uring/poll.c
index 0d9f49c575e0..d4ccf2f2e815 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -163,6 +163,19 @@ static inline void io_poll_remove_entry(struct io_poll *poll)

  static void io_poll_remove_entries(struct io_kiocb *req)
  {
+	if (!list_empty_careful(&req->f_uring_poll_entry)) {
+		spin_lock(&req->file->f_lock);
+		list_del_init_careful(&req->f_uring_poll_entry);
+		/*
+		 * upgrade to a full reference again,
+		 * it will be released in the common
+		 * cleanup code via io_put_file().
+		 */
+		if (!(req->flags & REQ_F_FIXED_FILE))
+			WARN_ON_ONCE(!get_file_rcu(req->file));
+		spin_unlock(&req->file->f_lock);
+	}
+
  	/*
  	 * Nothing to do if neither of those flags are set. Avoid dipping
  	 * into the poll/apoll/double cachelines if we can.
@@ -199,6 +212,54 @@ enum {
  	IOU_POLL_REMOVE_POLL_USE_RES = 2,
  };

+static inline struct file *io_poll_get_additional_file_ref(struct io_kiocb *req,
+							   unsigned issue_flags)
+{
+	if (!(req->poll_cancel_on_close))
+		return NULL;
+
+	if (unlikely(!req->file))
+		return NULL;
+
+	req->flags |= REQ_F_NEED_CLEANUP;
+
+	if (list_empty_careful(&req->f_uring_poll_entry)) {
+		/*
+		 * This first time we need to add ourself to the
+		 * file->f_uring_poll.
+		 */
+		spin_lock(&req->file->f_lock);
+		list_add_tail(&req->f_uring_poll_entry, &req->file->f_uring_poll);
+		spin_unlock(&req->file->f_lock);
+		if (!(req->flags & REQ_F_FIXED_FILE)) {
+			/*
+			 * If it's not a fixed file,
+			 * we can allow the caller to drop the existing
+			 * reference.
+			 */
+			return req->file;
+		}
+		/*
+		 * For fixed files we grab an additional reference
+		 */
+	}
+
+	io_ring_submit_lock(req->ctx, issue_flags);
+	if (unlikely(!req->file)) {
+		io_ring_submit_unlock(req->ctx, issue_flags);
+		return NULL;
+	}
+	rcu_read_lock();
+	if (unlikely(!get_file_rcu(req->file))) {
+		req->file = NULL;
+		req->cqe.fd = -1;
+		io_poll_mark_cancelled(req);
+	}
+	rcu_read_unlock();
+	io_ring_submit_unlock(req->ctx, issue_flags);
+	return req->file;
+}
+
  /*
   * All poll tw should go through this. Checks for poll events, manages
   * references, does rewait, etc.
@@ -230,7 +291,12 @@ static int io_poll_check_events(struct io_kiocb *req, bool *locked)
  		/* the mask was stashed in __io_poll_execute */
  		if (!req->cqe.res) {
  			struct poll_table_struct pt = { ._key = req->apoll_events };
+			unsigned issue_flags = (!*locked) ? IO_URING_F_UNLOCKED : 0;
+			struct file *file_to_put = io_poll_get_additional_file_ref(req, issue_flags);
+			if (unlikely(!req->file))
+				return -ECANCELED;
  			req->cqe.res = vfs_poll(req->file, &pt) & req->apoll_events;
+			io_put_file(file_to_put);
  		}

  		if ((unlikely(!req->cqe.res)))
@@ -499,6 +565,7 @@ static int __io_arm_poll_handler(struct io_kiocb *req,
  				 unsigned issue_flags)
  {
  	struct io_ring_ctx *ctx = req->ctx;
+	struct file *file_to_put;
  	int v;

  	INIT_HLIST_NODE(&req->hash_node);
@@ -506,6 +573,7 @@ static int __io_arm_poll_handler(struct io_kiocb *req,
  	io_init_poll_iocb(poll, mask, io_poll_wake);
  	poll->file = req->file;
  	req->apoll_events = poll->events;
+	INIT_LIST_HEAD(&req->f_uring_poll_entry);

  	ipt->pt._key = mask;
  	ipt->req = req;
@@ -529,7 +597,11 @@ static int __io_arm_poll_handler(struct io_kiocb *req,
  	if (issue_flags & IO_URING_F_UNLOCKED)
  		req->flags &= ~REQ_F_HASH_LOCKED;

+	file_to_put = io_poll_get_additional_file_ref(req, issue_flags);
+	if (unlikely(!req->file))
+		return -ECANCELED;
  	mask = vfs_poll(req->file, &ipt->pt) & poll->events;
+	io_put_file(file_to_put);

  	if (unlikely(ipt->error || !ipt->nr_entries)) {
  		io_poll_remove_entries(req);
@@ -857,11 +929,17 @@ int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
  	if (sqe->buf_index || sqe->off || sqe->addr)
  		return -EINVAL;
  	flags = READ_ONCE(sqe->len);
-	if (flags & ~IORING_POLL_ADD_MULTI)
+	if (flags & ~(IORING_POLL_ADD_MULTI|IORING_POLL_CANCEL_ON_CLOSE))
  		return -EINVAL;
  	if ((flags & IORING_POLL_ADD_MULTI) && (req->flags & REQ_F_CQE_SKIP))
  		return -EINVAL;

+	if (flags & IORING_POLL_CANCEL_ON_CLOSE) {
+		req->poll_cancel_on_close = 1;
+	} else {
+		req->poll_cancel_on_close = 0;
+	}
+
  	poll->events = io_poll_parse_events(sqe, flags);
  	return 0;
  }
@@ -963,3 +1041,23 @@ void io_apoll_cache_free(struct io_cache_entry *entry)
  {
  	kfree(container_of(entry, struct async_poll, cache));
  }
+
+void io_uring_poll_release_file(struct file *file)
+{
+	struct io_kiocb *req, *next;
+
+	list_for_each_entry_safe(req, next, &file->f_uring_poll, f_uring_poll_entry) {
+		io_ring_submit_lock(req->ctx, IO_URING_F_UNLOCKED);
+		io_poll_mark_cancelled(req);
+		list_del_init_careful(&req->f_uring_poll_entry);
+		io_poll_remove_entries(req);
+		req->file = NULL;
+		io_poll_execute(req, 0);
+		io_ring_submit_unlock(req->ctx, IO_URING_F_UNLOCKED);
+	}
+}
+
+void io_poll_cleanup(struct io_kiocb *req)
+{
+	io_poll_remove_entries(req);
+}
diff --git a/io_uring/poll.h b/io_uring/poll.h
index ef25c26fdaf8..43e6b877f1bc 100644
--- a/io_uring/poll.h
+++ b/io_uring/poll.h
@@ -27,6 +27,7 @@ struct async_poll {

  int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
  int io_poll_add(struct io_kiocb *req, unsigned int issue_flags);
+void io_poll_cleanup(struct io_kiocb *req);

  int io_poll_remove_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
  int io_poll_remove(struct io_kiocb *req, unsigned int issue_flags);


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: Problems replacing epoll with io_uring in tevent
  2022-10-26 17:08   ` Jens Axboe
                       ` (2 preceding siblings ...)
  2022-10-27  8:51     ` Stefan Metzmacher
@ 2022-10-27 19:54     ` Stefan Metzmacher
  3 siblings, 0 replies; 14+ messages in thread
From: Stefan Metzmacher @ 2022-10-27 19:54 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Linus Torvalds, Samba Technical

Hi Jens,

>> I'm currently trying to prototype for an IORING_POLL_CANCEL_ON_CLOSE
>> flag that can be passed to POLL_ADD. With that we'll register
>> the request in &req->file->f_uring_poll (similar to the file->f_ep list for epoll)
>> Then we only get a real reference to the file during the call to
>> vfs_poll() otherwise we drop the fget/fput reference and rely on
>> an io_uring_poll_release_file() (similar to eventpoll_release_file())
>> to cancel our registered poll request.
> 
> Yes, this is a bit tricky as we hold the file ref across the operation. I'd
> be interested in seeing your approach to this, and also how it would
> interact with registered files...

It should work fine with fixed files, but I haven't tested it.

But from reading the code I'm wondering what happens in general with pending
requests on a closed fixed file? There's no referencing in io_file_get_fixed(),
or is it done via io_req_set_rsrc_node() together with the
io_rsrc_node_switch_start()/io_queue_rsrc_removal()/io_rsrc_node_switch()
within io_fixed_fd_remove()?

But IORING_POLL_CANCEL_ON_CLOSE doesn't have any effect together with
REQ_F_FIXED_FILE (in my current code). Maybe a io_fixed_fd_remove()
should call __io_async_cancel with IORING_ASYNC_CANCEL_FD_FIXED.

I was also thinking if any pending could be canceled by a close(),
because they all have the registered in the struct file list...
But that might be overkill and io_uring aware applications can just use
IORING_ASYNC_CANCEL_FD_* explicitly. Also fixed files are also only
used by io_uring aware code.

metze

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problems replacing epoll with io_uring in tevent
  2022-10-27 19:25       ` Stefan Metzmacher
@ 2022-12-28 16:19         ` Stefan Metzmacher
  2023-01-18 15:56           ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Stefan Metzmacher @ 2022-12-28 16:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Samba Technical, Linus Torvalds, io-uring

Hi Jens,

any change to get some feedback on these?
https://lore.kernel.org/io-uring/[email protected]/
and
https://lore.kernel.org/io-uring/[email protected]/

Thanks in advance!
metze

Am 27.10.22 um 21:25 schrieb Stefan Metzmacher:
> Hi Jens,
> 
>>>> I'm currently trying to prototype for an IORING_POLL_CANCEL_ON_CLOSE
>>>> flag that can be passed to POLL_ADD. With that we'll register
>>>> the request in &req->file->f_uring_poll (similar to the file->f_ep list for epoll)
>>>> Then we only get a real reference to the file during the call to
>>>> vfs_poll() otherwise we drop the fget/fput reference and rely on
>>>> an io_uring_poll_release_file() (similar to eventpoll_release_file())
>>>> to cancel our registered poll request.
>>>
>>> Yes, this is a bit tricky as we hold the file ref across the operation. I'd
>>> be interested in seeing your approach to this, and also how it would
>>> interact with registered files...
>>
>> Here's my current patch:
>> https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=b9cccfac515739fc279c6eec87ce655a96f94685
>> It compiles, but I haven't tested it yet. And I'm not sure if the locking is done correctly...
> 
> It doesn't deadlock nor blow up immediately :-)
> And it does fix the problem I had.
> 
> So what do you think about that patch?
> Am I doing stupid things there?
> 
> These points might be changed:
> - returning -EBADF instead of -ECANCELED
>    might be better and allow the caller to avoid
>    retrying.
> - I guess we could use a single linked list, but
>    I'm mostly used to how struct list_head works.
>    And I want something that works first.
> - We may find a better name than IORING_POLL_CANCEL_ON_CLOSE
> - struct io_poll is completely full, as well as io_kiocb->flags
>    (maybe we should move flags to 64 bit?),
>    so we need to use some other generic struct io_kiocb space,
>    which might also be good in order make it possible to keep io_poll_add()
>    and io_arm_poll_handler() in common.
>    But we may have the new field a bit differently. Note that
>    struct io_kiocb (without this patch) still has 32 free bytes before
>    4 64 byte cachelines are filled. With my patch 24 bytes are left...
> - In struct file it might be possible to share a reference list with
>    with the epoll code, where each element can indicate it epoll
>    or io_uring is used.
> 
> I'm pasting it below in order to make it easier to get comments...
> 
> metze
> 
>   fs/file_table.c                |   3 ++
>   include/linux/fs.h             |   1 +
>   include/linux/io_uring.h       |  12 +++++
>   include/linux/io_uring_types.h |   4 ++
>   include/uapi/linux/io_uring.h  |   1 +
>   io_uring/opdef.c               |   1 +
>   io_uring/poll.c                | 100 ++++++++++++++++++++++++++++++++++++++++-
>   io_uring/poll.h                |   1 +
>   8 files changed, 122 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/file_table.c b/fs/file_table.c
> index dd88701e54a9..cad408e9c0f5 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -16,6 +16,7 @@
>   #include <linux/security.h>
>   #include <linux/cred.h>
>   #include <linux/eventpoll.h>
> +#include <linux/io_uring.h>
>   #include <linux/rcupdate.h>
>   #include <linux/mount.h>
>   #include <linux/capability.h>
> @@ -147,6 +148,7 @@ static struct file *__alloc_file(int flags, const struct cred *cred)
>       }
> 
>       atomic_long_set(&f->f_count, 1);
> +    INIT_LIST_HEAD(&f->f_uring_poll);
>       rwlock_init(&f->f_owner.lock);
>       spin_lock_init(&f->f_lock);
>       mutex_init(&f->f_pos_lock);
> @@ -309,6 +311,7 @@ static void __fput(struct file *file)
>        * in the file cleanup chain.
>        */
>       eventpoll_release(file);
> +    io_uring_poll_release(file);
>       locks_remove_file(file);
> 
>       ima_file_free(file);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e654435f1651..7f99efa7a1dc 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -972,6 +972,7 @@ struct file {
>       /* Used by fs/eventpoll.c to link all the hooks to this file */
>       struct hlist_head    *f_ep;
>   #endif /* #ifdef CONFIG_EPOLL */
> +    struct list_head    f_uring_poll;
>       struct address_space    *f_mapping;
>       errseq_t        f_wb_err;
>       errseq_t        f_sb_err; /* for syncfs */
> diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
> index 43bc8a2edccf..c931ea92c29a 100644
> --- a/include/linux/io_uring.h
> +++ b/include/linux/io_uring.h
> @@ -61,6 +61,15 @@ static inline void io_uring_free(struct task_struct *tsk)
>       if (tsk->io_uring)
>           __io_uring_free(tsk);
>   }
> +
> +void io_uring_poll_release_file(struct file *file);
> +static inline void io_uring_poll_release(struct file *file)
> +{
> +    if (likely(list_empty_careful(&file->f_uring_poll)))
> +        return;
> +
> +    io_uring_poll_release_file(file);
> +}
>   #else
>   static inline int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
>                     struct iov_iter *iter, void *ioucmd)
> @@ -92,6 +101,9 @@ static inline const char *io_uring_get_opcode(u8 opcode)
>   {
>       return "";
>   }
> +static inline void io_uring_poll_release(struct file *file)
> +{
> +}
>   #endif
> 
>   #endif
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index f5b687a787a3..2373e01c57e7 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -547,8 +547,12 @@ struct io_kiocb {
>       union {
>           /* used by request caches, completion batching and iopoll */
>           struct io_wq_work_node    comp_list;
> +        struct {
>           /* cache ->apoll->events */
>           __poll_t apoll_events;
> +        u8 poll_cancel_on_close:1;
> +        struct list_head        f_uring_poll_entry;
> +        };
>       };
>       atomic_t            refs;
>       atomic_t            poll_refs;
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index a2ce8ba7abb5..fe311667cb8c 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -276,6 +276,7 @@ enum io_uring_op {
>   #define IORING_POLL_UPDATE_EVENTS    (1U << 1)
>   #define IORING_POLL_UPDATE_USER_DATA    (1U << 2)
>   #define IORING_POLL_ADD_LEVEL        (1U << 3)
> +#define IORING_POLL_CANCEL_ON_CLOSE    (1U << 4)
> 
>   /*
>    * ASYNC_CANCEL flags.
> diff --git a/io_uring/opdef.c b/io_uring/opdef.c
> index 34b08c87ffa5..540ee55961a3 100644
> --- a/io_uring/opdef.c
> +++ b/io_uring/opdef.c
> @@ -131,6 +131,7 @@ const struct io_op_def io_op_defs[] = {
>           .name            = "POLL_ADD",
>           .prep            = io_poll_add_prep,
>           .issue            = io_poll_add,
> +        .cleanup        = io_poll_cleanup,
>       },
>       [IORING_OP_POLL_REMOVE] = {
>           .audit_skip        = 1,
> diff --git a/io_uring/poll.c b/io_uring/poll.c
> index 0d9f49c575e0..d4ccf2f2e815 100644
> --- a/io_uring/poll.c
> +++ b/io_uring/poll.c
> @@ -163,6 +163,19 @@ static inline void io_poll_remove_entry(struct io_poll *poll)
> 
>   static void io_poll_remove_entries(struct io_kiocb *req)
>   {
> +    if (!list_empty_careful(&req->f_uring_poll_entry)) {
> +        spin_lock(&req->file->f_lock);
> +        list_del_init_careful(&req->f_uring_poll_entry);
> +        /*
> +         * upgrade to a full reference again,
> +         * it will be released in the common
> +         * cleanup code via io_put_file().
> +         */
> +        if (!(req->flags & REQ_F_FIXED_FILE))
> +            WARN_ON_ONCE(!get_file_rcu(req->file));
> +        spin_unlock(&req->file->f_lock);
> +    }
> +
>       /*
>        * Nothing to do if neither of those flags are set. Avoid dipping
>        * into the poll/apoll/double cachelines if we can.
> @@ -199,6 +212,54 @@ enum {
>       IOU_POLL_REMOVE_POLL_USE_RES = 2,
>   };
> 
> +static inline struct file *io_poll_get_additional_file_ref(struct io_kiocb *req,
> +                               unsigned issue_flags)
> +{
> +    if (!(req->poll_cancel_on_close))
> +        return NULL;
> +
> +    if (unlikely(!req->file))
> +        return NULL;
> +
> +    req->flags |= REQ_F_NEED_CLEANUP;
> +
> +    if (list_empty_careful(&req->f_uring_poll_entry)) {
> +        /*
> +         * This first time we need to add ourself to the
> +         * file->f_uring_poll.
> +         */
> +        spin_lock(&req->file->f_lock);
> +        list_add_tail(&req->f_uring_poll_entry, &req->file->f_uring_poll);
> +        spin_unlock(&req->file->f_lock);
> +        if (!(req->flags & REQ_F_FIXED_FILE)) {
> +            /*
> +             * If it's not a fixed file,
> +             * we can allow the caller to drop the existing
> +             * reference.
> +             */
> +            return req->file;
> +        }
> +        /*
> +         * For fixed files we grab an additional reference
> +         */
> +    }
> +
> +    io_ring_submit_lock(req->ctx, issue_flags);
> +    if (unlikely(!req->file)) {
> +        io_ring_submit_unlock(req->ctx, issue_flags);
> +        return NULL;
> +    }
> +    rcu_read_lock();
> +    if (unlikely(!get_file_rcu(req->file))) {
> +        req->file = NULL;
> +        req->cqe.fd = -1;
> +        io_poll_mark_cancelled(req);
> +    }
> +    rcu_read_unlock();
> +    io_ring_submit_unlock(req->ctx, issue_flags);
> +    return req->file;
> +}
> +
>   /*
>    * All poll tw should go through this. Checks for poll events, manages
>    * references, does rewait, etc.
> @@ -230,7 +291,12 @@ static int io_poll_check_events(struct io_kiocb *req, bool *locked)
>           /* the mask was stashed in __io_poll_execute */
>           if (!req->cqe.res) {
>               struct poll_table_struct pt = { ._key = req->apoll_events };
> +            unsigned issue_flags = (!*locked) ? IO_URING_F_UNLOCKED : 0;
> +            struct file *file_to_put = io_poll_get_additional_file_ref(req, issue_flags);
> +            if (unlikely(!req->file))
> +                return -ECANCELED;
>               req->cqe.res = vfs_poll(req->file, &pt) & req->apoll_events;
> +            io_put_file(file_to_put);
>           }
> 
>           if ((unlikely(!req->cqe.res)))
> @@ -499,6 +565,7 @@ static int __io_arm_poll_handler(struct io_kiocb *req,
>                    unsigned issue_flags)
>   {
>       struct io_ring_ctx *ctx = req->ctx;
> +    struct file *file_to_put;
>       int v;
> 
>       INIT_HLIST_NODE(&req->hash_node);
> @@ -506,6 +573,7 @@ static int __io_arm_poll_handler(struct io_kiocb *req,
>       io_init_poll_iocb(poll, mask, io_poll_wake);
>       poll->file = req->file;
>       req->apoll_events = poll->events;
> +    INIT_LIST_HEAD(&req->f_uring_poll_entry);
> 
>       ipt->pt._key = mask;
>       ipt->req = req;
> @@ -529,7 +597,11 @@ static int __io_arm_poll_handler(struct io_kiocb *req,
>       if (issue_flags & IO_URING_F_UNLOCKED)
>           req->flags &= ~REQ_F_HASH_LOCKED;
> 
> +    file_to_put = io_poll_get_additional_file_ref(req, issue_flags);
> +    if (unlikely(!req->file))
> +        return -ECANCELED;
>       mask = vfs_poll(req->file, &ipt->pt) & poll->events;
> +    io_put_file(file_to_put);
> 
>       if (unlikely(ipt->error || !ipt->nr_entries)) {
>           io_poll_remove_entries(req);
> @@ -857,11 +929,17 @@ int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>       if (sqe->buf_index || sqe->off || sqe->addr)
>           return -EINVAL;
>       flags = READ_ONCE(sqe->len);
> -    if (flags & ~IORING_POLL_ADD_MULTI)
> +    if (flags & ~(IORING_POLL_ADD_MULTI|IORING_POLL_CANCEL_ON_CLOSE))
>           return -EINVAL;
>       if ((flags & IORING_POLL_ADD_MULTI) && (req->flags & REQ_F_CQE_SKIP))
>           return -EINVAL;
> 
> +    if (flags & IORING_POLL_CANCEL_ON_CLOSE) {
> +        req->poll_cancel_on_close = 1;
> +    } else {
> +        req->poll_cancel_on_close = 0;
> +    }
> +
>       poll->events = io_poll_parse_events(sqe, flags);
>       return 0;
>   }
> @@ -963,3 +1041,23 @@ void io_apoll_cache_free(struct io_cache_entry *entry)
>   {
>       kfree(container_of(entry, struct async_poll, cache));
>   }
> +
> +void io_uring_poll_release_file(struct file *file)
> +{
> +    struct io_kiocb *req, *next;
> +
> +    list_for_each_entry_safe(req, next, &file->f_uring_poll, f_uring_poll_entry) {
> +        io_ring_submit_lock(req->ctx, IO_URING_F_UNLOCKED);
> +        io_poll_mark_cancelled(req);
> +        list_del_init_careful(&req->f_uring_poll_entry);
> +        io_poll_remove_entries(req);
> +        req->file = NULL;
> +        io_poll_execute(req, 0);
> +        io_ring_submit_unlock(req->ctx, IO_URING_F_UNLOCKED);
> +    }
> +}
> +
> +void io_poll_cleanup(struct io_kiocb *req)
> +{
> +    io_poll_remove_entries(req);
> +}
> diff --git a/io_uring/poll.h b/io_uring/poll.h
> index ef25c26fdaf8..43e6b877f1bc 100644
> --- a/io_uring/poll.h
> +++ b/io_uring/poll.h
> @@ -27,6 +27,7 @@ struct async_poll {
> 
>   int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
>   int io_poll_add(struct io_kiocb *req, unsigned int issue_flags);
> +void io_poll_cleanup(struct io_kiocb *req);
> 
>   int io_poll_remove_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
>   int io_poll_remove(struct io_kiocb *req, unsigned int issue_flags);
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problems replacing epoll with io_uring in tevent
  2022-12-28 16:19         ` Stefan Metzmacher
@ 2023-01-18 15:56           ` Jens Axboe
  2023-02-01 20:29             ` Stefan Metzmacher
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2023-01-18 15:56 UTC (permalink / raw)
  To: Stefan Metzmacher; +Cc: Samba Technical, Linus Torvalds, io-uring

On 12/28/22 9:19?AM, Stefan Metzmacher wrote:
> Hi Jens,
> 
> any change to get some feedback on these?
> https://lore.kernel.org/io-uring/[email protected]/
> and
> https://lore.kernel.org/io-uring/[email protected]/
> 
> Thanks in advance!

Finally getting around to this after the break...

I think your initial patch looks reasonable for doing cancel-on-close.
Can you resubmit it against for-6.3/io_uring so we can get it moving
forward, hopefully?

That would also be a good point to discuss the fixed file case as well,
as ideally this should obviously work on both types.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problems replacing epoll with io_uring in tevent
  2023-01-18 15:56           ` Jens Axboe
@ 2023-02-01 20:29             ` Stefan Metzmacher
  0 siblings, 0 replies; 14+ messages in thread
From: Stefan Metzmacher @ 2023-02-01 20:29 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Samba Technical, Linus Torvalds, io-uring

Hi Jens,

>> any change to get some feedback on these?
>> https://lore.kernel.org/io-uring/[email protected]/
>> and
>> https://lore.kernel.org/io-uring/[email protected]/
>>
>> Thanks in advance!
> 
> Finally getting around to this after the break...
> 
> I think your initial patch looks reasonable for doing cancel-on-close.
> Can you resubmit it against for-6.3/io_uring so we can get it moving
> forward, hopefully?
> 
> That would also be a good point to discuss the fixed file case as well,
> as ideally this should obviously work on both types.

I rebased on for-6.3/io_uring from over a week ago:
https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/oem-6.X-metze

root@ub1704-167:~/samba.git# uname -a
Linux ub1704-167 6.2.0-rc5-metze.01-00809-g3ffcd1640c8d #1 SMP PREEMPT_DYNAMIC Mon Jan 23 22:56:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

I modified the test a bit in order to only test fd events (without signal events),
with this samba code:
https://git.samba.org/?p=metze/samba/wip.git;a=shortlog;h=refs/heads/master-io-uring-native-ops
https://git.samba.org/?p=metze/samba/wip.git;a=shortlog;h=0ddfc6ac8f0bf7c33d0e273b45d9df1951b6452b

It shows that the epoll backend is still the fasted, see below.
I know that the samba_io_uring_ev_native is not really optimized in userspace,
but I fear the limiting factor is the need to re-issue IORING_OP_POLL_ADD
after each completion.

Having IORING_POLL_ADD_MULTI together with a working IORING_POLL_ADD_LEVEL,
might provide something that would be faster than the epoll backend,
but that would mean we would need to maintain a ready list and
issue only a limited amount of completions from the ready list
per io_uring_enter() in order to avoid overflowing the completion queue.
But I'm not sure how to implement that myself...

So IORING_POLL_CANCEL_ON_CLOSE is sadly not enough to be useful for me,
do you think I should submit it anyway even if it's unclear if samba will make use of it
any time soon?

metze

root@ub1704-167:~/samba.git# time bin/smbtorture //a/b local.event.samba_io_uring_ev_native.context
smbtorture 4.19.0pre1-DEVELOPERBUILD
Using seed 1675281295
time: 2023-02-01 19:54:55.272582
test: context
time: 2023-02-01 19:54:55.275092
backend 'samba_io_uring_ev_native' - test_event_context
Got 1000000.00 pipe events
Got 295844.07 pipe events/sec
time: 2023-02-01 19:54:58.655572
success: context

real    0m3,472s
user    0m1,861s
sys     0m1,609s

root@ub1704-167:~/samba.git# time bin/smbtorture //a/b local.event.epoll.context
smbtorture 4.19.0pre1-DEVELOPERBUILD
Using seed 1675281298
time: 2023-02-01 19:54:58.739744
test: context
time: 2023-02-01 19:54:58.741575
backend 'epoll' - test_event_context
Got 1000000.00 pipe events
Got 326759.90 pipe events/sec
time: 2023-02-01 19:55:01.802051
success: context

real    0m3,147s
user    0m1,926s
sys     0m1,218s

root@ub1704-167:~/samba.git# time bin/smbtorture //a/b local.event.poll.context
smbtorture 4.19.0pre1-DEVELOPERBUILD
Using seed 1675281930
time: 2023-02-01 20:05:30.685121
test: context
time: 2023-02-01 20:05:30.686870
backend 'poll' - test_event_context
Got 1000000.00 pipe events
Got 275666.78 pipe events/sec
time: 2023-02-01 20:05:34.314512
success: context

real    0m3,713s
user    0m1,799s
sys     0m1,911s



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-02-01 20:30 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-10-18 14:42 Problems replacing epoll with io_uring in tevent Stefan Metzmacher
2022-10-26 16:00 ` Stefan Metzmacher
2022-10-26 17:08   ` Jens Axboe
2022-10-26 17:41     ` Pavel Begunkov
2022-10-27  8:18       ` Stefan Metzmacher
2022-10-27  8:05     ` Stefan Metzmacher
2022-10-27 19:25       ` Stefan Metzmacher
2022-12-28 16:19         ` Stefan Metzmacher
2023-01-18 15:56           ` Jens Axboe
2023-02-01 20:29             ` Stefan Metzmacher
2022-10-27  8:51     ` Stefan Metzmacher
2022-10-27 12:12       ` Jens Axboe
2022-10-27 18:35         ` Stefan Metzmacher
2022-10-27 19:54     ` Stefan Metzmacher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox