public inbox for [email protected]
 help / color / mirror / Atom feed
From: Pavel Begunkov <[email protected]>
To: Ming Lei <[email protected]>, Kevin Wolf <[email protected]>
Cc: Jens Axboe <[email protected]>,
	[email protected], [email protected]
Subject: Re: [PATCH 5/9] io_uring: support SQE group
Date: Mon, 29 Apr 2024 16:48:37 +0100	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <Zihi3nDAJg1s7Cws@fedora>

On 4/24/24 02:39, Ming Lei wrote:
> On Tue, Apr 23, 2024 at 03:08:55PM +0200, Kevin Wolf wrote:
>> Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben:
>>> On 4/7/24 7:03 PM, Ming Lei wrote:
>>>> SQE group is defined as one chain of SQEs starting with the first sqe that
>>>> has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that
>>>> doesn't have it set, and it is similar with chain of linked sqes.
>>>>
>>>> The 1st SQE is group leader, and the other SQEs are group member. The group
>>>> leader is always freed after all members are completed. Group members
>>>> aren't submitted until the group leader is completed, and there isn't any
>>>> dependency among group members, and IOSQE_IO_LINK can't be set for group
>>>> members, same with IOSQE_IO_DRAIN.
>>>>
>>>> Typically the group leader provides or makes resource, and the other members
>>>> consume the resource, such as scenario of multiple backup, the 1st SQE is to
>>>> read data from source file into fixed buffer, the other SQEs write data from
>>>> the same buffer into other destination files. SQE group provides very
>>>> efficient way to complete this task: 1) fs write SQEs and fs read SQE can be
>>>> submitted in single syscall, no need to submit fs read SQE first, and wait
>>>> until read SQE is completed, 2) no need to link all write SQEs together, then
>>>> write SQEs can be submitted to files concurrently. Meantime application is
>>>> simplified a lot in this way.
>>>>
>>>> Another use case is to for supporting generic device zero copy:
>>>>
>>>> - the lead SQE is for providing device buffer, which is owned by device or
>>>>    kernel, can't be cross userspace, otherwise easy to cause leak for devil
>>>>    application or panic
>>>>
>>>> - member SQEs reads or writes concurrently against the buffer provided by lead
>>>>    SQE
>>>
>>> In concept, this looks very similar to "sqe bundles" that I played with
>>> in the past:
>>>
>>> https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle
>>>
>>> Didn't look too closely yet at the implementation, but in spirit it's
>>> about the same in that the first entry is processed first, and there's
>>> no ordering implied between the test of the members of the bundle /
>>> group.
>>
>> When I first read this patch, I wondered if it wouldn't make sense to
>> allow linking a group with subsequent requests, e.g. first having a few
>> requests that run in parallel and once all of them have completed
>> continue with the next linked one sequentially.
>>
>> For SQE bundles, you reused the LINK flag, which doesn't easily allow
>> this. Ming's patch uses a new flag for groups, so the interface would be
>> more obvious, you simply set the LINK flag on the last member of the
>> group (or on the leader, doesn't really matter). Of course, this doesn't
>> mean it has to be implemented now, but there is a clear way forward if
>> it's wanted.
> 
> Reusing LINK for bundle breaks existed link chains(BUNDLE linked to existed
> link chain), so I think it may not work.
> 
> The link rule is explicit for sqe group:
> 
> - only group leader can set link flag, which is applied on the whole
> group: the next sqe in the link chain won't be started until the
> previous linked sqe group is completed
> 
> - link flag can't be set for group members
> 
> Also sqe group doesn't limit async for both group leader and member.
> 
> sqe group vs link & async is covered in the last liburing test code.
> 
>>
>> The part that looks a bit arbitrary in Ming's patch is that the group
>> leader is always completed before the rest starts. It makes perfect
>> sense in the context that this series is really after (enabling zero
>> copy for ublk), but it doesn't really allow the case you mention in the
>> SQE bundle commit message, running everything in parallel and getting a
>> single CQE for the whole group.
> 
> I think it should be easy to cover bundle in this way, such as add one new
> op IORING_OP_BUNDLE as Jens did, and implement the single CQE for whole group/bundle.
> 
>>
>> I suppose you could hack around the sequential nature of the first
>> request by using an extra NOP as the group leader - which isn't any
>> worse than having an IORING_OP_BUNDLE really, just looks a bit odd - but
>> the group completion would still be missing. (Of course, removing the
>> sequential first operation would mean that ublk wouldn't have the buffer
>> ready any more when the other requests try to use it, so that would
>> defeat the purpose of the series...)
>>
>> I wonder if we can still combine both approaches and create some
>> generally useful infrastructure and not something where it's visible
>> that it was designed mostly for ublk's special case and other use cases
>> just happened to be enabled as a side effect.
> 
> sqe group is actually one generic interface, please see the multiple copy(
> copy one file to multiple destinations in single syscall for one range) example
> in the last patch, and it can support generic device zero copy: any device internal
> buffer can be linked with io_uring operations in this way, which can't
> be done by traditional splice/pipe.
> 
> I guess it can be used in network Rx zero copy too, but may depend on actual
> network Rx use case.

I doubt. With storage same data can be read twice. Socket recv consumes
data. Locking a buffer over the duration of another IO doesn't really sound
plausible, same we returning a buffer back. It'd be different if you can
read the buffer into the userspace if something goes wrong, but perhaps
you remember the fused discussion.

-- 
Pavel Begunkov

  parent reply	other threads:[~2024-04-29 15:48 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-08  1:03 [RFC PATCH 0/9] io_uring: support sqe group and provide group kbuf Ming Lei
2024-04-08  1:03 ` [PATCH 1/9] io_uring: net: don't check sqe->__pad2[0] for send zc Ming Lei
2024-04-08  1:03 ` [PATCH 2/9] io_uring: support user sqe ext flags Ming Lei
2024-04-22 18:16   ` Jens Axboe
2024-04-23 13:57     ` Ming Lei
2024-04-29 15:24       ` Pavel Begunkov
2024-04-30  3:43         ` Ming Lei
2024-04-30 12:00           ` Pavel Begunkov
2024-04-30 12:56             ` Ming Lei
2024-04-30 14:10               ` Pavel Begunkov
2024-04-30 15:46                 ` Ming Lei
2024-05-02 14:22                   ` Pavel Begunkov
2024-05-04  1:19                     ` Ming Lei
2024-04-08  1:03 ` [PATCH 3/9] io_uring: add helper for filling cqes in __io_submit_flush_completions() Ming Lei
2024-04-08  1:03 ` [PATCH 4/9] io_uring: add one output argument to io_submit_sqe Ming Lei
2024-04-08  1:03 ` [PATCH 5/9] io_uring: support SQE group Ming Lei
2024-04-22 18:27   ` Jens Axboe
2024-04-23 13:08     ` Kevin Wolf
2024-04-24  1:39       ` Ming Lei
2024-04-25  9:27         ` Kevin Wolf
2024-04-26  7:53           ` Ming Lei
2024-04-26 17:05             ` Kevin Wolf
2024-04-29  3:34               ` Ming Lei
2024-04-29 15:48         ` Pavel Begunkov [this message]
2024-04-30  3:07           ` Ming Lei
2024-04-29 15:32       ` Pavel Begunkov
2024-04-30  3:03         ` Ming Lei
2024-04-30 12:27           ` Pavel Begunkov
2024-04-30 15:00             ` Ming Lei
2024-05-02 14:09               ` Pavel Begunkov
2024-05-04  1:56                 ` Ming Lei
2024-05-02 14:28               ` Pavel Begunkov
2024-04-24  0:46     ` Ming Lei
2024-04-08  1:03 ` [PATCH 6/9] io_uring: support providing sqe group buffer Ming Lei
2024-04-08  1:03 ` [PATCH 7/9] io_uring/uring_cmd: support provide group kernel buffer Ming Lei
2024-04-08  1:03 ` [PATCH 8/9] ublk: support provide io buffer Ming Lei
2024-04-08  1:03 ` [RFC PATCH 9/9] liburing: support sqe ext_flags & sqe group Ming Lei
2024-04-19  0:55 ` [RFC PATCH 0/9] io_uring: support sqe group and provide group kbuf Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox